In this series, we’re articulating practices to consistently engineer useful features. Last time, we reviewed the intuition behind feature engineering: what it is, why we might use it, and how it might apply.
Today we’ll dive into an example: experimenting with feature engineering for recognition of handwritten numbers. This is a classification task—the task of using machine learning to sort examples into buckets.
Classification tasks represent about 70% of today’s customer-facing production machine learning applications, so the feature engineering principles we’ll cover with this example have applications for a large portion of commercial automation teams.
We’ll also start by looking for the extremest case of feature engineering: the case where the features can totally replace the base attributes as inputs to a classification model. Many real-world models cannot achieve good performance (accuracy) under these conditions, but we can start there as a theoretical exercise to understand the characteristics of features. Once we’ve covered these theoretical principles of feature engineering, we’ll see how to apply them in practice in the next post.
I like to drive my research through questions, so that’s how we’ll proceed.
Question 1: What characterizes each of our classes?
Here we have ten different classes: the numbers 0-9.
Let’s individually consider each class and the characteristics that allow us to recognize it.
What makes a handwritten one (1) recognizable as a 1?
Here’s a sample of what I have:
- One vertical line
- No horizontal lines, or maybe one horizontal line on the bottom
- one pen stroke
- not that much ink
- vertically symmetric
- horizontally symmetric
- no curves
- maybe a pointy thing on the end
We can repeat this process for each of our classes and come up with a list of characteristics that we associate with them. Suppose we brainstormed for each of our classes and made a list of class characteristics. Let’s add some structure to that data:
number | ink (pixel density) | horizontal symmetry | vertical symmetry | curves | points |
---|---|---|---|---|---|
0 | some | yes | yes | 2 | 0 |
1 | little | yes | yes | 0 | 0 |
2 | some | no | no | 1 | 1 |
3 | some | no | yes | 2 | 1 |
4 | some | no | no | 0 | 1 or 2 |
5 | some | no | no | 1 | 2 |
6 | some | no | no | 2 | 0 |
7 | some | no | no | 0 | 1 |
8 | a lot | yes | yes | 4 | 0 |
9 | some | no | no | 2 | 0 |
Instead of filling this table with every single item on all my lists, I have listed a few characteristics that showed up on many of my class lists. You can do this too: your table doesn’t necessarily have to have every single thing on it.
Question 2: For what feature set does each class has a unique combination of values?
This determines how small, how simple, and how streamlined our model can be while doing this classification task.
The smallest possible set of features on which we could base a model is one feature. Let’s see how that breaks down for number classes. Do any of our features (columns) have a unique value in them for every single one of our classes?
The answer is no. Look at “Ink,” for example. With the exception of 1 and 8, every single class has the same value here (“some”).
As it turns out there is a little more stratification than this, but still not enough for this feature to shatter the classes on its own. Check out the 95% confidence intervals for pixel density in a sample of the mnist dataset; we see a lot of overlap for several of our classes:
Let’s update our table to better reflect pixel density, and let’s also see if we can shatter the classes with two features. I have added colored backgrounds to make it easier to see quickly if two classes have matching feature values.
number | ink (pixel density) | horizontal symmetry | vertical symmetry | curves | points |
---|---|---|---|---|---|
0 | very heavy, apparently | yes | yes | 2 | 0 |
1 | little | yes | yes | 0 | 0 |
2 | heavy | no | no | 1 | 1 |
3 | heavy | no | yes | 2 | 1 |
4 | heavy | no | no | 0 | 1 or 2 |
5 | medium | no | no | 1 | 2 |
6 | medium | no | no | 2 | 0 |
7 | heavy | no | no | 0 | 1 |
8 | heavy | yes | yes | 4 | 0 |
9 | heavy | no | no | 2 | 0 |
- What if we use both pixel density and horizontal symmetry? The value combination for these features is still identical for 2, 3, 4, and 7, and also for 5 and 6.
- What if we add vertical symmetry? In that case 2, 4, and 7 still have identical values, as do 5 and 6.
- What if we add number of curves? Still no. The 4 and the 7 have identical feature values.
- In fact, no combination of these features completely separates these classes. That’s because the open-top 4 (with 1 point on that left corner, but not at the top) and the 7 have identical feature values across all five features. Looking at the sample image grid above, I see 15 open-top 4s in the row of 16 4s. Suffice it to say they’re common enough that this would be an issue.
This table represents all of the feature combinations along with examples of the target values that have the same feature values for each of them.
number of features | ink (pixel density) | horizontal symmetry | vertical symmetry | curves | points | example identical values | shatter? |
---|---|---|---|---|---|---|---|
1 | x | 2,3,4,7 | 5,6 | no | ||||
1 | x | 0,1,8 | rest | no | ||||
1 | x | 0,1,3,8 | rest | no | ||||
1 | x | 0,3,6,9 | 4,7 | no | ||||
1 | x | 0,1,6,8,9 | 4,7 | no | ||||
2 | x | x | 2,3 | 4,7 | no | |||
2 | x | x | 5,6 | 4,7 | no | |||
2 | x | x | 3,9 | 4,7 | no | |||
2 | x | x | 2,3 | 4,7 | no | |||
2 | x | x | 0,1 | 4,7 | no | |||
2 | x | x | 3,6 | 4,7 | no | |||
2 | x | x | 2,3 | 4,7 | no | |||
2 | x | x | 0,3 | 4,7 | no | |||
2 | x | x | 0,1 | 4,7 | no | |||
2 | x | x | 0,9 | 4,7 | no | |||
3 | x | x | x | 2,4 | 4,7 | no | ||
3 | x | x | x | 3,9 | 4,7 | no | ||
3 | x | x | x | 2,3 | 4,7 | no | ||
3 | x | x | x | 4,7 | no | ||
3 | x | x | x | 4,7 | no | ||
3 | x | x | x | 2,5 | 4,7 | no | ||
3 | x | x | x | 4,7 | no | ||
3 | x | x | x | 4,7 | no | ||
3 | x | x | x | 4,7 | no | ||
4 | x | x | x | x | 4,7 | no | |
4 | x | x | x | x | 2,4 | 4,7 | no | |
4 | x | x | x | x | 4,7 | no | |
4 | x | x | x | x | 4,7 | no | |
4 | x | x | x | x | 4,7 | no | |
5 | x | x | x | x | x | 4,7 | no |
Again, note that the answer to “shatter?” for every single one of our feature combinations is “no.” What does this mean? That means that, even if we could encode each of these features with perfect accuracy for the vast majority of our examples (an enormous if that we will return to later), we still could not rely on these features alone to distinguish all of our classes.
That doesn’t mean our work here is useless. In fact, in most datasets, you won’t have the opportunity to rely only on engineered features to produce acceptable accuracy. So here, we have an option as to what we’re going to try to do:
- Think of additional features that would shatter the classes
- Begin discussing the quality and utility of the engineered features we already have for classifying things
- Consider how badly we need all of these classes, and whether we can combine the ones that don’t differentiate well.
We’re going with number 1 for a couple of reasons:
- We’re doing #2 in the next blog post, so I don’t want to steal content from that post
- For #3 the question would essentially be “do we care about being able to differentiate 4 and 7?” This is a bit of a contrived situation: if we’re identifying numbers, of course we care about differentiating those two. We’ll talk about real-world examples in a future post, and that will clarify why #3 might be an option in some cases.
- I’m having fun with this feature thing, so let’s take another crack at coming up with features that shatter the classes.
Question 3: What differentiates a hard-to-distinguish class from its next closest class?
The open-top 4 and the 7 are giving us some trouble: their feature values match across all five features. According to our feature map, they’re very similar. But when we look at 4s and 7s with our eyeballs, we tell them apart quite easily. Why? Let’s look more closely at 4s and 7s.
You know, I think of these numbers as easy to differentiate, but now that I’m looking at them, I start to understand the confusion. Compare the first 7 and the eleventh 4. Now compare the eighth 4 and the seventh 7. If you made that 7 just a little more concave on top, I wouldn’t necessarily know for sure that it wasn’t supposed to be a 4.
Most of these cases, though, I could tell pretty fast what they are. What characteristics might I use to distinguish them? Here are some options:
- 4s have one horizontal line. 7s have one, or two in the case of that first 7.
- 7s have one diagonal line. 4s have none if they’re open-top and two if they’re closed-top.
- 7s have no vertical lines. 4s have two if they’re open-top and one if they’re closed-top.
- 7s have more pixel density at the top than anywhere else. 4s have more pixel density in the middle.
Let’s add some features to our table and see how we do:
number | ink (pixel density) | horizontal symmetry | vertical symmetry | curves | points | vertical lines | horizontal lines | pixel density distribution |
---|---|---|---|---|---|---|---|---|
0 | very heavy, apparently | yes | yes | 2 | 0 | 0 | 0 | even |
1 | little | yes | yes | 0 | 0 | 1 | 0 | even |
2 | heavy | no | no | 1 | 1 | 0 | 1 | bottom |
3 | heavy | no | yes | 2 | 1 | 0 | 0 | even |
4 | heavy | no | no | 0 | 1 or 2 | 1 or 2 | 1 | even |
5 | medium | no | no | 1 | 2 | 1 | 1 | even |
6 | medium | no | no | 2 | 0 | 0 | 0 | bottom |
7 | heavy | no | no | 0 | 1 | 0 | 1 | top |
8 | heavy | yes | yes | 4 | 0 | 0 | 0 | even |
9 | heavy | no | no | 2 | 0 | 0 | 0 | top |
These added features now theoretically allow us to shatter our classes: that is, if we can extract these feature values from the attributes of the dataset with perfect accuracy, then we can also classify the examples based on these features alone rather than the base attributes. This is the aforementioned enormous if.
How could we go about extracting these features? How accurate would those extractions be? And even if we could do it, would extracting these features give us any practical advantage over training on the base attributes for this data?
These are the types of questions we’ll ask in the next post, where we apply our theoretical understanding of the features to the data itself.
Conclusion
We have introduced a classification problem—recognition of handwritten numbers—to help us articulate some practices for feature engineering.
We begin with a theoretical exercise: attempting to identify the features that characterize each of our classes and distinguish them from one another. If we can identify a combination of features for which each of our classes has a unique combination of values, then we can say that knowing these features allows us to theoretically classify our data.
To determine this, we ask some questions about our problem:
- What characterizes each of our classes?
- For what feature set does each class has a unique combination of values?
- What differentiates a hard-to-distinguish class from its next closest class?
Once we have answers to these questions, it’s time to get practical. In the next post, we’ll talk about how we might extract these features from our data, and under what conditions it would be useful to do so.
If you liked this post, you might also like:
Behind the Scenes: Syllabus Design (a fun look at structuring a roadmap for formal learning)
Design Patterns for Data Science (an ongoing series on the “engineering” part of machine learning engineering)
“Smart” is not a Hiring Criterion (a good reminder for propeller heads everywhere)