In this series, we’re articulating practices to consistently engineer useful features. Last time, we experimented with feature engineering for recognition of handwritten numbers.
Now, we’ll talk about the practical considerations of using the features we extracted.
In the last post, we wanted to classify the handwritten digits 0-9. As we stepped through the theory behind feature engineering, we identified a set of characteristics about the drawings that allow us to differentiate the digits on sight:
- ink (pixel density)
- horizontal symmetry
- vertical symmetry
- vertical lines
- horizontal lines
- pixel density distribution
This example requires all of our features to shatter our classes, but it does so on the assumption that we’re only using features. That’s because, so far, we have treated feature engineering as a theoretical exercise.
Most real data science projects don’t rely solely on engineered features. Instead, they attempt classification with the base attributes of the data, and then they move toward a combination of base attributes and features as model inputs to achieve the best accuracy.
If we make this allowance—if we allow base attributes to help us classify these numbers—then we may no longer need all of these features. That means we’ll need to prioritize which ones to use.
We can ask some questions to help us get a sense of those priorities.
1. How much would each of these features contribute to a correct classification?
We want to prioritize the features that will individually move us the furthest toward accomplishing our goal of classifying the numbers. If training a model on the base attributes of the data gets us most of the way to an acceptable accuracy level, it’s possible we only need one or a few engineered features to reach that level.
Let’s look at our feature candidates again:
|number||ink (pixel density)||horizontal symmetry||vertical symmetry||curves||points||vertical lines||horizontal lines||pixel density distribution|
|0||very heavy, apparently||yes||yes||2||0||0||0||even|
|4||heavy||no||no||0||1 or 2||1 or 2||1||even|
Which features in this list have the most opportunity as class differentiators?
To figure this out, we can look for a few characteristics.
A. Features that differentiate between more classes. Horizontal symmetry doesn’t tell us much about an example: it tells us that a number either is a 0, 1, or 8, or it’s not. Compare that to number of curves. This feature doesn’t differentiate all the classes on its own, but if we know an example’s number of curves, that narrows down a lot which class it could be. So if we could obtain either of those two features, number of curves would advance us further toward the classification goal than horizontal symmetry would.
B. Features that differentiate more strongly between classes. For both pixel density and horizontal symmetry, the numbers 5 and 8 have different values. For all the 5 and 8 examples in our dataset, how sure would each of these features make us of the example’s class? Pixel density isn’t fully trustworthy: even if we normalize for the size of the handwritten digits, someone might draw a 5 with a very full curve on the bottom and a long horizontal line at the top, and someone else might draw an 8 with an S and a /, such that the 5 has higher pixel density than the 8. The pixel density value is correct, but its class is not necessarily accurate. Compare that to vertical symmetry. It’s pretty hard to make a 5 vertically symmetric and still make it look like a 5. It’s also pretty hard to make an 8 vertically asymmetric and still make it look like an 8. So that feature is a stronger differentiator for these two digits than pixel density.
C. Features that help distinguish between hard-to-distinguish classes. As we considered feature candidates, we ran into a problem: 4 and 7 had the same feature values for a lot of the features. So it’s imperative to add features that can differentiate these two if we want a model to distinguish 4s from 7s. That makes pixel density distribution a particularly valuable feature, if we can get it.
That’s the final feature contribution characteristic we’ll cover, and it brings us to another question: these contribution characteristics assume that we can obtain these features. Now we have to determine if that’s true.
2. How hard is it to obtain or approximate each of these features?
As with any development task, the complexity of each of these features factors into their priority. How could we extract each of these features from the base attributes, and what would it take to do so?
Pixel density represents one of the less complex features to extract. We can take the images in the mnist dataset, which are represented as arrays with black-to-white pixel densities, and sum the values of the pixels in the array, like so:
What about horizontal symmetry and vertical symmetry? We could attempt to approximate these by folding the array in half horizontally or vertically and taking the difference between the two sides (squared here so that the sign of the difference does not matter—both black on white and white on black asymmetry should raise the asymmetry score):
These are trickier, though: the asymmetry methods we have written basically rely on the images being perfectly centered and the handwritten digits being perfectly formed. If the 8 is written with an S and a /, it will get a higher score than an 8 “should.” So these cute 4-line methods belie the complexity that would creep into a robust, production-ready extraction.
How about pixel density distribution? We might implement that in several ways. Here are some examples of ways to implement pixel density distribution:
- One column representing the percentage of pixel density that occurs in the top half of the image
- Three columns for top, middle, and bottom of the image, each containing a double representing the percentage of total pixel density that happens in that portion of the image
- Individual columns for every horizontal row of the image with the same idea as the above options
For this feature, it might make sense to test out a few different implementations and see what works best.
None of these seem so bad, right? Maybe not. For more complex feature extraction examples, we can return to the scenario from our first post in this series: deciding where to eat for dinner.
We want to choose a restaurant for dinner, and we care about convenience to our house, flavor, and spiciness. Our data tells us each restaurant’s address and its cuisine.
How would we translate those attributes into our features?
To quantify convenience to our house, we might:
- Plug the address into a map API
- Use a public transit API to determine transit routes between the address and our house
- Figure out an equation to factor in number of transfers and number of minutes in transit to quantify the convenience of the location
- Output the result as our feature.
That takes some doing.
And it gets more complex. To determine flavor and spiciness, for example, all we have is the cuisine. But within a cuisine, both flavor and spiciness vary by dish and even by individual chef. So our approximations would have some uncertainty, which we’d probably have to account for in some sort of probabilistic model. Plus, we want to consider the impact of bias on our methodology: we’d want to carefully consider the way we’re making generalizations based on cuisine, and we’d also need to consider what sources are informing our assumptions about the relationship between cuisines and tastes.
Sometimes the hardest features to extract are the most worthwhile. But it’s worth considering how hard the work will be when we’re deciding whether, or how, to do it.
3. Do these features contain different information from the base attributes?
When we talked about feature complexity, we skipped over some of the features in our numerical digit example: curves, points, vertical lines, and horizontal lines, for example. This seems odd, right? After all, those features differentiate between a lot of classes, differentiate between some classes fairly strongly, and also play a large role in our ability to separate hard-to-distinguish classes like 4 and 7.
Maybe we skipped them because they’re really complex to extract? Not really: they’re tricky, to be sure, but it would be possible to pass a curve-shaped or point-shaped window over each image and take the max value of pixel densities in our window.
Does that idea sound familiar? If you’re familiar with convolution kernels, it might. Neural networks for image processing work by attempting to reduce the characteristics of each image to a series of more complex but less numerous quantifiable features. The Learn OpenCV blog is the most articulate resource I have found on the details of that process (OpenCV is an image processing library, but the blog talks about computer vision in general as well).
But if convolutional neural networks do this kind of thing automatically, then why would we manually extract the features?
Exactly. Curves, points, and lines are exactly the kind of thing that we’d expect our model to extract from our base data attributes without our help.
In fact, as we get to deeper layers of an image processing network, they can also pick up on more complex features like symmetry. A big fancy one trained on pet images, for example, might have neurons that specifically activate for features as complex as, say, the face of a dog versus the face of a cat.
So how do we know which information our existing model will automatically extract from our base attributes?
A common approach is to start architecting the model with some training rounds on the base attributes of the data, then looking for patterns in what the model gets wrong. If our model has 0s and 1s figured out but it struggles with 4s and 7s, then we know it did not find enough information in the base attributes to differentiate 4s and 7s. So we can prioritize developing features that help the model detect this distinction more clearly.
That seems straightforward. Why do all this theoretical exercise, shattergramming, and feature evaluation if we can just do that?
Without a practical understanding of our problem, we can end up adding unnecessary and unhelpful complexity to our model.
Say we do several rounds of error analysis, and each time we try tacking on a new feature that we think might work. When we continually do this without taking a step back to look at our classes, we might end up with a lot of features that make the model harder to interpret but don’t add much to the accuracy. We might also end up with multiple features that all try, in different ways, to code the same distinction, which adds noise that can hurt model accuracy.
Finally, if I might insert a fundamentalist perspective into the mix, it’s immensely harder to move closer to solving a problem that we refuse to understand. So in many cases spending the time to understand the problem can yield far greater rewards with far less frustration than yanking back and forth on the model’s feature set.
Once we’ve thought about the features we could engineer to separate the classes in our classification problem, we’ll want to evaluate the utility of each of those features. To do that, we can ask a few questions about each one.
- How much would each of these features contribute to a correct classification? A contribution can come from a few places, so we want to think about a few contribution characteristics for each feature:
- How many classes does it differentiate?
- How strongly does it differentiate between classes?
- Does it distinguish between hard-to-distinguish classes?
- How hard is it to obtain or approximate each of these features? It’s important to consider, before attempting to extract a feature, how we might extract it and how complex that might be. Some features take more work than others.
- Do these features contain different information than the base attributes? If our model will already pick up on information from the base attributes, then we may not need features to make that information more explicit. We can figure out what additional information the model needs by training it on base attributes and seeing what it still gets wrong.
Data scientists often rely on a cycle of training models, validating them, analyzing errors, and then tweaking. We can do that for feature engineering as well. So why talk about any of this theory or strategy behind engineering features? In short, because it helps us more accurately understand our problem. And when we work to understand a problem, we have a better chance of using our practical tools to develop an elegant solution.
If you liked this post, you might also like:
Code Mechanic: Numpy Vectorization: truly for only the nerdiest among you
Design Patterns for Data Science: in which I soapbox at the intersection of software engineering and data science
Does Values-Based Investing Hurt Returns?: in which we apply statistical rigor to an important ethical and financial question. With bonus commentary on millenials killing things!