I’m building a machine learning library. Machine learning, like cryptography, is not a roll-your-own endeavor if you know what’s good for you. My goal isn’t to make another scikit-learn, but rather to examine the intuition behind different cost functions.
Side benefits: I get to experiment with API design in the context of machine learning solutions, and I get to breathe life into the data visualizations. I’m sharing the first visualization with you, which graphs predictions and outcomes for multidimensional data.
The thing about humans and computers is that, while computers have no trouble working with data with thousands of dimensions, humans struggle to wrap their heads around more than three. This has led to a whole subfield of machine learning dedicated to the representation of high-dimensional data in ways that our puny brains can understand.
This has to do with the fact that we customarily analogize data dimensions to orthogonal dimensions in space. But we can only visualize three dimensions this way: there aren’t enough space dimensions to make this work for high-dimension data.
What if we tried a different approach? Instead of relating data dimensions to orthogonal planes, we can instead represent each dimension by its contribution to our predicted outcome.
The graph shows the predicted and actual outcomes for a linear regression function run on a toy dataset. Each bar represents a predicted outcome for one data point. The dot represents the actual outcome for that data point.
This was a two dimensional dataset: there are two features for each data point, and those features are represented by the blue and green bars that you see on the graph. The blue bar for each data point represents the value of some_feature for that point, multiplied by the weight (or slope) the regressor assigned for that feature. The green bar does the same for some_other_feature. The red bar at the bottom represents the intercept, for which no individual point has a unique value to multiply against it. That’s why all the red bars are the same height.
The data points appear in order of increasing predicted outcome. They are not arranged according to the value of any feature. Instead, each feature is represented by its bar. The dots show the actual outcome for each data point. Notice that, taken together, the predicted outcomes do seem to hang in the middle of the scatter of actual outcomes. This is what we would expect for a regressor fit to this data.
The data is toy data. But the point of this visualization is that if the data had more dimensions, it would still work. Take, for example, this four dimensional data:
The red is, again, the intercept term, this time at the top of the bars. You can see how each dimension contributes to the prediction.
In this case, we also see what can happens when a feature has an inverse correlation with the predicted outcome in some cases. Negative feature value x weight products appear below the horizontal axis—think of the bars as buoys, and the horizontal axis as a water line. So the part above the water line comprises all the features that drive the prediction higher, and the part below the water line comprises all the features that drag the prediction lower.
The advantage of a visualization like this over something like principal component analysis is that it doesn’t abstract away the meaning of the data itself. Each bar still represents exactly one feature: you look at it and understand how that feature is affecting your prediction. By contrast, PCA reduces dimensionality by merging the common elements of directionality from different features. So we end up with in a small enough number of components to fit into our mental model of orthogonal planes (maybe), but the components obscure the meaning of the data we have collected.
There’s value in techniques that make data visualizable while preserving the original structure of the inputs: they make it easier for the folks running the models to collaborate with folks who understand the data itself. These are frequently separate groups of people, and to get the most out of the data they need to work together. A domain expert can pick out insights that a generalist might miss. A business exec can see overall trends without drawing conclusions about one specific case.
That’s my optimistic vision, anyway. In the meantime I think there’s more room for all of us—software engineers and data scientists alike—to keep thinking about this.