Two months ago I shared some ideas about visualizing multivariate linear regression in a way that allows SMEs to interpret their data and understand what the models are doing. I recommend reading that post before this one because it introduces the problem space and explains my idea for visualizing high-dimension data in a two-dimensional space while preserving the meaning of each of the portrayed features.
I have worked on the charts some more. I want to share some progress I have made, show you some examples of how the chart might be used, and tell you about some additional challenges I’m facing in hopes that you will have some suggestions about how to address them.
Example 1: Visualizing Fast Food Nutrition Data
I ran a linear regressor on a set of nutrition data, fit for the number of calories in each food item. Here is the visualization of the resulting model:
I have rounded out the legend to show that the outcomes (test set y values) appear as red dots and the predictions (model’s estimated y values) appear as blue dots. The blue dots are a little smaller than the red dots so that you can still see the original outcome even if the prediction is 100% accurate. That way you don’t have to guess if the prediction is perfect or so far off that the original outcome doesn’t even appear on the chart 😂
The data points appear in order by increasing predicted value, rather than on a scale for any one input feature. This visualization represents eight input features (or eight dimensions), and it shows the contribution of each feature toward the outcome.
The predictions are pretty darn accurate—the blue dots (predictions) all appear pretty close to the red dots (outcomes). This model had an accuracy of 0.995 (1 is perfect). By the way, if all coefficients are positive, the blue dot (prediction) will always appear right on top of each data point’s bar: that bar visualizes how each feature value, multiplied by its weight, adds up to the model’s prediction.
You’ll notice that the three most prominent features for predicting calories are protein, carbs, and total fat. There’s a reason for this. The actual caloric value of a food is calculated by summing up the grams of carbs x 4 cals/gram, the grams of protein x 4 cals/gram, and the grams of fat x 9 cals/gram. They literally add up. This is reflected in the model: the coefficient (weight) for total fat is 8.11, the weight for carbs is 3.64, and the weight for protein is 3.57. With a fairly small training set (114 examples), the model almost figured out how many calories to expect to add from each gram of macronutrient.
Example 2: Visualizing Automobile Data
I ran a linear regressor on a set of automobile data, fit for the miles per gallon. Here is the visualization of the resulting model:
Here you can see two things about my graphing method: the first is a visual triumph, and the second is a visual question mark that I still need to work out.
The triumph: When your test examples are evenly distributed among your prediction values, you can see the line that the linear regressor predicted (blue) with the actual outcomes scattered around it (red). Luckily, the Law of Large Numbers is on our side for this one—any dataset large enough to support a data-driven decision is also large enough that this is extremely likely to happen. The fact that it’s starting to coalesce even on this borderline-toy (400 examples) dataset is very encouraging.
The question mark: all the weights, multiplied by their features, should add up to the prediction value such that the prediction dot sits exactly on top of the bar…right?
Yes, if all the weights * features are positive. But in this case, some features were inversely correlated with mpg. For example, the higher each vehicle’s weight and the more cylinders it has, the lower its mpg. This is reflected in the chart by the bar components for those features appearing below the axis of this graph. The intercept (purple) also happens to be negative for this regression. Those negative values, summed with the positive values from features like the acceleration and model year, create the prediction, but you can still see the entirety of the positive values in the bar components above the prediction. If you were to take each bar component that falls below the axis and subtract its height from the top of the bar, you’d arrive at the prediction value.
I’m not sure the best way to depict this so it makes sense. I have considered giving each data point two adjacent bars, with the first one starting at the axis and summing all the positive values, and the second one starting at the top of the positive values bar and extending downward for its length, leaving empty space between its lowest point and the axis. For negative prediction values, the paradigm would flip top to bottom. Would this be more clear? I don’t know. If you or any of your visual design friends have a better idea for how to do this, I’m actively soliciting your (or their) ideas.
Example 3: Home Price Data
I ran a linear regressor on a set of home data, fit for the most recent sale price. Here is the visualization of the resulting model:
We don’t see that perfect prediction line coalesce like we did in the last example. This is because of dataset size, as we discussed. Randomly choosing a test set that matches the trend of the whole gets much harder as the dataset gets smaller, and this one (200 examples) is too small to expect that.
Look at all those features! Home price is all about location. The dataset has a column called ‘location’ that lists the neighborhood of each of the homes. I one-hot encoded that column to include it in the model, so every neighborhood has its own feature. Luckily, each house is in only one neighborhood, so each data point will only have a nonzero bar component for one of those features. And indeed the outcome is instructive: evidently being located in King City drives home prices down (test examples 10, 12), while being located in San Luis Obispo drives home prices up a little (test example 13, 14) and being located in Santa Ynez drives home prices up a lot (test examples 17, 19). A quick Google search reveals that Santa Ynez is known for its world-class wineries. That explains it, then!
Some weird stuff happens when I graph this model, though. When I run the same model on the same data several times, more than half of the time I get a weird result:
I don’t know why these happen. So if you have any insight, I would love to hear it.
Additional Limitations Worth Discussing
These visualizations intend to represent high-dimension data in a way that preserves the meaning of each feature. That said, they still can’t represent millions of dimensions very legibly. Imagine if each of those bars contained millions of different colored components. Anyone analyzing that chart would have a hard time with takeaways. Then again, I know of no visualization schema that can legibly depict all of that without resorting to dimensionality reduction (if you know of one, please clue me in). In the meantime, I’d estimate that this type of visualization provides utility up to maybe two dozen features. Luckily, data with millions of features usually doesn’t start as a spreadsheet but rather as text or high-fi sensor data (audio or visual). Usually when we’re running models on this data, they’re not regression models but rather some kind of classifier or replicator. That doesn’t mean we’re not going to run into the visualization problem with them; we are. We’re just not running into it yet. In the next post on this topic, where I adapt the visualizer to depict a classification model, we’ll poke at this challenge a little more.
Fast food nutrition data: Nutrition Data for Fast Food 2017 https://www.statcrunch.com/app/index.php?dataid=2323899 Car mpg data: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml/datasets/Auto+MPG Home Price data: California Home Prices, 2009 https://www.statcrunch.com/app/index.php?dataid=2188686