Almost four and a half years ago, I wrote about takeaways from the first 14 chapters of Andrew Ng’s book Machine Learning Yearning.
The book came out chapter by chapter. Back in 2017, I had caught up to the book’s release schedule, so I set it aside to accumulate more releases. By now the thing has long been finished, but I finally came back.
My notes from the back half of this book span about two pages. Here’s page one:

Here’s page two:

I know, I know. My handwriting. Moving on.
Let’s refresh on who the book is for and what it’s about:
The book features terse chapters that succinctly explain how to think about the high-level decisions associated with a machine learning product.
That is, the book does not delve into which specific algorithms to use for what. Rather, it focuses on a set of practices that have aided the success of machine learning projects of all stripes.
– Me, four and a half years ago
In my experience working as a data scientist, then a machine learning engineer, then a software engineer whose clients were data scientists and machine learning engineers, I can confirm that this book covers the majority of an individual contributor’s decision making in this field. Or rather, it should—it’s bad if a team gets so focused on the details of their shiny new BEAM instance that they train on test data and fail to notice. (This is called data leakage, it’s bad, and I’ve seen it with my eyeballs. More here.).
That said, MLY is a high-level survey book. On top of that, like most books in the tech field, it devotes the last few chapters to bits and bobs that didn’t fit well into the remainder of the book’s didactic framework. The symptom of this is usually several unrelated topics, at vastly different levels of abstraction, all presented together. For example, we get a chapter on the relatively low-level task of debugging inference algorithms with the optimization verification test, followed by a chapter about the unrelated and very high-level topic of whether to do an end-to-end deep learning model or build a pipeline of components. I don’t fault the book for these chapters any more than I can fault a kitchen for having a junk drawer or an application for having a utils
directory. I, too, have a list of random blog topics entitled “This Goes Somewhere.” But I don’t want to focus on the “bits and bobs” part of the book here.
I do want to highlight Andrew’s framework for error analysis.
When data professionals talk about error analysis, we’re usually comparing metrics like accuracy, precision, recall, and F1 score. Andrew instead pops up a level to stratify “types of wrongness” according to where the wrongness is coming from, which then informs what to do about it.
Suppose we have our data split into two sections: the training set for training the model, and the development set for evaluating it. We’re dealing with four types of inaccuracy:
TYPE 1: Optimal error rate (sometimes called the Bayes Error Rate—half the terms in statistics are called Bayes this or Gini that): This is the accuracy of the labels in the labeled data that a model trains on. For example, if humans classified the training data, this is the error rate of the human classifications. It probably isn’t zero.
TYPE 2: Avoidable Bias: The difference between the model’s error when predicting on the training set and the optimal error rate. This happens when the model did not capture all the relevant information for predicting the target variable. We approach this problem by vertically scaling the model (bigger/more layers) or reducing regularization (though that last one puts us at risk of overfitting).
Adding examples to the existing training set does not help here, because the wrongness is coming from the model failing to find patterns that predict the target variable. This becomes clearer in some diagrams we’ll see later; training error only rises as the number of examples increases. It’s dev error that decreases. If the training set has a problem at this stage, it’s that the features don’t capture all the information that the “optimal error rate” labeler (usually a human) used to label the data. Adding examples like the existing ones won’t help. Replacing the training examples with examples of a different format that better capture all the relevant information might work. In other words, if this is happening, a wider training data table (additional features) might help, but a taller one (more examples) will not.
TYPE 3: Variance: The difference between the model’s error when predicting on the development set and the model’s error when predicting on the training set. This assumes that the training and the development sets came from the same distribution (i.e., we did not take a bunch of data that was meaningfully different from what we had in the corpus that we split for training and test and then shove all that in development). We address this one by adding training data or, to be frank, adding regularization by various means—this could be early stopping, could be decreasing the number of input features, could be adding a regularization parameter.
It’s worth noting here that we might address either avoidable bias or variance by re-architecting the model or by analyzing the errors and engineering features to address them. It’s worth noting here that feature engineering is one means by which we might make the training data table wider, as mentioned for avoidable bias. That said, we also discuss decreasing the number of input features—making the data table narrower—as an option to ameliorate variance. We could do both at once by essentially attempting to replace noisy features on which the model overfits with more instructive, engineered features that better capture the patterns we’re looking for.
TYPE 4: Data mismatch: This is what we get from a development set with a different distribution than the training set distribution, and it’s the difference between a model’s error when predicting on that different-distribution-test-set and the model’s error when predicting on a test set that came from the same distribution as the training set. The fix sounds old-fashioned, but it’s to try to understand what properties of the data differ between the training set distribution and the new, different development set, and then, if those properties matter for the use case, find more data like that to include in the training set.
I found the visual representations of optimal error (here called desired performance), avoidable bias (here called training error), and variance (here called dev error) instructive for identifying and responding appropriately to different error cases. Here’s one example:

Here’s another one:

I can refer to diagrammatic representations like this while talking with stakeholders about machine learning models, which makes a huge difference. I go more into the weeds on error analysis for an individual case in this piece over here, if you’re interested (it’s part of a six-post case study).
Once upon a time I hoped to put together a set of one-page pre-flight checklists for data science project stages, including error analysis at various levels. Shouldn’t a professional know the steps, Chelsea? Well, I’ve seen people with very fancy degrees and very highfalutin’ credentials miss very basic steps. In fact, I’ve been one of these people. There are just a lot of steps. A checklist isn’t there to supplant data science knowledge—I’d expect the professional to know, for example, what each step entails without elaboration—but just having the steps helps. Frankly, if it’s good enough for airlines and aerospace engineers, it’s definitely good enough for me.
Lately, though, I’ve also gravitated toward tools that build guardrails into the error messages (see this piece on documentation for a couple of software examples). For data-specific examples, I’m a fan of this data debugging output from spaCy:

…or this one from Roboflow:

We’ll see what I end up doing. In any case, my hope is to keep Andrew’s stratification top of mind for talking about the process of refining models.
If you liked this piece, you might also like:
My latest soapbox about documentation (I probably don’t say what you’re expecting!)
This piece on critique, which is the closest thing to “what art school taught me about programming” that you’ll ever get from me
The issue with “Data-Driven Innovation” (I realize I’m recommending this to a crowd whose jobs might depend on their boss buying into data-driven innovation, so wear your sword as long as you can I guess)