Andrew Ng knows a thing or two about machine learning. He cofounded Coursera and instructed the seminal machine learning Coursera class, then served as the VP and Chief Scientist for Baidu before recently launching a new project called Deeplearning.ai. Ng has made an enormous impact on modern machine learning and machine learning education.
Now he is writing a book, each chapter of which you can get for free as he finishes writing it! Machine Learning Yearning aims at an engineer or a technical product manager for a team building a machine learning solution. The book features terse chapters that succinctly explain how to think about the high-level decisions associated with a machine learning product.
That is, the book does not delve into which specific algorithms to use for what. Rather, it focuses on a set of practices that have aided the success of machine learning projects of all stripes. He discusses a few overarching themes that I want to note for our future reference.
1. Development and Test Data Selection
First, some definitions.
Training set: data used to train your model.
Dev set: data used to make decisions about changes to the model.
Test set: data used to evaluate the model—but not to change it, lest we begin to teach our model to the test set.
Ng notes the importance of choosing development and test data that reflect the data on which you want your model to perform well. For example, if you want your product to distinguish photos of cats taken with mobile phones, the development set should not comprise solely photos of cats from the web. A poor fit here can obscure the cause of a model performing well on the dev set and not in the real world: whereas, if the dev set accurately reflects real world data, then we can conclude that we probably overfit our model to our dev set. We need more data, or fresh data, in our dev set. So it’s time to change it out.
Ng recommends taking no longer than a week to scrape together a dev and test dataset for a brand new product. We want to get going on building something, and we can iterate on what we have moving forward to get something better. The one-week guideline does not hold for mature models, like email spam filters. Dev sets and test sets for these types of applications require more data and more care to choose a representative set, so they take more time to build.
2. Measuring Models—Single Number Metric. Optimizing and Satisficing Metrics
Ng recommends choosing a single numerical metric on which to base evaluation of how ‘good’ a model is. If the team uses multiple metrics to judge their models, then how do we compare a model that performs better than another model on one metric but worse on a different metric?
Often there are multiple criteria for evaluating models: accuracy, precision (if false positives are expensive), recall (if false negatives are expensive), speed (for models that need frequent updates) and size (for models that need to fit inside mobile apps). How do we balance all of these?
Ng recommends two approaches: for multiple numerical criteria of similar scale, like precision and recall, he recommends combining them somehow. One option is to take an average. A team could even weight the averages according to importance of the metric: for example, weight precision more heavily if false positives have big consequences (say, for illegal drug tests), or weight recall more heavily if false negatives have big consequences (say, in diagnosis of virulent or infectious diseases).
That works for different criteria around accuracy. But what about, say, speed or size? Ng recommends treating these as satisficing criteria, and identifying what speed or size would be good enough to compete on accuracy. For example, maybe any model that runs in under 100 ms is fast enough, or any program size under 2 MB is small enough for the team’s purposes. So, all models that meet these criteria: those that run in 2 ms or 95 ms, or those that are 0.5 MB or 1.9 MB, all get judged together based on a single numeric metric: accuracy, or precision, or recall, or some combination of a few metrics.
3. Error Analysis
Ng recommends error analysis for making decisions about how to prioritize our approaches to training a model. When we first begin to train a model, we can make quick gains by ramming the data through several different types of models with different tuning parameters. But once we have a general idea of which models are likely to work, we need a more sophisticated strategy to improve our accuracy.
To get some ideas, Ng recommends taking a sample of the data that the model got wrong, and sorting the wrongness into some buckets to identify potential approaches for improvement. For example, in 100 misclassifications of cats in the data, maybe 5 misclassifications are actually of dogs, 25 are blurry and another 30 have Instagram filters. Now we have some ideas about why the model is misclassifying these images.
We also have a rough priority order for tackling these misclassifications because we know how much we can reduce our error if we eliminate any one of them. If we have a good way to remove Instagram filters, for example, we stand to reduce our errors by 30%. Comparatively, we cannot gain so much from fixing the dog misclassifications because there are not that many of them to begin with. In addition to the incidence rate of a misclassification, the team must take into account how confident they are that they can reduce that misclassification and how much effort it will take to do so.
This is exactly the kind of book that I would keep on my desk at work. I could see myself checking with it for reference while making decisions. More than that, though, I see it as an excellent resource for quickly getting team members onto the same page about team practices. I’m excited to find out what the remainder of this book will recommend.
If you’re new to machine learning, or you’re a product manager looking to understand machine learning for your product, this post about how machine learning works is for you.