Notes and Illustrations: Grokking Machine Learning

Reading Time: 10 minutes

I used to do a single page of illustrations and annotations for books I was reading, then publish them on this blog as “One Page Notes.” I can’t do that for Luis Serrano‘s book Grokking Machine Learning because I ended up drawing and annotating for four (arguably five) pages.

In this book, Serrano attempts to translate foundational concepts of machine learning for a lay audience. This is tough to do—and also, by the way, somewhat stigmatized:

You can check out my piece on approaching academic papers to learn more about why and how academia incentivizes overcomplicating concepts in papers, but I’m skipping it here because it’s not the highest-risk failure mode for work like Serrano’s. This tends to happen with papers aimed specifically at an audience of colleagues and practitioners.

Instead, when folks write about machine learning to a lay audience, the result is usually reductive.

I get wary anytime a piece says “ANYONE can do machine learning!”

It’s not that I don’t think people can do machine learning. It’s that titles and taglines like these correlate with pieces that hand-wave all the math with phrases like “this doesn’t matter, the concept is what matters” and then, after they’ve done that, try to convince the reader that they’re ready to implement machine learning models.

The truth is, if you have absolutely no understanding of how the model works, you are not, in fact, ready to deploy one. To talk about one, sure. To be a product manager for a team with machine learning engineers on it, okay1. To actually build and deploy machine learning models? Absolutely not. Machine learning models are notorious for unexpected and harmful outcomes. There’s a desperate need for rigor in the field, and while improving accessibility to the field can help with that, lying to people about how much knowledge they need to responsibly use ML hurts it.

I think Grokking Machine Learning approaches the subject at the appropriate complexity for programmers with a light (or rusty) math background, who want to understand how machine learning models work, and who are starting a journey toward implementing one. Serrano starts with illustrated, straightforward high-level explanations, but he also introduces the math and expects readers to invest time in understanding it. Each chapter walks through relevant toy problems, replacing Greek symbols with numeric values so readers can trace examples without an existing command of mathematical notation. The chapter conclusions include exercises (with solutions, not just “this is left as an exercise to the reader.”)

Before someone @s me about it: yeah, the marketing for the book does have some reductiveness red flags. Take, for example, this pull quote from the publisher’s landing page:

So machine learning is, in fact, complicated, and does take a lot of work to master, and we’re not doing people a service by suggesting that “It’s not!” Also, for the sake of context (and I happen to believe the publisher should have disclosed this), the person who wrote this pull quote is the founder of a for-profit, not-for-credit paid online course platform that Serrano worked for at the time of the book’s publication. This marketing is not exactly oozing legitimacy. Take it from me: I read the actual book, and it’s better than the marketing would suggest.

You can download a PDF of all my notes for the book:

Below you’ll also find jpg screenshots from the PDF if you’d prefer to look at the notes here in the browser without downloading a file.

This first screenshot illustrates a high-level flowchart for choosing, using, and evaluating a model.

Believe it or not, you’ve probably seen a version of this illustration before. It’s just that the types of models, the error functions they use, and the comparisons of their evaluation metrics usually get illustrated in the aggregate as a table. I do not like this. The reason I do not like this is that my brain does not make decisions in a tabluar way. It makes decisions in a decision tree way. To me, tables are for looking up information, and flowcharts are for making decisions. Organic diagrams like the above also make it easier to track the relationships between concepts: for example, the Venn Diagram in the upper right demonstrates that a perceptron algorithm and a logistic classifier handle correct classifications differently during training, but they handle incorrect classifications similarly. The box below that compares the sigmoid and softmax functions that we use for translating a model’s numeric prediction into a probability of a given datum’s class. A table makes sigmoid and softmax look like totally disparate operations. In fact, they’re the same operation performed for binary or multiclass classification, respectively.

After talking through the modeling process for seven or so chapters, the book gets into different specific types of model. It starts with the Naive Bayes Classifier:

The box on the right follows a specific numeric example from the book about how to identify email spam—though I think I changed the probabilities from the book to make the mental math easier for this illustration. This classifier helps us figure out “if we know the probability that a spam email or a not-spam email contains each of these words, what is the probability that a new email containing some combination of those words is spam?”

After this classifier, we go to decision trees.

I got to the end of the page here, hence the split illustration. The illustration starts with a comparison of the Gini Impurity Index and Entropy, both of which numerically represent “how diverse is this collection of items?”

The choice of error function, of numeric representation, and even of model type are often a judgment call. And sometimes, the judgment call is based on things like “this function gives us a nice, easily differentiable result and the other one is more annoying to work with” rather than “this is the better, more accurate one.”

Other times, the choices in machine learning are almost purely empirical. When it comes to which model to use, the standard approach is to take several different types of model that make sense, throw them all at the dataset, and see which one does best based on precision, recall, f1 score, or accuracy.

From Decision Trees, we move on to neural networks.

I realized this later—that says “nodes.” It doesn’t say “nudes.” Nudes wouldn’t even make sense here.

Earlier parts of the book shed light on some of the reasons to choose a particular optimization function for a model. With neural networks here, the book discusses the reasons for using different activation functions at different steps in a neural network’s training process. The book focuses specifically on neural networks for classification, but it touches on the adjustments one would need to use it on a regression problem.

From there, the book moves on to support vector machines.

The unique thing about a support vector machine is that, by default even before regularization, it balances two terms in the optimization function instead of just one: the error, and the distance between the two support vectors (which the classification boundary bisects).

The Support Vector Machine section is where I did some of my finest illustration work to create visualizations for “The Kernel Trick”—‚the technique of adding columns to a dataset that is not linearly separable to increase its dimensionality such that it becomes linearly separable. Observe and congratulate me:

That wraps up the individual types of model covered in the book. From there, it moves on to groups of models. We call the coordinated use of several models ensembling. Yes, like the term for the group that does a coordinated song or dance performance in a musical. I tried to give the notes structure, but I think it got a little lost among the drawings. The idea: two general strategies for ensemble learning are bagging (grouping several models trained on overlapping subsets of the data) and boosting (running the model on the training set, and then focusing later iterations on the points the previous iterations got wrong by various means). The book covers three (well, really two) means of doing that: AdaBoost, Gradient Boosting, and then XGBoost, which is Gradient Boosting with a pruning step.

This last part I editorialized on a little. The idea is to provide a series of steps for a machine learning engineer to take, from when they get the data to when they have a model they’re prepared to ship. In my humble opinion, this part of the book is too high-level to be especially useful, but it’s a starting point for making one such list. For example, in “splitting the data” here, I added “balance the classes,” which the book didn’t mention, and which I have personally been dinged in an applied ML interview for not mentioning.

There are certainly more items to add here. My intention, at some point, is to draft a checklist I trust2, so look out for that in the future. In the meantime, please don’t use this checklist as a project skeleton. You can use the diagram of k-fold cross-validation, if it’s helpful.


  1. Like I mentioned, this book is best suited (in my view) for programmers with a light (or rusty) math background, who want to understand how machine learning models work, and who are starting a journey toward implementing one. If you’re preparing to be a product manager for a team with machine learning engineers on it, I’d be more likely to recommend you this book than Grokking Machine Learning.
  2. The book I’m using for supplementary material on that project (not this book, a different book) is pretty annoying and spends a lot of time reminding readers that the author used to work at Google, so it’s taking me a while to get through.

If you liked this piece, you might also like:

My latest soapbox about documentation (I probably don’t say what you’re expecting!)

This piece on critique, which is the closest thing to “what art school taught me about programming” that you’ll ever get from me

The issue with “Data-Driven Innovation” (I realize I’m recommending this to a crowd whose jobs might depend on their boss buying into data-driven innovation, so wear your sword as long as you can I guess)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.