In preparation for an upcoming role, I recently re-read Natural Language Processing with PyTorch, which I skimmed a couple of years ago but never got around to writing about.
I am not going to evaluate this book. I can’t: I’m not the target audience. What I will do:
- Share who I think this book is for, and why
- Explain my recommendations for how to read this book
- Offer my favorite parts of this book
Who this book is for:
This book is designed for a math major who needs to use a specific variety of DL NLP for a specific project. This is why I’m not the target audience: I am a software engineer and a data scientist who has practical experience with machine learning models and a self-taught math background.
When I read mathematical notation, it takes me an interim step. I can read formulae, but I have to write or describe them in words to truly understand what’s going on. When I see a formula followed by the word “obviously,” I know that the “obviously” is predicated on mathematical notation coming easily to the reader1. To wit, here, from Chapter 6:
The assumption that the reader can read a formula like it’s a comic strip contrasts, to me, with the assumption that the reader has not done any machine learning before. Chapter 3 explains what L1 and L2 regularization are. Someone with some machine learning experience probably knows that. Chapter 4 portends the inclusion of examples that will help the reader “learn what it means to do gradient descent.” Someone with machine learning experience absolutely knows that. In fact, I’d argue that if a person is brand new to these concepts, a book about deep learning for natural language processing is probably not the book for them yet.
It’s possible that this wasn’t the authors’ idea of “must-include” information, but rather a recommendation from an editor without a machine learning background. I have had editors ask me to include explanation for things that a general audience wouldn’t know, but my audience would. So that might be the context behind this choice.
How not to read this book:
My recommendation would not be to read this book sequentially from start to finish.
- Chapter 1 provides a general overview of some PyTorch operations. In practice, I would look up these operations in the PyTorch documentation on an as-needed basis.
- Chapter 2 moves on to an overview of some data representation concepts in NLP. In practice, I’d more likely be referring to NLTK documentation for these.
- Chapter 3 describes a perceptron with this illustration:
Perceptrons are one of those things that, in my humble view, every single text makes out to be conceptually harder than they are. So I can’t say there’s precedent for a clearer explanation than this. But I’m chuckling a little at the idea that taking a mathematical formula and putting it on some marshmallows with toothpicks sticking out of them will make it clearer.
The diagram is followed (Chapters 4 & 5) by a catalog of activation and optimization functions, which is useful to someone with machine learning background. If you don’t have that background, skip these for now: later sections tell you which on they’re using in context. You can come back to the “why” and “how” later.
The remaining chapters discuss specific types of NLP model and how/why you would use them. They each do this with an illustrative example, which I appreciate as a pedagogical choice. In particular, the code samples have several standout qualities. More on that in the next section, but the short version is this: the code samples make this book almost by themselves.
The text borrows from academic fashion in several places (“proving that this function is just a transformation of that function is an exercise left to the reader”, “this example is a straightforward extension to what the thorough reader will have seen in previous chapters”), which makes it read as by/for folks with academic backgrounds. The thing about academic fashion is that it doesn’t optimize for information transfer for several reasons, not all of them good. This means that books that follow said fashion are also not optimized for information transfer, particularly for a general purpose audience.
That’s not an insurmountable issue: it just means this book isn’t really designed to be read cover to cover like a crime thriller. There is a way to read it. We’ll get to it in a second.
But first, obtaining the book: don’t use Kindle.
The code samples in the text, though theoretically generous, are useless on Kindle because Kindle’s UI does not allow you to copy and paste, which is what a resource like this one should reasonably expect its readers to want to do. I installed a browser extension that does allow you to copy and paste stuff from Kindle cloud reader, but it doesn’t preserve the formatting.
This is an issue because the code samples in this book are in Python, which enforces formatting as a precondition for compilation. In addition, the font size of the code samples is huge relative to the font size of the text that explains them, such that optimizing the viewing experience for one makes the other illegible (I tried both ways).
What I’d probably have done here is, everywhere that the book contains a code sample, provided a link to that specific file in the public repository where the sample code is stored.
This public repo is gold, and the table of contents plus this repo alone would be worth the price of the book.
If I were to go back and read this book again, I would not read it sequentially from cover to cover. Instead, I would use the table of contents to find the chapter containing the topic I wanted to learn about. Then, I would go to the repo and dig through the chapter file to find the code example demonstrating that topic. I would study the code example, referring to the inline comments for tactical understanding, and referring back to the book for strategic and theoretical understanding.
Like, these docstrings are as good as docstrings on actively maintained and widely used open source libraries:
The authors also do a much better job of naming variables, extracting methods, and dividing responsibilities between classes than most data science/machine learning code I’ve seen (I’ve written, adjusted, and audited machine learning models as part of my job for several years). If I came onsite and this was the code that the data science team had to show me, I would jump for joy. The other really lovely thing about these examples is that the authors share them in Jupyter notebooks such that I can fork the repo and run the models myself.
I would go so far as to say that I would pay $$BANK$$ for this exact book as a series of Jupyter notebooks instead of a Kindle book. I could read along and then actually run the code samples inline. The book becomes really valuable to me when it:
- Explains the formulae verbally as well as numerically
- Links out to proofs, rather than leaving them as an exercise for the reader
- Links out to explanations of prerequisite knowledge, rather than an inline explanation
- Ensure copious logging in the code samples such that when the reader runs them, they see what’s happening in real time.
This last one gets tricky if we want to add any explanations or visualizations that don’t come with the PyTorch modules. Here’s why: logging functionality adds code that doesn’t have to do with the thing the authors are trying to demonstrate (the models in PyTorch). Anytime you add code that doesn’t contribute to the core functionality, you risk distracting people on the non-core part. Logging is particularly notorious for this because it’s tough to abstract away: you sorta have to intersperse lines in your core code to do it.
Now, we could get around that with this code with a clever combination of inheritance and decorating. But I’m using the word “clever” deliberately, and in programming, when we say that word, it’s not a good thing. And if we’re including in the target audience a math specialist who knows just enough Python to get their work done, it’s a bad, bad thing. So we’d have to think really carefully about our audience and the tradeoffs of logging for procedural model clarity versus actual code clarity.
I could talk about that all day, but I won’t. I want to get to one other fantastic thing about this book.
The book includes references in a thoughtful, contextually relevant way.
Each chapter includes recommendations for additional reading. Books commonly do this with a superscripted footnote and that’s it. But this book goes further: it explicitly says “if you want to learn more about X, check out this book and/or this paper.” It gives an idea of when various innovations in machine learning were made, and why. It also provides community context such as ‘Check out this book by the person who came up with this idea.2‘
So by the time I get to the list of sources, I know which ones I’m looking for. For me, this goes a long way toward making information stick. I only wish the book had taken advantage of its online format and linked these additional resources directly, rather than with citations alone. That’s another thing that a Jupyter Notebook version of the publication could do with ease.
As a practitioner, I’d take a code-focused approach to this book.
That is, in my view, its strongest asset, and one that could be better leveraged if the book were published in formats that centered the code samples more than the text. That said, the interspersed text provides valuable context—especially the references, which would benefit (in my view) from direct links.
1. I give very few pieces of prescriptive advice about instructive writing, but this is one: words like simply, obviously, easy, just, and straight-forward are almost never necessary, and in fact almost never add anything to your instructive writing, and you should remove them.
2. Obligatory reminder here that when a piece of writing says “So-and-so came up with the idea,” what it usually means is “So-and-so is the least disputed candidate publicly accredited with coming up with the idea.” Those two things are different, and historical memory of who came up with something is wildly political, frequently inaccurate, and laughably reductionist in most cases. We’ve talked a bit about why that happens right here (warning: don’t read this if you’re an Elon stan), but for more context, I can’t recommend Mar Hicks’ work highly enough.
If you liked this piece, you might also like:
This recent piece, though it’s not my favorite, about how to be a 10x developer