Natural Language Processing: A Project-First Approach

Reading Time: 7 minutes

Last month I read Emmanuel Ameisen‘s post called How to solve 90% of NLP Problems: a Step-by-Step Guide. It sounds like y’all have been enjoying my one-page summaries, so I thought I’d share my notes from this blog post with you. 

What is a Project-First Approach?

Why am I calling this a project-first approach? I call it that because, based on project experience, Ameisen is describing the 80-20 of NLP projects.

Consider the alternative: the majority of NLP overview texts begin from a theoretical basis. What are all the things we could, theoretically, do with text, and what are all the ways that we might, theoretically, do those things? When you’re in the field building NLP apps for customers, you aren’t doing all of those things: you’re doing a very small subset of those things.

Allow me to compare it to the Android stack. If you follow the exercises in Android Programming: The Big Nerd Ranch Guide, you learn to build custom layouts, deploy fun animations, integrate with fingerprint authentication, and even test out the Mobile Vision and Android Voice APIs. But in a year of building Android apps for customers, you’ll spend nine months doing exactly one thing: fetching data from somewhere and displaying it in a list view. You spend 80% of your time doing 20% of what you learned.

Same with NLP. Ameisen wrote a bunch of apps for paying customers, and he didn’t find himself summarizing, generating, or editing text all that much. Instead, he used NLP to put things in buckets: sort feedback into positive and negative buckets. Sort requests into urgent versus general/not urgent buckets. Sort people, based on their reviews, into customer lifetime value buckets. We call this type of machine learning problem classification, and you can make a career out of this one thing.Ameisen describes a general approach for doing this one thing. In large part, I agree with the approach he describes.

Step 1: Data Cleaning

The blog post provides a handy checklist for cleaning data prior to running an NLP classifier on it. It includes removing irrelevant characters and words, tokenizing the remaining text, and unifying words that should be evaluated the same way—like upper and lower case versions of the same word, different stems of the same lemma, and different spellings of the same word.

From my experience I would add a procedure to this list: concatenating words with their part of speech. ‘I light a candle’ and ‘Turn on the light’ and ‘This is a light package’ all contain the word ‘light,’ but it means entirely different things in each case. To give the model a shot at evaluating them differently we can append the part of speech to each one, as it’s a verb in the first case, a noun in the second, and an adjective in the third. This takes time to do manually, but there are some automated tools that can help. Some of the tools get a fancy and try to identify prepositional phrases, parse the direct object from the indirect object, and separate transitive and intransitive verbs. The fancier they get, the less accurate they get, so I don’t go this deep. Instead I say ‘any object, direct or indirect, is a noun’ and append that to the word. ‘Any verb, transitive or intransitive, is a verb.’ And so on. It’s still not perfect (‘I have light packages’ and ‘I have light coloring’ still get evaluated the same even though light means different things in each), but it’s a start on which I can iterate in later steps.

Step 2: Train

The post describes thinking about how you want to represent your data and visualizing it based on that representation to see if you can visually separate some ‘classes.’ I’m ‘meh’ on this strategy because I’ve seen very few cases where you can visually separate classes based on NLP data. Presumably Ameisen has seen this and the visualization takes one line of code, so might as well do it just in case it bears fruit. Just don’t count on it working (to be fair Ameisen makes it clear that you shouldn’t count on it working).

This section also recommends thinking about what kinds of models you’d like to try. Logistic regression is a popular first choice for text classification. You can pick a few and run a GridSearch to see their results compared side by side. Worth mentioning: this step will take a while. You should be OK on GPU unless you’re using really a lot of text (I’ve vectorized the entire literary corpus of George R.R. Martin on a normal laptop), but it did take over half an hour to run 4 different models. Make a cup of tea.

Step 3: Evaluate and Understand

This step gives us the information we need to iterate on our approach to make our model more accurate. What kinds of mistakes are our models making? How costly are those mistakes? The piece goes on to describe a couple of specific tactics. Systematic error analysis is the cornerstone of effective learning for both humans and machines, so I’m encouraged to see it emphasized here. If you’re looking for more high-level information on this particular step of a machine learning project, I recommend the ‘Error Analysis’ chapter in Machine Learning Yearning by Andrew Ng.

Additional Questions

When I’m done taking notes on a piece I like to write out additional questions and, depending on the piece, go searching for answers. I had a few questions related to technical details in this post, but perhaps most important to highlight is the ubiquity of the Bag of Words method for representing text data as the example representation in NLP pieces. It’s simple, easy to explain, and indeed the basis of most initial NLP work attempts. But it’s worth noting that this data representation, which disregards word order, is the most commonly tried among a variety of options. If you’re running classifiers on a very particular type of text, you can get around the word order thing, in part, by tokenizing common phrases in your type of text so they are evaluated together. For example, if you train models on a lot of legal text, your particular type of legal text will have phrases in it that mean a very specific thing to lawyers. Find out what those phrases are and make them their own tokens. If this isn’t enough for your model, then it may be worthwhile to consider representations that try to include word order.

Conclusion

Ameisen’s post offers a pithy, informed, step-by-step approach to begin the type of text classification problem that dominates the NLP-for-business space right now. While my project experience thus far has brought me to a few slightly different conclusions than Ameisen, I agree with a lot of what he’s saying here, and I think the piece is a worthwhile read for ML Engineers and product managers on ML-focused products.

If you liked this post, you might also like:

This review of Everybody Lies by Seth Stephens-Davidowitz

This review of Weapons of Math Destruction by Cathy O’Neil

This review of Machine Learning Yearning by Andrew Ng

One comment

  1. Thanks for sharing! I found it is very helpful.

    Please, can you share your thought on how to make domain specific word embedding, for example as you said text related to law. I know how to train word2vec from data set, but my data is very small. I’m using en_core_web_md from Spacy, but I found it is too generic. Is there a way to update word representation with external data without re-train from scratch.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.