In the last Design Patterns for Data Science post, we talked about the relative advantages of Jupyter notebooks and scripts, and we moved some code from a preprocessing notebook to a method inside a python file. When we did that, we discovered that the limitations we faced changed from tooling limitations to programming limitations—problems we can solve with design patterns from software engineering.
But before that, I want to talk for a moment about data science, programming, and cups.
I had dinner with a friend the other night, a principal data scientist with a software engineering background. He shared some valuable insight on the programming philosophies of software engineers versus data scientists. Data scientists, he pointed out, tend to build code to have a short lifespan: a script to change the data once, or a helper method to use for few weeks until a model is finished. Software engineers, by contrast, imagine their code having a longer lifespan: months or, if it’s really built well, years. Many old-hand software engineers have a story about some embarrassing code they wrote that accidentally stayed in use for decades. Decades.
So it’s not that one group is better at writing code than the other. It’s that one group has optimized on programming paper cups, and the other group has optimized on programming ceramic mugs.
But there’s a lesson in the stories from the old-hand software engineers: code often sticks around far longer than anybody thought it would. That’s true in software engineering and has resulted in a number of patch job solutions from the cute ones (like Android’s
nullColumnHack) to the scary ones (like security holes in the web itself).
It’s also true in data science: a temporary solution ends up getting passed around and reused without context. The data science workspace ends up littered with paper cups—dented, coffee-stained, oft-reused paper cups, each of them well past their intended lifespan.
Data science can benefit from learning some of the techniques that software engineers use to program mugs. Engineering techniques help us better prepare our code for the reuse, repurposing, and context sharing that it can expect to withstand.
That’s why I’d like to share the object structure I ultimately came up with for solving the case study from the last post.
Review from the Last Post
Here’s our case study:
Suppose we have 50,000 rows of labeled data. Each row has a paragraph of legal text in a column called base_content. We want to use this base_content to classify the rows into one of twelve categories of specific types of legal text. We need to extract the data out of large CSV files, and then we need to do a little preprocessing to get them ready for testing models.
That post walked through several options:
- Doing the preprocessing at the top of the analysis notebook
- Importing a notebook into another notebook (please don’t)
- Saving preprocessing steps to a csv
- Copying preprocessing code into a method in a python file
The last solution has a number of advantages, and that’s where we move from tooling limitations to programming limitations:
1. It starts the extraction process from scratch every time we run it. We’re dealing with 48,000 rows of text, each of which has a text attribute that needs tf-idf vectorization. The method you see above takes roughly 20 seconds to run. That feedback loop is long enough to add friction to our workflow. And every time we need to undo our notebook changes and get a fresh copy of the dataframe, we have to wait that 20 seconds again. The ideal solution would be much faster than this.
2. We do not get to make any choices about how our preprocessing is done for each analysis step. Back at Option 1, we talked about how these data analysis projects go. Each analytical step requires some unique modifications to the data. It’s possible that we want to start with different pre-processing steps depending on the analysis we’re doing. With this method, though, we’re always getting the same steps—including tf-idf, which is computationally expensive. Ideally, inside a client notebook, we could compose together the pre-processing steps we need without any additional ones we don’t need.
So what did I end up doing with this data access and pre-processing code?
Option 5. Singleton data storage wrapped in a class with a fluent interface
Singleton: The data exists exactly once in a shared data store. Class: A distinct instance of the data—but all of the instances reach into the single shared data store. Fluent interface: Each method on an instance returns the instance itself so you can chain methods together. You have used a fluent interface before if you have chained pandas methods together like this:
df = pd.read_csv('some.csv').rename(columns=str.lower).drop('unnamed: 36', axis=1).pipe(custom_method)
Let’s take a look at the code itself to get an idea of what is happening. I have added comments within the individual methods in case you’re interested in a deep dive:
|import numpy as np|
|import pandas as pd|
|from sklearn.feature_extraction.text import TfidfVectorizer|
|_cached_data = None|
|if Content._cached_data is None:|
|# extract my data into a dataframe. I'm pulling from a file,|
|# but you might do this by fetching data from an endpoint|
|# or pulling it from a database.|
|df = pd.read_csv('all_my_data.csv')|
|# example data cleaning. I'm dropping nulls and filling in defaults,|
|# but you might do things like calculate means|
|# or do other preprocessing steps you know you will want for later.|
|df = df.dropna(subset=['column_where_null_values_render_row_useless_for_analysis'])|
|df['column_where_null_values_mess_up_my_analysis'] = df['column_where_null_values_mess_up_my_analysis'].fillna('')|
|Content._cached_data = df|
|# example method that does additional computationally-expensive pre-processing on the data.|
|# I made a separate method for this instead of the 'cheap' pre-processing like on lines 24 and 25|
|# because it takes a while (15 seconds). So we only invest that time if the client explicitly _needs_ this done.|
|if 'tfidf' not in Content._cached_data.columns:|
|df = Content._cached_data|
|base_contents = np.array(df['base_content'])|
|vocab_length = 40000|
|tfidf_transformer = TfidfVectorizer(ngram_range = (1,3), max_features = vocab_length)|
|tfidf_encodings = tfidf_transformer.fit_transform(base_contents)|
|df['tfidf'] = list(tfidf_encodings.toarray())|
|Content._cached_data = df|
|# Fluent interface methods return the object calling the method so the client can chain methods together.|
|# In the 'original' fluent interfaces in Java, this method is often named .get().|
|# I need it because I want the value at the end of my fluent statement to be a dataframe, not a Content object,|
|# but initializers must return None (so Python can give back the initialized object we initialized).|
|# This method hands over the internal state as a dataframe.|
|return Content._cached_data.copy() # Many thanks to Bijay Gurung for catching an issue with this! Now updated to work properly 🙂|
|#You might note that the above method is not needed in the pandas fluent interface.|
|#That's because the pandas methods that return dataframes are _not_ initializers.|
|#pd.Dataframe(), pd.read_csv(), and pd.read_excel() are all _class_ methods that return a dataframe object.|
|#There is no "pd" instance. This setup is convenient for users, but it's tough to mock and test.|
As you can see, when we instantiate a
Content(), we refresh a data cache.
_cached_data lives on the Content class itself, rather than any given instance, so once one instance has fetched it, all instances have access to it. For this reason, the first time one of our notebooks makes a
Content(), it takes some time to fetch the data. The second time and every time thereafter, though, the data is already loaded for use! Check out these two time trials:
The first time takes about 20 seconds. The second time? 0.0004 seconds.
Singletons introduce a risk: if one instance modifies the store, then every instance now has a modified store—even if they did not want the data modified. Our class here mitigates that risk because all methods that modify the underlying data are both idempotent and agglutinative.
Idempotent: We can run it 1 time or 400 times, and the result will be the same.
Agglutinative: All changes add to the data store. No methods change or remove data in the data store. Data doesn’t get re-represented in the notebook no matter what order a notebook user chooses to run their cells. Furthermore, we know that tf-idf takes a long time. So if we don’t need that pre-processing step, we run
Content().to_dataframe() and we don’t have to wait for it to finish at all—not even the first time we access the data. We can take advantage of python’s built-in testing tools to write tests for this class and ensure that we don’t cause regressions (break things that were previously working) if we make changes to the existing code.
Software engineers spend a lot of time thinking about how to make code flexible and reusable. The data science community can benefit from that thought. Here we have taken a method inside a python class—which gets the job done, albeit inefficiently—and converted it into a python class with a single instance of expensive-to-access data. We have also added a fluent interface so that clients can customize which pre-processing steps they run on the data. In future posts, hopefully we’ll get to see additional examples of ways that the software engineering and data science words can cross-pollinate to make everybody happier.
If you liked this post, you might also like:
This post explaining cost function optimization. People seem to like it 🙂
This post explaining Taylor Series. Someone once told me that this post ‘saved their life,’ which was probably an exaggeration but I WILL TAKE IT, THANK YOU.
Visualizing the regression and classification process (for presenting models to businessfolk/helping SMEs with error analysis)