Are you particular about your programming workspace?
I am. I like code to be neat and well-documented so I understand what it does. I like it well-tested so I can run experiments and avoid regressions. I like the code to run fast, where possible. I also like it to run only as many times as I need.
Software engineers have IDEs to help them get those things. In my experience, data science tools aren’t as sophisticated. (If this is news to you, Joel Grus’s JupyterCon NYC talk will give you a taste. His talk is titled ‘I Don’t Like Notebooks.’)
I value the same things in data science code that I do in app code. Other data scientists value them, too. So how do we create a workspace that works well for data science tasks?
I don’t have all the answers yet. But I thought I’d share my process as I find solutions that work for me.
This case study assumes that you are somewhat familiar with python and Jupyter notebooks. It also assumes that you have extracted data from files to perform data analysis.
Processing My Data with Jupyter Notebooks
Suppose we have 50,000 rows of labeled data. Each row has a paragraph of legal text in a column called base_content. We want to use this base_content to classify the rows into one of twelve categories of specific types of legal text.
We need to extract the data out of large CSV files, and then we need to do a little preprocessing to get them ready for testing models. There are a number of ways that we can do this.
Option 1. Do all our preprocessing in the same notebook as our analysis.
This is what the vast majority of Jupyter notebooks do. The top of the notebook pulls data out of CSVs, moves columns around, chains together a bunch of find-and-replace, drops some null values, fills in some other null values, et cetera et cetera. It might even do some fancier stuff like vectorize text data or perform part-of-speech tagging. All of this pre-processing may or may not be documented along the way with comments or markdown cells, like so:
From there, the same notebook goes straight into the analysis, like so:
Sometimes maybe the same variable name gets used twice along the way, but it doesn’t really matter because we’re supposed to be running all these cells in order from top to bottom anyway.
This keeps everything in one place, which is nice. There are a couple of downsides:
1. There is a high risk of things getting messed up if the cells run out of order. In notebooks, you can run cells in whatever order you like. Theoretically this feature allows you to move incrementally through code, or back and forth, while saving all your values in memory for later use in your exploration.
In practice, when we combine this functionality with data stores that we modify in place (like dataframes) or variables that we reassign several times (like
y_test), we expose ourselves to the risk of using the wrong data in the wrong places. This especially happens when we modify something and then run a cell above it, expecting those modifications to not have been made yet.
2. Notebooks are hard to test.
I’ll be frank. If there is a good way to write automated tests for notebook code, I have not found it. Automated tests are an important piece of documentation for code that sees regular use, especially as the users become more distant from the writer (either different team members, or the same team member at a different point in time). When we ignore them, we expose ourselves to the risk of having zero information about how the system works if something ever breaks.
3. This approach does not scale to multiple notebooks.
What if we need to do the same pre-processing for several models? Do we copy and paste this same litany of pre-processing steps into every notebook, and change it in all those places each time we modify our pre-processing regimen?
That’s annoying, of course, so often we see a different solution: jam all our analysis into one behemoth notebook. The notebook is tough to interpret. Finding a piece of information in it feels like searching for a needle in a haystack. And of course, each additional analytical step requires some modification to the dataframe (in place), increasing the risk that we screw things up when we inevitably run the cells out of order.
A brief aside: I include a lot of markdown documentation in my notebooks.
It surprises me how rarely data scientists take advantage of the markdown in notebooks. In my view, the ability to mix code cells with markdown cells is the greatest strength of the notebook, and it is the primary advantage of the notebook over a command line REPL or a python script.
To integrate with any kind of software or business function, we have to socialize our decisions: the choices we made, the alternatives we tried, the strengths and limitations of our approaches.
If nothing else, notebooks make it uncommonly easy to do that. We can write some markdown above a particularly opaque line of pandas manipulation to explain that we had to replace certain values with other values to make our data valid (and why). We can write out the questions we’re trying to answer, then write the code below the question to help us find the answer. We can include images and links that blend seamlessly with the charts we have included with
When we don’t use this feature of notebooks, we’re riding extremely close to using notebooks as a script IDE. In which case, why not write our code in a python file and run it as a script? We’d get the same opportunity to incrementally advance our code by printing out the result of the last line each time we run the script. Plus, we would not be subject to the memory limitations of the Jupyter notebook’s local browser-based environment, nor would we encounter the discouraging snafus of running notebook cells out of order.
Notebooks give us a rarefied opportunity to fully, completely, and clearly document our scientific process. If we’re going to use them, we should use that.
Option 2. Save processing in its own notebook and import that notebook into other notebooks.
Ok. I super do not recommend this.
In the above mentioned talk, Joel Grus addresses why. I’m copying his slides directly into this post so you can see that I’m not the only person who feels this way about the notebook-importing option:
Option 3. Save processing in a csv that gets pulled by other notebooks
What if we did our pre-processing in one notebook, put a cell at the end of the notebook that did
dataframe.to_csv('our_data.csv'), and then did
pd.read_csv('our_data.csv') at the beginning of each analysis notebook? You would just need to run the pre-processing notebook before the others—or even run it just once, ever, and then commit the resulting file.
This might work in some cases.
In our case, though, we are creating tf-idf vectors for our text data, the result of which is a sparse array (a very large collection of integers containing mostly zeros). As it turns out,
.to_csv() in pandas converts collections to string representations…sort of. I say “sort of” because the representation is only faithful if the collection is very short. If it is a large collection—say, a collection containing all the numbers from 1 to 23458—then the string version in the .csv shows
"[1, 2, 3...23456, 23457, 23458]".
With the ellipses (dots) in the middle.
Essentially the data is lost.
So unfortunately this particular solution doesn’t work for our use case.
Option 4. Save processing in its own python file with utility methods
What if we have a notebook that explains all of our pre-processing steps, and then we stick that code inside a method in a .py file and call that in the notebook?
See the example below. On the left, we have the top of our explanatory notebook. On the upper right, inside
prepare_data.py, I have copied the python lines of the notebook into a method called
.get_data_with_tfidf(). On the lower right, I import that method into the notebook, call it, and assign the result to a dataframe.
This strategy addresses some of the issues we talked about in Option 1. First, a user cannot accidentally run the pre-processing steps out of order now because they happen in order in the method called from the client. Second, we have lots of good ways to write automated tests for a method inside a python file. Third, the approach scales to multiple notebooks because we can import and call this same method in as many notebooks as we want.
This code organization strategy feels like it’s moving in the right direction to me. We have our rough draft, notebook version where we explain all our choices, and then we have our final draft python file version with its own automated tests. The chief concern I have about this approach: code duplication. The lines of the notebook that we end up using for pre-processing get copied wholesale into the .py file. Stylistically DRYness isn’t everything, but it’s not ideal that we would have to make any changes to our pre-processing in two places or else let the notebook get out of date. Please let me know if you have a solution for this: as of yet, I don’t have one.
Our method approach also has a couple of limitations that we’d like to address for our use case:
1. It starts the extraction process from scratch every time we run it.
We’re dealing with 48,000 rows of text, each of which has a text attribute that needs tf-idf vectorization. The method you see above takes roughly 20 seconds to run. That feedback loop is long enough to add friction to our workflow. And every time we need to undo our notebook changes and get a fresh copy of the dataframe, we have to wait that 20 seconds again. The ideal solution would be much faster than this.
2. We do not get to make any choices about how our preprocessing is done for each analysis step.
Back at Option 1, we talked about how these data analysis projects go. Each analytical step requires some unique modifications to the data. It’s possible that we want to start with different pre-processing steps depending on the analysis we’re doing. With this method, though, we’re always getting the same steps—including tf-idf, which is computationally expensive. Ideally, inside a client notebook, we could compose together the pre-processing steps we need without any additional ones we don’t need.
There are a number of programming design patterns that we can use to address these limitations. We do that right here, in part 2. But for now, let’s notice something important: the limitations we’re facing now are fundamentally different from the limitations we faced back in option 1.
When we took the all-notebook approach to organizing our code, we left ourselves vulnerable to problems with running order, risks of untested code, and limitations with scaling. Progress can feel frustrating and slow in an environment where your code might not work and you cannot reuse anything without glomming it all into one place.
That said, notebooks have a valuable strength: the opportunity to document our choices in markdown with runnable code and charts that exemplify our decisions. Python scripts do not offer us that same opportunity.
We currently start out in notebooks to ask questions and document our answers. We then switch to scripts to finalize and reproduce our answers. When we do this, we shift our limitations. We remove risks around the running order of data writes. We create opportunities to document and quality-check our code through automated tests. We open up more options for where to put our code, since we can reuse shared dependencies without keeping all our analysis in one notebook. That flexibility gives us the opportunity to address the next tier of limitations—client side performance (running speed) and query customization.
If you’ve enjoyed this piece and want to read more about data science, you might like:
Visualizing the regression and classification process (for presenting models to businessfolk/helping SMEs with error analysis)
A two-part series on all my notes from Statistics Done Wrong (I highly recommend this book)
This notebook comparing stock prices across ESG metrics (sounds boring, but I think there is some interesting statistics happening here.)