In this series, we’re walking through a data science business problem from start to finish. We have introduced the business case, assessed the data available to us, and performed some exploratory analysis.
During that exploratory analysis, we needed to do some time-intensive data processing. We have a dataframe with three columns:
number_of_patients. It contains the number of times each physician has performed each procedure. If a physician performs multiple procedures, their id will appear in the
physician_id column once for each procedure they perform.
We’d like to convert this into a dataframe with one row for each physician and one column for each procedure, with the intersection displaying the number of times the physician has performed that procedure.
The first iteration of this transformation was time-intensive. This post shares how I improved the performance of that transformation.
A special thank-you goes out to my mentor and friend, Q Ethan McCallum, for recommending I try the thing I ultimately did to solve this. If you’re looking for someone to help determine your company’s data strategy, you’ve gotta give him a call.
Here is the part of the last notebook where we run the aforementioned conversion:
from specialty_data import extract_procedure_features combined_numerized_procedures = extract_procedure_features(physician_procedures) combined_numerized_procedures.head(10)
10 rows × 3142 columns
I wrote a function called
extract_procedure_features and stored it in a file called
specialty_data. That function takes in the dataframe that tells us how many of each procedure a physician has performed, and it returns a dataframe with a row for each physician and a column for each procedure, with the number of times the physician has performed that procedure at the intersection.
First Implementation: 40 Minute Run Time
This is the way that I originally wrote this function:
Why does this code take so long to run?
.apply on line 12? Pandas doesn’t parallelize that; it’s an iterative operation. In our case, it’s iterating through about 6 million rows. So even though the apply method,
numerize, is not computationally expensive, it has to happen a lot of times.
I didn’t want this process to take 40 minutes, so I messed around for a day or two to increase the efficiency of the process. Most of the time I ran into the same obstacle: I couldn’t figure out how to count vectorize by a number instead of by a collection of instances of a
procedure_id. Could I have dropped into C and gotten this to go a lot faster, the way scikit-learn’s
TfidfVectorizer do? Maybe I could have, but I needed to move on from this. If you’d like to implement that, I’d love to see the result!
Second Implementation: <10 Minute Run Time
Instead, I parallelized the process described above across the cores of my local machine using
dask, a library for parallel processing in pandas.
Here’s how that looks:
We’ve passed our operation over to dask to perform (see lines 16-19). Steps:
dd.from_pandasconverts a pandas dataframe into the dask format that allows parallelization. The first argument is the dataframe (though dask also supports other types of collections). The second argument,
num_partitions, indicates the number of partitions into which the collection should be divided. We divide across the number of cores on the machine we’re using, drawn from
cpu_count()in the python
map_partitionsindicates the operation to perform. Its name comes from
map, the operation we perform on collections to do something to each element and return a collection with the transformed value. Our transformation,
numerize, will be performed on all the partitions.
computeoperation comes built into dask collections, and it runs the transformation specified in
map_partitions. They’re separate calls because you can have
computerun several transformations at once, if you like.
getcall specifies which scheduler to use for the computation. We are using a scheduler backed by a process pool that we got from
dask.multiprocessing. The other options are, from the documentation:
dask.threaded.get: a scheduler backed by a thread pool
dask.get: a synchronous scheduler, good for debugging
distributed.Client.get: a distributed scheduler for executing graphs
When we run this dask version, the transformation still takes some time, but even on the smallest EC2 instances it takes fewer than 10 minutes.
I found the dask documentation articulate, helpful, and up to date as I executed this change.
But what about pandas parallelization?
I want to emphasize something here, because I have had four separate data scientists tell me that pandas speeds up operations by running them in parallel. Pandas does not do this.Not for
str methods, not for
apply, none of it. Pandas shells some stuff out to
numpy, which iterates quickly for reasons that are also not parallelization (we discussed them here). In order to execute parallel processing, we need something more than pandas.
As far as I can tell, the reason people think pandas does parallelization comes from the changes to in version 0.18 that released the global interpreter lock (GIL) on certain operations like
What is the GIL? Python implements a global interpreter lock, which means that it only allows access to the Python interpreter on one thread at a time. The maintainers made this decision to simplify memory management, and it means that parallelizing across multiple cores doesn’t work in pure Python.
pandas released the GIL on some operations, that means that those operations use (or were changed to use) collections in a language other than python—usually C, via cython. Cython still assumes that python operations are not thread-safe unless the programmer explicitly releases the GIL. Thomas Nyberg explains this really well.
This change allows parallelization to work on those specific operations, but it doesn’t do parallelization. Data scientists and engineers need to execute their own parallelization, either through
ThreadPoolExecutor or via a library that does it, like
Friends don’t let friends tell people that pandas parallelizes operations. It doesn’t.
In this case, though, we needed to perform a combination of one-hot encoding and multiplying by an instance count on a dataframe. Our individual steps are small and fast, but the dataframe is huge. With millions of rows of data, the transformation still took a long time.
To speed it up, I went with a parallel processing library called
dask. It has excellent documentation and, in my view, an intuitive API. Could more clever things be done to speed this up? Maybe. To me, this was the best bang-for-buck solution.
Now it’s time to get back to our data science cast study! Are you ready to move on to prototyping models? Here we go.
If you liked this post, you might also like:
Another time I talked about performance — example in ruby
A risk-oriented testing strategy that mentions performance — example, again, in ruby
This keynote about refactoring — which discusses the other qualities of code that I think are important