Applied Data Science Case Study, Part 3: Exploratory Analysis

Reading Time: 12 minutes

In this series, we’re walking through a data science business problem from start to finish. We have introduced the business case: a list of leads for the sales team at a company that makes continuing education courses for cardiologists.

We have also assessed the data available to us by asking questions like:

  • Are we allowed to use this data?
  • Does it have the fields we need?
  • How can we protect people’s privacy while using this data?
  • What biases and ethical concerns arise with this data?

In that assessment, we decided to take some steps to clean this data and prepare it for exploratory analysis. The CMS, NPPES, and NUC all publish relatively nice datasets, so we don’t have a ton of processing to do. I downloaded the datasets as CSVs from the sites linked above and saved them to my machine. Let’s get to work.

Initial Cleaning and Processing

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

I don’t have much additional comment here because I’d like to get to the analysis. It should be noted, though, that the steps in this notebook do take up the majority of a data scientist’s time. We can look to this handy chart for a rough breakdown.


I will point out one thing about the data intake code: if you want to chain pandas methods in a jupyter notebook and have the invocations line up vertically, you can achieve that by adding a backslash to the end of each line.

Exploratory Analysis

Let’s approach this with a series of questions. That seems to work well for us ;).

Do we have enough data for this problem?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
physicians = pd.read_csv('data/physicians.csv')
procedures = pd.read_csv('data/procedures.csv')
In [3]:
(36025, 2)
['id' 'specialty']

36,000 rows. Not all of these are labeled so we cannot train on all of them, but it’s an encouraging number for a binary classification problem, particularly if we’re talking about tabular data with scalar features that demonstrate a clear pattern.

Now let’s look at the balance of our classes—particularly how many physicians are classified as cardiologists versus other specialties. When we go to train the model we’ll take unknown specialties out of the training set, but we’ll leave them in up until then so that they get included in any feature manipulation we do.

In [4]:
specialties = physicians.groupby('specialty')\
    .sort_values(by='id', ascending=False)\
    .rename(columns={'id': 'count'})

COLORBLIND_FRIENDLY_COLORS=['#225ea8', '#41b6c4', '#a1dab4', '#ffffcc']

specialties.plot.pie(y='count', figsize=(10, 10), colors=COLORBLIND_FRIENDLY_COLORS, fontsize=16, legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x126d7d6a0>

Cardiology represents a large chunk of this pie: almost a quarter, even before we remove unknowns. This tells me that we have enough examples of this class to get a good idea of its feature patterns.

This also clues me in to what kind of model we’ll need and what kind of data we’ll feed it. In this first pass, I probably won’t rebalance the data classes since cardiology makes up about a third of the labeled data. If, instead, I were trying to identify anesthesiologists that make up such a small chunk of the data, I’d look at finding a way to get more anesthesiology data, undersampling the other classes in the training data, or oversampling the anesthesiology class (roughly in that order of preference) for my training data.

Additionally, for a class as large as cardiology, my experience tells me that we’re likely to get good results with a traditional classification model. If the class were very small, I’d timebox my attempts with a traditional classification model and consider, instead, an anomaly detection model of some stripe.

What features can we use?

In [5]:
(587774, 4)
['physician_id' 'procedure_code' 'procedure' 'number_of_patients']

For each physician we have procedure codes (along with the procedures they represent) and the number of patients on whom they have performed the procedure. We could look for a number of patterns with this data:

  1. Cardiologists may perform some procedures at least once that other physicians never perform (ascertained by one hot encoding the procedures)
  2. Cardiologists may perform some procedures lots of times that other physicians perform less or never (ascertained by multiplying the above by procedure count)
  3. Cardiology is a very specialty form of medicine that congregates in places with large research hospitals. They may perform a different total number of procedures than other physicians (ascertained by summing all procedure counts for each physician).
  4. Also because of the specialization, they may perform a more specific range of procedures than physicians in other specialties (ascertained by creating some kind of vectorization of each procedure to determine which procedures are similar to each other, and looking at the difference in the size and density of the “neighborhood” of procedures each physician performs. My first attempt at this might be tokenizing the procedure descriptions and starting with a vectorization of that).

I think that feature set number 2 out of these is likely to give us the most information about the specialty patterns for the lowest up-front time investment. So I’d like to start with that. Let’s see if we can get an idea of what a physician’s procedure breakdown might look like:

In [6]:
physician_procedures = procedures.groupby(['physician_id', 'procedure_code'])\
physician_procedures[physician_procedures['physician_id'] == 0]
physician_id procedure_code number_of_patients
0 0 99202 14
1 0 99203 15
2 0 99205 12
3 0 99212 27
4 0 99213 16
5 0 99221 13
6 0 99232 12

So we have here a dataframe that tells us how many of each procedure a physician has performed. For example, physician 0 has performed seven procedures, each between a dozen and two dozen times.

Let’s convert this into a dataframe with a row for each physician and a column for each procedure, with the number of times the physician has performed that procedure at the intersection.

The way I wrote this conversion takes about ten minutes to complete. We’ll take a closer look at this procedure in an upcoming post.

In a permanent project for this model, we’d want to think about how to obtain, update, and store this information. We could keep it in a csv (which would be large, with a lot of zeros), keep it in a sparse matrix separate from the physician and procedure data (which would save space but be harder to index against the physician ids), or combine all our data in some kind of database. If we do it in a row-store database like MySQL or Postgres, we run into one problem we also have with csvs: we’ll be using a lot of storage on zeros. So we might consider a column-store database like Cassandra or similar, which are built for sparse data like this. We’d want to carefully consider that though, since inserts and updates are more time-consuming for a database like that, and these counts would need to be updated every time we have new information about a physician doing a procedure. We could store sparse features in one database and dense arrays like physician id in another database, then cross-index as needed, but that comes with its own technical challenges. A discussion of the tradeoffs of how to store this data is probably out of scope for this code challenge.

So for this first pass, I stick to storing it in CSV.

In [7]:
from specialty_data import extract_procedure_features

combined_numerized_procedures = extract_procedure_features(physician_procedures)
Unnamed: 0 physician_id procedure_00100 procedure_00103 procedure_00104 procedure_00120 procedure_00140 procedure_00142 procedure_00144 procedure_00145 procedure_Q9961 procedure_Q9962 procedure_Q9963 procedure_Q9965 procedure_Q9966 procedure_Q9967 procedure_Q9969 procedure_Q9970 procedure_Q9974 procedure_S0281
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 0 0 0
6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 8 8 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0
9 9 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

10 rows × 3142 columns

Get Binary Data with Unknowns Removed

Now we’ll add our target variables to our feature dataframe. We’ll also filter down our data for training to the physicians whose specialty we know. If we leave them in, our binary classifier’s cost function will optimize on differentiating between known cardiologists and unknown cardiologists and will lean toward classifying unknown physicians as not-cardiologists.

In [8]:
combined_numerized_procedures['specialty'] = physicians.specialty
combined_numerized_procedures['binary_cardiology'] = physicians.specialty.apply(lambda x: 1 if x == 'Cardiology' else 0)
(36025, 3144)

I like to check shapes after I perform some transformations to make sure that everything worked as I expected.

In [9]:
data_without_unknowns = combined_numerized_procedures.loc[(combined_numerized_procedures.specialty != 'Unknown'), :].copy()
(24933, 3144)

We have 25,000ish labeled rows. This is still likely enough data to make headway on our classification problem.

In [10]:
unknown_data = combined_numerized_procedures.loc[(combined_numerized_procedures.specialty == 'Unknown'), :].copy()
(11092, 3144)

We have about 11,000 cardiologists of unknown specialty.

Do the Labeled and Unlabeled Data Look Similar?

In order to predict anything about the unknown physicians from the ones whose specialties we know, the patterns in the two sets of data need to be similar: otherwise, the labeled data isn’t telling us about the unlabeled data. Let’s do a quick check that the procedure patterns in the labeled and unlabeled data are similar:

In [11]:
procedure_columns = list(filter(lambda x: x.startswith('procedure'), data_without_unknowns.columns.values))
procedure_codes = [title[-5:] for title in procedure_columns]
In [12]:
labeled_procedure_sums = data_without_unknowns[procedure_columns].sum(axis=0)
unlabeled_procedure_sums = unknown_data[procedure_columns].sum(axis=0)
In [13]:
labeled_data_sorted_indices = np.flip(np.argsort(labeled_procedure_sums))
procedure_labels = np.asarray(procedure_columns)[labeled_data_sorted_indices]

known_specialty_procedures = labeled_procedure_sums[labeled_data_sorted_indices] / data_without_unknowns.shape[0]
unknown_specialty_procedures = unlabeled_procedure_sums[labeled_data_sorted_indices] / unknown_data.shape[0]

What we’ve done here is divide the total procedure count for each procedure by the total number of physicians for both labeled and unlabeled data. We’ll put them next to each other on a bar graph and see if we notice any big differences in which procedures were performed between the two groups.

Look for any “jagged” areas where the orange bar sticks up way over the black bar or vice versa.

In [14]:
indices = np.arange(len(procedure_columns)) 
width = 0.4     
spacing = 0.2

fig = plt.figure(figsize=(20,10))

axes = plt.gca()
ax = fig.add_subplot(111)

known_bars =, known_specialty_procedures, width, color='black')  
unknown_bars =, unknown_specialty_procedures, width, color='orange')

ax.set_ylabel('Number of Procedures per Physician')
ax.set_xlabel('Procedure Types')
ax.set_title('Physician Procedures by Type')
ax.legend((known_bars[0], unknown_bars[0]), ('Specialty Known', 'Specialty Unknown'))

I’m not in love with this color scheme, but I wanted something colorblind-friendy that also had enough contrast with each other and with the white background to easily see any jaggedness.

I played with the zoom on this chart a little so I could see things better because I was having trouble getting the interactive pan/zoom setting on matplotlib to work in the notebook. If you happen to know the trick to getting that working, I’d love to hear it.

The orange and black bars aren’t exactly the same height, but for the brunt of this data they’re pretty close. It’s still possible that these two overall patterns are the same and the patterns on a per-physician basis are completely different. That could happen if, say, all the unlabeled physicians came from hospitals and clinics where they divide up procedures completely differently than the institutions that label the physician specialties.

To ensure that that is not the case, I’d want to find out how this data was collected. Without that assurance, I’m still going to consider it unlikely enough to disregard for now based on my understanding of how the medical field works.


Perhaps my favorite course I have ever taken on the topic of machine learning is Yaser Abu-Mostafa’s MOOC called Learning from Data. In it, he explains that machine learning is useful when:

  1. There is a pattern
  2. Without a clear mathematical representation
  3. Represented in (a sufficient amount of) data.

In this exploratory analysis, we have determined that we should have enough data to approach this, and that data demonstrates patterns that, while not categorizable as cardiologist or not based on some obvious condition, are consistent across our labeled and unlabeled data.

Next, we experiment with using parallelization to speed up our processing for the procedure data. After that we try a machine learning approach to find the unlabeled cardiologists.

If you liked this post, you might also like:

This three-part thought experiment about feature engineering

Design patterns for data science (aka why this notebook doesn’t have my parallelized processing code in it)

A look at the Numpy API (this was written before I knew how used gists for code)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.