In this series, we’re walking through a data science business problem from start to finish. We have introduced the business case, assessed the data available to us, and performed some exploratory analysis (with a fun detour into parallelizing pandas operations!)
The moment you’ve been waiting for has arrived. It’s time to fit some models.
Random Baseline
Okay, let’s get a baseline for accuracy with a “model” that randomly assigns specialties as cardiology or not with the same frequency as we see in the labeled data. We’ll use this baseline to determine whether our models perform any better than random chance.
target_frequency = specialties.loc['Cardiology', 'count'] / data_without_unknowns.shape[0]
target_frequency
About a third of our labaled data is labeled “Cardiology.” We’ll run a random classifier 100 times and take the mean accuracy of those runs so we don’t get fooled by some random outlier that happens to be more or less accurate than random would normally be.
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
accuracies = []
for i in range(100):
combined_numerized_procedures['binary_random_baseline_cardiology'] = physicians.specialty.apply(lambda x: 1 if np.random.rand() < target_frequency else 0)
accuracies.append(accuracy_score(combined_numerized_procedures.binary_cardiology, combined_numerized_procedures.binary_random_baseline_cardiology))
np.asarray(accuracies).mean()
So our models have to get an accuracy over 60-ish percent to do better than random guessing.
Try some Models
I’m going to start with two models: a logistic regression model (used for classification) and a random forest classifier.
Why these:
- Both pretty fast
- Both relatively interpretable
- I have seen both do well on this kind of problem many times
If they both do really poorly, we can either tweak their hyperparamenters, try different features, or try a different kind of model. This is a starting point with a fast iteration loop, so we’ll get more information fast about whether we need to change something.
from sklearn.model_selection import train_test_split
num_examples = data_without_unknowns.shape[0]
def train_validate_test_split(series):
probability = np.random.random()
if probability < 0.65:
return "train"
elif probability < 0.85:
return "validate"
else:
return "test"
data_without_unknowns['group'] = data_without_unknowns.apply(train_validate_test_split, axis=1)
assert data_without_unknowns.shape[0] == num_examples
Once again, we end with an assertion to check the shape of our end product. I tend to do this especially when the axis parameter is involved, because in my experience with pandas it’s not consistent whether 0 or 1 refers to rows or columns.
it should be noted also: sklearn has a train_test_split function. I forewent using that function in favor of my own because:
- train_test_split doesn’t give us a validation group to iterate on our model before testing. I want this so that I can be sure my changes are improving the model’s generalizable results rather than teaching to the test set.
- I want easy access to which examples gave which results so I can do error analysis and look for patterns in the mistakes.
X_train = data_without_unknowns.loc[(data_without_unknowns.group == 'train'), procedure_columns]
X_validate = data_without_unknowns.loc[(data_without_unknowns.group == 'validate'), procedure_columns]
X_test = data_without_unknowns.loc[(data_without_unknowns.group == 'test'), procedure_columns]
y_train = data_without_unknowns.loc[(data_without_unknowns.group == 'train'), 'binary_cardiology']
y_validate = data_without_unknowns.loc[(data_without_unknowns.group == 'validate'), 'binary_cardiology']
y_test = data_without_unknowns.loc[(data_without_unknowns.group == 'test'), 'binary_cardiology']
data_without_unknowns.groupby('group')['physician_id'].count()\
.reset_index()\
.rename(columns={'physician_id':'count'})
^ A note: to a data scientist, the amount of manipulation I do on that groupby might seem odd. I can do the groupby without the index resetting or column naming and get the same result. I do it because the output display with the manipulation looks nicer is easier for folks to read who are unfamiliar with the pandas library.
We have our train, validate, and test groups. It’s time to run some models on that training set.
from sklearn.linear_model import LogisticRegression
lc = LogisticRegression()
lc.fit(X_train, y_train)
lc_predictions = lc.predict(X_validate)
lc_probabilities = [p[1] for p in lc.predict_proba(X_validate)]
Let’s save off these results for error analysis later.
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'lc_predictions'] = lc_predictions
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'lc_probabilities'] = lc_probabilities
So how does our accuracy look?
print(accuracy_score(y_validate, lc_predictions))
Let’s take a look at the features it highlights and see if they make sense.
codes_and_procedures = procedures.groupby('procedure_code')['procedure'].unique().reset_index()
CODE_TO_NAME = dict(zip(codes_and_procedures.procedure_code, codes_and_procedures.procedure))
def code_to_name(code):
return CODE_TO_NAME[code]
We’ll use these helper methods above to match up some interpretable feature names to their weights in the models. We could theoretically pull this out into a helper class or utility file. I haven’t done that yet because:
- The methods rely very closely on the features we used, so do not want to hide that dependency for now
- Notebooks are most useful for two things: exploration and presentation. Right now, we are using this method for exploration. Eventually, we might want to use it for replication, say, of an experiment. In these cases, having this method in a helper file or even a state-aware class (that knows which features were used in this training run) might suit us better.
from specialty_data import labeled_sorted_weights
labeled_sorted_weights(lc.coef_[0], procedure_codes, code_to_name).head()
We’re seeing here that the five features that most strongly indicate a cardiology classification all appear to have to do with the heart. That makes sense. Let’s also look at the procedures that most strongly contraindicate a cardiology classification:
labeled_sorted_weights(lc.coef_[0], procedure_codes, code_to_name)[-5:]
These features don’t sound anything like cardiology, so they also make sense.
Before we start any error analysis here, let’s look at the random forest classifier and see how it makes this decision.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_validate)
rfc_probabilities = [p[1] for p in rfc.predict_proba(X_validate)]
Let’s save off these results for error analysis later.
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'rfc_predictions'] = rfc_predictions
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'rfc_probabilities'] = rfc_probabilities
So how does our accuracy look?
accuracy_score(y_validate, rfc_predictions)
What features is the model using?
rfc_weights_df = labeled_sorted_weights(rfc.feature_importances_, procedure_codes, code_to_name)
rfc_weights_df.head()
Similar to the logistic regression model, our random forest classifier seems to make decisions chiefly based on procedures that have to do with the heart (ekg stands for electrocardiogram, which is a heart evaluation device).
Feature importance in a random forest classifier doesn’t work quite the same way as weights in a logistic classifier. Rather, the feature importance tells us how much this feature contributes to the classification, in either direction. Lower numbers mean that those features contribute less, all the way down to zero (not contributing), as you see here:
rfc_weights_df[-10:]
That means that, if we go far enough down on the high feature importances, we’re likely to see some of the things that the logistic regression model had the lowest coefficients for, because they are very important indicators of who isn’t a cardiologist. Look:
rfc_weights_df['rank'] = range(0, rfc_weights_df.shape[0])
contraindicative_feature = rfc_weights_df[rfc_weights_df.procedure == 'cervical_or_vaginal_cancer_screening;_pelvic_and_clinical_breast_examination']
contraindicative_feature
This feature has a pretty high rank.
Gauging Accuracy
The overall accuracy on these models is pretty high for a first pass.
A lot of the stuff I work on day-to-day starts out with an accuracy of 60-80%, and we need to get it into the high 90s. This is because my work revolves around the interpretation of legal text, with critical liability risks if the model gets certain things wrong. The larger the consequences of a misclassification, the closer a model needs to get to 100% accuracy. Another situation that might demand perfect accuracy is the use of a model on enormous data sets—the kind with millions of examples to classify, such that 97% accuracy would represent tens of thousands of mistakes.
Our use case has no catastrophic consequences of misclassification (i.e. no deaths, injuries, or crimes). We’re also not in the realm of big data: with about 11,000 examples to classify, a 1% accuracy difference equates to only 110 examples. For us, 96-97% accuracy might be good enough.
Nevertheless, in the next post, we look at some examples of how we might approach raising the accuracy of a model like this.
Conclusion
In this post, we established a random baseline against which to judge the accuracy of our models on our dataset.
Then we chose two different types of model and tried them both out on our dataset.
We opened up those models and examined why they made classifications the way they did, and we talked about the difference between a weight and a feature importance.
In the next post, we do some error analysis on our models—complete with some awesome data visualizations 😎
If you liked this post, you might also like:
The rest of the posts in this applied data science case study (I recommend reading them in the order they are numbered to get the full experience of a data science project)
This keynote I gave about refactoring (complete with a fun 26 minute video of the first half of the talk)
This post on getting mentors (for when you need someone to take a look at your own data science case study!)