We’ve come a long way. In this series, we’re walking through a data science business problem from start to finish. We have:
- introduced the business case,
- assessed the data available to us
- performed some exploratory analysis ,
- stopped off to improve our data processing performance,
- then fit some models on our data.
We have now arrived at the most context-intensive step in our process: error analysis.
This is where we look closely at our models to answer a few questions:
- How do they make decisions?
- What do they get right, and why?
- What do they get wrong, and why?
- What next steps could we take with either the data or the model to improve their utility?
This post is even more into-the-weeds than the other five because it covers precisely the step where data scientists really get into the weeds.
Are you up for that?
Okay then, let’s dive in.
Error Analysis: what do the models get wrong?
Let’s address an elephant in the room: The overall accuracy on the models I’m about to show you is pretty high for a first pass.
A lot of the stuff I work on day-to-day starts out with an accuracy of 60-80%, and we need to get it into the high 90s. This is because my work revolves around the interpretation of legal text, with critical liability risks if the model gets certain things wrong. The larger the consequences of a misclassification, the closer a model needs to get to 100% accuracy. Another situation that might demand perfect accuracy is the use of a model on enormous data sets—the kind with millions of examples to classify, such that 97% accuracy would represent tens of thousands of mistakes.
Our use case has no catastrophic consequences of misclassification (i.e. no deaths, injuries, or crimes). We’re also not in the realm of big data: with about 11,000 examples to classify, a 1% accuracy difference equates to only 110 examples. For us, 96-97% accuracy might be accurate enough to ship as is with no further work, especially as a first pass.
Nevertheless, I’m going to demo some error analysis methods here so you can see how I would approach raising the accuracy of a model like this.
Let’s start with the logistic regression classifier.
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, classification_report, roc_curve, precision_recall_curve
print(classification_report(y_validate, lc_predictions))
It overwhelmingly gets the classification correct, but it struggles the most with recall of cardiologists. That tells me that some cardiologists are slipping through our classifier undetected. Let’s compare some raw numbers to get an idea of how many:
lc_false_negatives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 1) & (data_without_unknowns.lc_predictions == 0), :]
lc_false_negatives.shape
lc_false_positives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 0) & (data_without_unknowns.lc_predictions == 1), :]
lc_false_positives.shape
So we have a lot more false negatives than false positives with no tweaking to the out-of-box model. Returning to our conversation with the product manager at the top of this investigation, what we do now depends on the cost of a false negative fersus a false positive.
Let’s pretend, for the sake of argument, that a false positive costs us little: non-cardiologists who get our missive mostly ignore it, with a very low likelihood of unsubscribing from our client. Meanwhile, the false negative costs a lot: our client is selling a subscription for a very exciting (and expensive!) cardiology subscription product that has gotten a lot of hype from early adopters on Instagram and Youtube ;). Customer lifetime value is high, and likelihood of buying is relatively high. We want to capture all those subscriptions for our client. In that case, we’d want a very sensitive model that catches every cardiologist, plus some more false positives.
Maybe this is a small-scope change: maybe we can capture those additional cardiologists by changing the prediction probability classification threshold (the default value is 0.5). Let’s take a look at how that threshold affects our precision and recall:
from sklearn.metrics import precision_recall_fscore_support
thresholds = np.linspace(0,1,101)
predictions = [list(map(lambda x: 1 if x > threshold else 0, lc_probabilities)) for threshold in thresholds]
performance_metrics = [precision_recall_fscore_support(y_validate, prediction) for prediction in predictions]
def plot_threshold_performance(predictions, performance_metrics):
ax = plt.figure(figsize=(20,10))
plt.plot(thresholds, [p[0][0] for p in performance_metrics], 'r', label='precision on non-cardiologists')
plt.plot(thresholds, [p[0][1] for p in performance_metrics], 'g', label='precision on cardiologists')
plt.plot(thresholds, [p[1][0] for p in performance_metrics], 'd', label='recall on non-cardiologists')
plt.plot(thresholds, [p[1][1] for p in performance_metrics], 'p', label='recall on cardiologists')
plt.title("Precision and Recall by Threshold")
plt.ylabel("Precision and Recall")
plt.xlabel("Threshold")
plt.xticks(np.linspace(0, 1, 11))
plt.legend(loc="lower left", prop={'size': 16})
plt.ylim([0.5,1])
plt.xlim([0,1])
plot_threshold_performance(predictions, performance_metrics)
Again, I’ve futzed with the zoom a bit on this graph so that the interesting part fills the viewspace.
I’m seeing an important tradeoff area near the 0.4 threshold. The cost of false negatives versus false positives determines where we want to put this tradeoff.
Out of curiosity, which threshold gives us the highest raw accuracy?
accuracies = [accuracy_score(y_validate, prediction) for prediction in predictions]
max(accuracies)
max_index = accuracies.index(max(accuracies))
thresholds[max_index]
So if false positives and false negatives cost us the same, we can tweak the threshold to raise the raw accuracy of this model.
Other options at our disposal for getting a more accurate model here:¶
- Grid searching over a range of hyperparameters (like penalty type or regularization value (C)) to tune the model to the hyperparameters that give the best accuracy.
- Analyzing the specific cases where the model misclassified things to ask “How do these cases differ from the correctly classified cases?” and desiging features to help the model better capture those cases.
Before we do more optimization on this model, though, let’s see how the random forest classifier did.
print(accuracy_score(y_validate, rfc_predictions))
Out of the box, this raw accuracy is higher than that of our logistic regression model, even the threshold-optimized version.
Random forest classifiers pick up accuracy from ensembling (combining the signals of those trees, a collection of weak classifiers, into one strong classifier). The tradeoff: their results, while still interpretable, are a little harder to interpret than the logistic regression ones. This is because a) you can draw a picture of the entire model as a tree chart, but it’s hard to see the whole thing at once and it’s a bit confusing with all the individual decision trees in there, and b) feature importance does not tell us whether the feature indicates or contraindicates the class. We can still make an educated guess at this information by looking at how a given feature correlates with the random forest’s predict_proba probabilities, but it’s not as automatic as it is with a logistic regression model’s coefficients.
So it’s not unusual for the random forest classifier to fit a little better, but error analysis can be a little more involved.
Let’s see where the model is getting things wrong.
print(classification_report(y_validate, rfc_predictions))
rfc_false_negatives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 1) & (data_without_unknowns.rfc_predictions == 0), :]
rfc_false_negatives.shape
rfc_false_positives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 0) & (data_without_unknowns.rfc_predictions == 1), :]
rfc_false_positives.shape
This model starts off with a couple dozen fewer mistakes than the logistic regression classifier. Let’s take a look at some of those cardiologists that the model didn’t catch:
rfc_false_positives.groupby('specialty')['physician_id'].count()
So here’s the first thing that strikes me: two of the specialties of the false positive physicians, cardiac electrophysiology and cardiac surgery, sound like they would probably perform a lot of the same procedures as a cardiologist (I looked this up—a cardiologist might refer a patient to a physician with one of those two specialties depending on their assessment of the patient’s condition).
This again begets the question: what is our campaign for? If it’s for some heart-treatment-related thing, we might actually want to add all cardiac specialties to our target list. That might give us more likely prospects in addition to knocking out some of these false positives.
I’m also seeing a relatively large number of false positives in the internal medicine department. This begets a question: how similar are the procedures performed by an internal medicine physician and a cardiologist, on averge?
We can get an answer to that by calculating the cosine similarity of the mean procedure vectors for each specialty. (Yes, this is absolutely overkill for a model with a first-pass accuracy of 97%, but now I’m interested).
mean_procedure_vectors = data_without_unknowns.groupby('specialty')[procedure_columns].mean()
from scipy import spatial
def cosine_similarity(specialty, other_specialty):
return 1 - spatial.distance.cosine(specialty, other_specialty)
Cosine similarity varies between 0 (no similarity at all) and 1 (these things are the same). So if we get the cosine similarity of a specialty with itself, we get 1 (plus some floating point fun).
cosine_similarity(mean_procedure_vectors.loc['Addiction Medicine', :], mean_procedure_vectors.loc['Addiction Medicine', :])
All right, time to answer the question I only just realized I’ve had all my life: which physician specialties are most similar to others by procedure pattern?
from sklearn.metrics import pairwise_distances
arrayed_procedures = np.asarray(mean_procedure_vectors[procedure_columns])
cosine_similarities = 1 - pairwise_distances(arrayed_procedures, metric="cosine")
def cosine_similarity_heatmap(cosine_similarities):
fig = plt.figure(figsize=(20,20))
fig, ax = plt.subplots(figsize=(20,20))
im = ax.imshow(cosine_similarities, cmap="YlGnBu")
ax.set_xticks(np.arange(len(mean_procedure_vectors.index.values)))
ax.set_yticks(np.arange(len(mean_procedure_vectors.index.values)))
ax.set_xticklabels(mean_procedure_vectors.index.values)
ax.set_yticklabels(mean_procedure_vectors.index.values)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.title("Heatmap of Cosine Similarity Between Mean Procedure Counts by Specialty")
plt.show()
cosine_similarity_heatmap(cosine_similarities)
cosine_similarity(mean_procedure_vectors.loc['Cardiology', :], mean_procedure_vectors.loc['Internal Medicine', :])
The procedures are not totally different, but they’re not precisely similar either. So what gives? Let’s pull up the procedure records of thes 17 physicians and see if we notice anything.
internists_labeled_as_cardiologists = rfc_false_positives.physician_id
internists_labeled_as_cardiologists_procedures = physician_procedures[physician_procedures.physician_id.isin(internists_labeled_as_cardiologists)].copy()
grouped_internist_cardiology_procedures = internists_labeled_as_cardiologists_procedures.groupby('procedure_code')['number_of_patients']\
.sum()\
.reset_index()\
.sort_values(by=['number_of_patients'], ascending=False)
grouped_internist_cardiology_procedures['procedure'] = grouped_internist_cardiology_procedures.procedure_code.apply(code_to_name)
grouped_internist_cardiology_procedures.head(20)
We’re looking at the most performed procedures among our mislcassified internal medicine physicians. I get why they got labeled cardiologists: a lot of these are also procedures that a cardiologist would perform a lot. The top one, for example, is an ekg procedure. Internal medicine doctors do heart and circulatory ultrasounds as well as blood tests. I’ll bet those rank high in feature importance for cardiologists.*
*Admittedly feature importance in a random forest can mean the feature is either strongy indicative or strongly contraindicative, but the data patterns and our understanding of medicine both point to these being indicative.
Let’s see if I’m right:
internists_labeled_cardiologists_top_procedures = [p[0] for p in grouped_internist_cardiology_procedures.head(10).procedure]
rfc_weights_df[rfc_weights_df.procedure.isin(internists_labeled_cardiologists_top_procedures)]
Boom. Our most-performed procedures among internal medicine physicians falsely classified as cardiologists carry top ranks in the model.
I suspect the separating agent here, if there is one, is going to be which procedures the internal medicine doctors dont perform: that is, what makes them different from a cardiologist correctly classified as a cardiologist. We could do this by grouping top procedures for cardiologists or by looking at the top-ranked procedures in the model that internal medicine doctors don’t do and seeing if they’re different from the ones they do.
rfc_weights_df.head(20)
You know what I’m noticing? The procedures that both these internal medicine physicians and cardiologists do a lot are not super invasiveand not super specialized: they’re blood tests, stress tests, ultrasounds, and routine ekgs.
What gets top ranks in the model that these internal medicine doctors don’t do? Inserting a catheter into the heart for imaging. Inserting probes into the esophagus for heart ultrasounds. Doppler ultrasounds. Constant tests over long periods. These are more invasive, more specialized procedures.
So here are the features I would consider trying to engineer here:¶
If it’s really, really important for some reason that we scoop up these distinctions our model missed, I might try to generate features that capture the level of invasiveness, specialization, or diversity of the procedures physicians are performing. Internal medicine doctors handle all kinds of stuff (high diversity of procedures), but in general when it’s time to go really deep they refer to someone else (lower specialization and lower invasiveness of procedures).
So we looked at the specialties with the most false positives.
We only see one or two false positives in other specialties, so we’ll leave those alone for now and turn to the false negatives.
missed_cardiologists = rfc_false_negatives.physician_id
missed_cardiologists_procedures = physician_procedures[physician_procedures.physician_id.isin(missed_cardiologists)].copy()
grouped_missed_cardiologist_procedures = missed_cardiologists_procedures.groupby('procedure_code')['number_of_patients']\
.sum()\
.reset_index()\
.sort_values(by=['number_of_patients'], ascending=False)
grouped_missed_cardiologist_procedures['procedure'] = grouped_missed_cardiologist_procedures.procedure_code.apply(code_to_name)
grouped_missed_cardiologist_procedures.head(20)
I’m showing the top 20 here, but I looked through about the first 200 of these. I notice that:
- A lot of these are quite vague—for example, office visits and hospital care for varying lengths of time. A lot pf physicians do those, of many specialties.
- The cardiologists do appear to do fewer of their most-performed procedures than, say, the internists we looked at before. The misclassified internists performed more routine ekgs—80% more, in fact—than these cardiologists, even though there are more than 5 times as many cardiologists contributing to this grouping.
These findings dovetail with the possibilities mentioned in the false positive section above: if we need to inch this model closer to 100% accuracy so badly that it’s worth a significant time investment, we could look into emphasizing those highly specialized procedures that cardiologists do relatively few of, but a different specialty would do none of.
I’m not going to jump into engineering those features now. I’ll leave it as a future option, if we need it, and move on.¶
Once we had all the features we wanted in the model, we could try some hyperparameter tuning to tweak it further.
For example, let’s see if the threshold on this one is the optimal one, the way we did with the logistic regression model.
predictions = [list(map(lambda x: 1 if x > threshold else 0, rfc_probabilities)) for threshold in thresholds]
performance_metrics = [precision_recall_fscore_support(y_validate, prediction) for prediction in predictions]
plot_threshold_performance(predictions, performance_metrics)
accuracies = [accuracy_score(y_validate, prediction) for prediction in predictions]
max(accuracies)
max_index = accuracies.index(max(accuracies))
best_threshold = thresholds[max_index]
best_threshold
Let’s keep running with this model and do a little more tuning:
rfc.get_params()
Let’s do a grid search over some of these hyperparameters. We’ll try this out with n_estimators (the number of trees in the model), max_features (the number of features to consider at each split point), and max depth (the number of levels each tree can have). We could tune some of the others, too, if we liked, but these ones in my experience give the most bang for buck, with buck being time invested.
Random forest classifiers pick up speed from bagging (running decision tree classifiers on subsets of the data). It combines the predictions of those trees at the end (ensembling). Because each individual tree trains on a subset of the data and all those trees can train concurrently, a ramdon forest classifier often runs even faster than a single decision tree on the same dataset.
We’re going to take advantage of that speed and run this thing a whole bunch of times to compare some tuning options.
We are not planning to do this constantly. We’ll get the optimal parameters from this, then try them on our classifier.
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 6)]
search_options = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth}
rfc_search = RandomizedSearchCV(estimator = rfc, param_distributions = search_options, n_iter = 60, cv = 3, verbose=1, random_state=1, n_jobs = -1)
rfc_search.fit(X_train, y_train)
rfc_best_params = rfc_search.best_params_
rfc_hyperparameters_optimized = RandomForestClassifier(max_depth=rfc_best_params['n_estimators'], max_features=rfc_best_params['max_features'], n_estimators=rfc_best_params['max_depth'])
rfc_hyperparameters_optimized.fit(X_train, y_train)
rfc_optimized_probabilities = rfc_hyperparameters_optimized.predict_proba(X_validate)[:,1]
rfc_optimized_predictions = list(map(lambda x: 1 if x > best_threshold else 0, rfc_optimized_probabilities))
accuracy_score(y_validate, rfc_optimized_predictions)
As you can see, all that tuning snagged us a small additional accuracy (for a model that takes much longer to run, I might add). I have rarely seen this kind of tuning make an enormous difference, particularly compared to feature engineering, class consolidation, or additional data cleaning. So I’ll optimize on iteration speed at the beginning of a project with default settings, fewer trees, or less max depth, cycle on those other optimization strategies, then swing back to hyperparameter tuning at the end.
But, it’s the highest accuracy we’ve seen. Let’s run this on the test set and see how we do.
rfc_optimized_test_probabilities = rfc_hyperparameters_optimized.predict_proba(X_test)[:,1]
rfc_optimized_test_predictions = list(map(lambda x: 1 if x > best_threshold else 0, rfc_optimized_test_probabilities))
accuracy_score(y_test, rfc_optimized_test_predictions)
That’s our accuracy on our test set.
Time to run our model on our unlabeled data and hope that it correctly labels a lot of cardiologists (our info tells us it will). To give it a little extra chance, we can fit it on the entirety of the labeled data first.
rfc_hyperparameters_optimized.fit(data_without_unknowns[procedure_columns], data_without_unknowns['binary_cardiology'])
model_name = 'binary_cardiology_model_1'
rfc_production_probabilities = rfc_hyperparameters_optimized.predict_proba(unknown_data[procedure_columns])[:,1]
rfc_production_predictions = list(map(lambda x: 1 if x > best_threshold else 0, rfc_production_probabilities))
unknown_data['predicted_cardiologist'] = rfc_production_predictions
unknown_data['predicted_by'] = model_name
unknown_data[['physician_id', 'predicted_cardiologist', 'predicted_by']].to_csv('results/production_cardiology_classifications.csv')
The production_cardiology_classifications.csv file contains the ids of the unlabeled physicians and the model predictions of whether or not they are cardiologists. We can use this to plan who to contact for this campaign.
from persistence import Experiment
first_version = Experiment(
name=model_name,
description="First version of cardiology binary model",
feature_names=procedure_columns,
training_data_path="data/processed_input_data.csv",
classification_threshold=best_threshold
)
first_version.save()
Finally, we save off our model in case we need to access it later. We might need it later for a lot of things:
- To get predictions for new data
- To use as a base line comparison while trying to improve this model with new architecture, features, or hyperparameters
- To examine as part of a bias audit or technical audit
- To keep track, in the event of multiple models exercising multiple ontologies on diffferent parts of the same table over time
Paired with this, in addition to saving our production predictions, we saved the name of the model with which we made those predictions.
For now, we’re done with our cycle on this specific model.
Recommendations regarding using this approach for other specialties¶
Earlier in this investigation, we looked at the breakdown of specialties in the labeled data. We saw that cardiology makes up by far the largest part of this data. Here’s the plot again, for reference:
specialties.plot.pie(y='count', figsize=(10, 10), colors=COLORBLIND_FRIENDLY_COLORS, fontsize=16, legend=False)
print("Number of Internal Medicine Physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Internal Medicine"].shape[0]))
print("Number of Family Practice Physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Family Practice"].shape[0]))
For example, we do have almost 3,000 examples for internal medicine and more than 2,000 examples for family practice physicians.
We can work with this, though it’s likely to be worth balancing the samples to train classifiers on classes like these.
For some of these specialties, though:
print("Number of Colorectal Surgery (formerly proctology) physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Colorectal Surgery (formerly proctology)"].shape[0]))
print("Number of Registered Dietician/Nutrition Professionals: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Registered Dietician/Nutrition Professional"].shape[0]))
print("Number of CRNA Physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "CRNA"].shape[0]))
We have relatively few examples of these. Out of 25,000 labeled rows, each of these specialties makes up such a teeny number. These classes are small enough that, if we absolutely must try to detect them given this training data, even the most sample-conscious classification might not catch them. We’d likely have to do some combination of clever feature engineering and/or an anomaly detection type of model to get the result we need on these.
Conclusion
We’ve done quite a bit in this series, haven’t we?
We started with a relatively large, relatively clean dataset. We looked at its size and class breakdown.
We talked about features we might use to do this classification, and we chose a starting set of features. We did some data processing to represent those features. Then we examined the similarity of the patterns in those features between labeled and unlabeled data to gain some confidence that there was a consistent pattern for our model to detect.
We tried two types of model that offer us fast iteration times, relatively high interpretability, and (in my experience) good results on feature-rich, data-rich classification problems like this one: a logistic regression model and a random forest classifiers. We talked a little bit about the attributes of those models, their speed and accuracy, and how to interpret them. Then we divided our labeled data into train, validate, and test sets, ran both models, and took a peek inside at the features they emphasized in their classifications.
Finally, we analyzed the errors: we took a look at what our models got wrong. We chose to run with the random forest classifier for our production model. We learned some more about the relationships between the classes in our data and developed hypotheses as to why the model would make the errors it made. We came up with action steps: features we could engineer to test those hypotheses and possibly improve the models. Finally, we tweaked the hyperparameters on our model to eke out a little more accuracy. We ran our final version on our test set, then trained it with the labeled data and ran it on the unlabeled data for a final product to inform our contact list.
This general approach might work for some of the other specialties, but probably not for all of them. As is often the case in data science, much of it depends on the problem we’re solving, the value we’re creating, the costs of our mistakes…
…and, of course, the data!
If you liked this piece (and series!) you might also like:
This other whole notebook of data science excitement on a completely different data question
The Framework for Feature Engineering Series (featuring our friend the MNIST dataset)
The Design Patterns for Data Science Series (where engineering practices meet data science challenges)
Exploring Numpy Vectorization (in case this post didn’t get far enough under the hood for you)