Applied Data Science Case Study, Part 6: Error Analysis

Reading Time: 23 minutes

We’ve come a long way. In this series, we’re walking through a data science business problem from start to finish. We have:

  1. introduced the business case,
  2. assessed the data available to us
  3. performed some exploratory analysis ,
  4. stopped off to improve our data processing performance,
  5. then fit some models on our data.

We have now arrived at the most context-intensive step in our process: error analysis.

This is where we look closely at our models to answer a few questions:

  • How do they make decisions?
  • What do they get right, and why?
  • What do they get wrong, and why?
  • What next steps could we take with either the data or the model to improve their utility?

This post is even more into-the-weeds than the other five because it covers precisely the step where data scientists really get into the weeds.

Are you up for that?

Okay then, let’s dive in.

Error Analysis: what do the models get wrong?

Elephant. Prints available for sale by SumaREE art on Etsy. Original image at https://i.etsystatic.com/7785542/r/il/802dbc/1207671500/il_570xN.1207671500_9az2.jpg

Let’s address an elephant in the room: The overall accuracy on the models I’m about to show you is pretty high for a first pass.

A lot of the stuff I work on day-to-day starts out with an accuracy of 60-80%, and we need to get it into the high 90s. This is because my work revolves around the interpretation of legal text, with critical liability risks if the model gets certain things wrong. The larger the consequences of a misclassification, the closer a model needs to get to 100% accuracy. Another situation that might demand perfect accuracy is the use of a model on enormous data sets—the kind with millions of examples to classify, such that 97% accuracy would represent tens of thousands of mistakes.

Our use case has no catastrophic consequences of misclassification (i.e. no deaths, injuries, or crimes). We’re also not in the realm of big data: with about 11,000 examples to classify, a 1% accuracy difference equates to only 110 examples. For us, 96-97% accuracy might be accurate enough to ship as is with no further work, especially as a first pass.

Nevertheless, I’m going to demo some error analysis methods here so you can see how I would approach raising the accuracy of a model like this.

Let’s start with the logistic regression classifier.

In [31]:
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, classification_report, roc_curve, precision_recall_curve

print(classification_report(y_validate, lc_predictions))
             precision    recall  f1-score   support

          0       0.96      0.99      0.97      3351
          1       0.98      0.91      0.94      1668

avg / total       0.97      0.96      0.96      5019

It overwhelmingly gets the classification correct, but it struggles the most with recall of cardiologists. That tells me that some cardiologists are slipping through our classifier undetected. Let’s compare some raw numbers to get an idea of how many:

In [32]:
lc_false_negatives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 1) & (data_without_unknowns.lc_predictions == 0), :]
lc_false_negatives.shape
Out[32]:
(154, 3149)
In [33]:
lc_false_positives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 0) & (data_without_unknowns.lc_predictions == 1), :]
lc_false_positives.shape
Out[33]:
(25, 3149)

So we have a lot more false negatives than false positives with no tweaking to the out-of-box model. Returning to our conversation with the product manager at the top of this investigation, what we do now depends on the cost of a false negative fersus a false positive.

Let’s pretend, for the sake of argument, that a false positive costs us little: non-cardiologists who get our missive mostly ignore it, with a very low likelihood of unsubscribing from our client. Meanwhile, the false negative costs a lot: our client is selling a subscription for a very exciting (and expensive!) cardiology subscription product that has gotten a lot of hype from early adopters on Instagram and Youtube ;). Customer lifetime value is high, and likelihood of buying is relatively high. We want to capture all those subscriptions for our client. In that case, we’d want a very sensitive model that catches every cardiologist, plus some more false positives.

Maybe this is a small-scope change: maybe we can capture those additional cardiologists by changing the prediction probability classification threshold (the default value is 0.5). Let’s take a look at how that threshold affects our precision and recall:

In [34]:
from sklearn.metrics import precision_recall_fscore_support

thresholds = np.linspace(0,1,101)
predictions = [list(map(lambda x: 1 if x > threshold else 0, lc_probabilities)) for threshold in thresholds]
performance_metrics = [precision_recall_fscore_support(y_validate, prediction) for prediction in predictions]

def plot_threshold_performance(predictions, performance_metrics): 
    ax = plt.figure(figsize=(20,10))

    plt.plot(thresholds, [p[0][0] for p in performance_metrics], 'r', label='precision on non-cardiologists')
    plt.plot(thresholds, [p[0][1] for p in performance_metrics], 'g', label='precision on cardiologists') 
    plt.plot(thresholds, [p[1][0] for p in performance_metrics], 'd', label='recall on non-cardiologists') 
    plt.plot(thresholds, [p[1][1] for p in performance_metrics], 'p', label='recall on cardiologists')

    plt.title("Precision and Recall by Threshold")
    plt.ylabel("Precision and Recall")
    plt.xlabel("Threshold")
    plt.xticks(np.linspace(0, 1, 11))

    plt.legend(loc="lower left", prop={'size': 16})
    plt.ylim([0.5,1])
    plt.xlim([0,1])
    
plot_threshold_performance(predictions, performance_metrics)
//anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Again, I’ve futzed with the zoom a bit on this graph so that the interesting part fills the viewspace.

I’m seeing an important tradeoff area near the 0.4 threshold. The cost of false negatives versus false positives determines where we want to put this tradeoff.

Out of curiosity, which threshold gives us the highest raw accuracy?

In [35]:
accuracies = [accuracy_score(y_validate, prediction) for prediction in predictions]
max(accuracies)
Out[35]:
0.9663279537756525
In [36]:
max_index = accuracies.index(max(accuracies))
thresholds[max_index]
Out[36]:
0.45

So if false positives and false negatives cost us the same, we can tweak the threshold to raise the raw accuracy of this model.

Other options at our disposal for getting a more accurate model here:

  1. Grid searching over a range of hyperparameters (like penalty type or regularization value (C)) to tune the model to the hyperparameters that give the best accuracy.
  2. Analyzing the specific cases where the model misclassified things to ask “How do these cases differ from the correctly classified cases?” and desiging features to help the model better capture those cases.

Before we do more optimization on this model, though, let’s see how the random forest classifier did.

In [37]:
print(accuracy_score(y_validate, rfc_predictions))
0.9735006973500697

Out of the box, this raw accuracy is higher than that of our logistic regression model, even the threshold-optimized version.

Random forest classifiers pick up accuracy from ensembling (combining the signals of those trees, a collection of weak classifiers, into one strong classifier). The tradeoff: their results, while still interpretable, are a little harder to interpret than the logistic regression ones. This is because a) you can draw a picture of the entire model as a tree chart, but it’s hard to see the whole thing at once and it’s a bit confusing with all the individual decision trees in there, and b) feature importance does not tell us whether the feature indicates or contraindicates the class. We can still make an educated guess at this information by looking at how a given feature correlates with the random forest’s predict_proba probabilities, but it’s not as automatic as it is with a logistic regression model’s coefficients.

So it’s not unusual for the random forest classifier to fit a little better, but error analysis can be a little more involved.

Let’s see where the model is getting things wrong.

In [38]:
print(classification_report(y_validate, rfc_predictions))
             precision    recall  f1-score   support

          0       0.97      0.99      0.98      3351
          1       0.98      0.94      0.96      1668

avg / total       0.97      0.97      0.97      5019

In [39]:
rfc_false_negatives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 1) & (data_without_unknowns.rfc_predictions == 0), :]
rfc_false_negatives.shape
Out[39]:
(95, 3149)
In [40]:
rfc_false_positives = data_without_unknowns.loc[(data_without_unknowns.binary_cardiology == 0) & (data_without_unknowns.rfc_predictions == 1), :]
rfc_false_positives.shape
Out[40]:
(38, 3149)

This model starts off with a couple dozen fewer mistakes than the logistic regression classifier. Let’s take a look at some of those cardiologists that the model didn’t catch:

In [41]:
rfc_false_positives.groupby('specialty')['physician_id'].count()
Out[41]:
specialty
Anesthesiology                1
Cardiac Electrophysiology     8
Emergency Medicine            1
Endocrinology                 1
Family Practice               6
Infectious Disease            1
Internal Medicine            19
Thoracic Surgery              1
Name: physician_id, dtype: int64

So here’s the first thing that strikes me: two of the specialties of the false positive physicians, cardiac electrophysiology and cardiac surgery, sound like they would probably perform a lot of the same procedures as a cardiologist (I looked this up—a cardiologist might refer a patient to a physician with one of those two specialties depending on their assessment of the patient’s condition).

This again begets the question: what is our campaign for? If it’s for some heart-treatment-related thing, we might actually want to add all cardiac specialties to our target list. That might give us more likely prospects in addition to knocking out some of these false positives.

I’m also seeing a relatively large number of false positives in the internal medicine department. This begets a question: how similar are the procedures performed by an internal medicine physician and a cardiologist, on averge?

We can get an answer to that by calculating the cosine similarity of the mean procedure vectors for each specialty. (Yes, this is absolutely overkill for a model with a first-pass accuracy of 97%, but now I’m interested).

In [42]:
mean_procedure_vectors = data_without_unknowns.groupby('specialty')[procedure_columns].mean()
In [43]:
from scipy import spatial

def cosine_similarity(specialty, other_specialty):
    return 1 - spatial.distance.cosine(specialty, other_specialty)

Cosine similarity varies between 0 (no similarity at all) and 1 (these things are the same). So if we get the cosine similarity of a specialty with itself, we get 1 (plus some floating point fun).

In [44]:
cosine_similarity(mean_procedure_vectors.loc['Addiction Medicine', :], mean_procedure_vectors.loc['Addiction Medicine', :])
Out[44]:
1.0000000000000002

All right, time to answer the question I only just realized I’ve had all my life: which physician specialties are most similar to others by procedure pattern?

In [45]:
from sklearn.metrics import pairwise_distances

arrayed_procedures = np.asarray(mean_procedure_vectors[procedure_columns])
cosine_similarities = 1 - pairwise_distances(arrayed_procedures, metric="cosine")

def cosine_similarity_heatmap(cosine_similarities):
    fig = plt.figure(figsize=(20,20))

    fig, ax = plt.subplots(figsize=(20,20))
    im = ax.imshow(cosine_similarities, cmap="YlGnBu")

    ax.set_xticks(np.arange(len(mean_procedure_vectors.index.values)))
    ax.set_yticks(np.arange(len(mean_procedure_vectors.index.values)))
    ax.set_xticklabels(mean_procedure_vectors.index.values)
    ax.set_yticklabels(mean_procedure_vectors.index.values)
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    plt.title("Heatmap of Cosine Similarity Between Mean Procedure Counts by Specialty")

    plt.show()

cosine_similarity_heatmap(cosine_similarities)

This heatmap is sort of hard to see. I can put it all in the frame at once (like it is now), or I can make it big enough to read those labels. Not both. Another common way to express this information pictographically is to perform some kind of dimensionality reduction technique (in this case, for example, t-SNE would be an appropriate choice) and then display all the specialties as points in a two or three dimensional space, with more similar specialties’ points appearing close to each other. It’s also a good approach. One downside of it is that it’s confusing that the X and Y axes don’t represent dimensions. Instead, we have a lot of dimensions (in this case 3,140 dimensions) that our human brains can’t visualize all at once, so we take the salient information—distance between these points—and arrange them in a sort of neighborhood based on who should be close to whom.
See that diagonal line of black squares? Those cosine similarities are all the max value (1) because that’s where we see each specialty’s cosine similarity with itself. Light color squares show where the two specialties had no similarity. The darker the square, the more similar the procedure patterns of these two specialties. While we’re here, take a look at Cardiology relative to Cardiac Surgery and Cardiac Electrophysiology. those squares are all pretty dark, meaning those specialties do a lot of the same procedures. But this all started because we saw a bunch of internal medicine physicians labeled as cardiologists. That square is darkish, but not ultra-dark. In fact, let’s get the number:
In [46]:
cosine_similarity(mean_procedure_vectors.loc['Cardiology', :], mean_procedure_vectors.loc['Internal Medicine', :])
Out[46]:
0.5852663095224359

The procedures are not totally different, but they’re not precisely similar either. So what gives? Let’s pull up the procedure records of thes 17 physicians and see if we notice anything.

In [47]:
internists_labeled_as_cardiologists = rfc_false_positives.physician_id
internists_labeled_as_cardiologists_procedures = physician_procedures[physician_procedures.physician_id.isin(internists_labeled_as_cardiologists)].copy()
grouped_internist_cardiology_procedures = internists_labeled_as_cardiologists_procedures.groupby('procedure_code')['number_of_patients']\
    .sum()\
    .reset_index()\
    .sort_values(by=['number_of_patients'], ascending=False)
grouped_internist_cardiology_procedures['procedure'] = grouped_internist_cardiology_procedures.procedure_code.apply(code_to_name)
grouped_internist_cardiology_procedures.head(20)
Out[47]:
procedure_code number_of_patients procedure
83 93010 6027 [routine_electrocardiogram_(ekg)_using_at_least_12_leads_with_interpretation_and_report]
177 99214 5195 [established_patient_office_or_other_outpatient,_visit_typically_25_minutes]
81 93000 3889 [routine_ekg_using_at_least_12_leads_including_interpretation_and_report]
116 93306 3464 [ultrasound_examination_of_heart_including_color-depicted_blood_flow_rate,_direction,_and_valve_function]
176 99213 3246 [established_patient_office_or_other_outpatient_visit,_typically_15_minutes]
14 36415 1914 [insertion_of_needle_into_vein_for_collection_of_blood_sample]
112 93296 1736 [remote_evaluations_of_single,_dual,_or_multiple_lead_pacemaker_or_cardioverter-defibrillator_transmissions,_technician_review,_support,_and_distribution_of_results_up_to_90_days]
30 80053 1637 [blood_test,_comprehensive_group_of_blood_chemicals]
98 93280 1478 [evaluation,_testing,_and_programming_adjustment_of_permanent_dual_lead_pacemaker_system_with_physician_analysis,_review,_and_report]
31 80061 1337 [blood_test,_lipids_(cholesterol_and_triglycerides)]
64 85025 1262 [complete_blood_cell_count_(red_cells,_white_blood_cell,_platelets),_automated_test]
59 84443 1237 [blood_test,_thyroid_stimulating_hormone_(tsh)]
185 99232 1235 [subsequent_hospital_inpatient_care,_typically_25_minutes_per_day]
178 99215 1056 [established_patient_office_or_other_outpatient,_visit_typically_40_minutes]
198 G0008 1026 [administration_of_influenza_virus_vaccine]
110 93294 955 [remote_evaluations_of_single,_dual,_or_multiple_lead_pacemaker_with_physician_analysis,_review,_and_report_up_to_90_days]
111 93295 912 [remote_evaluations_of_single,_dual,_or_multiple_lead_cardioverter-defibrillator_with_physician_analysis,_review,_and_report_up_to_90_days]
57 84436 845 [thyroxine_(thyroid_chemical)_measurement]
172 99204 773 [new_patient_office_or_other_outpatient_visit,_typically_45_minutes]
74 90662 687 [vaccine_for_influenza_for_injection_into_muscle]

We’re looking at the most performed procedures among our mislcassified internal medicine physicians. I get why they got labeled cardiologists: a lot of these are also procedures that a cardiologist would perform a lot. The top one, for example, is an ekg procedure. Internal medicine doctors do heart and circulatory ultrasounds as well as blood tests. I’ll bet those rank high in feature importance for cardiologists.*

*Admittedly feature importance in a random forest can mean the feature is either strongy indicative or strongly contraindicative, but the data patterns and our understanding of medicine both point to these being indicative.

Let’s see if I’m right:

In [48]:
internists_labeled_cardiologists_top_procedures = [p[0] for p in grouped_internist_cardiology_procedures.head(10).procedure]
rfc_weights_df[rfc_weights_df.procedure.isin(internists_labeled_cardiologists_top_procedures)]
Out[48]:
procedure weight rank
0 ultrasound_examination_of_heart_including_color-depicted_blood_flow_rate,_direction,_and_valve_function 0.176591 0
2 routine_ekg_using_at_least_12_leads_including_interpretation_and_report 0.073315 2
5 routine_electrocardiogram_(ekg)_using_at_least_12_leads_with_interpretation_and_report 0.042066 5
13 evaluation,_testing,_and_programming_adjustment_of_permanent_dual_lead_pacemaker_system_with_physician_analysis,_review,_and_report 0.014275 13
16 established_patient_office_or_other_outpatient_visit,_typically_15_minutes 0.013168 16
19 established_patient_office_or_other_outpatient,_visit_typically_25_minutes 0.011293 19
22 remote_evaluations_of_single,_dual,_or_multiple_lead_pacemaker_or_cardioverter-defibrillator_transmissions,_technician_review,_support,_and_distribution_of_results_up_to_90_days 0.009366 22
73 insertion_of_needle_into_vein_for_collection_of_blood_sample 0.001366 73
114 blood_test,_comprehensive_group_of_blood_chemicals 0.000565 114
134 blood_test,_lipids_(cholesterol_and_triglycerides) 0.000442 134

Boom. Our most-performed procedures among internal medicine physicians falsely classified as cardiologists carry top ranks in the model.

I suspect the separating agent here, if there is one, is going to be which procedures the internal medicine doctors dont perform: that is, what makes them different from a cardiologist correctly classified as a cardiologist. We could do this by grouping top procedures for cardiologists or by looking at the top-ranked procedures in the model that internal medicine doctors don’t do and seeing if they’re different from the ones they do.

In [49]:
rfc_weights_df.head(20)
Out[49]:
procedure weight rank
0 ultrasound_examination_of_heart_including_color-depicted_blood_flow_rate,_direction,_and_valve_function 0.176591 0
1 exercise_or_drug-induced_heart_and_blood_vessel_stress_test_with_ekg_monitoring,_physician_interpretation_and_report 0.088641 1
2 routine_ekg_using_at_least_12_leads_including_interpretation_and_report 0.073315 2
3 nuclear_medicine_study_of_vessels_of_heart_using_drugs_or_exercise_multiple_studies 0.054052 3
4 exercise_or_drug-induced_heart_and_blood_vessel_stress_test_with_ekg_monitoring_and_physician_supervision 0.044443 4
5 routine_electrocardiogram_(ekg)_using_at_least_12_leads_with_interpretation_and_report 0.042066 5
6 exercise_or_drug-induced_heart_and_blood_vessel_stress_test_with_ekg_monitoring,_physician_supervision,_interpretation,_and_report 0.041578 6
7 insertion_of_catheter_in_right_and_left_heart_for_imaging_of_blood_vessels_or_grafts_and_left_lower_heart 0.018026 7
8 initial_hospital_inpatient_care,_typically_50_minutes_per_day 0.017725 8
9 injection,_regadenoson,_0.1_mg 0.016546 9
10 technetium_tc-99m_sestamibi,_diagnostic,_per_study_dose 0.016075 10
11 doppler_ultrasound_study_of_heart_blood_flow,_valves,_and_chambers 0.015891 11
12 insertion_of_catheter_in_left_heart_for_imaging_of_blood_vessels_or_grafts_and_left_lower_heart 0.015382 12
13 evaluation,_testing,_and_programming_adjustment_of_permanent_dual_lead_pacemaker_system_with_physician_analysis,_review,_and_report 0.014275 13
14 follow-up_or_limited_ultrasound_examination_of_heart 0.013578 14
15 insertion_of_probe_in_esophagus_for_heart_ultrasound_examination_including_interpretation_and_report 0.013374 15
16 established_patient_office_or_other_outpatient_visit,_typically_15_minutes 0.013168 16
17 remote_evaluations_of_single,_dual,_or_multiple_lead_pacemaker_with_physician_analysis,_review,_and_report_up_to_90_days 0.011943 17
18 insertion_of_catheter_in_left_heart_for_imaging_of_blood_vessels_or_grafts_and_left_lower_heart 0.011860 18
19 established_patient_office_or_other_outpatient,_visit_typically_25_minutes 0.011293 19

You know what I’m noticing? The procedures that both these internal medicine physicians and cardiologists do a lot are not super invasiveand not super specialized: they’re blood tests, stress tests, ultrasounds, and routine ekgs.

What gets top ranks in the model that these internal medicine doctors don’t do? Inserting a catheter into the heart for imaging. Inserting probes into the esophagus for heart ultrasounds. Doppler ultrasounds. Constant tests over long periods. These are more invasive, more specialized procedures.

So here are the features I would consider trying to engineer here:

If it’s really, really important for some reason that we scoop up these distinctions our model missed, I might try to generate features that capture the level of invasiveness, specialization, or diversity of the procedures physicians are performing. Internal medicine doctors handle all kinds of stuff (high diversity of procedures), but in general when it’s time to go really deep they refer to someone else (lower specialization and lower invasiveness of procedures).

So we looked at the specialties with the most false positives.

We only see one or two false positives in other specialties, so we’ll leave those alone for now and turn to the false negatives.

In [50]:
missed_cardiologists = rfc_false_negatives.physician_id
missed_cardiologists_procedures = physician_procedures[physician_procedures.physician_id.isin(missed_cardiologists)].copy()
grouped_missed_cardiologist_procedures = missed_cardiologists_procedures.groupby('procedure_code')['number_of_patients']\
    .sum()\
    .reset_index()\
    .sort_values(by=['number_of_patients'], ascending=False)
grouped_missed_cardiologist_procedures['procedure'] = grouped_missed_cardiologist_procedures.procedure_code.apply(code_to_name)
grouped_missed_cardiologist_procedures.head(20)
Out[50]:
procedure_code number_of_patients procedure
138 93010 5138 [routine_electrocardiogram_(ekg)_using_at_least_12_leads_with_interpretation_and_report]
184 99214 5002 [established_patient_office_or_other_outpatient,_visit_typically_25_minutes]
183 99213 4236 [established_patient_office_or_other_outpatient_visit,_typically_15_minutes]
195 99232 2590 [subsequent_hospital_inpatient_care,_typically_25_minutes_per_day]
136 93000 2524 [routine_ekg_using_at_least_12_leads_including_interpretation_and_report]
30 36415 2248 [insertion_of_needle_into_vein_for_collection_of_blood_sample]
192 99223 1972 [initial_hospital_inpatient_care,_typically_70_minutes_per_day]
196 99233 1709 [subsequent_hospital_inpatient_care,_typically_35_minutes_per_day]
162 93880 1432 [ultrasound_scanning_of_blood_flow_(outside_the_brain)_on_both_sides_of_head_and_neck]
167 93970 1049 [ultrasound_scan_of_veins_of_both_arms_or_legs_including_assessment_of_compression_and_functional_maneuvers]
83 80053 990 [blood_test,_comprehensive_group_of_blood_chemicals]
84 80061 947 [blood_test,_lipids_(cholesterol_and_triglycerides)]
121 85025 947 [complete_blood_cell_count_(red_cells,_white_blood_cell,_platelets),_automated_test]
198 99238 902 [hospital_discharge_day_management,_30_minutes_or_less]
144 93306 737 [ultrasound_examination_of_heart_including_color-depicted_blood_flow_rate,_direction,_and_valve_function]
212 G0008 725 [administration_of_influenza_virus_vaccine]
179 99204 707 [new_patient_office_or_other_outpatient_visit,_typically_45_minutes]
185 99215 586 [established_patient_office_or_other_outpatient,_visit_typically_40_minutes]
191 99222 578 [initial_hospital_inpatient_care,_typically_50_minutes_per_day]
97 82570 541 [creatinine_level_to_test_for_kidney_function_or_muscle_injury]

I’m showing the top 20 here, but I looked through about the first 200 of these. I notice that:

  1. A lot of these are quite vague—for example, office visits and hospital care for varying lengths of time. A lot pf physicians do those, of many specialties.
  2. The cardiologists do appear to do fewer of their most-performed procedures than, say, the internists we looked at before. The misclassified internists performed more routine ekgs—80% more, in fact—than these cardiologists, even though there are more than 5 times as many cardiologists contributing to this grouping.

These findings dovetail with the possibilities mentioned in the false positive section above: if we need to inch this model closer to 100% accuracy so badly that it’s worth a significant time investment, we could look into emphasizing those highly specialized procedures that cardiologists do relatively few of, but a different specialty would do none of.

I’m not going to jump into engineering those features now. I’ll leave it as a future option, if we need it, and move on.

Once we had all the features we wanted in the model, we could try some hyperparameter tuning to tweak it further.

For example, let’s see if the threshold on this one is the optimal one, the way we did with the logistic regression model.

In [51]:
predictions = [list(map(lambda x: 1 if x > threshold else 0, rfc_probabilities)) for threshold in thresholds]
performance_metrics = [precision_recall_fscore_support(y_validate, prediction) for prediction in predictions]

plot_threshold_performance(predictions, performance_metrics)
//anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
In [52]:
accuracies = [accuracy_score(y_validate, prediction) for prediction in predictions]
max(accuracies)
Out[52]:
0.9735006973500697
In [53]:
max_index = accuracies.index(max(accuracies))
best_threshold = thresholds[max_index]
best_threshold
Out[53]:
0.5

Let’s keep running with this model and do a little more tuning:

In [54]:
rfc.get_params()
Out[54]:
{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_split': 1e-07,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Let’s do a grid search over some of these hyperparameters. We’ll try this out with n_estimators (the number of trees in the model), max_features (the number of features to consider at each split point), and max depth (the number of levels each tree can have). We could tune some of the others, too, if we liked, but these ones in my experience give the most bang for buck, with buck being time invested.

Random forest classifiers pick up speed from bagging (running decision tree classifiers on subsets of the data). It combines the predictions of those trees at the end (ensembling). Because each individual tree trains on a subset of the data and all those trees can train concurrently, a ramdon forest classifier often runs even faster than a single decision tree on the same dataset.

We’re going to take advantage of that speed and run this thing a whole bunch of times to compare some tuning options.

We are not planning to do this constantly. We’ll get the optimal parameters from this, then try them on our classifier.

In [55]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 6)]

search_options = {'n_estimators': n_estimators, 
               'max_features': max_features,
               'max_depth': max_depth} 

rfc_search = RandomizedSearchCV(estimator = rfc, param_distributions = search_options, n_iter = 60, cv = 3, verbose=1, random_state=1, n_jobs = -1)
rfc_search.fit(X_train, y_train)

rfc_best_params = rfc_search.best_params_
Fitting 3 folds for each of 60 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 59.1min finished
In [56]:
rfc_hyperparameters_optimized  = RandomForestClassifier(max_depth=rfc_best_params['n_estimators'], max_features=rfc_best_params['max_features'], n_estimators=rfc_best_params['max_depth'])
rfc_hyperparameters_optimized.fit(X_train, y_train)

rfc_optimized_probabilities = rfc_hyperparameters_optimized.predict_proba(X_validate)[:,1]
rfc_optimized_predictions = list(map(lambda x: 1 if x > best_threshold else 0, rfc_optimized_probabilities))
accuracy_score(y_validate, rfc_optimized_predictions)
Out[56]:
0.9744969117354054

As you can see, all that tuning snagged us a small additional accuracy (for a model that takes much longer to run, I might add). I have rarely seen this kind of tuning make an enormous difference, particularly compared to feature engineering, class consolidation, or additional data cleaning. So I’ll optimize on iteration speed at the beginning of a project with default settings, fewer trees, or less max depth, cycle on those other optimization strategies, then swing back to hyperparameter tuning at the end.

But, it’s the highest accuracy we’ve seen. Let’s run this on the test set and see how we do.

In [57]:
rfc_optimized_test_probabilities = rfc_hyperparameters_optimized.predict_proba(X_test)[:,1]
rfc_optimized_test_predictions = list(map(lambda x: 1 if x > best_threshold else 0, rfc_optimized_test_probabilities))
accuracy_score(y_test, rfc_optimized_test_predictions)
Out[57]:
0.9787460148777896

That’s our accuracy on our test set.

Time to run our model on our unlabeled data and hope that it correctly labels a lot of cardiologists (our info tells us it will). To give it a little extra chance, we can fit it on the entirety of the labeled data first.

In [58]:
rfc_hyperparameters_optimized.fit(data_without_unknowns[procedure_columns], data_without_unknowns['binary_cardiology'])
model_name = 'binary_cardiology_model_1'

rfc_production_probabilities = rfc_hyperparameters_optimized.predict_proba(unknown_data[procedure_columns])[:,1]
rfc_production_predictions = list(map(lambda x: 1 if x > best_threshold else 0, rfc_production_probabilities))

unknown_data['predicted_cardiologist'] = rfc_production_predictions
unknown_data['predicted_by'] = model_name
unknown_data[['physician_id', 'predicted_cardiologist', 'predicted_by']].to_csv('results/production_cardiology_classifications.csv')

The production_cardiology_classifications.csv file contains the ids of the unlabeled physicians and the model predictions of whether or not they are cardiologists. We can use this to plan who to contact for this campaign.

In [59]:
from persistence import Experiment

first_version = Experiment(
    name=model_name,
    description="First version of cardiology binary model", 
    feature_names=procedure_columns, 
    training_data_path="data/processed_input_data.csv", 
    classification_threshold=best_threshold
)    
    
first_version.save()

Finally, we save off our model in case we need to access it later. We might need it later for a lot of things:

  • To get predictions for new data
  • To use as a base line comparison while trying to improve this model with new architecture, features, or hyperparameters
  • To examine as part of a bias audit or technical audit
  • To keep track, in the event of multiple models exercising multiple ontologies on diffferent parts of the same table over time

Paired with this, in addition to saving our production predictions, we saved the name of the model with which we made those predictions.

For now, we’re done with our cycle on this specific model.

Recommendations regarding using this approach for other specialties

Earlier in this investigation, we looked at the breakdown of specialties in the labeled data. We saw that cardiology makes up by far the largest part of this data. Here’s the plot again, for reference:

In [60]:
specialties.plot.pie(y='count', figsize=(10, 10), colors=COLORBLIND_FRIENDLY_COLORS, fontsize=16, legend=False)
Out[60]:

The next largest slice, internal medicine, is less than half the size of cardiology. We have a lot fewer examples from which to extract information about how any other specialties’ procedure pattern differs from the others. To do this same approach on another specialty, my first choice would be to collect more data in the target specialty. If we can’t do that, then in some of these larger classes, we might have enough samples to still get a useful result.
In [61]:
print("Number of Internal Medicine Physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Internal Medicine"].shape[0]))
print("Number of Family Practice Physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Family Practice"].shape[0]))
Number of Internal Medicine Physicians: 2925
Number of Family Practice Physicians: 2197

For example, we do have almost 3,000 examples for internal medicine and more than 2,000 examples for family practice physicians.

We can work with this, though it’s likely to be worth balancing the samples to train classifiers on classes like these.

For some of these specialties, though:

In [62]:
print("Number of Colorectal Surgery (formerly proctology) physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Colorectal Surgery (formerly proctology)"].shape[0]))
print("Number of Registered Dietician/Nutrition Professionals: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "Registered Dietician/Nutrition Professional"].shape[0]))
print("Number of CRNA Physicians: {}".format(data_without_unknowns[data_without_unknowns['specialty'] == "CRNA"].shape[0]))
Number of Colorectal Surgery (formerly proctology) physicians: 48
Number of Registered Dietician/Nutrition Professionals: 9
Number of CRNA Physicians: 1

We have relatively few examples of these. Out of 25,000 labeled rows, each of these specialties makes up such a teeny number. These classes are small enough that, if we absolutely must try to detect them given this training data, even the most sample-conscious classification might not catch them. We’d likely have to do some combination of clever feature engineering and/or an anomaly detection type of model to get the result we need on these.

Conclusion

We’ve done quite a bit in this series, haven’t we?

We started with a relatively large, relatively clean dataset. We looked at its size and class breakdown.

We talked about features we might use to do this classification, and we chose a starting set of features. We did some data processing to represent those features. Then we examined the similarity of the patterns in those features between labeled and unlabeled data to gain some confidence that there was a consistent pattern for our model to detect.

We tried two types of model that offer us fast iteration times, relatively high interpretability, and (in my experience) good results on feature-rich, data-rich classification problems like this one: a logistic regression model and a random forest classifiers. We talked a little bit about the attributes of those models, their speed and accuracy, and how to interpret them. Then we divided our labeled data into train, validate, and test sets, ran both models, and took a peek inside at the features they emphasized in their classifications.

Finally, we analyzed the errors: we took a look at what our models got wrong. We chose to run with the random forest classifier for our production model. We learned some more about the relationships between the classes in our data and developed hypotheses as to why the model would make the errors it made. We came up with action steps: features we could engineer to test those hypotheses and possibly improve the models. Finally, we tweaked the hyperparameters on our model to eke out a little more accuracy. We ran our final version on our test set, then trained it with the labeled data and ran it on the unlabeled data for a final product to inform our contact list.

This general approach might work for some of the other specialties, but probably not for all of them. As is often the case in data science, much of it depends on the problem we’re solving, the value we’re creating, the costs of our mistakes…

…and, of course, the data!

If you liked this piece (and series!) you might also like:

This other whole notebook of data science excitement on a completely different data question

The Framework for Feature Engineering Series (featuring our friend the MNIST dataset)

The Design Patterns for Data Science Series (where engineering practices meet data science challenges)

Exploring Numpy Vectorization (in case this post didn’t get far enough under the hood for you)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.