Applied Data Science Case Study, Part 5: Fitting Models

Reading Time: 8 minutes

In this series, we’re walking through a data science business problem from start to finish. We have introduced the business caseassessed the data available to us, and performed some exploratory analysis (with a fun detour into parallelizing pandas operations!)

The moment you’ve been waiting for has arrived. It’s time to fit some models.

curve fitting xkcd

Random Baseline

Okay, let’s get a baseline for accuracy with a “model” that randomly assigns specialties as cardiology or not with the same frequency as we see in the labeled data. We’ll use this baseline to determine whether our models perform any better than random chance.

In [15]:
target_frequency = specialties.loc['Cardiology', 'count'] / data_without_unknowns.shape[0]
target_frequency
Out[15]:
0.3318493562748165

About a third of our labaled data is labeled “Cardiology.” We’ll run a random classifier 100 times and take the mean accuracy of those runs so we don’t get fooled by some random outlier that happens to be more or less accurate than random would normally be.

In [16]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

accuracies = []

for i in range(100):
    combined_numerized_procedures['binary_random_baseline_cardiology'] = physicians.specialty.apply(lambda x: 1 if np.random.rand() < target_frequency else 0)
    accuracies.append(accuracy_score(combined_numerized_procedures.binary_cardiology, combined_numerized_procedures.binary_random_baseline_cardiology))

np.asarray(accuracies).mean()
Out[16]:
0.5906484385843164

So our models have to get an accuracy over 60-ish percent to do better than random guessing.

Try some Models

I’m going to start with two models: a logistic regression model (used for classification) and a random forest classifier.

Why these:

  • Both pretty fast
  • Both relatively interpretable
  • I have seen both do well on this kind of problem many times

If they both do really poorly, we can either tweak their hyperparamenters, try different features, or try a different kind of model. This is a starting point with a fast iteration loop, so we’ll get more information fast about whether we need to change something.

In [17]:
from sklearn.model_selection import train_test_split

num_examples = data_without_unknowns.shape[0]

def train_validate_test_split(series):
    probability = np.random.random()
    if probability < 0.65:
        return "train"
    elif probability < 0.85:
        return "validate"
    else:
        return "test"

data_without_unknowns['group'] = data_without_unknowns.apply(train_validate_test_split, axis=1)
assert data_without_unknowns.shape[0] == num_examples

Once again, we end with an assertion to check the shape of our end product. I tend to do this especially when the axis parameter is involved, because in my experience with pandas it’s not consistent whether 0 or 1 refers to rows or columns.

it should be noted also: sklearn has a train_test_split function. I forewent using that function in favor of my own because:

  1. train_test_split doesn’t give us a validation group to iterate on our model before testing. I want this so that I can be sure my changes are improving the model’s generalizable results rather than teaching to the test set.
  2. I want easy access to which examples gave which results so I can do error analysis and look for patterns in the mistakes.
In [18]:
X_train = data_without_unknowns.loc[(data_without_unknowns.group == 'train'), procedure_columns] 
X_validate = data_without_unknowns.loc[(data_without_unknowns.group == 'validate'), procedure_columns]  
X_test = data_without_unknowns.loc[(data_without_unknowns.group == 'test'), procedure_columns] 

y_train = data_without_unknowns.loc[(data_without_unknowns.group == 'train'), 'binary_cardiology'] 
y_validate = data_without_unknowns.loc[(data_without_unknowns.group == 'validate'), 'binary_cardiology']  
y_test = data_without_unknowns.loc[(data_without_unknowns.group == 'test'), 'binary_cardiology'] 

data_without_unknowns.groupby('group')['physician_id'].count()\
    .reset_index()\
    .rename(columns={'physician_id':'count'})
Out[18]:
group count
0 test 3764
1 train 16150
2 validate 5019

^ A note: to a data scientist, the amount of manipulation I do on that groupby might seem odd. I can do the groupby without the index resetting or column naming and get the same result. I do it because the output display with the manipulation looks nicer is easier for folks to read who are unfamiliar with the pandas library.

We have our train, validate, and test groups. It’s time to run some models on that training set.

In [19]:
from sklearn.linear_model import LogisticRegression

lc = LogisticRegression()
lc.fit(X_train, y_train)

lc_predictions = lc.predict(X_validate)
lc_probabilities = [p[1] for p in lc.predict_proba(X_validate)]

Let’s save off these results for error analysis later.

In [20]:
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'lc_predictions'] = lc_predictions
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'lc_probabilities'] = lc_probabilities

So how does our accuracy look?

In [21]:
print(accuracy_score(y_validate, lc_predictions))
0.9643355250049811

Let’s take a look at the features it highlights and see if they make sense.

In [22]:
codes_and_procedures = procedures.groupby('procedure_code')['procedure'].unique().reset_index()

CODE_TO_NAME = dict(zip(codes_and_procedures.procedure_code, codes_and_procedures.procedure))

def code_to_name(code):
     return CODE_TO_NAME[code]

We’ll use these helper methods above to match up some interpretable feature names to their weights in the models. We could theoretically pull this out into a helper class or utility file. I haven’t done that yet because:

  • The methods rely very closely on the features we used, so do not want to hide that dependency for now
  • Notebooks are most useful for two things: exploration and presentation. Right now, we are using this method for exploration. Eventually, we might want to use it for replication, say, of an experiment. In these cases, having this method in a helper file or even a state-aware class (that knows which features were used in this training run) might suit us better.
In [23]:
from specialty_data import labeled_sorted_weights

labeled_sorted_weights(lc.coef_[0], procedure_codes, code_to_name).head()
Out[23]:
procedure weight
0 insertion_of_catheter_for_diagnostic_evaluation_of_right_heart_structures 0.083004
1 evaluation_of_lower_heart_chamber_assist_device_with_physician_analysis 0.071833
2 insertion_of_catheter_for_imaging_of_heart_blood_vessels_or_grafts 0.061003
3 insertion_of_catheter_in_left_heart_for_imaging_of_blood_vessels_or_grafts_and_left_lower_heart 0.060338
4 exercise_or_drug-induced_heart_and_blood_vessel_stress_test_with_ekg_monitoring_and_physician_supervision 0.056542

We’re seeing here that the five features that most strongly indicate a cardiology classification all appear to have to do with the heart. That makes sense. Let’s also look at the procedures that most strongly contraindicate a cardiology classification:

In [24]:
labeled_sorted_weights(lc.coef_[0], procedure_codes, code_to_name)[-5:]
Out[24]:
procedure weight
3135 insertion_of_arterial_catheter_for_blood_sampling_or_infusion,_accessed_through_the_skin -0.077253
3136 psychiatric_diagnostic_evaluation_with_medical_services -0.088021
3137 anesthesia_for_lens_surgery -0.088499
3138 anesthesia_for_procedure_on_gastrointestinal_tract_using_an_endoscope -0.102577
3139 cervical_or_vaginal_cancer_screening;_pelvic_and_clinical_breast_examination -0.138399

These features don’t sound anything like cardiology, so they also make sense.

Before we start any error analysis here, let’s look at the random forest classifier and see how it makes this decision.

In [25]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

rfc_predictions = rfc.predict(X_validate)
rfc_probabilities = [p[1] for p in rfc.predict_proba(X_validate)]
//anaconda/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Let’s save off these results for error analysis later.

In [26]:
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'rfc_predictions'] = rfc_predictions
data_without_unknowns.loc[data_without_unknowns.group == 'validate', 'rfc_probabilities'] = rfc_probabilities

So how does our accuracy look?

In [27]:
accuracy_score(y_validate, rfc_predictions)
Out[27]:
0.9735006973500697

What features is the model using?

In [28]:
rfc_weights_df = labeled_sorted_weights(rfc.feature_importances_, procedure_codes, code_to_name)
rfc_weights_df.head()
Out[28]:
procedure weight
0 ultrasound_examination_of_heart_including_color-depicted_blood_flow_rate,_direction,_and_valve_function 0.176591
1 exercise_or_drug-induced_heart_and_blood_vessel_stress_test_with_ekg_monitoring,_physician_interpretation_and_report 0.088641
2 routine_ekg_using_at_least_12_leads_including_interpretation_and_report 0.073315
3 nuclear_medicine_study_of_vessels_of_heart_using_drugs_or_exercise_multiple_studies 0.054052
4 exercise_or_drug-induced_heart_and_blood_vessel_stress_test_with_ekg_monitoring_and_physician_supervision 0.044443

Similar to the logistic regression model, our random forest classifier seems to make decisions chiefly based on procedures that have to do with the heart (ekg stands for electrocardiogram, which is a heart evaluation device).

Feature importance in a random forest classifier doesn’t work quite the same way as weights in a logistic classifier. Rather, the feature importance tells us how much this feature contributes to the classification, in either direction. Lower numbers mean that those features contribute less, all the way down to zero (not contributing), as you see here:

In [29]:
rfc_weights_df[-10:]
Out[29]:
procedure weight
3130 radiation_treatment_delivery,_superficial 0.0
3131 radiation_treatment_delivery,_single_treatment_area 0.0
3132 radiation_treatment_delivery,_two_treatment_areas,_3_or_more_ports 0.0
3133 radiation_treatment_delivery,_three_or_more_treatment_areas 0.0
3134 radiation_treatment_delivery,_three_or_more_treatment_areas 0.0
3135 radiation_treatment_delivery,_three_or_more_treatment_areas 0.0
3136 intensity_modulated_radiation_treatment_delivery_per_session 0.0
3137 radiation_treatment_management,_1_or_2_treatments 0.0
3138 stereotactic_radiation_treatment_management_of_1_or_more_lesions_using_imaging_guidance,_per_treatment_course 0.0
3139 anesthesia_for_procedure_on_salivary_gland_with_biopsy 0.0

That means that, if we go far enough down on the high feature importances, we’re likely to see some of the things that the logistic regression model had the lowest coefficients for, because they are very important indicators of who isn’t a cardiologist. Look:

In [30]:
rfc_weights_df['rank'] = range(0, rfc_weights_df.shape[0])
contraindicative_feature = rfc_weights_df[rfc_weights_df.procedure == 'cervical_or_vaginal_cancer_screening;_pelvic_and_clinical_breast_examination']
contraindicative_feature
Out[30]:
procedure weight rank
155 cervical_or_vaginal_cancer_screening;_pelvic_and_clinical_breast_examination 0.000322 155

This feature has a pretty high rank.

Gauging Accuracy

The overall accuracy on these models is pretty high for a first pass.

A lot of the stuff I work on day-to-day starts out with an accuracy of 60-80%, and we need to get it into the high 90s. This is because my work revolves around the interpretation of legal text, with critical liability risks if the model gets certain things wrong. The larger the consequences of a misclassification, the closer a model needs to get to 100% accuracy. Another situation that might demand perfect accuracy is the use of a model on enormous data sets—the kind with millions of examples to classify, such that 97% accuracy would represent tens of thousands of mistakes.

Our use case has no catastrophic consequences of misclassification (i.e. no deaths, injuries, or crimes). We’re also not in the realm of big data: with about 11,000 examples to classify, a 1% accuracy difference equates to only 110 examples. For us, 96-97% accuracy might be good enough.

Nevertheless, in the next post, we’ll look at some examples of how we might approach raising the accuracy of a model like this.

Conclusion

In this post, we established a random baseline against which to judge the accuracy of our models on our dataset.

Then we chose two different types of model and tried them both out on our dataset.

We opened up those models and examined why they made classifications the way they did, and we talked about the difference between a weight and a feature importance.

In the next post, we’ll do some error analysis on our models—complete with some awesome data visualizations 😎

If you liked this post, you might also like:

The rest of the posts in this applied data science case study (I recommend reading them in the order they are numbered to get the full experience of a data science project)

This keynote I gave about refactoring (complete with a fun 26 minute video of the first half of the talk)

This post on getting mentors (for when you need someone to take a look at your own data science case study!)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.