Data Safety: The “Knife Skills” of Quantitative Programming

Reading Time: 10 minutes

I teach programming classes, and I send each new student body a survey before class begins. The survey asks questions like “What are you hoping to get out of this class?” (Fun fact: if I ask “why are you taking this class,” half a dozen wise guys say “because I need it to graduate.”)

For my Python Programming class, several students respond that they’re preparing for careers in data science or data engineering. A couple mention machine learning. These roles require the ability to interpret data and draw conclusions from it.

That doesn’t mean people in these roles have this ability.

I spent the last year and a half of my career supporting scientific research projects. It’s super common for researchers to lack rigor in their data analysis: using means with no error bars. Making claims without the requisite statistical power. Asking test subjects the wrong questions. I’ve seen these things in astronomical, biological, environmental, and medical research.

The data professions themselves do even worse. I worked as a data scientist before I went indie. The company had one main natural language processing model with moderate accuracy beyond random chance. One day, my colleague and I were pairing on something, and we discovered that the model—which we had hired two self-styled subject matter expert data scientists to build—included test data in the training set. When we took the test data out, retrained, and ran against the test set, the model performed no better than random chance. We flagged the issue. The veep closed it without reply after we had both left the company.

There is power, and therefore danger, and therefore responsibility, in being a data professional.

  • The power: we use numbers to influence decisions at scary-powerful corporations, on products that touch millions of lives, and on public opinion itself.
  • The danger: there are a lot of ways to generate numbers, and one number doesn’t necessarily look any more suspicious than another number.
  • The responsibility: to understand the implications of a number and make sure to represent it accurately.

It’s easy for research organizations with different slants, for example, to draw two different conclusions from the same data. But what we’re going to talk about today is not intentional slant. Instead, we’re going to talk about honest but costly mistakes that often happen in bona fide scientific papers. The researchers did not intentionally slant the paper—they just didn’t have statistical safeguards in place to come to a sound and accurate conclusion.

So, just as chefs need knife skills, technologists need statistical safety skills.

This will be the first in a two part series on the subject.

I have written below, but if you prefer your info in video format, here ya go (this video covers parts 1 and 2 of the blog series).

We use data to build models.

In the literal sense, we use them to build machine learning models, but we also use them to shape our mental models—our understandings of cause and effect. Models are almost never 100% accurate—rather, they’re a simplified representation of a phenomenon that gives us a way to wrap our heads around it.

When we analyze a set of numbers, we have to keep in mind that that set doesn’t represent all the numbers in the world. Take, for example, height. How tall are you? Could you use that information to extrapolate that everybody in the world is the same height as you?

Of course not! That’s just one data point. Now imagine we took the heights of everyone in your neighborhood. Would that give us a better idea of how tall people are, in the world, in general, than just your height? It would, but it still wouldn’t be the full picture. Your neighborhood probably doesn’t have the tallest or the shortest person in the world. The average height also might not really match the average height of everyone in the world, especially because huge parts of the world’s population are completely not represented in your neighborhood.

Most data analysis attempts to draw conclusions about something based on an incomplete subset of the total data about that thing. Statistical methods give us a way to quantify how wrong we might be about our extrapolations.

Enter the distribution and the standard deviation.

I want to stay focused on statistical safety for this post, and the distribution and standard deviation are baseline knowledge for that discussion. Here’s a review of those concepts from math is fun. Also, here’s the four minute clip from the video linked above where I introduce these two concepts.

We need those two concepts to understand the next important concept, which is confidence intervals. As I mentioned, this piece is part one in a two part series, and only one data safety calculation made it into the first part because it’s that important. It’s confidence intervals.

Confidence Intervals

A confidence interval helps us answer the question: “Based on how much data we have sampled and how much the values in that sample differ from each other, how far away from our sample mean could the true mean be?”

The 95% confidence interval is a range of values that you can be 95% certain contains the true mean of the population. As the sample size increases, the range of interval values will narrow, meaning that you know that mean with much more accuracy compared with a smaller sample. – Simply Psychology

Image Credit: Stanley Chan, Purdue University, 2017

In the above image, the top two pictures compares the confidence intervals for data with a small standard deviation versus a large one. When the numbers are more spread out, the confidence interval around their mean is larger.

The bottom two pictures compare confidence intervals for a large dataset versus a small one. When we have fewer numbers, we can be less sure exactly where our mean is, so our confidence interval around it is larger.

Typically we use the normal distribution for calculating confidence intervals when we have more than 120 samples. However, for really small numbers of samples (under 120), we can use the wider, flatter t-distribution, which looks like the normal distribution at and above about 120 samples.

This code calculates confidence intervals based on the mean, the size, and the standard deviation of several sets of data. It has a default confidence interval of 95%, but you can change it. It returns the upper and lower bounds of the confidence intervals around the means you passed in.

Precisely, that means “Given this mean, standard deviation, and number of data points, there is a confidence probability that the true mean for all the data that this dataset represents is between this lower number and this higher number.”

import math
from scipy.stats import t
import numpy as np

def confidence_interval_for_collection(sample_size=[], standard_deviation=[], mean=[], confidence=0.95):
    degrees_freedom = [count - 1 for count in sample_size]
    outlier_tails = (1.0 - confidence) / 2.0
    confidence_collection = [outlier_tails for _ in sample_size]
    t_distribution_number = [-1 * t.ppf(tails, df) for tails, df in zip(confidence_collection, degrees_freedom)]

    step_1 = [std/math.sqrt(count) for std, count in zip(standard_deviation, sample_size)]
    step_2 = [step * t for step, t in zip(step_1, t_distribution_number)]

    low_end = [mean_num - step_num for mean_num, step_num in zip(mean, step_2)]
    high_end = [mean_num + step_num for mean_num, step_num in zip(mean, step_2)]

    return low_end, high_end

Programming Note: This method is called a collection method because it takes in collections of values and does an operation on each one. It’s useful, for example, for calculating confidence intervals for every mean in a collection of means stored in a DataFrame. You can also use it on just one set of summary data, but you have to pass in the sample size, standard deviation, and mean of that dataset inside of brackets to make a one-item colleciton, like this:

confidence_interval_for_collection(sample_size=[217], standard_deviation=[0.05], mean=[0.62])

Let’s see it in action by calculating confidence intervals around the number of days that various Chicago municipal maintenance operations took:

import pandas as pd
aggregation = pd.read_csv('metrics.csv') \
        .assign(year=lambda row: row["Period Start"].apply(lambda x: x[-4:])) \
        .assign(activity_year=lambda row: row["Activity"] + " (" + row["year"] + ")") \
        .assign(average_days_to_complete_activity=lambda row: row["Average Days to Complete Activity"].apply(lambda x: float(x))) \
        .groupby('activity_year') \
             'Target Response Days': 'max', 
             'average_days_to_complete_activity': ['mean','std'],
             'Activity' : 'count'

This example code produces some aggregate metrics from Chicago’s municipal maintenance data. In particular, for a given activity in a given year, we can see how many days the job took on average. We can also see how spread out the individual measurements were with the standard deviation:

It would be easy to draw conclusions based on these averages. But that’s not the entire story. What if 2011 only had three alley grading projects, and the year 2013 had 23? For the same reason that twenty-three people’s heights extrapolates much better for predicting human heights than three people’s heights, the confidence interval around three data points will be wider than the one around a lot more data points. Let’s calculate some confidence intervals around these averages based on the size of the data sets:

aggregation.columns = [' '.join(col).strip() for col in aggregation.columns.values]
aggregation["conf_interval_bottom"], aggregation["conf_interval_top"] = confidence_interval_for_collection(sample_size=aggregation["Activity count"], standard_deviation=aggregation["average_days_to_complete_activity std"], mean=aggregation["average_days_to_complete_activity mean"])

aggregation["average_slippage"] = aggregation["average_days_to_complete_activity mean"] - aggregation["Target Response Days max"]
aggregation["slippage_corrected"] = aggregation["conf_interval_top"] - aggregation["Target Response Days max"]


For this table, slippage_corrected will refer, not to the difference between the target and the average duration of a job, but rather the difference between the target and the maximum number of days, with 97.5% likelihood, that a job would have taken. Negative numbers in the table below indicate that even the top of the confidence interval indicates jobs finishing ahead of schedule:

For something like maintenance job durations, a confidence interval might seem like overkill.

A lot of the time, businesspeople are going to make decisions based on averages and nothing else.

For business decisions, this is often fine, and it’s not worth the effort to be really statistically rigorous. However, there are situations where you need to be using the confidence interval, such as:

Any situation where people’s health, safety, and quality of life depend on how you act on this data.

In the next installment, we’ll discuss the T-test as well as some study design flaws to watch out for.

If you liked this piece, you might also like:

This six-part case study demonstrating a full data science cycle

The Design Patterns for Data Science Series (where engineering practices meet data science challenges)

Exploring Numpy Vectorization (in case this post didn’t get far enough under the hood for you)

One comment

  1. I like the “Knife Skills” metaphor here. Here’s my attempt to extend it to two things I’m interested in:

    Prediction is Baking: It’s tempting to look at words like confidence and then lazily apply it to make a prediction of the future but in reality it’s a whole different technique. Dicing things to a uniform consistency for even cooking is to working with living beings (yeast) is like assuming a normal distributions is to many real world things are nowhere near a normal distribution. Really liked Nassim Taleb’s book The Black Swan on this.

    Bayesian statistics are Sous-Vide: Fancy new technique enabled by modern technology (cheap devices to control temperature of water and more computational power with which to run simulations). Just as Sous Vide takes the uncertainty out of the equation on whether food is cooked, Bayesian analysis always encodes uncertainty as bias as a core part of the methodology. I like Allen Downey’s writing and Bayesian Methods for Hackers, though I worry that the very practical focus of the 2nd one might be vulnerable to StackOverflow style copy/pasting without understanding.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.