Researchers, data engineers, data scientists, and machine learning engineers all need the ability to interpret data and draw conclusions from it. But even academic papers often display a startling lack of statistical rigor—let alone the average industry project.
So, just as chefs need knife skills, technologists need statistical safety skills.
In a previous piece, I shared why I think this is important, introduced you to confidence intervals, and provided code for you to calculate them yourself.
Now, let’s learn some more data safety knife skills 😈
A T-test helps us answer the question: Is the distance between two means (averages) statistically significant? It’s possible they are, even if their confidence intervals overlap.
This code performs a T-test for two datasets given their sizes, standard deviations, and means. It has a default confidence of 95%, but you can change it. It returns a tuple of whether to accept the null hypothesis, and then the t-value.
“Accept the null hypothesis” means “The difference between these two means is not statistically significan” or, more precisely, “there is a
1 - confidence chance that the difference between these two means represents an actual difference between these two datasets.”
def t_test_for(num_samples_1, standard_deviation_1, mean_1, num_samples_2, standard_deviation_2, mean_2, confidence=0.95): alpha = 1 - confidence total_degrees_freedom = num_samples_1 + num_samples_2 - 2 t_distribution_number = -1 * t.ppf(alpha, total_degrees_freedom) degrees_freedom_1 = num_samples_1 - 1 degrees_freedom_2 = num_samples_2 - 1 sum_of_squares_1 = (standard_deviation_1 ** 2) * degrees_freedom_1 sum_of_squares_2 = (standard_deviation_2 ** 2) * degrees_freedom_2 combined_variance = (sum_of_squares_1 + sum_of_squares_2) / (degrees_freedom_1 + degrees_freedom_2) first_dividend_addend = combined_variance/float(num_samples_1) second_dividend_addend = combined_variance/float(num_samples_2) denominator = math.sqrt(first_dividend_addend + second_dividend_addend) numerator = mean_1 - mean_2 t_value = float(numerator)/float(denominator) accept_null_hypothesis = abs(t_value) < abs(t_distribution_number) #results are not significant return accept_null_hypothesis, t_value
Here’s how we use that method:
t_test_for(20, 0.05, 0.62, 10, 0.05, 0.63)
Here, we have descriptive statistics on two datasets: one with a size of 20, a mean of 0.62, and a standard deviation of 0.05, ad one with a size of 10, a mean of 0.63, and a standard deviation of 0.05. Because these averages are so similar and the datasets so small indicating that the true mean could differ pretty significantly from the mean among the samples, the t-test indicates that these two means have a 95% probability of not representing a true difference between two populations.
So far, between confidence intervals and t-tests, we have covered two mathematical models for understanding your data. Now, let’s talk about some data analysis pitfalls and how you can avoid them.
Don’t extrapolate based on a non-representative subset.
A confounding variable is a factor that affects your data that you don’t account for in your analysis that affects your results. There’s an effect in statistics called Simpson’s Paradox that sounds very fancy and ends up being basically this. The effect happens when subgroups of data show one trend, but the entirety of the data shows the opposite trend. Let’s talk about an example:
A 1973 study of gender bias at UC Berkeley revealed that, on the whole, women were admitted to the school at a lower rate than men. It turned out, when the data was broken down by department, that women were admitted at a higher rate than men. What happened: women were applying to the programs that were harder to get into. It’s really important to track down your confounding variables as much as possible.
Here’s a common medical example that has been an issue for ages: BMI. Body Mass Index scores are used to determine how “healthy” people are based on their weight-to-height ratio. The problem is that the BMI scale was developed based on a sample of white people and overrepresents white men in particular. It doesn’t extrapolate well to other people’s bodies (and it especially pathologizes black women’s bodies). As a result, doctors deny care to people who need it because they “need to lose weight first” according to a metric that was not built for their bodies.
One way to track down confounding variables is to make sure that the distribution of your sample set on each variable you have on record matches the distribution of the population you’re trying to extrapolate to. Doctors right now are trying to do this with COVID vaccines: they want the trials to include people of all ethnicities so they don’t make a safety claim based only on white people and then start injecting a substance into people’s arms with no info about how it might affect them. Getting study participants has been difficult, though, because this country’s history with performing reckless medical experiments on black people in particular (for example, the Tuskeegee Study) gives BIPOC really good reasons to be wary of the medical establishment.
Don’t unwittingly make multiple comparisons.
You may have 95% certainty that any one comparison isn’t statistically significant by fluke, but when you run a bunch of comparisons, eventually one of them will be a fluke. In fact, when you run 100 separate comparisons, your likelihood that none of the significant outcomes are flukes drops to a measly 1%.
There’s a comic about jelly beans that illustrates this. Here’s a clip where I explain the comic:
When you compare a whole bunch of different variables to your target variable while looking for a cause, eventually you’re pretty likely to come across one that correlates just by random chance rather than any true causal mechanism.
How do we reduce the likelihood of reporting chance correlations as evidence of a causal mechanism? Statisticians occasionally attempt to counterbalance the use of multiple comparisons with the Bonferroni correction: divide your intended p value by the number of comparisons you’re running. So, for example, suppose you were looking for variables related to incidence of arthritis, and you looked at consumption of 5 different foods. Assuming you start with a threshold p value of 0.05 to declare a relationship, you’d divide that by 5 (for the number of comparisons) to arrive at the actual value you would use to declare a relationship: 0.01, indicating a 99% probability that a correlation is not random chance.
It’s hard to find a relationship that sure, so the Bonferroni Correction gets criticism for being too harsh and missing important findings. Sometimes folks temper it by lowering the P value, but not all the way to the Bonferroni correction prescription.
Recognize and avoid continuity errors.
This type of data representation error happens when someone misrepresents continuous data that does not fall into discrete categories (like body mass index) and either misrepresents it as discrete categories or interprets it in some way that isn’t true to the data.
For example, with body mass index, we frequently see two categories: ‘normal weight’ and ‘overweight.’ So first of all, body mass index as a metric in the first place has been demonstrated to be a poor measure of health and fitness. So, we already have some issues. But let’s stick to continuity errors specifically.
Frequently body mass index data gets categorized as ‘normal weight’ (24.9 or lower) or ‘overweight’ (above 24.9). Where is underweight? Also, what is the difference between a 24.8 and a 25.1? When these middle values get averaged together with extremes on either end, it looks like this 0.2 difference in the middle is a night and day difference. It’s not. We’re just representing a wide range with a tiny number of categories.
It’s worth examining whether and why we need to categorize continuous variables before we do it. There are good reasons (get evenly sized buckets of points to compare means, draw meaningful visualizations, et cetera). But don’t do this by default.
Perhaps most importantly, document and share your analysis when you share your findings!
Mistakes happen, and you can’t be expected to know everything. But you should provide your data, your exact process, your analytical calculations, and the conclusions you have drawn when you share a finding. This gives others the chance to understand the basis of your claims and determine if it’s rigorous enough for their use case. A reader should be able to run the same analysis that you did and verify what you’ve said about it.
Each decision you make—about confidence intervals or t-tests, about comparisons, extrapolations, or data representations—benefits others more when you share it transparently.
If you liked this piece, you might also like:
This six-part case study demonstrating a full data science cycle
The Design Patterns for Data Science Series (where engineering practices meet data science challenges)
Exploring Numpy Vectorization (in case this post didn’t get far enough under the hood for you)