This past week I got around to a deeper reading of Statistics Done Wrong: The Woefully Complete Guide by Alex Reinhart.
This is the book that I wish Seth Stephens-Davidowitz had picked up at some point.
It’s all about how to screw up your experimental design and data analysis. It draws from copious examples, the majority of which come from published studies in reputable journals 😲.
Reinhart covers why and how a groundbreaking discovery might not be real. For example, that thing about women’s periods syncing up? There’s no definitive evidence to support that.
He covers how and why scientists might miss small but important effects—like additional pedestrian fatalities after legalizing right turns at red lights.
He covers why news articles depict doctors frequently changing their minds about which foods are good for you—several studies’ means might all live in the same ballpark, but two of their confidence intervals might include zero (indicating likely no effect) and the other three might have confidence intervals that just barely don’t include zero (indicating a likely significant effect).
This isn’t really a one page notes because I have three pages on this book: two about the book itself, then a third one on some follow-up resources. But we can talk about this book in a couple of different parts.
Here we’ll cover the first six chapters. These six chapters’ examples come chiefly from classification problems—we have this mean in the control group, we have this other mean in the experimental group, and we need to classify our treatment as effective or ineffective based on the distance between those two means relative to our certainty about those means.
Confidence Intervals and Statistical Power
So we start off with some metrics to help us quantify certainty: confidence intervals and statistical power. The more examples we have in our dataset, the more certain we can be about the means we extract (thanks, regression to the mean!)
It turns out that confidence intervals can give us a much clearer picture than the commonly-referenced p values about our certainty in our data. A confidence interval says ‘We calculated this mean. Based on how much data we collected, the true mean of the phenomenon that this data samples could be anywhere in this range, with X percent certainty.’
For some reason
scikit-learn models don’t come standard with confidence interval calculations. The library has a number of pull requests and open issues about this, with a lot of discussion in them around how this works for high-dimensional data, whether it applies here or here, et cetera et cetera. (Start here, if you’re interested). So far I found only one released bolt-on solution for
scikit-learn that gives you confidence intervals, and it only works for `RandomForestRegressor` and `RandomForestClassifier.`
stats module gives you the method for calculating a confidence interval, but it does so with the normal distribution, which assumes that you have a large (at least 120, and preferably thousands) number of examples in your dataset.
What if your dataset is teeny?
You can still use a confidence interval that employs the t-distribution, a flatter, wider distribution than the normal distribution that gives us a starting point for small sample sizes (generally described as 30 or fewer). The t-distribution’s z-scores start to line up with the normal distribution’s right around 120 samples. Keep that in mind as you’re tallying up your example totals.
Here is some code to help you do that:
Suppose you have some data that you’ve grouped together like so:
And you want confidence intervals for all of your groupings.
I gotchu, boo:
Statistical power, it turns out, is more complicated to calculate. To fully explain it, Reinhart points us to a 560 page book called Statistical Power Analysis for the Behavioral Sciences. To his credit (as well as the author’s and whichever soul scanned all 560 of those pages), it’s available for free online. I haven’t gotten through it yet. In the meantime, SASS and a number of other statistical analysis programs can help you with this, as could a statistical consultant. I’ll give you code when I have it.
We also have an important metric for quantifying distance between our means: t-tests.
You can have two means whose confidence intervals overlap that still possess a meaningful difference. Enter the t-test!
We have written here a method that will tell us, firstly, whether to accept or reject the null hypothesis, which assumes no meaningful difference between the two sets of data we want to compare. I have named that output ‘accept_null_hypothesis’ because I don’t love the ubiquitous use of the confounding phrase ‘reject the null hypothesis‘ in scientific inquiry. It’s a double negative (reject the absence of meaningful difference), which adds an unnecessary additional piece of mental acrobatics to the (already frequently herculean) task of determining what, exactly, the scientists are trying to say in their conclusion paragraph.
We are going with accept the absence of meaningful difference as the variable name for two reasons. First of all, we remove the double negative this way. Second of all, accepting the null hypothesis is (or should be) the outcome of the vast majority of scientific inquiry. Scientists, collectively, test a whole bunch of stuff to see what has an effect. Most of the things tried, it turns out, don’t have that effect. So our
accept_null_hypothesis value will usually be true. When it’s false, we should sit up and pay attention.
…which brings us to multiple comparisons
This is an important topic, and Reinhart devotes three chapters in this first part of the book to this topic.
Multiple comparisons: when scientists compare the same examples to their dependent variable multiple times to look for correlates or causes.
When a study uses 1,000 cell samples, all of which came from one of the same two mice, we have multiple comparisons by autocorrelation.
When we study a variety of stock prices and include each of their year-over-year returns as separate data points in our model, we have multiple comparisons by taking multiple measurements.
And then there’s this:
Ah, yes. You may have 95% certainty that any one comparison isn’t statistically significant by fluke, but when you run a bunch of comparisons, eventually one of them will be a fluke. In fact, when you run 100 separate comparisons, your likelihood that none of the significant outcomes are flukes drops to a measly 1%.
This is also what’s happening when you take a group of people with some incidence rate of cancer and you hand them a questionnaire with 100 questions on it: meat consumption, egg consumption, dairy consumption, soybean consumption, exercise, sleep habits, etc. Even if none of these questions has anything to do with cancer, at some point one of the answer sets will line up with who got cancer by pure fluke. Your questions could be about the most ridiculous garbage imaginable: favorite casual reading genre, self-described level of narcissism, third favorite pizza topping, color of favorite pair of shoes—and this is still going to happen.
In fact, Reinhart cites a memorable example in which researchers demonstrated exactly that. They determined, by making a staggering number of bogus comparisons and picking the most fortuitously significant-looking one, that listening to the song “When I’m Sixty-Four” made a randomly-assigned group of undergraduates a year and a half younger than another group that listened to the song “Kalimba,” when controlling for their fathers’ ages. In addition to demonstrating the problem with multiple comparisons, that study fattens my growing collection of evidence that, by comparison to plenty of other people, I’m not even that salty. Huzzah!
I’ve glossed over some of the important concepts in the first part of Reinhart’s book, so if you’re looking for something specific I recommend checking for it in the handwritten notes in the first photograph or, even better, picking up the book yourself! I went through the revised and expanded version, but the free version available online is also quite good and, in fact, I have consulted it on a few occasions during projects.
Bottom line: the distance between your control and experimental outcomes only matters relative to your certainty about their locations. Much of the math that goes into establishing the meaningfulness of study results boils down to striking a balance between those two things. And even when that balance is struck, there is still a chance that the seemingly meaningful result is a fluke or the too-small-to-establish-meaning difference is, in fact, meaningful.
The world is a messy place, and our data and models represent only approximations of it. Keep that in mind.