In July, Bloomberg promoted an article on Twitter: “Drinking more than a small shot glass of beer a day could pose risks to health for men under the age of 40, a new study suggests.”
You might not have seen this exact title, but you’ve seen titles like it. “Study suggests <something that the publisher is pretty sure will make you click>.” And sometimes, the claims are even true! But often, they ain’t.
This post isn’t gonna be about the truth of the impact of alcohol on health. But it is gonna be about how you might decide whether to trust an article like this. For the article to be trustworthy, two things have to be true:
- The study’s design, data strategy, and results analysis have to be rigorous enough to make the claim.
- The study has to actually make the claim that the article says it makes.
Now, I’m a dork who goes and analyzes the rigor of actual papers for fun.1 That activity lives pretty squarely in the territory of Item 1. I don’t realistically expect most people to go and do this. In 99% of cases, a reader will learn only whatever Bloomberg or whomever said about the study.
But that’s (mostly) okay, because you can often learn enough from the article alone to identify a shaky claim. If it’s really an internally and externally valid claim, that probably won’t be clear from the article and you’d need to do further research to establish that. And truth be told, most of these articles are based on shaky claims.
So what if you’re not a study-reading dork and you still want to identify these shaky claims?
Warning Signs in “Studies Have Shown” Articles
warning sign #1: any title designating one specific food as a miracle/curse food.
For almost any headline like “Does X specific food cause Y plethora of different fatal ailments?” the study behind it is some observational study rife with multiple comparisons.
What does that mean? Time to learn some experimental design and statistics! Don’t you worry, I’ll keep it light ;).
1. Observational study: as opposed to a controlled experimental environment, this was “a bunch of people lived their normal lives and then answered surveys about it.”
We therefore don’t actually know all the differences in the groups under study.
Observationals are also sketchy because they often rely on people answering survey questions. That’s not an accurate proxy for what actually happened because 1. People lie on surveys and 2. People genuinely don’t know. There is plenty of actual, rigorous scientific evidence to support both of these assumptions.
In particular for drinking, we know people tend to answer with lower numbers than the truth— particularly when they are younger and feel more shame about bad habits.
Also, when folks have had enough to drink, they may have had drinks they don’t even remember having. Moreover, people pretty regularly under-count the number of servings of alcohol they’ve had. For example, they’ll count a long island iced tea as “one drink.” Former Miami bartender here: the recipe for that drink takes 3-5 servings of alcohol, depending on where you get it. This is an extreme example, but cocktails in general tend to come with double shots by default at a lot of purveyors. To build a solid dataset about drinking habits, you’d need a jigger at a minimum, and probably a subject matter expert too. You can’t just survey a bunch of randoms and treat that as the ground truth.
Now that you don’t trust observational studies, let’s move onto multiple comparisons.
2. Multiple comparisons is what happens when you take two groups with different outcomes and check ALL THE THINGS that differed between their lives, collectively.
Let’s say you take two groups of people: one group that have received cancer diagnoses in their lives and one that have not. You compare a laundry list of things about their lives and preferences: drinking, smoking, exercise, coffee habits, tea habits, how much they practice playing guitar, their favorite color, and whether they like dogs or cats more.
Some of those sound plausible and some ridiculous, right? So the thing is, when you check a BUNCH of different things for differences between two outcome groups, some of them might differ by pure chance.
Take two middle school kickball teams whose captains chose people based on perceived kickball skill. Now start comparing random things about those two teams: hair color, eye color, color of shirt worn today, number of freckles, style of earrings, brand of sneakers, favorite cafeteria dessert. How many things do you think you’d have to compare before you found one that, randomly, one team’s answer distribution differed from the other team’s? Five? Ten? When I put it like this, people realize—probably not that many. You’d probably find one by the time you got to twenty, right? We’ve all been on randomly or captain-selected recreational teams where, one day, it just so happened that most of one team was wearing grey or something. But it would be laughable to claim “People who wear grey shirts more likely to win a game of kickball.”2
That happens in “scientific studies” all the time. It can be hard, in experiments, to distinguish a coincidence from an actual effect. Even if a whole bunch of variables have been identified to individually each have less than a 5% chance of randomly varying alongside some target variable (this is what “p-value <0.05 means” in a paper), if you put together twenty of those, you’ve got a probability of sixty five percent that one of them varies with the target variable by random chance.
Now, the likelihood of the two groups’ characteristics differing by random chance does decrease as the number of study subjects increases. For two kickball teams that each had a million players on them, differences in random characteristics would become less likely. That’s a concept called statistical power. Most studies, to be frank, possess nowhere near the statistical power they would require to overcome the spurious correlation risk presented by multiple comparisons.
So that’s usually how these “BROCCOLI GIVES YOU CANCER” studies are done. Just, like, for reference.
I wrote more about that here if you want to go a little deeper . I deliberately made that post accessible to people without a ton of math background.
So that’s number one, general case for articles like these. Now let’s talk about this specific article, and articles like it that you might come across in the wild.
Warning Sign #2: Numbers that are clearly extrapolations.
An extrapolation means that no one actually measured this directly; rather, they devised this number by taking some different measurement and applying some sort of mathematical operation to attempt to convert it into the measurement of interest.
In this article, the big example is “shotglass of beer per day.”
No one drinks like that. People drink much larger quantities of beer than that. This study took those much larger drinking numbers, compared them to risk factors, fit a formula to the trend, and then looked for where that formula indicated a theoretical risk threshold of zero.
The thing about statistics is, it’s the art & science of using fake numbers to model real things.
Even when measuring directly, we gotta deal with this. And the more you f**k with the numbers you measured, the faker they are.
Like, come on:
“For women aged 15-39 the “theoretical minimum risk exposure level” was 0.273 drinks – about a quarter of a standard drink per day.”
“theoretical minimum risk exposure level” deserves 19 air quotes. Ain’t nobody serving quarter-glasses of wine for science.
Warning Sign #3: Treating continuous variables as categorical ones
The big example in this article is the way it treats subjects under 40 vs over 40. It acts like there’s this skyscraper of risk that’s the same for a 15 year old and a 39 year old, and then at 40 that skyscraper just accordions down to a two flat. The way it’s reported suggests that, for someone trying to minimize risk of health effects, you basically can’t drink until you’re 40, and then suddenly you can drink some, and then at 70ish you can drink quite a lot.
That is not how aging works. You don’t suddenly get a different GI tract from G*d on your 40th birthday. Age is a continuous variable. H*ll, people the same calendar age don’t even have the same amount of effects of aging on the body.
Now the reason studies do this is often that they don’t have enough participants to treat the variable as continuous, so instead they make buckets, and then it gets reported like this.
But that’s another thing to look for:
Warning Sign #4: Study Size
The more subjects are in the study, the more likely it is that their outcomes represent some larger population like them. We saw how this works earlier in the example with the kickball teams. I don’t expect laypeople to actually calculate statistical power for studies: though the concept is simple, the math can get complicated, and its doubtful that most articles provide you with enough information on the study of interest to allow you to do this from the article alone anyway.
I have written about this in case you are curious about how it works: again, accessible to folks without a current, fresh, or deep math background.
Warning Sign #5: Results Interpretations from Not-A-Statistician
I won’t belabor this, but I think about a lot: people with a science background that is NOT specifically statistics will often misinterpret their own results or the results of others in their field.
Doctors do this all the time. You can plainly see the treatment of continuous variables as categorical ones in universal guidelines like “Start skin checks at 30, mammograms at 40, and colonoscopies at 45.” And those are some of the more respectable ones. There’s also “A pregnancy is geriatric at 35” and “It makes sense to scare everyone about safeguarding their STI risk but only check for like 5 of the 9 most common STIs in the STI panel.” Then there’s my personal favorite “always shame fat people, regardless of what actual problem brought them to the doctor.” Doctors will try to tell you there’s data behind these choices, and they’re correct. I’ve read a lot of the “data behind these choices.” The guidelines that have fallen out of them suffer from either misinterpretation of that data or poor coordination between different medical recommendation administrators (often both).
To recap, watch out for:
- Bold claims based on observational studies
- Numbers that are clearly extrapolated
- Continuous variables in fake buckets—particularly a SMALL number of buckets like 2
- Few study participants
- Researchers who aren’t statisticians
OK, two final things:
1. Often the article’s fudging isn’t the researchers’ fault: the paper heavily hedges its claims, and some clickbait influenster seizes it and magics it into “BROCCOLI GIVES YOU CANCER” with a combo of hustle fantasy and total lack of stats knowledge.
Publishers often incentivize these writers to produce “pieces that drive engagement,” which sounds very highfalutin’ and education-positive until you learn that “engagement” typically means “clicks” or “getting further in the article purely because that means we have more fallow scrolling space on the sidebar to put ads.” I’ve discussed the problems with engagement as an optimization metric on this blog before.
In fact, even the researchers themselves sometimes are forced to…uh, fudge their experimental rigor in service of getting publish. It’s seriously f*cked up! I have written about that on this blog before too, for those curious.
2. I’m NOT refuting the claim that alcohol carries risks. I’m saying this specific study does not confirm specific alcohol thresholds for specific ages. Please do not go on the internet and claim that Chelsea Troy the statistician told you it’s totally physically safe to throw regular ragers. I did nothing of the sort—”keep it under a shotglass of beer until the exact day you turn 40″ was just too good to pass up as a blatant example of the problems with these types of articles.
- When I was a teenager, my mother and I would sit down together on Thursday nights with our Lean Cuisine meals (it was the early aughts, after all) to watch the fashion design reality show competition Project Runway. We watched the show to pregame our main event: reading Tom and Lorenzo. This was before Tom and Lorenzo got big and diversified on the fashion blogging scene: at the time, they were just two married gays making their internet hay in snarky criticism (or, more rarely, praise) of the looks produced by the contestants on that week’s episode. Now, as a Fully Grown Gay myself, I confess that I fantasize about having my own column à la T&L, but instead of fashion design I’d do study design. Every week we take a research paper and some articles written about it, and we just rip. If I managed to be funny enough to pull that off, I think it could have a remarkably positive impact on statistical literacy among laypeople reading articles and, frankly, subject matter experts who struggle to interpret their own data or their colleagues’ data. It would be a rip blog, though, so I suspect I’d also make some enemies unless I somehow figured out how to be both hilarious and tactful. If you trust me with that challenge and you have some money to pay me to make this column happen, you know where to find me.
- I just know that this passage has snapped a lot of the programmers reading this post into a laser-focused hunt for some arcane counterpoint like “maybe the grey shirts make them harder to throw balls at.” Please, lovelies, I assure you: the genius points you might earn by successfully putting me in rhetorical check like this get far overdrawn by the genius points you lose in completely missing the point of the example. Let’s keep it moving, nerds.
If you liked this piece, you might also like:
These two pieces on statistical safety! Complete with videos. I tried to look hottish for you.
This piece called “Best Practice is not a Reason to Do Something” about the problems with applying an aggregate study result to an individual case. Features Oscar the Grouch.
This piece, which discusses the bad parts of engagement as an optimization metric. Content Warning: Gay.