What should we be asking about machine learning studies?

Reading Time: 14 minutes

Last month I attended the Princeton Reproducibility in Machine Learning Summit: twelve researchers in machine learning, data science, and social sciences presented papers that I’d more or less aggregate under the heading “What We’re Doing Wrong, from a Data Integrity Standpoint, as an Industry.”

I’ve already written a couple of times about how we misinterpret data as technical practitioners. I’ve also done some videos and posts on basic data safety (part 1 here, part 2 here). I thought it would be interesting to hear from a group of researchers tackling questions of data integrity in machine learning because, frankly, a lot of work in the field right now ignores that. We’re like fourteenth century doctors, chopping off limbs and seeing what happens.

I’m also—as are many of you—a practitioner in industry.

Academic machine learning crowds tend to do a lot of bellyaching about all the ways that academia incentivizes researchers to eschew scientific and statistical rigor in their write-ups. They have to show an effect; they have to show large effect sizes; they have to show something surprising; they hurt their chances by pointing out study weaknesses. “Publish or perish,” plus the irresponsible characteristics that make a study purportedly publishable, create a whole incentive system researchers don’t have power in.

And it is bad, and researchers should bellyache about it. Also, I’m not convinced that perverse publication incentives are the only reason a lot of data science gets messed up. I think, just as often, it’s a lack of rigor, attention, and knowledge on the part of the researchers. Scientists who are not subject matter experts in statistics make grievous errors in data analysis and interpretation with some regularity. And because that’s the case, we have an opportunity to gain ground in the area of data-driven study integrity regardless of the researchers’ particular incentive structure.

That said, industry might have a less perverse incentive structure specifically when it comes to study integrity.

In private industry, our compensation doesn’t depend on our publication schedule; it depends on getting the thing to work. That comes with its own whole set of problems, but the incentive scheme at least doesn’t directly contraindicate analytical rigor. Shitty research means the thing works less well, which means, at least in theory, that the product that does not work as well and must be sold for less. A lot of machine learning papers come from a place with this incentive scheme.

In fact, according to Mozilla’s 2022 Internet Health Report:

What values are advanced by researchers? Frequently, they are commercial ones.

Today, nearly half of the most influential machine learning research papers — and many top AI university faculties—are funded by big tech.


That’s a pretty significant change from a decade ago:

Source: Mozilla Internet Health Report

And, it’s worth noting, people lament this change because “industries care about money, not ethics.” The idea here is that industry pays too little attention to how a model works relative to whether it makes money. The idea also includes a conflation of ethics and data integrity that I take issue with, but we’ll get to that later.

First of all, to be candid, academia isn’t doing better than industry on this ethics front. Look how much of research has to do with ethics right now:

Source: Mozilla Internet Health Report

Even if every single one of the Ethical Principle papers in ML came from academia (which they do not—I work at Mozilla and we publish some of them, quod erat demonstratum), the vast majority of academic ML turnout would still have to be in the 97% of papers that focus on “technical performance.”

Second of all, this isn’t just an “ethics” thing. Data integrity is, and we would do well to treat and discuss it as, a “performance” consideration. Science is cumulative, so we hamstring ourselves as an industry by building on top of shoddy work. It’s not just “Use actual correct statistics in your statistics because it’s morally superior.” It’s also…literally our job. Not noble ambition, but table stakes.

If the academic “publish or perish” directive prevents data scientists and machine learning engineers from doing our literal jobs, then industry provides an opportunity to imagine what we would do or can do as machine learning researchers with the particular incentive barriers of academia aside.

And researchers are imagining that, but so far what we’ve got are a disorganized collection of assorted ideas—not a framework, as far as I can tell, for applying analytical rigor to study interpretation or design. So I’ve distilled, from the talks I heard at this summit, some questions I think it is helpful for us to ask in evaluating machine learning research—our own and others’.

Question 1: Where is the ground truth coming from?

Michael Roberts1 brought up the primacy of the data collection methodology while sharing his work analyzing the validity of COVID studies from the past two years. It’s easy to read a paper and glance at where the data came from, but that information can gravely impact the interpretive power of the analysis.

Roberts brought up a couple of examples:

  • If doctors labeled the data, then it bakes in human biases (and—addition mine—therefore includes some amount of error worth accounting for)
  • Even if the results come from testing blood, different machine calibrations, sample ages, and other clinical practice variables can introduce variability in the readings. To fix this would require standardization across the machines, which gets harder as we attempt to scale a study. But we don’t make our “big data” better by ignoring inconsistencies like these. At the very least, we owe it to our colleagues as researchers to document them, and when we’re analyzing a paper, we’d do well to note them.

As Roberts points out, Clinical trials are slow, and in phases, for a reason.

Often, though, machine learning studies focus on data that the researchers did not, themselves, collect. Inioluwa Deborah Raji discussed some examples of this at the summit, among other consequences of irreproducible ML practice:

  • Andrej Karpathy labeled ImageNet alone and it became the standard (I have since talked about this with someone who pointed out that the bulk of the work came from the organization and research efforts of Fei-Fei Li, so it’s possible I misinterpreted this portion of the talk or something got elided here)
  • Twitter data gets used for a lot of natural language processing work, but Twitter communication is gamified in a way that incentivizes people to use language quite unnaturally: specifically, to confirm what others believe for twitter high scores. That impacts the conclusions that researchers can draw from this specific text, which does not generalize to other contexts.

Question 2: What would be an appropriate benchmark for our use case?

Raji’s slide dubbed it “The Benchmark Lottery”: every paper judges its model on different benchmarks, making it difficult at best, and impossible at worst, to confidently compare the results of different studies or models. Now, for such a comparison to even be possible, we’d likely need to adopt a Kagglesque Common Task Framework like Jake Hofman describes in his presentation about integrative study designs. That is, we’d have to agree on what question we’re trying to answer, agree on the metrics of evaluation, even agree on a test set with which to perform comparison evaluations, and have researchers situate their work within one of these.

But even before we tried that, I’m not convinced that folks analyzing these papers are noticing how badly we need it. Are we reading papers with an eye toward what would be an appropriate benchmark for our use case and then trying to find work that took such a benchmark, or are we taking any related paper and attempting to intellectual pyrotechnics* our way from whatever conclusion they reached to whatever benchmark they probably would have gotten that’s more related to what we need? Because that’s a dangerous game. I think we shouldn’t play it.

*Any analytical method predicated on the a priori assumption that We Are Very Smart is unsuitable for finding answers to questions that matter. Save it for those icebreaker debate games where you argue about whether a hot dog is a sandwich or whatever, please.

Question 3: Does this work offer a self-contained reproduction or review, reading, and critique?

I appreciated Odd Erik Gundersen’s description of reproducibility here: individual investigators can follow the documentation and draw the same conclusions.

Gundersen also shared a reproducibility taxonomy based on the presence of 3 pieces of info in a study: Description of what to do, Code used to do it, and Data the code is run on. He then prescribed four labels: R1 for studies that include only a description, R2 for those that include description and code, R3 for those that include only description and data, and R4 for those that include all three. (There isn’t a label for just code and data, but I’m keeping in mind that we’re talking about papers here, for the most part. The only time I regularly see code and data with no description is in github repos, not papers).

This is an area where data scientists might look to the software engineering community. My metric for success with setup documentation, for example, is that a new contributor can get up and running without my intervention.

The pull request system in software engineering also codifies a starting framework for streamlining thoughtful critique. Is it perfect? Absolutely not. I believe the ubiquitous drive-by pull request review is wildly irresponsible. Jess Kerr’s fantastic talk at Philly ETE this year also addresses the incentives of pull requests pitting reviewer and submitter as adversaries. But we do have a system that would theoretically make reviewer replicability possible, if the reviewer considered it mandatory to replicate while reviewing. And we’re nowhere near that point in data science now. A Machine Learning Reproducibility Checklist like this one from McGill may also be helpful in this regard.

Question 4: What question are we trying to answer, and how far proxied are we from that within this study?

Brandon Stewart described the difference between a theoretical estimand—the purpose of the methodology and what we want to get out of this—and an empirical estimand—he thing we are estimating or predicting. The more different those two things are, the less the latter means for the former.

I went on a rant recently apropos of a lay example: an article about a study that made the rounds saying that people under 40 faced health risks the moment they consume alcohol in greater quantity than “a shotglass of beer”:

Look for numbers that are clearly extrapolations.

Extrapolation = no one actually measured this directly, they multiplied or divided some numbers to get it

The BIG example in this article is “shotglass of beer.”

No one drinks like that.

That means that this study took some much LARGER drinking numbers than that, compared them to risk factors, and then extrapolated a THEORETICAL risk threshold of zero.

The thing about statistics is, it’s the art & science of using fake numbers to model real things. I’m serious.

Even when measuring directly, we gotta deal with this. And the more you f**k with the numbers you measured, the faker they are.

“For women aged 15-39 the “theoretical minimum risk exposure level” was 0.273 drinks – about a quarter of a standard drink per day.”

“theoretical minimum risk exposure level” deserves 19 air quotes. Ain’t nobody serving quarter-glasses of wine for science.

– Me, big mad on Twitter

At the summit, Jessica Hullman discussed this somewhat more vaguely in a call for researchers to advertise their uncertainty, particularly when it comes to weak theses (theses from hypotheses that are easier to support than refute, common in psychology) and data and engineering uncertainty (hallmarks in machine learning research). Here’s an excerpt:

Consider a typical supervised ML paper that shows that an
innovative algorithm, architecture, or model achieves some accu-
racy on a benchmark dataset. Even if we assume the reported ac-
curacy is not optimistic for the various reasons discussed above,
the researcher has contributed an engineering artifact, a tool that the practicing engineer can carry in their toolbox based on its su-
perior performance to the state-of-the-art on a particular learning
problem. New observations based on additional data cannot refute
the performance claim of the given algorithm on the dataset, be-
cause the population from which benchmark datasets are drawn
are rarely specified to the detail needed for another sample to be
drawn [128]. Attempts to collect a different sample from an im-
plied population to refute claims are rare; when they have been at-
tempted, researchers have found that the original claims no longer
hold [175]. Further, when researchers have tried to compare model
performance across benchmark datasets, they have found that re-
sults on one benchmark rarely generalize to another, and can be
fragile [59, 208].

It’s common to see a massive gap between the claim and the estimand used to make the claim. In fact, I’d definitely put this in a list of “top 3 offenders on incendiary study headlines that go viral” alongside “multiple comparisons, sometimes more comparisons than data points” and “p > 0.05 but statistical power is comically low.”

Part of the problem might come from the fact that we gravitate toward the wrong tools for the job based on an amorphous concept of what we’re “supposed” to be using. Momin Malik presented a chart at the summit entitled A Hierarchy of Limitations in Computer Science. It looks like this:

Source: Momin Malik, A Hierarchy of Limitations in Computer Science

I appreciate the attempt to identify four decision points that researchers can use to determine what sort of study design would make the most sense for a given question. I won’t wax too poetic on this decision tree because I intend to come back to this—as well as to Hofman’s work—in a future blog post.

I would not call this list of questions exhaustive.

I can already think of other questions I would want to ask in the evaluation of a study’s design and analysis—for example, around statistical power and number of comparisons.

But I wanted to focus on the particular concerns brought up at the Princeton Reproducibility in ML Summit and attempt to distill them into questions to contribute to an organized analytical guide for a practitioner. I’ve built rubrics for inclusive company culture and maintainable code bases before, and I wonder what such a rubric might look like for machine learning study design, absent perverse publication incentives.

I’ve seen a lot of lists kinda like that before, but they feel scattershot. They’re either only applicable in certain contexts and devoid of a taxonomy that situates those contexts within the broader field, or they’re a bunch of random disparate recommendations rolling around in one paper like a collection of multicolored DnD dice in a repurposed peanut butter jar. I don’t purport to be the person who will singlehandedly fix this. But in working toward an answer, I figure I’ll at least get better at my job.


  1. Throughout this blog post, you’ll find that many of my links to presenters’ contributions navigate to the same page with an index of the summit presentations. I tried to link specifically to each presenter in several places (a google doc program we were provided, a list of bios, and then the researchers’ academic websites), but I kept getting 403s, 404s, and 503s galore. I finally gave up and started using the one link I could find that seems to pretty consistently work. I do not have time to hunt down the changing URLs of indices to academics’ stuff on the internet.

If you liked this post, you might also like:

Knife Skills of Quantitative Programming

This piece on design patterns for data science

A statistically backed reason to be skeptical of best practices

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.