A programmer on your team makes a technical decision. Maybe it’s in a ticket they’re working on, or maybe they make a change without buy-in. Now we’re in retro, a pairing session, or pull request review. Someone asks why they did it. They answer: “this is best practice.”
It’s easy to see the motivation to say that: tech culture thinks in absolutes. We say things like “You always want to DRY up duplicate code” and “It’s more important for code to be legible than performant.” We ask each other things like “What’s better, functional or object-oriented programming?” As if those things apply equally to all contexts. As if the decision-making in this field is, just, that easy. We teach programmers to think this way. So it’s not surprising, when a programmer sees a system running afoul of one of these universal tenets, that they want to change it.
As a result, programmers often use the term “best practice” to defend a choice without realizing that they don’t understand the ‘why’ behind it.
So I’d like to explain how something becomes a “best practice” and why that doesn’t give it carte blanche to be used on a case by case basis.
Let’s start with an analogous example.
Do you know why drivers can legally turn right at red lights in the United States?
Short version: a motor fuel shortage (purportedly) and extrapolation error.
Long version: In 1973, the U.S. faced a motor fuel shortage. Cars use less fuel when they can keep rolling, rather than stop/start. So in 1975 the federal government made allowing the turn a requirement for states to receive funding for mandated conservation programs.
The rule was, technically, that the turn had to be legal anywhere that engineering guidance indicated it was safe. Hang onto that; we’ll come back to it. Let’s talk about the other factor: extrapolation error. Extrapolation error rears its head when data scientists over-index on their models.
See, the world is complicated. For most phenomena that we’d like to quantify, the dataset just doesn’t include information on all the causative factors, some of which we may not even know. Example: we can tell you what lifestyle practices raise your RISK of developing certain lifestyle diseases, but we can’t guarantee that you WILL or WON’T develop them, because there are other causative factors we don’t understand.
So, to get just about anything done, we have to simplify things by establishing some up-front assumptions that aren’t always correct, but are nevertheless useful. A great example of this is the p-value in data analysis.
The p-value: What it is, what it isn’t, why it matters
The p-value represents the probability that the differences exhibited between a control and an experimental group are due to random chance. Whether it represents that very well is a contested topic, and full disclosure, I fall on the skeptical end of the opinion spectrum. Nevertheless, articles in scientific publications frequently use the p-value to determine the “statistical significance” of a result.
This means that if a control and experimental group exhibit a difference such that the p-value falls beneath a certain amount, the study is considered to support the existence of a “statistically significant” difference between the groups. The overwhelming majority of studies that use a p-value auto-default that value to 0.05, meaning “There is no greater than a 5% chance of getting the results we got if there were actually no difference between the control and experimental groups.”
Now, gentle reader, the degree of meaning for that value depends on things like statistical power in the dataset and the number of comparisons we’re making. But even when we do use it, it is critical for us to remember what it is: an arbitrary number we picked to imperfectly represent the truth in the aggregate. It is not, itself, the truth. As such, it certainly should not be used as the only factor in prescribing a course of action to an individual. But people, policymakers, and even scientists do this constantly.
Before right on red was made federally legal in 1980, policymakers reviewed a study on its safety. The study’s conclusion: more people do get killed in car crashes when right on red is legal, but it’s not a large number relative to the total population of car crash victims. It was not “statistically significant,” and therefore, as far as the policymakers understood, nonexistent.
Even though, like, come on. Of course it exists. You and I have seen it with our own two eyeballs! A driver approaches a red light, only looks left (for traffic), turns right, and nails (or, in most cases we’ve witnessed, thankfully just almost nails) a pedestrian.
I have watched this happen at intersections as a pedestrian. When I was 17 I witnessed my father do this to a pedestrian while I was in the car. When I was in my early twenties, my bike ended up under a Miami mom’s minivan over this (I jumped off). When I was in my late twenties, I got in a screaming match with a driver in the middle of Chicago over this. He yelled at me to wait my turn, so I sat on his hood and pointed at the bright white “walk” symbol marking my path across the street. (In my defense, I’d had a really bad day already.) All of these things absolutely happened, regardless of whether someone’s interpretation of some numbers says they happened or not. It is patently absurd to assert that a thing that happened to me at least three times before I turned thirty “does not, statistically speaking, happen.”
I confess that I, as a data scientist, think about this kind of thing every time a doctor is like “you, personally, definitely don’t need to do XYZ, because studies were done and the effects were not statistically significant.” That is not how that works. That is absolutely not how that works. Call out your doctor on this. Tell them I sent you.
Because a somewhat more accurate interpretation of “not statistically significant” than “does not exist” would be something like “We decided that the number of cases that this causal factor does not explain falls under a percentage of the total that we arbitrarily deemed acceptable.” Sounds a lot less sciencey and definite, huh? It is less definite. But that’s because the answer was never definite in the first place.
For AGGREGATE decisions, statistics are useful.
But shaky statistics like a p-value do not, under any circumstances, singularly justify recommendations in an individual case.
Regardless of what the numbers say, there is a very clear story of how a pedestrian gets hit by a car. That still HAPPENS, regardless of the p-value that analysts picked in MODELING whether that happens.
Similarly, a best practice can be useful even in most cases, and it is STILL important for developers to understand what those cases are and whether THIS case is one of them.
Because when the system goes down at 3 AM, customers and stakeholders will not be impressed by “The Design Patterns book said to do it this way.”
We’re still responsible for the choices we make.
A rigorous engineer can justify a choice with “Our use case has X needs and Y vulnerabilities. This approach has A benefits and B risks. X lined up with A and Y didn’t line up with B, and that’s why I chose this.”
That is, engineers need to know the CONTEXTS in which a “best practice” is in fact the right practice…or at least, they need to be able to tell whether THEIR context is one of them.
I give a talk about the idea of “assumed context” and, broadly, how to detect the assumptions hiding in the decisions we make as engineers. I hid the idea in a title that sounds like a tech conference cocktail hour topic, but that’s just to nerd snipe programmers into listening to me. I promise, if this post spoke to you, you’ll find something useful in my talk about What Counts as a Programming Language.
Other stuff I wrote that you might like:
Quantitative Programming Knife Skills Part 1, about the shortcomings of things like p-values and the importance of understanding them
Quantitative Programming Knife Skills Part 2, about the methodological traps to watch out for in analyzing data
Why compilers don’t correct “obvious” parse errors, which is ultimately about how extrapolating from an individual case to the aggregate, much like interpreting aggregate results to command individual behavior, is not a brilliant idea