The other day, someone on Mastodon asked about my thoughts on code coverage tools. This isn’t a topic I have discussed directly on the blog before, so I thought I’d share a post elaborating on my response to that person.
Before we begin: I do not provide what I think this person was angling for, which is a ranking of specific code coverage tools. You generally won’t see that from me, and here’s why: I’m very vain about my blog posts. I don’t like it when my blog posts are wrong. A ranking of code coverage tools, in addition to being language-specifc (not my jam) and requiring me to assume other people’s priorities in a tool (also not my jam), changes as the libraries evolve in every single language community. Any post on the subject would be out of date within months, and I want my blog posts to more or less stay right over long periods of time.
So instead what I’m gonna do in this post is offer you tools to evaluate your own code coverage tool options for your own language community and, more importantly in my opinion, make informed decisions about when and how to use a code coverage tool in the first place.
Evaluating a Code Coverage Tool
Code coverage tools, canonically defined, provide developers with a “score” indicating what proportion of the lines in their implementation code are exercised by the test suite.
Here’s how a code coverage tool typically operates:
1. It runs your test suite.
2. Every time a line of code in your implementation is run while running your test suite, it marks that line as “run.”
3. It offers a “score” at the end in which the numerator is the number of lines in your implementation code that got marked “run” and the denominator is the number of lines it thinks are in your implementation code.
Quality metrics for a tool that does this would focus on its accuracy and its usability. Here are five example metrics: three about the tool’s implementation and two about its API, one of which touches on its performance:
- Does it find your test suite correctly? Does it catch the unit, the integration, and any other big-loop tests? Is that what you want?
- Does it mark lines as “run” correctly? (Most of the well-used ones…uh, should)
- Does it count “total lines” correctly? Does it leave out config, imports, stuff you can’t test?
- Can it be configured to pass 1, 2, and 3?
- Is it convenient enough to run that people will actually run it?
The Risk with Code Coverage Tools
When I was just a baby programmer, Coraline Ada Ehmke paired with me on my very first test-driven piece of functionality. In that session, we discussed the inherent risk in a code coverage tool—any code coverage tool—which is not a technical or implementation-related risk, but rather a psychological one.
Code coverage tools provide a numerical score.
This is dangerous.
Numerical scores are like catnip to programmers who tend to treat such simplifications as optimizing metrics, increasing (or decreasing) them ad infinitum in pursuit of their theoretical platonic ideal. Usually, that theoretical ideal extends far beyond the point of practical utility: at best, it wastes a bunch of developer time. At worst, it breaks things or makes them harder to maintain. For example, I have read many a block of code made inscrutable, brittle, or even buggy in pursuit of making it faster, because all code takes some amount of time to run, which means there’s always a number available to try to shrink in pursuit of faster. That’s true forever, long after there is absolutely zero added benefit for the client to this code being faster.
With code coverage tools, programmers can completely lose sight of the actual goals of the test suite, like change resilience or documentation, because OOOHH SHINY NUMBER MUST EMBIGGEN
The solution that Coraline and I discussed at the time was to explicitly expect and shoot for a number more like 90% than 100% coverage. Now, 90% is sort of a made up number; it doesn’t account for differences in a team’s stack, subject matter, or situation.
But here’s what that number (or any number) accomplishes: it establishes code coverage as a satisficing metric—one to be made “good enough” and then left alone in pursuit of other goals—rather than an optimizing metric to be extremified at all costs to the code base’s other qualities. The number 90 has just, sort of, empirically demonstrated itself to work as a starting point. In programming usually you can usually read “empirical” as “in seasoned practitioners’ practical experience.”
Refining from the Number 90
And I do think that reframing the code coverage number as a satisficing metric rather than an optimizing metric helps to combat some of its pitfalls, but I think that establishing a number and moving on is still missing the whole story. If my code coverage tool is, in fact, running on a test suite for a class with no dependencies, integrations, or non-deterministic calculations, honestly? My code coverage SHOULD be 100%. If my thing is purely piping between two other APIs, or it’s purely an automated predictor that gets updated in prod off of past values, then 90% is likely to be too high for practical use.
So now what I would say is, a code coverage tool can be a valuable part of a holistic approach to test effectiveness, which starts with a meaningful set of heuristics for evaluating that effectiveness.
Now, the truth is, Rebecca Wirfs-Brock (yes, that Rebecca Wirfs-Brock) is, like, the person to ask about delineating those heuristics. However, if my teammate came to me and asked me to do this and I didn’t, tragically, have the opportunity to ask Rebecca1, my first pass would have the following heuristics in it:
2. If my suite PASSES, what’s the risk that we’re still shipping a broken thing?
Now, “risk” is complicated. I usually assess risk on three metrics: “how bad is it broken,” “how likely is it that it’s broken”, and “how likely is it that we don’t catch the brokenness.” I wrote about that in this piece over here (skip down to Step 3 in that blog post).
Are those three risk metrics inside team’s risk tolerance? That’s a useful testing heuristic.
I’m’a stop here and say that it is largely in service of this heuristic that code coverage tools come in handy. A lower code coverage number can be a sentinel metric for heuristic #2 falling outside the team’s risk tolerance.
3. When my test suite FAILS, how often does that actually indicate a problem?
The ideal answer here is “never” because the more it happens, the more it impacts heuristic 1, “people stop using the tests.” Depending on the team’s circumstances, some higher incidence might be acceptable to your team.
4. When a developer is asked to maintain, fix, or modify something without context on it, how useful is the test suite to help them gather the context they need to do that job well?
This one is critical to me on my teams. Forensic software analysis is expensive, slow, discouraging for most, very rarely taught, and something we all end up having to do because we all suck at context transfer. My tests better lighten that load.
For more on forensic software analysis, see this piece right here.
I’d start with those 4 heuristics and then maybe add more as we live in our code base and find more, the same way bars tend to open with a “3 RULES” sign and twenty years later they have 11 rules and the last 8 are things like “Don’t bring an alligator in this bar because one time Futzo did that and it was a disaster.”
So where does that leave us? Code coverage tools: Evaulatable on their effectiveness with about 5 metrics, and useful as a sentinel score within a broader framework of heuristics to judge the effectiveness of a test suite. But they’re not, by themselves, a serviceable proxy for test or code quality. That requires a more robust framework with evaluation heuristics that reflect the team’s goals.
- Luckily for you, I can ask Rebecca. So I chatted with her about testing heuristics (notes here) ;).
Other stuff I wrote that you might like:
Quantitative Programming Knife Skills Part 1, about the shortcomings of things like p-values and the importance of understanding them
Quantitative Programming Knife Skills Part 2, about the methodological traps to watch out for in analyzing data
Why compilers don’t correct “obvious” parse errors, which is ultimately about how extrapolating from an individual case to the aggregate, much like interpreting aggregate results to command individual behavior, is not a brilliant idea