A few weeks ago, someone on Mastodon asked a question about code coverage tools. I answered in a tootstream (gosh, we sound ridiculous when we talk about what we did on the internet, don’t we?) and later expanded on my answer in a blog post. On the topic of identifying heuristics, I mention the legend herself, Rebecca Wirfs-Brock. She and I got the chance to actually have a conversation about this, and I thought I’d share some notes.
When we began, she asked me this question:
“What do you know about testing that you wish other people knew?”
“Because obviously there’s something.”
I confess I do sometimes wish more folks thought about testing the way I do (though I suspect that’s true of anyone with an opinion). I won’t purport to be only voice on any perspective about it, but one thing that does disappoint me about testing discourse is the way testing gets blithely treated as a panacea.
Don’t get me wrong: tests can be a wonderful, valuable vehicle for information. But somehow a lot of industry educational materials have come to the conclusion that it is the cure for all that ails us. Brittle features? A few tests’ll make ’em resilient. Poorly documented? Throw in some tests. I have even heard the assertion that we can define legacy code by it’s tested-ness: legacy code is untested code. I’m not sure whether that aphorism’s purveyor meant to imply that the converse is true and tested code can’t be legacy code, but that’s absolutely the vibe I get from a lot of education around testing and I don’t agree with it.
The issue with that is that eventually programmers run into code issues that tests can’t fix—or worse, code issues that tests exacerbate. And that’s not at all what they were sold. By selling tests as a cure-all, we set people up to become cynical about them. We fail to elucidate the contexts in which they help the most and then, when they don’t help, programmers start to question their efficacy all the time.
Before I move on, a brief disclaimer:
This is not an introductory software testing post.
In particular, the goal of most such introductions is to get people who are not writing tests at all to write some tests. It does so by drastically simplifying the context surrounding writing tests, which is an appropriate and reasonable pedagogical technique.
What happens after that is that the simple, tested system becomes a more complex system, or the programmers move on to a more complex system, and the simplified context from the intro book no longer applies absolutely everywhere. The special contexts that now become relevant were not covered in the intro to testing books because they would be overwhelming there, but they become more and more prominent as systems become complex.
Applying testing techniques where they do not fit produces substandard outcomes that make programmers jaded about testing as a whole. As they become jaded, their patience for dealing with tests starts to shrink. Sometimes they identify other means of verifying their functionality that work better in their use case than an automated test, and they use that instead of a test. This is fine. They then fail to communicate what they did or how it worked to the rest of the team. Not fine. Or they determine that the risk of the thing breaking is not great enough to justify mucking with tests based on their cost benefit analysis. And that’s how we still have so many untested new systems, despite every techie in leadership who doesn’t touch the code that much anymore yelling in public about how they’d sacrifice their strongest ram to the gods of testing or whatever.
This post talks about some of those mismatches. It’s not an indictment of testing as a practice. If you’re new to testing, please do not defenestrate all your unit tests over reading this post.
Do you need tests that you keep having to change?
Much of the introductory testing literature indicates that, if you keep having to change a test, then the code is written incorrectly or that the test is written at “too low a level.”
Level height is a term that I think we overload in software engineering: here, we’re discussing scope. At the lowest level, a unit test covers an individual function or exposed method on an individual class. At the highest level, a system test or system simulation checks that the all the components are working together in concert.1Read more: Testing: A Heuristic Hunting Conversation with Rebecca Wirfs-Brock
Introductions to testing generally focus pretty heavily on the smallest scope test: the unit test. The thing is, the unit test is precisely the type of test that is at highest risk of needing to change as the system changes. Why? Because it’s the closest to the system implementation. We say that tests should test behavior and not implementation, and what we mean by that is that tests should make sure the system does what it’s supposed to without worrying about how the system does it. When the whole Intro To Testing literature uses unit tests, the example here looks like “Watch this test stay the same while we change how the example function adds 2 + 2.” The signature of the function is treated as the behavior and its body is treated as the implementation. At a unit level, that’s true.
At a system level, it’s not. Individual functions—units—are the “how” of systems, and their signatures change as we change how a system accomplishes its goals. We move behaviors around between functions. We split them up and combine them together. The system retains its original behavior, but the implementation details move. If the unit tests are exercising the specifics of which parts of the implementation live behind each signature, then the unit tests covering the functions with those signatures have to change.
Maybe there’s some reason that it’s important to test the behavior behind each of these individual signatures in isolation. But when there isn’t, teams end up avoiding useful changes because all the unit tests are going to fail.
Michael Feathers talks about this in Working Effectively with Legacy Code. He recommends drawing effect sketches of untested systems and identifying pinch points where several critical components zip together in the outcome of one function. He recommends testing that function to quickly get several components under test, and he calls this a characterization test.
Working Effectively With Legacy Code is not an “Intro to Testing” book. It’s a “So You’ve Inherited an Untested Behemoth of a Code Base” book. But the thing is, part of the way we get an untested behemoth of a code base is by presenting people with a testing strategy that fails to account for the ways that unit testing loses universal utility as collections of individual functions coalesce into larger and larger systems.
“The decision not to test something in isolation is placing a bet.”
Though I’ve said something similar in the past—specifically about a popular refactoring technique—this isn’t me. Rebecca explains her approach to testing with the idea of bets. Her testing strategies sometimes include using tests to get functionality working, and then, once that functionality is working, tossing the tests. In the particular circumstance where this code isn’t going to change, tests create a larger surface area of code to maintain in exchange for little information, since they’re asserting the absence of regressions from code that stays static. The bet here is that maintenance on these tests would cost more than it’s worth in verification. We tend to elide that analysis in testing by insisting that we always need tests, then leaving developers to wallow in their inner turmoil when they encounter situations where they’re experiencing more cost than benefit.
Another circumstance that foregrounds the tradeoff of test development? When the thing that makes the code easy to break is what’s making the code hard to test. When something is hard to test—say it’s a piece of code that integrates with a flaky third party API—that tells us something about how it might break. And the way to guard against that might not lie in the unit test. Maybe we’re running periodic integration tests whose whole job is to make sure somebody else’s code works before we count on it. Or maybe we’re introducing our own fallbacks such that if this third party API never worked, our system would have a way around it. Heck, maybe the solution here involves moving away from the third party altogether! And what about the converse—when the thing that makes the code hard to break is what’s making it easy to test? Maybe it’s a low-level operation that relies exclusively on a simple combination of already-well-verified components.
How do we test this? Maybe we don’t. Rebecca describes a minimalist style of programming as one where you don’t do something if it isn’t buying you information. For example, suppose our test suite takes eight minutes to run, with a bunch of verifications in there for things that have always worked. Why are we always running tests that always pass? Maybe we can save that for a pre-release step, and keep our development suite a fast feedback loop of the kind that prompted us to introduce tests in the first place. Taken further, maybe we can focus our use of each of our tests on the circumstance in which it shines. What we need a test to do differs depending on whether we are relying on this particular test chiefly for guidance in writing/designing the code, checking it for regressions, or documenting code for other developers.
And then there’s the non-deterministic code.
As we turn further toward data automation in our implementations, these cases will appear more often. Here, we’re testing code whose outcome might differ across runs with identical inputs. I spend a lot of time, as a data engineer working on the sanitation of large volumes of search data, thinking about the gold standard for verifying that data.
The goal here is not necessarily to test the input-to-output path. It is instead to test the assumptions that drove our original design. Suppose that our search sanitation strategy is predicated on search terms being largely in English or maybe Spanish. That works great as long as the storage of sanitized data lives in a beta test group populated from United States search customers. What happens when that sanitation algorithm performs well, and the product team decides to open it up to global use? One of the assumptions that made this algorithm effective has now become obsolete, and my verification system better catch it. But I’m not gonna be able to catch this by individually verifying a scattershot sample of search terms that a global audience might possibly submit. I need something more holistic. More on that in a future post.
In the meantime, we need a mechanism to evaluate test utility in a given circumstance.
That is, we need a series of heuristics to use. Rebecca introduces three kinds of heuristics:
- Values. These are context-dependent. A value might be minimalism, for example. We can turn back to these when we’re making decisions to help us choose the path whose tradeoffs best match the things we say we care about the most.
- Triggers to action with no guarantee. A commitment not to DRY up code until we see it in three places, for example. This might compete with context: maybe there are cases where we don’t want to use this rule. It helps to know what those are.
- Heuristics that drive other heuristics. I have a risk analysis procedure for development teams where I ask folks to rate the risks in their system on a binary classification system. They choose whether a risk is catastrophic or not, whether it’s likely or not, and whether it’s insidious or not. And then they roughly prioritize risks based on how many of these binary labels each risk has. The binary classification system informs the prioritization system, and each is a rough proxy for reality designed to give otherwise unmoored developers a direction.
Much of the utility of heuristics comes from providing direction in slippery situations where it’s easy to become unmoored.
The heuristics we choose for a given testing strategy depend heavily on both the values we bring to the work and the context in which that work is happening. They’re likely to differ from project to project. But “always test everything, preferably before it’s written” is not, empirically, working for us. So there’s something to be gained in finding a set of heuristics that better matches the needs of the system before its developers lose faith in the testing protocols and operate on an ambiguous alternative to our stated testing goals.
- There are also special varieties of test. A smoke test specifically checks for the critical paths necessary to christen a release for a broader phase of testing. A performace test typically checks on an operation’s resource requirements—like time, space, or memory—and fails if the operation requires more of that resource than some acceptable threshold. An integration test specifically checks an interaction between a component and something else—typically a third party API. People sometimes also use the term “integration test” in place of what I have called a “system test” above, which sometimes generates confusion if the interlocutors are using two different definitions.
If you liked this piece, you might also like:
This piece I wrote about abstraction gradients and their utility in software design
This piece about isolating the riskiest areas of a code base and testing there
This talk I gave about assumed context—complete with a provocative title
When I was running Test Clinics for the Salt Project, I used the idea of “meaningful tests” – which sometimes meant unit tests, other, integration tests.
I also found it incredibly valuable to define what unit/integration/etc tests meant because if you ask 3 different programmers to define a unit test you’ll get 4 different definitions (: