We tend to divide automated tests into two categories: unit tests and integration tests.
We maximize unit test coverage during the development cycle and then, once the code is written, we fit a few integration tests to make sure everything stays together.
But sometimes we can get a better return on our time and effort by identifying the riskiest parts of our code and prioritizing tests to fit that risk.
*Erratum: the drawings in this post show some code that says
ActiveRecord::Receipt.all. At that point we are using the
ActiveResource library, not
ActiveRecord , so the code should say
Let’s look at an example. Suppose we work for WalletPal, a website that helps people manage their receipts. The original build is a monolithic Rails app that stores people’s receipt data in a relational database. WalletPal would like to migrate the receipt data into a document database inside a new app, so the frontend on the original app will fetch the migrated data via an HTTP API. We’re in charge of rewriting that data layer—and the frontend itself should not change.
We extract the
ActiveRecord calls to the relational database out of the controllers and into a data source class, and we namespace it with the term
Repositories. We write another class with the same interface, and we namespace that one
Services. We switch out the repository for the service class when it’s time to get the data from the API app instead of from the local database. Once we have switched all our data dependencies to the new API, the receipts table in the relational database will be deleted.
How do we test through this fairly large refactor?
A Popular Option: Unit TDD
In my experience, the most common testing approach is to maximize coverage with unit tests as we’re writing our code.
Here is our repository unit test.
Here’s the service version. This test ensures only that the API call is made: it doesn’t check on parsing the response. It could, but the point of this example is not the unit tests, so we’re moving along.
These tests demonstrate two pitfalls of unit tests.
1. Unit tests tend to test the framework as much as (sometimes more than) they test our code.
Repository::ReceiptsDataSource spec tests a method that calls
Receipt.all, a classic Rails
ActiveRecord call. This is basically testing the Rails framework; the test’s passage has hardly much to do with our code at all. It does test that we call that line—if someone deleted the body of our method, this test would fail. But what is the risk of that? Probably low.
We see similar framework testing in our
Services::ReceiptsDataSource spec. The spec ensures that we call an endpoint that ends in
.json. Does that seem weird? It is. The API works fine when the call to the endpoint does not have
.json on it. It’s there because of an idiosyncrasy in
ActiveResource, a Rails library that generates RESTful API calls from an
ActiveRecord-like interface. What we’re testing here is our call to
ActiveResource more than any code of our own.
2. Unit tests micro-manage.
Because unit tests happen at a low level, they sometimes get sucked into testing our implementations rather than the outcomes that matter.
We see this, too, in our
Services::ReceiptsDataSource spec. This test is, for the most part, testing the fact that we’re using
ActiveResource. If we chose to make this network call in a different but also valid way—for example, by using a more bare-bones network library—the method would still have exactly the same result, but this test would fail.
This kind of coupling makes it harder to refactor.
There’s another pitfall to unit tests: they’re relatively slow to drive out.
Red, add method signature, green.
Red, add functionality, green.
Red, add conditional clause, green.
Red, add another condition, green.
Refactor to ternary.
Yes, it’s very meditative. But it’s also slow.
I’m not saying that unit tests aren’t worth the time. If the options are to write no tests or to write unit tests, write the unit tests. That said, I want to identify tests that communicate more about my app while taking less time to build and maintain. The unit test specifically is not always the most appropriate type of test for us to reach for.
Another Option: Integration Tests
We could also write an end-to-end test that visits a URL on the WalletPal app and asserts that the appropriate receipts show up. This is a useful approach that circumvents some of the pitfalls of unit tests: we can write the implementation however we want and, as long as the page still works, such a test would pass.
Integration tests, for their own part, have two pitfalls:
1. The feedback loop is long.
Integration tests don’t give us any feedback about whether our system will operate until it’s completely finished. That makes it hard to move incrementally in the direction we want to go, because we have to re-run the app to see progress as the tests continue to fail.
2. Integration tests flake.
An integration test requires functioning configuration of all the moving parts, so they fail for environments without that setup. For example, maybe they fail locally because the local API app isn’t stood up. That wouldn’t affect prod because the prod version is deployed and started. Maybe they fail because they’re hardcoded to test for prod. Maybe they fail because the web driver goes too slowly on the CI machine and our test times out (I have seen this in iOS apps). When we have a confluence of reasons that a test could fail that have nothing to do with something being wrong with the code, our development team starts to re-run failing tests to check for determinism (which takes twice as long) or ignore the outcomes of those tests (which invalidates the tests’ raison d’être entirely).
What if we could find an approach that fell somewhere in the middle—that is, an approach that used automated tests to define and catch the riskiest cases without micro-managing all the cases and without requiring a full buildout to show progress?
Step 1: Make a risk profile.
Here’s a skeletal representation of our original system:
Now let’s draw a picture of our system with the changes we’d like to make:
This diagram can help us visualize the risks associated with our refactor. Let’s go through our diagram and ask the question: what could go wrong here?
In pink, I have listed one or two things that could go wrong at each of several collaboration points within our diagram. Now, for each of these things that could go wrong, I want to answer three questions:
1. How bad would the worst case outcome be if this went wrong?
2. How likely is this to go wrong?
3. If this goes wrong, how likely is it to sneak through QA and deployment?
In the diagram, I have labeled each thing that could go wrong with a 1 if the outcome could be catastrophic, a 2 if it’s relatively likely to happen, and a 3 if it’s likely to sneak through QA and demployment
Step 2: Plan automated tests for the riskiest items.
I’m focused on placing automated tests around problems that are somewhat likely to happen, somewhat likely to go uncaught, and somewhat catastrophic if allowed to stay wrong.
So, for example, “the API server went down” is somewhat catastrophic, but it’s not at all likely to go uncaught. It makes itself known the moment the engineer, the designer, or QA goes to prod to eyeball a new feature when the page shows no data. So an automated test that requires the collaborating server to be up is not that useful from a risk profile perspective. I tend to de-prioritize tests that focus on things like this.
Instead, I’m looking for the crocodile peeking out of the water: something with lasting consequences that could slip through the QA process.
Imagine, for example, if that server serves data that looks kind of valid but is, in fact, inaccurate: say, a weird character in the text of the receipt messes up the JSON body in the trip over to the new server, so the new server’s version of this receipt only shows the portion of the items that appeared on the receipt before the weird character. It still looks like a receipt, but it’s missing items. That could go uncaught unless QA is looking very closely. That kind of data equivalence is an excellent candidate for automated testing.
What are our riskiest items in this refactor? To me, here are the top two in order of risk:
1. Something gets messed up in the data import, so the document database version looks OK at first glance but is not, in fact, equivalent to the original relational data.
This is the highest risk because it is the least likely to get caught, and it can have lasting implications if customers use our inaccurate data to try to get tasks done. It also could mean permanent damage if, say, we don’t catch it before we delete the receipts table. At that point, our source of truth on the “correct” data is gone. We’ll write some tests to make sure that our data in the document database contains all the information we want form the relational database.
2. When we switch from the repository data source to the service data source, either the service data source does not have all the methods that the repository data source did and the page crashes, or the service data source’s methods serve the data in a different format than the repository data source did, and it looks different, doesn’t show up, or crashes on page load.
This is a lower risk because, upon visiting the page, development/QA is more likely to notice something like this than an insidious data issue. That having been said, it could get a long way through the QA pipeline and then be a pain to fix. Picture it: we load the page with with the repository data source extracted from the controller calls to ActiveRecord. The page looks nice. Then we switch and inject the service data source. We expected it to work out because all our unit tests passed. Suddenly, either or first page load or on loading a page with some data that looks a little different from our happy path test case, we realize that we extracted a method into the repository to get our local build to display the page, and we forgot to create an exact equivalent in the service. What if adding that method to the service necessitates a new endpoint in the service app? Now we’re talking about a big change that we didn’t discover until we tried to do the switchover.
So we’ll add tests for data equivalence and for API consistency between the to make
Services::ReceiptDataSource and the
Data Equivalence Tests
Let’s check our data in production to make sure that I have equivalent receipts in both places. To do this, we will use an attribute called number on the receipt. We could do this same thing with any and all receipt attributes, like text or total. We could even do it with the receipts’ foreign keys to other tables, like payers, payees, or payment_processors.
This test might look a bit odd. It’s an integration test of sorts: It’s connecting to my API app, making requests, and comparing the results to the local database. Such a test will require the collaborating app to be up, but this test is not a full integration test in the sense that it does not go to a URL in my local app, pull up the page, look for things on the page, or press buttons in the UI. A test like this is not one that I would expect the team to run locally, or even in staging. This is a test that I would want to see run in prod before I would switch my app data source from the local database to the API.
This test will tell me about the equivalence of my data, and it’s a great case for automation because it can whip through the intricacies of every single one of my data rows much faster than QA could. Will this test take a while to run? Yes, it will: it makes n + 1 http calls, with n being the number of receipts in my database. Keep in mind we’re not running this all the time: we’re running it in the particular case that we prepare to do something risky, like switch the data source that the app is using or delete the receipts table in the relational database. In that case, compared to no test, it is indeed slow. Compared to having QA manually check every record to get this same level of confidence? It’s blazing fast.
Data Source Interface Tests
A) The Method Signature Tests
Yes, this approach feels fragile. In statically typed languages, we wouldn’t assert this with a test. Instead, we would have both data sources adhere to an interface, and then the code wouldn’t compile if the things that these tests are testing weren’t true.
That said, I still think the example is valuable because it exemplifies working inside of our constraints. We work for WalletPal, and WalletPal’s app is in Ruby. We don’t get to go to the CTO and complain that we could write prettier code for our feature if only the team had chosen Java or Swift. They didn’t, and here we are. So consider this an example of including an affordance to get the behavior assurance we want in a duck-typed language.
B) The method return value tests
There are execution weaknesses in this test: if the assertions on lines 29, 30, or 31 fail, the failure message does not tell me for which of my results the assertions failed. So if another developer has never seen this code before, the failure message does more to confuse than it does to elucidate. It’s worth noting that a couple of folks did recommend expecting the first result to equal the second result. The thing is, I do not expect those results to be equal. The thing I want to test should work even if the data from the two sources were different. I want to assert that, regardless of the data itself, both results respond to certain requests—namely,
.each.—the method we use on this return value in the view.
I favor an assertion that addresses specifically how we want the two return values to be equivalent, which wouldn’t necessarily be made clear in something like a Java interface. Swift protocols allow us to specify behaviors in return values rather than types, which is one of my reasons for thinking about Swift as a language to watch as it matures. But I’ll spare you my Swift evangelism in this post.
Unit tests help us us drive out intra-class functionality and build confidence in our incremental changes. Integration tests help us ensure that our inter-class and inter-app configuration works as a system. But neither of these test types presents a panacea for helping us save time and avoid worry: unit tests are relatively slow to drive out, and they tend to end up testing our framework and micro-managing our implementation choices. Integration tests give us a long feedback loop, and they tend to produce a lot of false positives for problems such that developers start ignoring their results.
Instead, we can look for opportunities to test the interaction between just a few collaborators—not one alone like the unit test, and not all together like the integration test.
Which parts of the system should have tests like that?
Here’s how we find out: we draw a diagram of our system, identify the things that could go wrong, and prioritize those problems that are riskiest based on how catastrophic they are, how likely they are, and how easily they could bypass a manual test. I would focus my effort on multi-class tests around the riskiest joints in the system. In so doing, I give myself the option to forego some unit tests in the individual classes covered by my risk tests.
This approach saves us some time, but more importantly, it gives us an opportunity to consider the risks present in our system as a whole. Then it allows us to write a test harness that mitigates the largest risks and communicates those risks to the rest of the team. This risk-focused perspective, over time, makes it easier for us and our teammates to spot and preempt the kinds of bugs that could become headaches later.