A Framework for Debugging

Reading Time: 9 minutes

In the last debugging post, we repurposed a framework from Philosophy of Software Design to specifically address debugging. We also talked about some debugging tactics.

This time, we’ll float up a level and talk about broader debugging strategies. Before we begin, let’s review the critical cognitive error that we often make when we think about bugs, to our great detriment:

Key point: we need to acknowledge that, [when we are debugging], we do not already understand the behavior of our code. This sounds like an obvious detail, but we often get it wrong.

Failing to acknowledge that we do not understand the behavior of our code most of the time is precisely how we, as an industry:

  • do not have a unified praxis or pedagogy for debugging
  • force everybody to learn debugging skills by themselves from individual experience like we’re all in solitary witch training like Kiki’s Delivery Service
  • fail to build one one another’s work or research to get better at it
  • somehow don’t see a massive problem with that approach
  • while simultaneously wondering why all our software has so many bugs
  • and acting like it’s an endearing quality of software that it usually doesn’t quite work the way it should.

Here’s how we make debugging so hard for ourselves: when we start with the inaccurate assumption that we understand our code, we’re drawn to practices that align with that assumption but end up adding frustration to the debugging process. As we gain experience, we learn to ignore these temptations and do something else, even if we don’t fully understand why.

If we instead approach debugging from the starting perspective that we do not understand our code, we’re drawn instead to practices that work for debugging, rather than forcing ourselves to learn from experience to do the opposite of what we want to do based on an inaccurate assumption.

We’re going to see an example of this in just a second. Let’s go over three strategies for developing an understanding of our code’s behavior.

Strategy 1: The Standard Strategy (Change stuff in the spot where we think the bug is happening)

The most common debugging strategy I see looks something like this:

Debugging 1: Prioritizing changing code at the place we think the bug is most likely happening.
Bigger version of image here

If we understand the behavior of our code, then this is often the quickest way to diagnose the bug. So it’s a useful strategy.

The problem arises when we don’t understand the behavior of our code and we keep repeating this strategy as if we do. We hurt our own cause by operating as if we understand the code when we don’t.

The less we understand the behavior of our code, the lower the correlation between the things we think are causing the bug and the thing that’s really causing the bug, and the weaker this strategy becomes. So we get this:

Debugging, Retrying things that don't work
Bigger version of image here

Once we have established that we do not understand the behavior of our code, we need to switch to a different strategy.

Strategy 2: Binary Search Strategy

(I know this image is kinda small. If you click on it, you can see it full size).

Debugging binary search strategy
Bigger version of image here.

In this strategy, we assume that the code path follows a single-threaded, linear flow from the beginning of execution (where we run the code) to the end of execution, or when the bug happens.

We choose a spot more or less in the middle of that and run tests on the pieces that would contribute to the bug with any of the tactics we discussed in the previous piece.

Here’s where assumption detection comes into play. We’re likely to thoughtlessly assume that we know things at this point: that variable x should be this, that that class should be instantiated, et cetera. This is where insidious bugs hide: in the stuff we’re not checking.

We agreed back in What Causes Insidious Bugs? that assumption detection is difficult:

It’s hard for us to detect when our assumptions about a system are wrong because it’s hard for us to detect when we’re making assumptions at all. Assumptions, by definition, describe things we’re taking for granted. They include all the details into which we are not putting thought. I talked more about assumption detection in this piece on refactoring.

I believe that improving our ability to detect and question our assumptions plays a critical role in solving existing insidious bugs and preventing future ones.

So let’s whip out our insidious bug hunting journal and try an exercise that will help us learn to detect and question our assumptions.

Exercise: Assumptions and Checks

At each step represented by a rounded box in the flow chart above, we write down what step of the process we are checking, and then we make a list for assumptions and a list for checks.

Example:

Step 1: Checking the controller method

Given

  • The controller exists
  • The route is routing to this controller method

Checking:

  • Is the resource present in the database?
  • Does the controller method pull the correct resource from the database?
  • Does the controller method add the necessary attributes to the resource?

The “Given” section attempts to explicitly state our assumptions: the things we are not checking. The “Checking” section lists the things we are checking—and we can mark each one with a checkmark or an X depending on whether they produce what we expect.

This exercise seems tedious, right up until we’ve checked every possible place in the code and all of our checks are working, but the bug still happens. At that point, it’s time to go back and assess our “Given”s, one by one. I recommend keeping these notes. How often do bugs that thwart us for long periods of time end up hiding in our assumptions? What can we learn from this about spotting our assumptions, and which of our assumptions run the highest risk of being incorrect?

At each check, whether we find something amiss or not, with a binary search we should reduce the problem space by half. Hopefully, this way, we can find the cause of our insidious bug in relatively few steps.

There are cases where binary search won’t work: namely, cases where the code path does not follow a single-threaded, linear flow from the beginning of execution to the end.

In this case, we may need to trace the entire code path from beginning to end ourselves.

Strategy 3: Follow the Process from Beginning to End

Debugging follow the process
Bigger version of image here.

Even though I am listing this strategy after the binary search one, for someone who is just learning about debugging strategies, I teach this one first. It follows the logical line of inquiry and results in fewer misinterpretations. Depending on the length and complexity of the process, this strategy might take longer than the binary search strategy, but I don’t worry about that too much.

In this strategy, too, it can be helpful to explicitly list our assumptions and checks. This practice has little to do with remembering what we did (though it does help with that). Rather, we are training our brains to spot our own assumptions. You’ll know it’s working if your “Given” lists start getting longer. We especially remember to include givens that weren’t what we thought when we hunted down previous bugs. It is specifically this intuition that we are building when we get better at debugging through practice. However, because we do not deliberately practice it, nor generalize the skill to other languages and frameworks, our disorganized approach to learning debugging from experience tends to limit our skills to the stacks we have written.

By identifying common patterns in the assumptions we tend to make that end up being wrong (and causing bugs), we can improve our language-agnostic debugging intuition.

On Bugs that Only Happen Sometimes

If you’ve chased down a bug like this, my sincere condolences for your frustration. These ones suck, and they usually happen for one of two reasons:

 1. We are defining our process too narrowly. Our idea of where our code begins and ends, and everything in between, is missing some variable. Maybe it’s an environment variable. Maybe it’s a race condition. Maybe it’s an edge case that we haven’t yet identified. To find it, we have to broaden our scope. What are we placing outside the set of things we’re looking at that actually belongs inside it?

2. One of our givens is only true sometimes. This kind of bug tries my patience more than any other kind. They’re difficult to track down because, even when we check our givens, those givens might turn out to be accurate…this time. We have to check them multiple times, and hope that each time we re-check it, it does the thing it might only be doing sometimes.

Conclusion

We’re talking about a broader praxis for debugging. We discussed three strategies for developing our understanding of our code when bugs happen, and we talked about the value of explicitly stating our assumptions as an investment in our future debugging skills. We also talked about common causes of bugs that only happen sometimes.

 

Are you enjoying the debugging series?

You can help me keep writing by tossing a coin at this Patreon 🙂

 

If you liked this piece, you might also like:

This ongoing series tracking my progress through Crafting Interpreters by Bob Nystrom

This piece that dives into the scipy CSR sparse matrix

This applied data science case study

One comment

  1. Very Interesting, but i think there’s missing another important strategy.
    It’s called the “fail fast principle”..
    https://www.martinfowler.com/ieeeSoftware/failFast.pdf
    The fail fast principle is unknown to most developers (even if they’re 20++ years in business);
    they do quite the opposite:
    If something’s wrong or missing the try to prevent any error and go on in the code flow.
    The code tries to be fail-safe but in fact the code hides bugs and makes it difficult to find these bugs.
    Applying the fail fast principle means:
    test input arguments and throw an exception if something wrong.
    Test the internal state of your current object and throw an exception if there is something invalid.
    Every switch-statement should have a default or else-clause.
    Throw an exception if this default or else branch is unexpected in the code flow.

    So if you’re hunting a bug and ran out of ideas, stop digging deeper
    and apply the fail fast principle until the code fails with an exception.

    If you get an exception, you have a stack trace attached to it.
    Now it’s time to start your debugger and find out why this exception was thrown.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.