Lessons from Space: Avoiding Catastrophe

Reading Time: 12 minutes

I’m visiting Cape Canaveral as a NASA Social appointee to cover the launch of the CRS-20 cargo mission to the International Space Station.

You can follow me on instagram or on Twitter to see posts about it, and I’m also writing this blog series about the process of a space launch. Today we’ll talk about some space launch failures: what these cautionary tales can teach us about building better software, and how we can apply what we already know to avoid the worst.

You can check out all the posts in the space series (as well as some other posts about space-related code) right here.

The mission patches for Apollo 1, Challenger, and Columbia. Image courtesy of Ars Technica.
The mission patches for Apollo 1, Challenger, and Columbia. Image courtesy of Ars Technica.

NASA has had its share of accidents. Sociologist Diane Vaughan has investigated two of the most devastating ones: the loss of the crewed mission Challenger in 1986 and the subsequent loss of the crewed mission Columbia in 2003.

Brief Background: What happened to the Challenger and Columbia?

On January 28, 1986, the Challenger exploded 73 seconds after launch. An O-ring meant to seal one of the solid rocket boosters became stiff in the unusually cold weather at Cape Canaveral. When it failed during launch, fuel and debris blow-by caused the secondary O-ring to fail, too, pushing against the solid rocket booster until it partially detached and collided with the external fuel tank. As the rocket disintegrated, the crew capsule spontaneously detached and fell 65,000 feet, hitting the water below at 200 miles per hour. All 7 crew members perished.

Then on February 1, 2003, the Columbia disintegrated upon reentry into the Earth’s atmosphere. During its launch, the launch monitor cameras had captured images of a piece of foam breaking off the external fuel tank and hitting the wing of the shuttle. Investigations after the accident concluded that a perforation of the heat shield and, possibly, the wing of the Columbia itself allowed atmospheric gases to bleed into the shuttle and tear it apart.

In the aftermath of both accidents, manager incompetence, funding failures, and poor prioritization of safety surfaced among the explanations for the disasters.

But Dr. Vaughan presents another conclusion.

Her book, The Challenger Launch Decisionshares the details of her investigation (with a second edition that also draws the comparison between Challenger and Columbia).

Rather than attributing the accidents to deadlines or individual incompetence, Dr. Vaughan finds a pattern that she calls the normalization of deviance. Over time, preexisting risks become normalized, and each progressively bigger risk seems like only a small difference from the cumulative risk level the team has already accepted.

For the Challenger, we see this in the interpretation of data about O-ring failures.

Several engineers at Thiokol (the company that made the O-rings) protested launching the Challenger due to the hazards posed by the cold temperatures. Managers there expressed reservations, too. But the marginal risk looked low to managers at Kennedy:

Screen Shot 2020-03-05 at 10.02.29 PM

And no protocols demanded comparison instead to the cumulative risk compared to ideal cases:

Screen Shot 2020-03-05 at 10.02.44 PM

So they saw the overall risk as low, citing the number “1 in 100,000” for expected space shuttle failures. In reality, the chances of failure in this case were about 1 in 16.

For the Columbia, we see this in coordinators’ perspective on the risks of the foam.

Columbia’s final mission was not the first time foam had fallen, hit the shuttle, and damaged the tiles. Launch managers had seen that happen before without incident; they had become habituated to a risk that they knew about. So they thought of it as a maintenance risk, but not a safety risk.

Engineers within NASA pushed to get pictures of the breached wing in space, and the Department of Defense offered to take the pictures from their orbital spy cameras. But NASA officials declined the offers. Said one official after the fact of the known risk: “We were comfortable with it.”

In both these cases, familiarity created a false sense of safety.

And we, you and I, make the same mistake.

Have you washed your hands, carefully, with soap, in the past 48 hours? Would you say you wash your hands more often or more carefully right now than you have, on average, in your life?

If so, are you doing that because you have learned that you reduce your risk of infection from COVID-19 (coronavirus) by washing your hands?

Screen Shot 2020-03-05 at 11.49.58 PM

Washing your hands also reduces your risk of infection from cold and flu viruses, which have their season every year.

I am not disputing the severity of COVID relative to illness from the flu: our early data indicates that COVID is both more infectious and more lethal in humans than most flu viruses. The point remains that the flu is, at best, no fun, and at worst deadly. Like coronavirus, it spreads much less effectively through the population when people wash their hands.

So where was all this enthusiasm for hand washing every other year? Were flu deaths just acceptable to us if it meant we could wizz and run? Recently I received the appalling news that it is unusual for all the sinks in a men’s bathroom to be occupied. Apparently, men weren’t washing their hands upon leaving restrooms. I’m never shaking a hand again.

But back to the point: we readily accept the risks to which we have become habituated. We won’t get our flu shot and we won’t wash our hands, but when a novel virus comes along, we change our behavior. We eat unhealthy food, lead sedentary lives, and ride in cars without seat belts, but we worry about solar flares, shark attacks, and getting struck by lightning.

Dr. Vaughan points out the organizational impact of risk habituation.

When you work in spaceflight, spaceflight risks become “everyday” risks to you, in the same way that car crash risks are everyday risks to a person who drives to work. In the banal decision-making routine of organizational life, a cumulative risk perspective falls away.

This slide from NASA’s senior manager teleconferencing meeting on November 3, 2014 summarizes that lesson:

Screen Shot 2020-03-05 at 9.24.46 PM

So, I have to rant for like 30 seconds, and then we’ll move on.

I don’t love Dr. Vaughan’s thesis, not because the data isn’t accurate (as far as I can tell it is), nor because the conclusions don’t follow from the data (they do), but because the report absolves the NASA managers of blame in a manner that sounds Nuremberg Defensey. “They were just doing their job” doesn’t cut it. For the record, I’m the person whose pinned tweet is this, and I’m also not the only one who sees individual responsibility dangerously absolved here.

Dr. Vaughan makes a good point: it is a problem if systemic pressures push individuals toward dangerous decisions. She argues that, if an organization responds to a crisis by firing the Person In Charge and sticking someone else in there while changing nothing else, then the dysfunctions of the organization will cause the exact same thing to happen again regardless of the new chosen figurehead. She’s 100% right. We should evaluate our organizational cultures and hold teams accountable for systemic retrospective thinking when things go wrong. Over time, organizations will see fewer incidents if they don’t force employees to risk their social and political capital to make sure the team does the right thing.

And also, if managers hear protests from 20 engineers that the rocket isn’t safe, they should absolutely risk their social and political capital to make sure the team does the right thing. That’s what they’re paid the big bucks to do: to have the uncomfortable conversation, to make the hard decision, to demonstrate leadership. Managers in the Challenger and Columbia decisions should have addressed engineers’ concerns, and at the very least, should have conveyed those concerns to their leadership and to launch personnel. They didn’t do it, and I’m not buying that they’re blameless.

I have left organizations over manager opacity like this. The consequences of that opacity have ranged from this guy is throwing his reports under the bus so he can keep the job he isn’t doing all the way up to if somebody in here doesn’t grow a spine right now, people are going to die. I would put the Challenger and Columbia situations pretty far toward the latter of those two scenarios.

Okay, end rant.

Anyway, what do space disasters and coronavirus have to do with slinging code?

We talk right here about how if integration tests are flaky, programmers start ignoring them. We’re not listening to our tests: that’s normalization of risk. But we’re also not always stopping to fix those tests. “We time-boxed our effort on that test—we just can’t do it again. We sunk so much time. It’s just flaky.” Okay. That sounds like we need to replace our unreliable risk alert system with something we can trust. Are we doing that?

We usually aren’t. But there’s something else we do from which we can draw lessons about this: we manage scope creep. That’s what happens when a project with predefined requirements, called scope, start to encompass more and more work (a larger scope) within the original allotment of time and effort. We learn to record the original scope and notice when new requests seem to change it. We stand up and point it out. We learn to evaluate client needs and come up with scaled-down solutions that provide for those needs, even if some “wants” must be stripped away. What if we did the same thing with risk creep?

We could maintain running context, in commit messages or elsewhere, of the decisions we have made and the risks we have undertaken. We could identify evaluation criteria and ideal metrics in areas like data security, user safety, and accessibility, and include them in either our automated test suite or a manual checklist to be executed before each deployment. We could define maximum tolerances, from that ideal, that allow us to make a go-no-go decision with a long range perspective.

The Nov 3 meeting included recommendations for combating organizational risk creep:

Screen Shot 2020-03-05 at 10.52.34 PM

You’ll see some parallels between these recommendations and the practices we learned about in the previous post on pre-launch testing.

And here’s why tech teams really need risk creep protocols:

Tech teams face a higher probability of risk creep than NASA does.

Here’s why: back in 1986, and even in 2003, and always at NASA, it was normal to be a “lifer”: to spend one’s whole career at one organization—and even on one string of missions. So, at the very least, if the risk tolerance drifts, there’s a chance that somebody on the team remembers where the tolerance drifted from.

Tech employees in 2020 spend a median of 2.2 years at a given company. That doesn’t account for employees switching between teams within tech companies. Critical context about what the ideal metrics used to be can easily leave the team with the next developer who decides they need a change of pace. When that happens, the exiting teammate needs to transfer their context to the others. But, as we have established in these two other pieces, context transfer constitutes an advanced, nuanced skill. When we rely on everyone on the team having an advanced skill that companies and developer education tools don’t prioritize, we introduce risk.

Suppose that we managed to educate, empower, and incentivize teams to share context and combat risk creep. Next, our organizations need to equip and permit managers to advocate for their findings to the higher-ups. And it has to be more than words: it has to come baked into the way we operate. I proposed a management strategy here that might allow us to do that by explicitly designating managers and directors to represent separate sets of concerns while making decisions.

I think it’s also worth including here a final pair of slides from the Nov 3 meeting—the slides that list the signs of groupthink. Like risk creep, groupthink has less of a chance of affecting our decisions if we know how to spot it.

Screen Shot 2020-03-05 at 10.51.26 PM

When the consequences of a botched release are high, protecting our original risk tolerances and promoting collective decision-making from disparate perspectives becomes critically important.

If you liked this piece, you might also like:

What software engineering teams can learn from improv comedy (I live in Chicago, which I think makes me legally obligated to write this)

Combatting shitty humor in the workplace (this isn’t the title of the piece because I wrote it when I had less experience and more fucks to give, but the content stands)

Advanced professionalism for the heavily tattooed or otherwise counterculture (again not the title of the piece, see comments on the previous piece)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.