I’m writing this blog series about the process of launching into space and the selfish reasons why we should care about it (in addition to our general enthusiasm for science). You can check out all the posts in the space series (as well as some other space-related code posts) right here.
Today we’ll talk about why the launch got delayed and what we can learn from that.
Originally set for March 1, CRS-20 got delayed to March 6.
Lots of factors affect a launch, like wind, weather, and fuel efficiency (rockets use a lot of fuel). If conditions don’t suit, then coordinators will move the launch. Occasionally, the rocket is already on the launchpad when a stray boat lounging in the danger zone scrubs the whole plan.
But a delay this big, announced this far out, often has to do with a technical failure.
The NASA blog confirms the cause:
During standard preflight inspections, SpaceX identified a valve motor on the second stage engine behaving not as expected and determined the safest and most expedient path to launch is to utilize the next second stage in line that was already at the Cape and ready for flight. The new second stage has already completed the same preflight inspections with all hardware behaving as expected. The updated target launch date provides the time required to complete preflight integration and final checkouts.
In a future post we’ll talk about the actors at play for this launch: NASA, SpaceX, et cetera. We’ll even talk about what the “second stage” is and why we’re putting this thing in space anyway. First, I want to tease out a few things from this delay:
On the Role of Deadlines
Product delivery teams and corporate clients, in my experience, tend to view deadlines as set in stone. I have worked on many an app for which the new release was set for August X, and could not move past August X, because that was the approved date. Why? Arbitrary. No special event on August X + 1 required a new release. But engineering teams cut corners, ship bugs, and scrap features over meeting deadlines.
A risk calculation needs to be happening that compares the expected outcome of a late release to the expected outcome of an on-time release. We get the expected outcome from a series of possible outcomes, their costs, and the probability that each one will occur.
Let’s use this launch as an example.*
*These examples contain wildly inaccurate numbers. I don’t know space numbers. The point is how to calculate expected cost. Also, if YOU know space numbers, I’d be THRILLED to get an email from you about how I can make these outcome tables the optimal combination of accurate and simple. Email me: chelsea at this site’s URL.
Possible Outcomes of Launching on March 1:
|Outcome||Cost||Probability||Cost * Probability|
|Launch Goes Perfect||$0||50%||$0|
|Minor Issue – have to MacGyver a solution||$30,000||42%||$12,600|
|Minor Trajectory Issue – more MacGyvered solutions, some cargo loss||$1 million||5%||$50,000|
|Major Trajectory Issue – ISS cannot retrieve capsule||$150 million||2%||$3 million|
|Launch explodes, additional damage to launch pad and surrounding area||$300 million||1%||$3 million|
If we add up all these expected costs, we’re at $6,062,600. In this example, the vast majority of the cost came from the two catastrophic outcomes. These are, by far, not the likeliest cases. But they are the most extreme cases.
Now suppose launching on March 6 allows us to halve the probability of all the negative outcomes:
|Outcome||Cost||Probability||Cost * Probability|
|Launch Goes Perfect||$0||75%||$0|
|Minor Issue – have to MacGyver a solution||$30,000||21%||$6,300|
|Minor Trajectory Issue – more MacGyvered solutions, some cargo loss||$1 million||2.5%||$25,000|
|Major Trajectory Issue – ISS cannot retrieve capsule||$150 million||1%||$1.5 million|
|Launch explodes, additional damage to launch pad and surrounding area||$300 million||0.5%||$1.5 million|
The expected costs are now halved, too. So if delaying the launch costs less than $3,031,300, it’s worth it. We’re using dollars as our unit here, but that’s exemplary, and any unit might be used. A combination of units might even be used. For example, this flight is just cargo and experiments. What if there were people aboard? What is it worth to reduce the probability of killing someone, or reducing their quality of life? Keeping deadlines flexible starts to look really attractive, doesn’t it? This is particularly true for NASA, according to this account of the December 2020 Boeing Starliner test flight:
The losses of the space shuttles Challenger in 1986 and Columbia in 2003 were both blamed in part on NASA officials pushing too hard to meet schedule deadlines.
Admittedly, when it comes to space, we’re talking about high potential costs; space is an expensive and tricky endeavor. Most products don’t exactly share the cost curve of a space launch. Nevertheless, we have plenty of costly outcomes to worry about: what if someone can steal sensitive customer data, like they did from the Equifax database? What if our product puts people in danger like the Boeing 737 max flight control software? What if our app fails its most vulnerable users like Siri used to do when folks told her they were contemplating suicide? What if inaccessible buttons cause our app to call 911 while telling low-vision users that they’re importing their contacts?
Much like a space launch, some of the costliest potential outcomes for a software release aren’t about how the code functions in most cases. They’re about how the code functions in a few extreme cases. I wrote about what it means to build robust software over here.
If slipping a release date means reducing our expected costs by more than the cost of the delay, it’s worth it. This is where the second part comes into play: keeping the costs of delay as low as possible.
On the Value of Redundancy
Redundancy: when we’ve prepared for things multiple times over. Redundancy is our insurance policy in case the first plan messes up.
We can spot redundancies in what we know about this launch so far. For example, we know that SpaceX had another flight-ready second stage available to replace the first one when a valve motor on the first one did not behave as expected. They didn’t have to ship a new one from California or get another one ready. That’s why this launch is delayed 6 days and not longer.
We can also infer a second redundancy. The CRS-20 mission carries supplies to the International Space Station. If the launch can be slipped a week, that tells us that the International Space Station already possesses enough supplies to get by for that period (and probably longer than that).
Like a health insurance policy or a car insurance policy, redundancies cost something. It’s easy to resent having to pay right up until we need the redundancies.
Product delivery teams and corporate clients often don’t budget for redundancies, or don’t budget for enough redundancies. Let’s look at one of our previous examples: in 2017, Equifax leaked the personal information of about 140 million people who were required to register with the credit reporting agency. The culprit? A vulnerability in an outdated version of an open source dependency called Apache Struts. Apache had released a patch (think of this as a reinforcement) for the vulnerability, but applying the patch required an investment of time and effort from programmers, plus extensive testing. Equifax leadership had not found time to let their tech team do the work.
When organizations focus on saving money in the short term, they don’t necessarily take the time to consider and prepare for disappointing outcomes. Emergency managers and Chief Security Officers regularly wring their hands about how people avoid investing in disaster preparedness (and unideal outcomes in general). When managers and stakeholders have confidence that they’ll succeed, they want to hang onto as many of the gains from that success as they can. Contingency plans for less-than-success don’t happen as often as they should.
We can look for gaps like this with our expected outcome chart as a map. This time, instead of calculating cost, we’ll list some possible redundancies; items that might appear in our contingency plan for this outcome.
|Outcome||Cost||Probability||Example Items in Contingency Plan|
|Launch Goes Perfect||$0||75%||None|
|Minor Issue – have to MacGyver a solution||$30,000||21%||Bring MacGyvering toolkit to the launch site|
|Minor Trajectory Issue – more MacGyvered solutions, some cargo loss||$1 million||2.5%||Ensure a spare supply of the most vulnerable cargo on the ISS and on the ground. Install emergency boosters to change craft trajectory if needed.|
|Major Trajectory Issue – ISS cannot retrieve capsule||$150 million||1%||Plan a trajectory with failure cases that don’t land the capsule in a populated area|
|Launch explodes, additional damage to launch pad and surrounding area||$300 million||0.5%||Check at least three times that there are no boats bopping around the danger zone before launch|
These kinds of measures help us reduce the probability or the cost (or both) of each of the possible disappointing outcomes.
It’s tricky, though, to predict all the possible outcomes.
We can plan for potential failures that we foresee. But we don’t foresee everything. Even space launches (or in the case of the 1999 Mars Orbiter, landings) suffer from this:
During the design phase, the propulsion engineers at Lockheed Martin in Colorado expressed force in pounds. However, it was standard practice to convert to metric units for space missions. Engineers at NASA’s Jet Propulsion Lab assumed the conversion had been made.
This navigation mishap pushed the spacecraft dangerously close to the planet’s atmosphere where it presumably burned and broke into pieces, killing the mission on a day when engineers had expected to celebrate the craft’s entry into Mars’ orbit.
Programmers see mistakes like this so often that we have jokes about it.
What can we do to feel confident that we have accounted for every possibility?
We’re human. We make mistakes. And when we try new things, we run into brand new problems. But we can do some things to mitigate this risk:
- We can study precedent to find out what problems we might encounter
- We can try to build broad and flexible contingency plans
- We can prioritize communication between experts in different areas to draw on as much knowledge as possible for predicting (and responding to) problems.
The Equifax leak isn’t a good example of an unforeseeable problem because the problem was foreseen; a patch had been released. It just wasn’t applied.
But how do space launch teams do these three things? And what can software and product teams do them, too?
In future posts in this series, we’ll explore some options.
If you liked this piece, you might also like:
The debugging posts (a toolkit to help you respond to problems in software)
The Listening Series (Prepare to question much of what you know about how to be good at your job.)
Skills for working on distributed teams (including communication skills that will make your job easier)