I’m writing this blog series about what software engineers can learn from spaceflight. You can check out all the posts in the space series (as well as some other space-related code posts) right here.
Today we’ll look at how launch teams test a rocket prior to launch. We’ll also talk about when and how to test ahead of a release and how it can go wrong (expect more nuance than “Always Test, The End.” In fact, we’ve discussed decisions about whether to test code in more detail right here).
You can check out all the posts in the space series (as well as some other posts about space-related code) right here.
Pre-Launch Rocket Tests
On Monday, March 2, we learned that SpaceX’s Falcon 9 rocket successfully completed its static fire test for the launch that is scheduled to happen on Friday.
When rockets go into space in the U.S., their manufacturers first book time on one of NASA’s launch pads to test the hardware and software. Manufacturers can (and SpaceX usually does) first perform a wet dress rehearsal, called “wet” because liquid components like coolant are loaded into the rocket prior to the test (though there is no fuel in the rocket at that time).
After that comes the required test, the static fire test, which includes a wet dress rehearsal. In the static fire test, the manufacturer turns on the rocket engines for somewhere between 3 and 5 seconds while the rocket is secured to the launch mount (with some exceptions—SpaceX has run a Falcon Heavy static fire test for 12 seconds).
During this time, analysts and testers collect data about the function of the components. This data runs through a decision tree to evaluate conformation to the engineers’ expectations for how things should run. That decision tree ends with the Go-No-Go Decision, which is—you guessed it—whether or not the rocket is deemed ready to go. If you’re interested, you can read transcripts and listen to audio recordings of Go-No-Go procedures from some previous launches.
Do go-no-go decisions change if we’re on a deadline?
They shouldn’t…in theory. But suppose we need to ship a new release and something weird happens during the final tests. Programmers often find themselves making the decision at that point whether the weird thing is acceptable to push to production. Someone on the product side says yes. Someone on the engineering side says no. What do we do?
Ideally, we avoid that situation. One of the strategies that NASA test engineers use to avoid it is to set all the go-no-go criteria ahead of the test. In fact, those criteria are loaded into a decision tree software before the static fire test happens. That tree can be modified depending on the particular type of rocket that’s launching, but the conversations about what amount of deviance engineers will accept from the expected values happen before anybody starts those engines.
Sarah Daugherty, one of the Flight Facility Test Director at NASA Wallops, talks about this some more in an interview:
BLAIR (the interviewer): You mentioned one like a radar failure. In a case like that will you ever have to make a decision to go ahead and launch even if something like that doesn’t work?
SARAH: Yeah, it depends. Before launch, we come up with a set of criteria that we make our “go,” “no go” decisions on. Those are mission specific always. It depends if only one radar fails and we have our backup radars up. We still maybe operating within our criteria and we may be okay. If that was the only radar we had per [se], we probably would not launch.
BLAIR: Got you.
SARAH: We just follow that set of criteria ahead of time that helps to minimize the stress and surprises on launch day when we have that plan in place. We can just follow that.
BLAIR: Can I possibly sponsor a green card from NASA EDGE? Like come up with an anomaly that we could throw into the mix on one of these dress rehearsals.
SARAH: Sure. They would be completely unbiased. Those are the best kinds.
By deciding ahead of time what the launch criteria will be and considering the matter settled at testing time, the team heads off ex-post-facto decisions about risk tolerance and equipment readiness.
How do we design the right tests?
This is a tricky question, particularly when our software’s malfunctions could have serious consequences. Research consultant Heidy Khlaaf works specifically on designing standards and formal verification methods for safety-critical systems: software that can cause catastrophes—destroying stuff, killing people, ruining the environment.
This talk of hers provides an excellent primer for the uninitiated (though it skips some material covered by, I gather from context clues, a previous talk at the same conference. I’ll get you a link to that talk later):
The first part of her talk focuses on smart sensors: small, low-level computers that run a block of code at regular interrupts to check on something at their installation site—say, a temperature or a pressure. Static fire tests rely on sensors like these to deliver the data that informs the go-no-go decision.
Dr. Khlaaf mentions some other broadly applicable concepts for software releases.
For example, I learned from this talk about industry standards dictating how software should be developed and maintained. Dr. Khlaaf mentions DO-331, the standard for aviation, and IEC 61508—dubbed “The Golden Boy” standard for its universal applicability in safety-critical systems. These standards might serve as sources of inspiration for assessing, preventing, and mitigating risk in any system. (Maybe I’ll do a blog series on this, once I finish the eight other series I’ve gotten into over here 😂).
Dr. Khlaaf also lists some questions we must consider when reasoning about failures and deficiencies flagged by automated tests:
We have talked in the past about how to approach these decisions. In this talk I give you a general intuition for how to judge code changes in the present based on how they will affect us in the future. Then in this screencast I show you how to build a risk profile for your application, so you can focus your testing with an understanding of your biggest risks. Finally, in this piece (which is also part of the space series), we talk about the role of deadline delays in risk reduction, and how to make a call to release or wait.
Toward the end of the talk, Dr. Khlaaf stresses the importance of a piece of software’s features outside of its functionality (more detail here); this is especially true for safety-critical systems.
Finally, Dr. Khlaaf mentions in the talk that verifying a safety-critical system requires (or should require) a third-party assessment. A lot of times, in cases of accidents, we learn that an organization approved its own release. You may have heard stories of software releases that like this, perhaps approved under time pressure. Sometimes, too, the release team isn’t responding to immediate pressure; rather, they have habituated to latent risks, and those risks finally come to pass.
In the next post in this series, we’ll talk more about some failures of safety-critical systems: what happened, why it happened, and what we can learn from it.
If you liked this piece, you might also like:
This piece about rethinking leadership responsibilities to make “managing up” more feasible for managers
This talk about the technology and psychology of refactoring—in which you’ll hear me explain some of the tradeoffs we’ve discussed here.
This 3 part series about time and space efficiency—In which I approach the topic of performance from the perspective of a code sample. Why make things fast? Why make them take up less space? And how do we evaluate the tradeoffs?