This past weekend I spoke at the inaugural instance of Blue Team Con, a technical security conference based here in Chicago.
I only managed to catch a few talks on Saturday. That said, I thought the organizers did an excellent job—particularly for this conference’s inaugural instance, and particularly given that they planned on starting in 2020 and then had to postpone. The organizers had set up one on one resume help and a lightning talk room in addition to two tracks. They also set up a large vendor village, took great care of speakers, and arranged several social activities for the evenings surrounding the conference.
For my talk, I prepared a 20 minute, compressed version of some of the material I present in my workshop “Parrot Emergency!” about application risk analysis.
Without further ado, I present to you: my slides and approximate transcript from the talk.
Full Transcript

Hi! I’m Chelsea Troy. And today we’re going to talk about analyzing application risk. Which means, we’ll be talking about a system to arrive at an answer to the question:

What could go wrong in an application?
And how can we tell if it’s wrong?
And how do we know how bad it is?
And what can we do to fix it?
A number of the systems I’ve seen recommended for use with engineering teams are complicated enough that the engineering teams end up not using them. Instead, the engineers go with their guts. Hopefully what I describe today will be accurate enough to be useful on your team, but also, simple enough to be used.

Does anyone know what this is?
It’s the Mars Climate Orbiter, launched in December of 1998. In September of 1999, we lost it, trying to put it into orbit around Mars.
The craft was designed and operated by Lockheed Martin out of Denver, and it provided thruster firing data in English units. The navigation system was designed by NASA JPL out of California, and it used metric units. As a result, these two systems interoperated incorrectly and flew the craft far too close to the Martian surface, where it likely ricocheted off the atmosphere or broke up during atmospheric entry.
The folks who designed this system were literal rocket scientists. How could something like this happen?
This is the curse of leverage. By placing ourselves in positions to decide how software works, we place ourselves in positions to lose a $125 million dollar orbiter, or in positions of responsibility, possibly over safety and lives.

BUT WAIT!
What’s this lady doing onstage at a cybersecurity conference talking about orbiters? How is that example related to security at all?
Well, I left a piece out of the story a minute ago, and I think it’s illustrative. NASA doesn’t hold Lockheed at fault for the issue that caused the loss of the orbiter. Even though it was Lockheed that didn’t follow the International System of Units, at odds with the spec. Instead, NASA’s director Arthur Stephenson called out a rush for delivery on the small forces model that prompted NASA to abbreviate its testing of the integration between the two systems. Had that testing been performed, he believes the issue would have been caught.
This wasn’t a hacker. It wasn’t a result on a government audit. It wasn’t deliberate social engineering. It was an integration issue with a trusted collaborator, that got overlooked because it happened in a place where the engineering team ASSUMED the risk was low.
And as such it’s an excellent demonstration of the primacy of taking a holistic view of the risks that affect our system—not just the security ones, or the ones most closely related to our own area of expertise.
So we’re going to walk through, now, a series of steps to do that. I’ll try to keep it brief to leave time for questions, but detailed enough that the fastidious note takers among you will be able to replicate this system for your own projects after we go through it.
To do this, I need an example project to show you. But I want to avoid triggering examples in our mid-pandemic conference. So we’ll adopt a more lighthearted example for this talk.

Does anyone know who this is?
How about what this is?
This is a scarlet macaw. She is one of two greater macaws cared for at the Shedd Aquarium in Chicago. She has a roommate named Pablo, a blue and gold macaw.
Now if something were to happen to Pablo or Serrano, they’d need help from a vet. And for the purposes of this talk, we’ll imagine that they’d be brought to a special veterinary clinic just for birds.
Just like a human hospital, the Parrot Emergency Room has triage, and bird beds, et cetera.
And let’s imagine, for a moment, that we are all on the engineering team that will be building the software used at this parrot emergency room.

The first question to answer, is what kind of risks might we encounter? Let’s go through several categories.
The first is the most obvious: functionality. Something could fail to work, or it could fail to work well. This is generally what we think of when we imagine “what could go wrong in an application.”

The second is scaling—something could fail when a lot of people use it at once, or more generally, when demand outstrips resource availability.
One clear proxy for development support is “whose mobile app completely tanks in remote locations, and whose has mechanisms by which it can at least do something?” That’s an example of demonstrating scaling limits by constricting resources. Things can go the other way too: a system might reach scaling limits due to an explosion of demand. When either of those situations results in software failing to function, that’s a problem.

Third, robustness—something could fail when given invalid inputs by accident. This is super common when folks are filling out forms, or really anywhere that humans interact directly with software.

Fourth, security—just as something could fail when given invalid inputs by accident, something could fail to guard against a deliberate attempt to compromise the system. I suspect a lot of the crowd here today has some experience with this one!

Fifth, accessibility—something could fail for folks who use accessibility aids. This sounds like a marginal case, but it becomes catastrophic when, say, an app designed to help folks dial 911 for a medical emergency fails to present the call button in a way that the screen reader can identify it.

Finally, inclusion—something could fail to consider the needs of people it’s supposed to serve, but who aren’t the decision makers. Suppose an app released a feature that allowed users to directly message one another, but had no permissions settings or blocking mechanisms. That makes the new feature a near-perfect harassment vector. That’s a problem, and rolling back all that feature work is expensive!

Now that we’ve discussed several categories of risk, it’s time for us to identify risks in our application. One tool that I find especially helpful for this step is a class model diagram like this one, which tracks the flow of data through the system. This system aggregates information from a server that stores breed information, a tablet where veterinarians enter patient conditions, and a collection of harnesses that monitor patient’s vital signs. It uses that information to assign a condition risk to each patient, and it surfaces that information to veterinarians through a list interface and mobile notifications.

Now it’s time to use the risk categories we discussed. We can annotate this diagram with any risks that we see in each area, like so:

This is an annotation that one of my Computer Science master’s program classes created. I know it’s small for you to see, so I’ll zoom in on a portion:

Here we see some of the risks that students identified. They include data tampering, a delayed response from the API, and even a mixed-breed bird that the breed information API may not have data about.

Once we have identified risks, we need to determine how “bad” they are. This can be complicated, and what we need here is a helpful heuristic rather than a perfect model. I like to ask three questions to determine which risk amplifiers each risk has.
First, is this catastrophic if it goes wrong? Meaning, can the system no longer perform its core function if this goes wrong? For example, the server going down might prevent the system from prioritizing birds at all.
Second, is it likely to go wrong? For example, it’s common for humans to enter wrong information in a form—much more common than for, say, a multi-region deployment solution to come offline because all of the regions go down at once.
Third, is it likely to go uncaught? To me, this is the most insidious and underestimated of these risks, and many security risks fall into this category. Hackers can keep access to sensitive data and systems if they can gain that access without anyone finding out. The longer a defect can run, the more damage it can cause.

Here, I like to ask my students to assign risk amplifiers to each of the risks they identified on the class model diagram. I have provided an emoji of a fire for catastrophic risks, a die for likely risks, and a detective for risks that might go uncaught. I then ask the students to copy and paste those emojis and move them next to the risks to which they apply.

Here, you can see the complete picture of my students’ risk amplifier assignments. Let’s zoom in again:

You can see some of their answers here. Data tampering is labeled catastrophic. Mixed breeds are labeled likely. Inaccurate info from bird owners is labeled likely to go uncaught.
There are few in this example, but it is also possible for a risk to have multiple risk amplifiers. So don’t be afraid to give more than one amplifier to each risk.
What I DON’T do, is worry about the relative size of these risks. For example, I’m not focused on comparing two likely risks to determine which is MORE likely. Treat these risk amplifiers as binary classifications.

Now it’s time to prioritize which risks to address. For this, one relatively speedy and efficacious strategy is to start with the ones that carry all three risk amplifiers, then focus on the ones that only have two, then focus on the ones that only have one.
Eventually, this will mean prioritizing, for example, a likely and catastrophic risk over a catastrophic, potentially uncaught risk. At that point, the team needs to make a judgment call on what’s worse for them—something likely or something uncaught. But this amount of judgment needed in the prioritization process has not proven to be a big slowdown in my experience, and creates an opportunity for contextual knowledge about the problems to enter the picture without halting the risk prioritization process.

Once we know what order to address risks in, we have to determine how we will address them. Unfortunately, I cannot talk fast enough to explain how to address every possible risk in a 30 minute talk. But I can offer a framework for how to think about this as well as some example solutions.

If the things that amplify our risks are their consequences, their likeliness, and the potential for them to go uncaught, then it makes sense for us to address those amplifiers directly by identifying ways to reduce consequences, reduce likelihood, and increase catchability.

Here are some examples for reducing consequences. We might introduce fallbacks, like sensible defaults that we use when a piece of information doesn’t come back from a server. Or increasing tolerances, like shutting off parrot reprioritization when a vet’s phone enters low power mode and serving only critical condition notifications. Or redundancy: backup servers and cell network to prevent some causes of lost connectivity.

Here are some examples for reducing a risk’s likelihood: automated testing! This is an especially helpful mechanism for ensuring that a known issue that got pushed to production never happens again. Quality assurance involves having a team specifically focused on inputting anything that might make an app crash and ensuring that this one doesn’t. In the absence of the resources to complete either of those two things, a regression checklist can smooth over major gaps. Write down every feature that absolutely has to work at all times, and have a couple of people run through the list before every deployment. It sounds rudimentary, but it’ll catch stuff. Commercial and space flight teams both use these.

And here are some examples for increasing catchability. Frequent and descriptive logging make it easy, and in some cases even possible, to track down issues in production after they occur. Penetration testing, like QA, involves focusing a team of people specifically on deliberately attempting to hack the system to ensure that it doesn’t work. Auditing happens more infrequently, and tends to operate analogously to the regression checklists we discussed. It can be stressful and unpopular, but you’d usually rather an audit (especially an internal one) catch a problem than your customers see it.

Identifying risks and undertaking prevention and mitigation strategies is key to engineering. The skill remains important on any stack, in any language, in any year, for any hardware or software. I hope that some of what we talked about today gives you some tools to bring back to your team for addressing risks more effectively and with less stress.
Thank you!

If you liked this transcript, you might also like:
The talks category on my blog, where you’ll find more transcripts like this one
This piece about the pitfalls of data-driven innovation – it’s not super related to this talk, but the audience that likes this talk will get some food for thought out of it, I promise
The Siren Song of the User Model – again, no direct relation to this talk, but it’s just good and I’m proud of it and this is my internet house with my name on it and I get to recommend whatever I want, dingdang it.