The Case of the Failing Upload, Pt. 1

Reading Time: 11 minutes

We’ve talked about where bugs come from. We’ve discussed tactics for tracking them down. Then we zoomed out and discussed strategies to think about debugging.

It’s time to try out our techniques in the moment. Such a moment always lurks, waiting for me. In this case, it lurked in the goddess of sight herself.

Some background:

Theia-Goddess
Theia (THEE-ah), Titan Goddess of Sight and of Blue Skies, Mother of the Sun and Moon

Much of the case work that I share with you comes from a volunteer-driven science platform called the Zooniverse. Our software makes it easy for researchers to ask for help from folks like you to discover a new kind of galaxy or spot animals on the Serengeti.

It’s less common for a Zooniverse project to combine elements from the sky and the Earth. But the floating forests project does that; it organizes images of Earth’s surface taken by NASA Landsat satellites. Citizen scientists to help spot trends in kelp growth along our coastlines—which tells us about climate change, biome changes, and more.

Does that seem like a cool project? It is. We’d love to facilitate more Earth science projects like this one. The problem: getting the images ready is a lot of work. Researchers fetch the Landsat data, pluck specific channels of that data (say, red light, or blue light), from the available options, filter the data for minimal cloud coverage and maximum coastline, resize the resulting images, and add them to the project online to be kelp-searched.

We hope for Theia to make this process easy. Researchers should be able to choose which satellites they want to get images from, as well as the locations for which they want those images. They should be able to mix and match the things they need done to the image, like resizing, choosing channels, or filtering for cloud cover. Finally, they should be able to upload their newly readied data to the Zooniverse for citizens to science.

The Current State of Affairs

I started working on Theia in November of 2019 as the previous engineer, who had built it alone, moved on to another position.

I understood that some version of Theia’s pipeline should already work. So I assembled a small pipeline of tasks that would request a few Landsat images, remap them, resize them, and then upload them to a test project on the Zooniverse. If I could get this pipeline to run from start to finish, I could confirm that we had the foundation in place for researchers to build functioning imagery pipelines.

Here’s the JSON version of the pipeline I put together to run this test:

{
"data": [
{
"type": "PipelineStage",
"id": "2",
"attributes": {
"sort_order": 1,
"output_format": null,
"operation": "image_operations.remap_image",
"select_images": [
"green",
"blue"
],
"config": {}
}
},
{
"type": "PipelineStage",
"id": "3",
"attributes": {
"sort_order": 2,
"output_format": null,
"operation": "image_operations.resize_image",
"select_images": [
"green",
"blue"
],
"config": {
"width": 200,
"height": 200
}
},
},
{
"type": "PipelineStage",
"id": "4",
"attributes": {
"sort_order": 3,
"output_format": null,
"operation": "panoptes_operations.upload_subject",
"select_images": [
"green",
"blue"
],
"config": {}
},
}
],
}

view raw
pipeline_stages.json
hosted with ❤ by GitHub

Theia runs on a set of workers. An entire pipeline can take a while to run because it starts with ordering images from the NASA Landsat service. The service places these orders on a queue to be filled like orders at your coffee shop. The app polls the service (repeatedly asks if the orders are finished) until they are ready to download. My starter pipeline would run for 15 to 25 minutes.

Then…it would fail. Observe the worker logs at the point of failure: awash in red, like a crime scene from a slasher film.

Screen Shot 2020-01-04 at 3.07.05 PM.png

And so begins our mystery: The Case of the Failing Upload.

How do we know it’s a case of a failing upload? The logs provide us with important clues: the failure happens on line 54 in a file called upload_subject.py. This is the file where we keep the code designed to upload the modified images to the Zooniverse.

Our code fails with a PanoptesApiException: Could not find the project with id='10633'. Panoptes is the name of the API where we modify Zooniverse data, and our app depends on a client library (code written in its own separate code base and exported into our app) specifically for connecting to that API. The library includes an object also called Panoptes.

Here’s what upload_subject.py looks like:

from os import getenv
from ..abstract_operation import AbstractOperation
from panoptes_client import Panoptes, Project, Subject, SubjectSet
from theia.utils import PanoptesUtils
class UploadSubject(AbstractOperation):
def apply(self, filenames):
if self.pipeline.multiple_subject_sets:
scope = self.bundle
else:
scope = self.pipeline
self._connect()
target_set = self._get_subject_set(scope, self.project.id, scope.name_subject_set())
for filename in filenames:
new_subject = self._create_subject(self.project.id, filename)
target_set.add(new_subject)
def _connect(self):
Panoptes.connect(
endpoint=PanoptesUtils.base_url(),
client_id=PanoptesUtils.client_id(),
client_secret=PanoptesUtils.client_secret()
)
def _get_subject_set(self, scope, project_id, set_name):
subject_set = None
if not scope.subject_set_id:
subject_set = self._create_subject_set(project_id, set_name)
scope.subject_set_id = subject_set.id
scope.save()
else:
subject_set = SubjectSet.find(scope.subject_set_id)
return subject_set
def _create_subject(self, project_id, filename, metadata=None):
subject = Subject()
subject.links.project = Project.find(project_id)
subject.add_location(filename)
if metadata:
subject.metadata.update(metadata)
subject.save()
return subject
def _create_subject_set(self, project_id, subject_set_name):
project = Project.find(project_id)
subject_set = SubjectSet()
subject_set.display_name = subject_set_name
subject_set.links.project = project
subject_set.save()
return subject_set

view raw
upload_subject.py
hosted with ❤ by GitHub

So the line project = Project.find(project_id) throws the exception.

At this point, we might start to consider some hypotheses for what could be going wrong here. Maybe the id—10633—is the wrong id for the project. Or maybe the exception we see is a catchall error for the find API call, and we’re not connected to Panoptes at all.

We could make a guess at which of these hypotheses is correct and jump right into fixing that. I call this The Standard Strategy for debugging, and it looks like this:

Debugging 1: Prioritizing changing code at the place we think the bug is most likely happening.
Bigger version of image here

If we understand the behavior of our code, then The Standard Strategy is often the quickest way to resolve the bug.

The problem arises when we don’t understand the behavior of our code and we keep repeating this strategy as if we do. The less we understand the behavior of our code, the lower the correlation between the things we think are causing the bug and the thing that’s really causing the bug, and the weaker this strategy becomes, until we’re here:

Debugging, Retrying things that don't work

This is exactly how developers lose a ton of time on insidious code issues. It’s the detective equivalent of starting the case by choosing a suspect and focusing on capturing that suspect rather than collecting evidence. It’s an expedient approach if that suspect is, in fact, guilty. But if they’re not, we’re back where we started, except more frustrated and with less time.

It’s not a smart approach when we aren’t quite sure what’s going on yet. Instead, we need to start our case by gathering more clues.

I’d like to gather clues expediently. I can’t do that if I have to wait 15-25 minutes for this whole pipeline to run in between each thing I try. So before I do anything else, my first debugging step is to find a way to reproduce the bug in a feedback loop that takes less than ten seconds.

Here’s how I try to do that: open a console in the app directory…

Screen Shot 2020-01-04 at 3.07.46 PM

import Project to my interactive session…

Screen Shot 2020-01-04 at 3.12.15 PM

And run the line of code that threw the exception all by itself.

Screen Shot 2020-01-04 at 3.13.21 PM.png

And that’s the exact same exception we got on our worker thread. We can reproduce our issue quickly. Score! Now we can test several hypotheses in rapid succession.

I think this issue might be happening because we aren’t properly logged in. My Zooniverse project, called ‘Example Image Project,’ isn’t public yet: I have left it unpublished for now while I use it to test Theia. Since it’s unpublished, the project probably requires some permissions to see it. Panoptes isn’t going to just let every Tom, Dick, and Harry ogle every project in progress.

If my hypothesis is correct, then the Project.find command should work for a project that is public. How will we find a public project? We could look on the site itself:

Screen Shot 2020-01-14 at 8.56.37 PM.pngBut we don’t have a good way to find the id of any of these public projects. The URL uses a slug rather than an id:

Screen Shot 2020-01-14 at 8.58.30 PM

But maybe we can issue another call to Project to get some more information about the projects.

What can we call on the Project object? We can find out by typing dir(Project) into our console.

Screen Shot 2020-01-04 at 3.11.34 PM.png

dir is a data model method, which is a language-agnostic term. You might be more familiar with the Python-specific colloquial term: “dunder method” (for the double underscore at the beginning and end of the method signature).

We use data model methods to get more information about an object, rather than to tell it to do something. Classes in most object-oriented languages possess a data model method called .class() for finding out the class name. In Python specifically, data model methods distinguish themselves by not being called on the object in question. Instead, we pass the object into the dunder method as the first parameter (yes, for instance methods in Python the thing a method gets called on is its first parameter, but that’s not how the dunder methods are implemented).

We have one more advantage in figuring out what Project can do: the Python Panoptes client library is open source, and we can look at the source code. Moreover, the methods have decent inline documentation. For example, this is the code of the find method that throws the exception:

Screen Shot 2020-01-04 at 3.16.59 PM

We need to get the ids of some public projects, so we’re looking for a method that will allow us to ask for those. How about the where method?

Screen Shot 2020-01-04 at 3.16.53 PM

Let’s try it out in our console:

Screen Shot 2020-01-04 at 3.20.31 PM

We get back a thing called a ResultPaginator. No idea what that is. Let’s assign it to a variable (results) and then repeat our dir trick from before to find out what we can ask it to tell us:

Screen Shot 2020-01-04 at 3.20.53 PM.png

There’s an attribute here called object_list. That sounds like it might be a list of the objects returned from the Project.where call, which just might be the project objects we want. Let’s call that on our results object and see what we get.

Screen Shot 2020-01-04 at 3.22.04 PM

That very first project, called WildCam Gorongosa, has an id of 593. So if we try to fetch that project instead of my unpublished Example Image Project, the call should work.:

Screen Shot 2020-01-14 at 9.27.33 PM.png

And it does!

Now, if I initiate a Panoptes client with the username and password that I used to create the Example Image Project, can I fetch 10633 the same way I just fetched 593 without logging in?

Screen Shot 2020-01-06 at 1.16.36 PM.png

It looks like I can.

(By the way, that line right there is one of the reasons this is a walkthrough and not a live stream. I love you all, but no, you cannot have my professional API credentials).

Screen Shot 2020-01-04 at 3.43.29 PM

This is fantastic news!

There’s a problem, though. This username and password combination is not, in fact, how we authenticate ourselves with Panoptes in client applications. Instead, we do it with something called social auth. Since this walkthrough is getting long, we’ll jump into social auth in part 2.

But here’s the critical takeaway for our debugging approach:

We programmers tend to think in building mode. But while we’re debugging, we often get more mileage for our time spent by switching to investigating mode.

To that end, throughout our debugging session, our question has not been “How would I solve this if I’m right about what the problem is?”

But rather “How little work can I do to confirm that I’m right about what the problem is?”

The difference here saves us time every time we’re wrong—which is a lot more of the time than we realize.

In the next post, we dive deeper into the code and determine how to get upload working.

If you liked this piece, you might also like:

The other posts about Zooniverse projects (most of them include live coding!)

The series on reducing job interview anxiety—no relation to this post, but people seem to like this series.

This talk about the technology and psychology of refactoring—in which you’ll hear me explain some of what you see me do when I code.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.