Adventures in Process Design: A Case Study on Data Import

Reading Time: 12 minutes

This is the first case study in a series called Adventures In Process Design. We started this series with this discussion of Avdi Grimm’s Southeast Ruby keynote. Our goal here is to identify transactions in existing software and redesign them to better model the processes they enable.

Decades of transaction-focused software have trained users not to expect such sophisticated designs. So by building our skills at looking for processes, we’re preparing ourselves to write software that centers our clients even more than clients center themselves!

Are you new to this series? Welcome! Check out all the series posts right here!

Case Study: Data Import

Suppose we’re working on Wodify, an application that allows athletes to track their workouts (and a real app that I featured in this UX design post three years ago). Let’s say we’re refactoring an import feature that lets athletes import a spreadsheet of their workouts.

I ran into this exact situation at work on an app for interpreting legal text. I have changed the topic to workouts for this blog post so I wouldn’t distract from the code with unrelated legal jargon. The software in this example is structurally the same as the work project.

Holley Mangold, U.S. Olympian Weightlifter
Holley Mangold, U.S. Olympian Weightlifter

Initial State: Highly Transactional

When an athlete visits the import screen and uploads their workout CSV to import, we run the following code to insert their workouts into the system:

We iterate through the CSV and attempt to create a Workout with the information inside each row. If we cannot create a workout with a row for some reason, we insert any errors into a collection of errors, which we write to a file at the end. The athlete can look at that file to determine which rows could not be imported and why.

From here, if the athlete still wants to import those rows, they have a couple of choices:

  1. Copy all of the rows that encountered an error into a new CSV, fix them all in accordance with the error messages, and run import on the new CSV. First of all, this is highly annoying for the athlete. It takes time and concentration. Secondly, it is is risky for data integrity because it requires a lot of manual editing. A human has to copy and paste rows from an index. This is a great way to introduce errors in the data. What if they read the ordinal index as a cardinal index (or just make a mistake) and copy in the wrong row? What if they miss some text when they’re highlighting the row to copy? Risky.
  2. Delete all the workouts that did get imported from this CSV, fix all the errors in accordance with the messages, and re-run import on the original CSV. This is still annoying, though less error-prone. We can attach an import_id to each row and roll back an entire import with the click of a button (which we did in the legal text software). However, it means that the athlete has to start over every time even one row in the CSV requires a change. In the legal text software, we were importing CSVs with thousands of rows, many of which had a lot of data. A single CSV could take hours to import. So every re-import meant hours of waiting.

We’re treating import as a transaction: You give me a file, I give you workouts. But if any part of that transaction doesn’t work, then we have to start again with either a new CSV or an updated CSV.

This will happen, of course, if a row is invalid. But it can also happen if the CSV is valid but the data doesn’t pass validations. Check out our workout model:

See how we validate the presence of a date? If that date is not there, then Workout cannot be created, and we’ll get a failure on the row. We add the error message to our list and move on without creating this workout. Again, if just one workout is missing a date, we have to import another CSV.

Let’s see if we can address this in our refactor.

Step 1. Finding the Process

To begin finding the process embedded in this import transaction, we need to ask ourselves a couple of questions.

1. Who is our client? In this case, our client is an athlete. Our client might be an end user (this case), another developer (if we’re writing a framework), or even another application (if we’re writing a CI tool or something).

2. What does our client want/need? Our client wants to use our app’s features to help them grow as athletes.

3. What do we need from our client to give them what they need? We need rows that represent valid workouts.

4. Why do we need that? This question is critical. Why do we need a date on every Workout row? Maybe it’s because our software tracks the athlete’s performance over time, and we need to know the order of the workouts to see if performance has improved from earlier workouts to later workouts. OK, seems like a decent reason.

But suppose we had a validation on the presence of a field we didn’t use. If we’re not using it, we don’t need it, we shouldn’t validate on it, and our user shouldn’t have to provide it (and maybe even shouldn’t be asked for it).

The corollary to this question: is there any way we can move forward without the thing from our client?  If there is, we should find a way to continue the relationship without the thing, for now, until we really need it. To use an example from signing up for a paid SaaS product: I’d prefer if I only absolutely had to provide credit card info after the free trial, not before.

Returning to our workout app example: is there anything we can do with this workout if it doesn’t have a date? Hmmm. What if we could create a workout without a date but keep it in some kind of interim state, not using it for performance calculations, until it is complete?

Then the athlete would only have to import the CSV one time and all the information would be inside the app. We could make a screen especially to update all of the workouts that need to be fixed before we can use them to calculate performance trajectory. The athlete could visit this screen to fix the workouts and include them in the calculation. Or the athlete could decide not to visit this screen, and those workouts will stay in the interim state while only the complete workouts go into the calculation. But the athlete can keep moving toward their goal of using our features, even if some workout is missing a date.

Step 2: Modeling the Process

So we want to create workouts regardless of their validity and store invalid workouts in an interim state. Rails validations, by default, don’t work like this. They’re there precisely to ensure that only valid data enters the database. This is valuable in cases where we cannot move forward without every piece of data adhering to a rule.

It’s my humble opinion that we as programmers treat those cases as far more common than they are.

So let’s try out an alternative pattern here with a few steps for reads and a few steps for writes:

Reads:
1. Identify a scope that we will use to return only complete, valid records.
2. Use that scope anywhere that absolutely requires complete, valid records.

Writes:
1. Identify a scope that we will use to return only incomplete or invalid records.
2. Try to create/update the record with validations enabled and assign these records to the scope for complete, valid records.
3. If that fails, create/update the record with validations disabled and assign the record to the scope for incomplete or invalid records.

Let’s walk through what this might look like for our workout app. First, let’s create our scopes inside our workout model:

We have added an attribute to Workout called :status. We will use this attribute to give some of our records an interim state. Complete, valid records get status: complete, which is assigned by default to any record that gets successfully created. Invalid records get status: needs_revision, and we assign that manually whenever we need to bypass validations.

We’ll limit our reads to the complete/valid scope whenever we only want to use records that pass validations. Here are some small examples in the controller itself:

We have three methods here:

  • An index method for a page that shows all the valid workouts
  • fix_incompletes method for a page that lets the athlete see and update invalid workouts. Athletes can visit this page if they want and fix their workouts. If they don’t, those workouts won’t show up in the .complete scope, but the app won’t break.
  • A performance method for a page that shows pretty graphs and charts of the athlete’s performance over time. We only generate these charts with data from workouts that have a date: the workouts that still need revision do not get included.

Another note on our implementation: we could alternatively set a default scope in the Workout model to only return complete and valid records, then call Workout.all and only get complete and valid records back. That approach carries some risks: this StackOverflow post articulates two of the big ones. There’s a third risk not mentioned there that we should also consider: developers expect Workout.all to return all of the records, so they could get stuck on a problem if a method is silently doing something they do not expect.

OK, so we have executed on both steps for Reads and the first step for WritesNow we need to handle the creation of both valid and invalid records in our workout import. Here’s how we might do that:

Note that .save(validate: false) bypasses validations. So we try to save normally, and if it fails we save again without validations, then immediately update the status column to :needs_revision instead of its default of :complete. Another note: .update_column() also bypasses validations.

According to the Ruby on Rails Guides.save(validate: false) is not recommended practice:

2.3 Skipping Validations

[Some] methods skip validations, and will save the object to the database regardless of its validity… save has the ability to skip validations if passed false as argument. This technique should be used with caution.

  • save(false)

Why use it with caution? Because importing invalid data left and right is a great way to muck up your database if you’re not anticipating the inclusion of data that fails to meet certain requirements. In carefully considering our scopes, though, we have anticipated the inclusion of data that fails to meet certain requirements. By understanding the reason behind the recommendation, we have managed to apply the reason in situations where the recommendation doesn’t suit our needs.

In fact, it would be theoretically possible for us to make our scopes even more granular. Suppose we had several validations on a model, and each validation protected a different feature of the app. For example, maybe Workout needs a :date for performance tracking, an :altitude to calculate oxygen stress, and a :coach to send the correct progress reports to the correct coaches. We can create scopes of complete as well as with_datewith_altitude, and with_coach to use where we are managing performance tracking, oxygen stress, and progress reports. (Alternatively, we could make a habit of doing Workout.where(x)  anywhere that we access the workouts. Both implementations share a theme of thinking critically about exactly which workouts we need anytime that we ask for them.

Conclusion

This refactor encompassed a smallish code change, but it gave us an opportunity to consider the difference between how we model a software solution with a transaction versus a process. 

We tested out a methodology for identifying processes in our code by asking questions about our client’s experience:

  • Who is our client?
  • What do they need/want?
  • What do we need from them to give them what they want?
  • Why do we need that? Is there any way we can move forward without it?

Then, we found a way to codify our understanding of the import process. It required us to reconsider our use of validations: instead of letting validations dictate whether data enters our database, we now use them to determine what state to store with each record. Then, when we use the records, we scope them to the states that we need for the operation we are running. We considered the why behind validations and adjusted our use case to model the process that validations broke while preserving the safety and specificity that validations are meant to provide.

These methodologies and practices might not apply to all cases of refactoring to processes. As we do more case studies, maybe we’ll alight on some principles that do generalize. The only way to find out is to keep reading, and let’s see what we learn in the next case study!

If you liked this post and you like thinking about serving your clients through a programming lens, you might also like:

How and why pair programming might improve your code—complete with cute cartoons drawn by yours truly!

UI Design for Programmers — I wrote this post years ago about a book that came out even more years ago. And I still think about some of the stuff in this post every single time I use or build a frontend.

High-Level Reflection — about the process of improving our work lives with one of the oldest applications on the face of the planet: a notebook!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.