RubyConf 2021: What did participants say in the Tackling Technical Debt workshop?

Reading Time: 13 minutes

Two days ago, I facilitated my first in-person workshop since The Before Times.

First thing I noticed: what this lighting does for my arms. My Twitter bio says “femme jock” for a reason 😉

I have placed this post in the “talks” category alongside a litany of talk transcripts interspersed with full slide decks. You can look at the slides here if you like, but this post contains no transcript: instead, RubyConf has produced a video of the workshop, complete with captions.

By the way, for the curious: before you see me do a workshop onstage, I have playtested that workshop at least twice: once where I bribed some colleagues I respect to rip it apart, and once where I gave it for free to a larger audience sourced from Twitter. For example, here’s the second playtest I did for the workshop you see above. If you’d like to be part of my free playtests, follow me on Twitter and watch out for playtest announcements.

Also by the way, this workshop is still under heavy development. I currently give two tech debt workshops: the two-hour one that I gave at this conference focused on individual contributors, and the three hour version I give through O’Reilly that digs deeper into tactical details for tech leads and directors of engineering. Email me (chelsea at chelseatroy dot com) if you’d like to bring this workshop to your org!

There’s also a longer course in development, in which I hope to entertain the shitstorms explore the tangents that I curtail in the workshops to beat my time cap. Throw your email into this form if you want to hear about that course when it comes out. Optionally, you can throw into one of the other fields “I wasn’t at RubyConf but I read the blog post.”

If you you like the title but you’re not ready for a whole workshop yet…

The workshop is based on this blog post series. There are three parts in the series. The workshop, clearly, digs into some of the more tactical ideas and allows you to participate more with exercises and discussion, but if you’re like “just give me the main points Chelsea,” this blog post series covers a lot of the big ones.

I also wrote this follow-up piece about adjusting engineer evaluations to better reward the behaviors that contribute to keeping the code base maintainable.

My favorite part of workshops is the audience participation.

And because I love that part so much, I make an effort to hang onto audience contributions for later. I did that with the RubyConf run of the workshop, and I’m pleased to share those results with you below.

This workshop grew out of this blog post series, and it focuses on:

  1. A specific definition of technical debt
  2. How maintenance load accumulates
  3. How to prevent or slow the growth of maintenance load
  4. How to address existing maintenance load to get a team out from underwater fast.

I ask participants “What’s your worst maintenance load horror story?”

At RubyConf, from an audience of ~150, here’s what we got in reply to the question:

  • This Google Doc
  • An endpoint for a tv to play advertising videos, which queried for the next 3 videos to play (why not just the next one?) and then proceeded to download those three videos. Every time. It also would randomly bug out and re-download the video player itself about every 20 plays.
  • Using an outdated paypal api that went down with a breaking change of the API and was broken for a long time: when fixed it was estimated to bring 600 K ~ annual revenue to the company. 14-mon
  • A primary developer of service A left the company so leadership figured that it would be better to delete it and rebuild the service (serving 1M customers) from scratch.
  • We have code on a legacy server that is written in a language that now one currently knows <- RT this x2
  • 14-month refactoring project that was abandoned because the team couldn’t reason about how to actually rollout the “new system”
  • 5+ years ago, we wrote a bunch of custom wrappers to make our excel exports work and people are now so terrified of messing things up that our excel reports are grossly out of sync with the rest of our app’s features.
  • Very few tests in the codebase, and the tests that were there flaked constantly. >70% of the time.
  • Maintaining two different client side apps for all new features as we migrate off of angularJS
  • A third-party decided an SDK version that was maybe a year old would suddenly, and without announcement, no longer be supported
  • Our test suite broke during a leap day deploy
  • Working at a retail vendor that had critical ColdFusion services that no one knew how to properly configure and restart in a crisis situation. This was in 2019
  • Looked into a report of a few users who hadn’t received an alert – this was about 4 months after going live. Upon investigation I found that the alert had never fired for anyone and was, in fact, completely broken.
  • Nobody understands the ecommerce code
  • Used to work for the biggest financial networking company in AMRS/EMEA. Our code for pulling packets off the NIC and trafficking them was maintained by a Russian contractor who coded in C using russian names for everything, no one in the entire company spoke russian, he left. 
  • Completely untested codebase with no sample data. No one knew how it worked.
  • We were on rails 3.2 until we were on rails 4. This was a few months ago
  • Multiple developers left the team within a month of each other creating a vacuum of knowledge of an offering we had just shipped in the previous quarter.
  • A developer wrote a C extension to parse XML “Faster” than original libraries, and required a different interface that forced the entire rest of the application to conform to the new required information sets
  • Core service implemented by contractors, and nobody wants to touch it, and attempts to refactor it have ended it being even more spaghetti.
  • Having a core service written in cobalt 
  • A team attempted to refactor a system by creating a better approach and then using a dual write approach to verify information. But it was never finished, and we have always dual written data, and now the old system depends on the new system so we cannot roll back anymore and the person who started the process and knew the state of it has long left
  • A developer on our team left, who was the primary author of one of our main features. It took us months to understand how the feature worked and to update/scale it to our new needs.
  • A senior dev was out sick for a week at the same time we were launching a new production site, and our prod setup includes using a gem he wrote that had zero documentation. Something went wrong and we had no idea where to even start.
  • Rewriting presentation logic to fit an outdated business class that most of our cloud devices were inheriting from. 
  • Broke an existing feature because no one realized it was connected to the change we were making.
  • Using Rails 3 in 2021 
  • We started creating teams and more layers for categorizing/prioritizing bugs
  • We have a passing Rails 6 build (coming from 5.2) but we’re afraid to try and deploy it
  • I’m trying to finish a refactor that was started three years ago but nobody knows why we didn’t finish it.
  • I once worked somewhere where the team would approve 50+ database migrations a day to fix bad data. Rather than fix the features, the organization decided to allow the support team to write migrations to fix customer data.
  • A previous employer had a CI pipeline with specs that had been failing for years. The company leadership didn’t want to “invest unnecessary effort” in restoring the CI pipeline, so we maintained the codebase by modifying the post-compile code in production.
  • Trying to debug cryptic error messages received from ADP’s system and third party subscription service.  I couldn’t understand why our code, written by a dev who had left 5 years ago, seemed to be so convoluted.  The ADP docs would note that they could not guarantee that their online documentation was up-to-date, and it took an extensive discussion in their developer support Slack with an engineer to understand how their integration process was actually supposed to work, which was not what our code was doing and also not fully covered in their own documentation.  A one hour session to document the problems and recommend a fix yielded a list of what would probably need to be an epic with at least 10 tickets.  Meanwhile, this feature is completely broken and there is no manual workaround support staff can use.

I also ask participants “How about the opposite?” to gather their experiences with systems that they found easy to maintain. Here’s that (much shorter) list from the same audience:

  • Upgrading to Rails 6 was a one-line change
  • I maintained a totally unfunded Rails/Backbone app for 4+ years, nothing ever broke, just occasional dependency upgrades.
  • An employee started on an existing project for the first time and was immediately effective, thanks to multiple folders of detailed documentation. 
  • Self-healing Kubernetes clusters, discreet per environment with safe pod rollback, deep introspection, liveness, health check capabilities – it was just simply the easiest CI/CD setup I’ve ever built, worked with, or maintained
  • We wrote a green field application with great test coverage, which allows us to make changes without fear.
  • Was able to comfortably make changes in a new app with only two weeks of onboarding  when i can feel comfortable making big changes because the test suite has got my back
  • Upgrading an open source project from 5.0 to 5.1 was painless, with only one failing spec that also prompted a discussion to re-think how a minor functionality should work.

Now, for those of you who read the series, it will come as no surprise that many of the examples of low maintenance load are new code bases: maintenance load takes time to accumulate, because context takes time to get lost. If that last sentence makes no sense to you: it’s OK! But if you read the series it’ll make sense.

After we do this exercise, I ask participants to think about what team practices distinguish these two lists. The participants write down those practices in their own private lists (don’t worry; you’ll see them in a second).

How do we keep maintenance load low as a system gets older? We need a set of skills that I call code stewardship, which we talk about in the workshop. I give five examples on this slide.

I ask participants to identify whether each of their practices fall…

…under one of the five examples above. Or maybe, just maybe, they have items that provide us with new examples of code stewardship! The participants at RubyConf filled out this EasyRetro with their replies. I think it’s beautiful; the images below omit some replies for “Writing discoverable code” and “Transferring context” because those columns were looooong. The other columns are represented here in their entirety.

Categorizing the team practices that separate our maintenance load horror stories from our happy stories

Categorizing the team practices that separate our maintenance load horror stories from our happy stories

At this point in the workshop, we stop and read through some of the examples in the orange column to get more ideas about how code stewardship can impact maintenance loads on our team. If you’d like to learn more about code stewardship, you might like this piece on where to find it or this piece on how to evaluate it.

Once we’ve done that, participants do a simulation.

They pretend that they are the lead developer at a company whose product allows people to make restaurant reservations. I give them the following six tickets to prioritize.

I then give them ten minutes to read the tickets and enter their priority order into a form. They usually spend minutes 1-4 reading the tickets, minutes 4-5 entering their priorities, minutes 5-7 sitting in awkward silence, and minutes 7-10 leaning over to their neighbors. “What did you get?” “What did you think of this one?” When I hear the awkward silence turn into hushed murmurs around minute 7, I turn on the mic to explicitly acknowledge and encourage the chatter, at which point the murmurs become considerably louder. I consider murmuring to be a positive feedback signal in this activity.

Then, we go through the tickets in rough order of collective priority and discuss how to address them. These charts demonstrate how 118 RubyConf participants prioritized the six tickets:

For each ticket in turn, a few brave audience members raise their hands to share why they prioritized the ticket the way they did and how they’d like to address it. To these participants: thank you. Your contributions prompt discussion and your insights make the workshop richer. I could not do the simulation without you, and I appreciate your engagement!

I do not have a record here of those individual contributions; however, during the workshop, my lovely co-facilitator Leah Miller (featured here on the Greater Than Code podcast) made sure that each respondent had a microphone, and as a result the RubyConf recording should include all of those contributions.

Hopping onto the stage for this workshop felt nostalgic—it’s been a long time since I’ve facilitated for a live audience, and I don’t think I’ve ever facilitated a live audience this large. But I’d absolutely do it again.

If you liked this piece, you might also like:

The rest of the talks category (if you’re into seeing what else I’ve said for audiences)

The technical debt series (for digging deeper into this subject matter)

The teaching category (for behind-the-scenes info about how I design lessons. Most of that translates directly to how I design workshops!)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.