After the bug, the RCA

Wednesday, July 29, 2020 by Louie Bacaj

Bugs, software defects, happen, and when they do they can destroy trust with your business partners and break promises to your customers.

I won't sugar coat it like most other articles on the topic out there seem to do that "bugs happen, oh well." Bugs can be so destructive as to put entire companies out of business or even put lives at risk. A laissez-faire attitude toward bugs is never a good one, regardless of what we are building. This is exactly why so many great engineers, and leaders in our industry, have spent so much time thinking about reducing the number of defects in the software engineering process while trying to balance that with delivery.

I want to discuss how we can build a better engineering culture that produces far fewer bugs, but before I do that, a story of a bug I inadvertently introduced.

The bug, a story

In 2015, while I was working on Jet.com, an eCommerce startup hoping to sell everything online and compete with Amazon, we introduced a series of features to help improve conversion on our site. Of these features, one of the most promising was the concept of promo codes, which of course, exists in a lot of online stores, and the premise is you enter a promo code and get a few dollars off.

That in itself was not super revolutionary or innovative, but the business wanted to make the promos much more sophisticated and complex. They wanted to motivate customer behavior with promos, so in effect, get the customer to buy multiple times, repeat, and introduce duration promos and a slew of other features. Duration based on time and duration based on the number of orders, among many other features.

The complexity skyrocketed. Compared to the simplicity of the MVP, the number of edge cases to test went through the roof, and while everything had great test coverage, what happened on one particular Thursday afternoon shocked me.

On that particular Thursday, around 4 pm, I was alerted that there was a significant spike in sales automatically, saw it in on a live dashboard the whole company had access to, and by the fraud team that was checking on live orders. They mentioned that sales were skyrocketing in the last few minutes, and it was all due to one special promotion we were running. That promotion allowed customers to get 20% off on orders of 50 dollars or above, these were meant to be capped at a maximum of 3 orders in a month, but after digging into the databases to my shock and horror, I saw people that had placed as many as 20 plus orders and the promotion was not capped in any way.

It took an hour from start to end to push a fix, and within that hour of the feature being live, over $150,000 dollars worth of general merchandise was sold that was attributed to this bug and to the abuse of the promotion.

A quick google search pulled up that the promotion was on Reddit and other forums online where people were teaching each other how to exploit the bug to get 20% off an unlimited amount of times. This is a valuable lesson to the power of the internet for me, it will punish you and do it fast, and it is not forgiving.

While we caught this quickly and could cancel the abusers' orders of the terms of the promotion, the business decided to let the orders go through and not punish our customers for our bug. However, if we didn't have useful alerts, a fraud team that was always investigating, this could've gone on for a long time and done far more damage to our company.

How do we build trust again?

But what can we do, as you can see bugs do happen, even engineers that are far better than you and me out there can't produce defect-free code. Even after we add every kind of test around the software we can add. The truth is we are building software on top of other software, a shaky foundation, which of course, we have to do because, again, we've got to balance it out with delivery. We can never guarantee that the software stack beneath our application has been built with the rigor we think it has, we hope it has, but in life, there are no guarantees.

A strong engineering culture has processes and rigor in place that produce software with far fewer defects; again, it will never be zero defects. Taking these things deadly serious is an incredible step forward, I say deadly serious because they could kill your company or your customer.

I have written about different processes such as the Design Reviews and Production Readiness review before systems go into production and have written other articles about building a strong engineering culture, but this one is about what to do after the bug happens.

The best thing to do is to have a blameless engineering culture focused on the incident and the bug rather than blaming individuals for their mistakes. It is never, ever productive, to blame people, unless it was intentional, and even in that case, it is best to fire the individual and move on rather than blame.

What is incredibly productive is writing up a very detailed Root Cause Analysis document, also known as an RCA, and sharing it with everyone; basically, everyone who will read it. An RCA will give people the confidence that although bugs happen, you have an excellent process for making sure the same bugs don't happen again.

How to write a good RCA

When writing a good RCA we need to keep in mind the top reasons why we are doing it:

To rebuild confidence with our business partners and customers.
To create rigor in our engineering teams and produce less bugs.
To create a blameless culture that is focused on better process.
To help everyone understand what went wrong, reading RCA's is a great source of knowledge for more junior folks.
To make sure that particular bug and ideally that class of bugs never happens again.

Below are the minimum sections each RCA should have:

Background or context

A good RCA should include some background, the context. At the same time, it may seem redundant to summarize system interactions; we have to assume some stakeholders reading the RCA will have no knowledge of those interactions but still be deeply interested in understanding what went wrong.

Event Description

After the reader has the context of the systems, they can begin to understand how the bug happened. In this section, we state just the facts as they happened. What was the bug, how did the bug occur, we never blame people we keep it to the facts.

It is also in this section we can include the impact of the bug, monetary impact to the company and impact to our customers. It is possible to break this part out to its own section, I have seen that done and that works as well. As part of this impact one could include a severity level if that is part of the teams process.

Chronology of the Events

It is incredibly helpful to have a subsection in this or perhaps a whole separate section detailing the Chronology of the Events and the timeline. This will include each event that took place by the minute. What exactly took place as the bug was happening and what the teams did to remedy the situation. This is important because it can be used to improve your process later, and you want to capture all of this while it's fresh and preserve it.

Investigative Teams and Methods

It is important to list the teams that found the bug, not to blame anyone, but simply to capture. The methods used to investigate the bug are also helpful; even if it is somewhat redundant from your event description, it is good to state this factually. Keeping in mind that an RCA is a factual blameless document that gives people confidence, you know how to deal with bugs when they happen.

The Root Cause

This is a short, concise description of the root cause, what caused the bug. In my view, this needs to be short and to the point because you have the other supporting factual information. Again no blame, just exactly what caused the issue. Even if a particular person pushed a button to break the system, do not list the person, list the feature that allows for that sort of bug to happen.

This is perhaps the most important part of the document but also much stronger when supported by the rest of the information we discussed in this document. A short, concise description shows you and your team truly understand what went wrong.

A quick technique on how to get to the root cause can be found here, using the 5 Whys.

Corrective Action

The corrective action can be one thing or multiple things and ideally addresses the root cause in full. This means that it tries to address this particular bug one time because that should be a given, but it should try to solve the whole class of these types of bugs if it can.

There may be steps you took that day temporarily to fix the issue and steps we've got to take long term to address it permanently.

The corrective action can be a simple bug fix that can be tracked via Jira or whatever system the team uses for tickets. It could be a process which the team puts in place such as requiring approves when pushing a button. My favorite kind of addressing is usually in the form of new tools that prevent the whole class of this type of edge case from occurring again.

Reference Material

Finally, in this section, we include any helpful data, logs, screenshots, and other information that may bolster the confidence that stakeholders have in our ability to deal with bugs.