On failing safely

11/28/2022

On failing safely

11/28/2022

On failing safely

11/28/2022

Seemingly every engineering team must eventually have the conversation around what to test and how to test it. For every team I've managed, the framework has been this:

Are there parts of the app that cannot be allowed to fail? Enumerate them.
For bugs we have that are not about this part of the app, what's typically been the root cause?

All applications have a definable critical user path, which can benefit from various types of automated testing or unit testing. We're left with a wide swath of lines of code that have nothing to do with this path and then we are tasked with determining what should be done for all those features waiting to break.

It's important to be real about the root cause of bugs: it's extremely easy to slip into thinking that automated testing all the things would be good because it's better than not automated testing. I fundamentally disagree with this approach. Often the reason behind a bug involves changing something reused in a place you didn't expect, a misunderstanding of requirements, or non-exhaustive testing in the initial implementation. Every one of these has obvious solutions that don't involve writing code around it.

A unit test will assert various things we can think of to assert. We can merely verify that the things we can imagine will break will not have broken. Deep down we know this, but again "better than nothing" brain will strike and convince us to exert effort here.

Imagine you and your team are climbing a mountain. "Better than nothing" brain convinces you that you should climb up tethered together, because if one of you slips the rest of the team can catch you. Did you think deeply about the fact that now, if you fail to catch the slip, you risk bringing down the entire team into your fall? Sometimes optimizing for one failure increases the chances of another, more serious kind.

Adding more code to feel safe from bugs feels similar to me. It's not yet been demonstrated that overall bugs caught via unit testing have been meaningfully reduced commensurate with the effort gone into writing tests. We should measure what we aim to improve, and have clarity on whether we have seen payoff against the cost, including the opportunity cost.

Often after a bug makes it into even the dev environment, we are ready to course correct. Yet the dev environment is exactly the place to catch bugs before production. It's the rarer bug that actually makes it to prod, and even when we acknowledge it is rare we turn to testing guardrails as a way to correct it – all the while acknowledging these won't catch everything either. At what cost?

The solution is to rid ourselves of the shackles of preventing all mistakes; it's not going to happen. Instead we should aim to be antifragile – we want to be resilient to failure. Meaning instead of spending energy protecting ourselves from a failure scenario, ensure we can bounce back quickly. In feature development, this is about ensuring rollouts are safe and rollbacks are easy.

This is true for businesses that don't suffer from competition being a click away and especially true for startups in general. There's essentially little risk from bugs that are transient. There are edges to this of course; big launches should aim to provide clean experiences within their main user flows. But in general, it's a much bigger risk to move too slow than to move to quickly. If I were betting on two teams: one doing TDD, and one antifragile team with effective safety nets, I would bet my net worth on team 2 to win a new market.

Follow for more

My writing is also posted on Medium, where you can follow to receive notifications for new posts, comment, and more: https://michael-flores.medium.com/

Seemingly every engineering team must eventually have the conversation around what to test and how to test it. For every team I've managed, the framework has been this:

Are there parts of the app that cannot be allowed to fail? Enumerate them.
For bugs we have that are not about this part of the app, what's typically been the root cause?