Yes, you should test on production…

16 min readAug 11, 2023

It might seem like a click baity title but it’s not. Let me explain.
I’ve been lucky enough to work on small and large scale projects, some more and some less critical. This is a brain dump of my thoughts on processes, constraints, excessive red tape, consequent complexity and frustration and whether (and how) there’s a way to make things better. The TL;DR is that some (“best”) practices are contextual and understanding when to use them is ultimately what gives us the title of “engineers”.
Also the charts below are purely indicative, take them with a grain of salt.

Shipping confidence

We can define “shipping confidence” as the feeling a mentally sane developer has when they know their code is about to be deployed to production (whether it can be updated over the air or not). Like with any feeling, shipping confidence is a spectrum.

A company can be placed on such spectrum as shown in this example:

Shipping confidence spectrum. Companies like SpaceX are forced to test the crap out of their software, companies like GitHub allow anyone to deploy straight to prod, see what happens (sometimes employing canary deployments) and rolling back if required. People like levels.io deploy to prod and use their customers to test. These are all good strategies. — Shipping confidence spectrum. Companies like SpaceX are forced to test the crap out of their software, companies like GitHub allow anyone to deploy straight to prod, see what happens (sometimes employing canary deployments) and roll back if required. People like levels.io deploy to prod and use their customers to test. These are all good strategies. Also, the positioning is very approximate, don’t sue me bro™️.

If Boeing makes a mistake in their flight control software, people will most likely die.

If Tesla makes a mistake in their autopilot software, people might die.

If JP Morgan makes a mistake in their funding software they will lose $ and some people might see funky numbers in their accounts, leading them to changing bank.

If Netflix makes a mistake in their streaming software millions of people’s dates and lunch breaks will be ruined and they might not renew. They would also lose credibility with their partners.

If GitHub makes a mistake it can affect thousands of businesses but they’ll likely shrug and their DevOps team will just post “GitHub is down, nothing we can do” on some Slack channel. They might offer some credits you can use within 30 days.

If Pieter Levels (👑) makes a mistake some customer might request a refund. He’ll fix it, then go back to shitposting about threadbois.

It’s clear that shipping confidence is tightly coupled with the loss of money (even the loss of life can be quantified in $, sad truth but truth nonetheless), whether direct or as a consequence of reputational damage. This is something that is normally explained to sophomore CS/CE students in some “software engineering” or “software architecture” kind of course.

The reason it’s usually taught to sophomores is because freshmen are too busy to understand what it means (they are only, rightfully, worried about making something work no matter what). And junior/senior students have likely already experienced this through an internship or a job.

Intuitively there’s also a direct relationship between shipping confidence and developer experience. And since the latter has a direct impact on throughput, correctness and, in general, team morale, it is important to understand how they behave together.

Cannon and flies

We have seen that the two main dimensions of shipping software are confidence and financial impact. Sometimes it’s hard to estimate the latter so we’ll just name it “criticality”.

Changing the name from “financial impact” to “criticality” allows us to turn the question “how much $ would we lose if this went wrong?” to a more generic, yet easier to estimate, “what’s the worst that can happen?”.

Some examples:

What’s the financial impact of a SpaceX rocket exploding?

Too hard to estimate the financial impact. How many man hours were spent on building it? What about the materials? Was anyone injured? Does it expose the company to a lawsuit?

Instead we can ask:

What’s the worst that can happen if a SpaceX rocket explodes?

Well, we can absolutely limit the criticality of such event. Stay as far as possible from any inhabited area, make sure everything can be remotely controlled, avoid a human payload for the first N tests etc.

Obviously SpaceX already does all that but it helps to outline some of the steps because they tell us something very important: it is fundamental for SpaceX to decrease criticality as much as possible, so time spent doing so is time well spent.

But not all companies deal with this kind of criticality. In fact the four quadrants below help us determine what degree of preparation is necessary once we can answer the question “what’s the worst that can happen?”.

The ideal quadrant is obviously “Safe and fast” but not everyone is lucky enough to be able to stay there. A small project that starts there might move to “Safe but slow” (e.g. a company that scales up and can’t afford downtime) or to “Move fast and break things™️ zone” (e.g. a weekend project that gets funding and becomes a growth oriented startup).

There is absolutely no reason why any software project should stay in the bottom left quadrant. The “opportunity cost zone” should be avoided like you would with nuclear waste.

Yet so many projects end up right there. Why is that?

Premature optimization

I’ve embedded a xkcd (1691) to improve the credibility of this article

I hate the terms “junior” and “senior” so I’ll replace them with “smart” and “grug brained”, respectively.

Most smart engineers are primarily moved by one thing: wanting to prove themselves. So they dive into best practises, elegant patterns, flamboyant versioning strategies and they often ask “what’s the best way to do X?”. How I know? Because I used to be one. And I was lucky enough to meet grug brained peers that were willing to question my choices.

Grug brained engineers start with a much simpler question “what’s the point of doing X?”. For a grug brained dev it’s truly soul crushing to work on X when the impact of X isn’t clearly defined, or at the very least estimated.

A great example of this is, lo and behold:

microservices.

Microservices are a great pattern when you have multiple teams and you want each team to have a high degree of autonomy in things like the choice of stack, deployment processes etc. As long as the edges of these services are clearly specced and documented (whether this is true is purely a matter of diligence and competence) things will be fine.

Microservices are utter crap when your team is made up of 5–10 people, everyone kinda works on everything, you’re not even sure of what you’re building yet and maybe you even rely on a monorepo because you read that Google does it so why shouldn’t you? Increased complexity and friction for no benefit whatsoever.

A smart developer might read somewhere that microservices are a great solution to complexity. What he fails to understand is that there’s no such thing as a free lunch. Employing a pattern like microservices (or whatever else) will decrease overall complexity by N% but add a fixed amount of inherent complexity due to the added constraints.

This translates into the single worst thing you can do to (figuratively) kill your engineering department by a thousand (churn) cuts: poor developer experience.

Choosing the right pattern (or not choosing one at all!) can have a huge impact on complexity.

Developer experience

Developer experience is how much your (current and potential) devs want to work on a specific project. Poor developer experience is a byproduct of dissonance between perceived complexity of a project and tangible complexity of the systems and processes related to it.

Imagine taking a big, satisfying dump and then having to solve a Rubik’s cube to get some toilet paper. The perceived complexity of taking some paper to wipe your ass is (I hope we agree) low. When you’re presented with a Rubik’s cube your only thought is that you’ll never use that toilet ever again.

In the best case you end up spending minutes or hours (opportunity cost!) solving the cube and leave in frustration (but with a clean butt). In the worst case you throw away your underwear.

Now, I’m sure whoever designed the toilet had a good reason to put that Rubik’s cube there. Maybe someone kept stealing the toilet paper. Maybe it’s a test to determine whether you’re worthy of having a clean butt. Who knows, but most of all, who cares!

The only people who’ll keep enduring that kind of pain are desperate people. People who need that paycheck, people who can’t be bothered to look for something different. Unfortunately there is no easy way to distinguish between people who are good and need a paycheck from people who just need a paycheck. But you sure as hell don’t want the latter in your team.

Therefore detecting this kind of dissonance becomes increasingly harder the larger the company grows. People will be less motivated to report, and contribute to the resolution of, these issues. Those who try will be dismissed with a vague “yeah, we’re working on it” and find ways to work around it in the hope that one day they’ll be in a position where they are taken more seriously.

A concrete example of this is adding logic to work around issues with internal infrastructure or APIs. Imagine you knew the guy who put the Rubik’s cube there and, because he’s not exactly easy to work with, instead of asking him “dude wtf is this?” you realized you can break the cube and reassemble it solved to get your toilet paper. Most people, believe it or not, will do the latter.

I’ve been a witness of this negligence in multiple occasions: a team starts working on a new project that needs support from another team, they start using their service only to realize there’s undocumented behavior or nasty bugs, so they try to report it once, twice, three times and those tickets sit in “Backlog” for weeks, sometimes months.

At some point frustration takes over and the first team decides they can just implement some additional validation, retries etc. to work around all that. That’s a terrible mistake and in the long run will be the cause of cost overruns, unmet deadlines, increased churn and overall bad vibes. And nobody wants bad vibes.

Easy math: imagine this is a service used by N teams and it has a bug that only comes out under specific circumstances so that it goes unnoticed until the first production deployment is completed. Fixing this bug might take 2 man-days. Let’s say some super senior guy who makes $300k/year has to work on this, which prices the bugfix at around $2400 (considering ~250 working days in a year).

Imagine now that it takes 3 days to find such bug, diagnose it and deploy a workaround. And these teams don’t really talk to each other (Big Corp style) so it’s not like they can write once, run anywhere™️ (nor they should, as shown below). Let’s assume the dev on each of these teams, that has to write the workaround, is paid $150k/year. N = 2 is enough to make fixing the bug the most rational, economically sound choice.

Even worse: imagine the same devs realize that they share the same pain and come together in a sort of think tank and start contributing to a single client to access this service. This way the workaround will only have to be implemented once and everyone will benefit from them, right? RIGHT?

Your job as an engineer is to do what’s best for the whole organization, not just for your team. Learning to (blamelessly) hold other teams accountable is an art, more than a science, but it’s what eventually helps the whole organization move forward.

So what can we do to avoid this dissonance in the first place? Easy: question everything and ask the right questions.

There is a fundamental difference between complaining and questioning. A complaint is an expression of frustration that carries no expectation that the source of the frustration will be resolved and no suggestions on how to do it. Questioning is an expression of doubt, accompanied usually by data (to prove the frustration is real and what impact it has) and, when possible, a suggestion on how to resolve it.

And what’s the best question we can ask when trying to determine whether the complexity of a system is justified in comparison with the perceived complexity of the task at hand?

“What’s the worst that can happen?”

And we’re back at criticality!

What’s the worst that can happen if we scp our PHP files straight onto a Linode VPS?

In a team of 1 with a product that has 10 users? Absolutely nothing. (Ironically we’ve seen this is true with thousands of customers too, at least in the B2C world)
In a team of 10 with a product that hasn’t launched yet? A “wtf” from a colleague on Slack.
In a team of 100 with a product that has 1M customers? Absolute madness.

What’s the worst that can happen if we go down for 1 hour?
(Forget that BS scene from The Social Network please)

For a product that turns your selfies into cat photos? Nothing. Maybe a refund.
For a product that handles automation for an HR department of a 150 person company (à la Workday)? A few complaints and a post mortem.
For a beta product that handles customer funds but serves like 20 transactions a day? Some frustration (assuming no loss of funds).

What’s the worst that can happen if there’s data loss?

Can your data be rebuilt from other data sources over a brief enough period of time (e.g. consolidated accounting from multiple cost centers in a day and the report deadline is a week away)? Nothing.
Is the data from your customers but ephemeral (think WeTransfer)? Maybe a few bug reports, refund requests.
Is your data customers’ emails and you can’t recover a backup with an acceptable delta (hours, not days)? You’re screwed. (Video from DHH (👑) that inspired this example)

You get the point.

Disclaimer: it’s one thing to go down 1 hour every 3 months, on average. A completely different thing to go down 1 hour every couple of days. Maybe I’ll add time/SLAs as a 3rd dimension to the confidence/criticality chart.

“Guards!”

Once you’ve assessed what the worst case is the follow up should be “is it worth spending time adding guards to avoid it?”.

In most cases the answer is a resounding “no”.

“Is it worth hardening my code to make sure there is never an uncaught exception, while processing a selfie that turns into a cat, to keep my logs clean?” — No.
“Is it worth adding a catch clause for a specific kind of error that I know how to recover from, to prevent a known issue from blocking a series of sequential steps?” — Hell yeah!
“Is it worth adding retry logic for a function that is being executed every few minutes and that can fail up to 10% of the total times it’s called with no tangible issues?” — Nope. But you should monitor that % and act accordingly.

If the answer is yes and the solution is semi-trivial then go ahead!

If the answer is maybe but you’re unsure about what it would entail you unfortunately need to estimate the opportunity cost of not avoiding it. If it exceeds the estimated cost for implementing the solution (by a fair margin) then green light. Otherwise, why bother?

Keep in mind that part of the cost comes from something that modern development tools prioritize a lot, and rightfully so…

Feedback loop

If your guard increases the feedback loop’s completion time, that’s something you won’t get rid of over time, quite the opposite! These changes tend to quickly compound and getting rid of them in the future becomes exponentially harder the longer they are there. (Ok maybe not “exponentially” but you get what I mean)

If there are 5 devs on your team and they complete, say, 100 feedback loops (e.g. committing to a remote environment) per day, an increase of 10 seconds leads to an increase of almost 1.5 hours in total time spent every day. At $100/hour (which is fair for experienced devs) that puts our increase in costs at about $3000/month. (1.5h/day * $100 * 20 days)

What can possibly cause a 10 second delay on each feedback loop, you ask? Easy: a pre commit hook that runs some integration tests. Or misconfigured caching in the CI pipeline that forces Docker to pull all layers every single time.

And that’s not considering the increase in frustration (and related decrease in shipping confidence) for the team. It might be tiny, but it’s there.

At that point it’s not only suggested, it would be outright wrong not to do it, that you start estimating how effective these changes are in relation to what the consequences might be.

It’s the little things that, put together, make our life miserable.
(Read it somewhere a long time ago, can’t find the source but it’s so true)

Let’s take an example: a service’s SLA forces the company to pay out about $1000/hour for downtime.

An obvious guard would be to have a dev environment that mimics production as much as possible (this isn’t always possible of course). Thankfully there’s a ton of leverage here: the reproducible envs can be automated, together with some sanitization (e.g. you don’t want to have real email addresses in your dev database duh) and tests.

From my experience this is pretty straightforward up to a few hundred thousand users and a DB under a terabyte, assuming reliance on 3rd parties (e.g. payments) is minimal (or mockable).

It becomes pretty much impossible at larger scale. One might be tempted to think “well, if I can’t rely on this when I have 1M users why would I do it at 100k?” and that’s equivalent to asking “well, if I can’t rely on both my legs when I grow so old I can’t use them, why would I do it now?”.

Someone who is overly cautious might think of hiring a QA engineer to run manual tests on some staging environment before shipping. The question becomes: how much are you willing to pay for them and what’s their effectiveness rate?

Assuming that now you go down 1 hour a month and you’d have to pay them $5000/month, the additional resource is clearly anti economic. Things change if the downtime is closer to 3 hours/month and you can outsource the role to someone who’s happy with a $2000/month salary. As long as their effectiveness is ~70% (i.e. they catch 70% of bugs that would lead to a 1h downtime) your monthly savings are $3000 * 70% — $2000 = $100.

Now an overzealous manager might come along and say: “last time we went down the CTO got so angry he almost forced me to switch to a light theme! We cannot go down, no matter what! From now on every deployment will require 2 approvals from selected people”.

Now each dev who wants to add a feature or fix a bug has to:

Clone the sanitized prod environment. (easy, automated, minimal friction)
Get the green light of QA on staging. (takes some time but might be worth it)
Require approval from 2 people, who might not even be familiar with his changes, who might ask him to “jump on a call” to explain what exactly is being deployed and who might eventually tell him they don’t have time to review the changes properly, so better wait until tomorrow. (brutal, brings down morale, dev wonders why 2 people who don’t even know what he’s working on have to review his changes when he’s been writing 80% of the code anyway)

Let’s do some math: let’s say this change is a blocker for such dev. He can’t (or doesn’t want to) work on something else because this is way too important, or because the context switch would be too tough or because there isn’t really anything else worth working on (surprisingly common).

Let’s also assume that it takes 30 mins for these 2 people to review the changes together with the dev. If the changes get reviewed right away, and we pay these people $100/hour, we are wasting $150. That’s 15% of our hourly SLA loss. Seven deployments and we’re back to square one.

In the worst case: time to review (blocking) + requested changes + rollback due to an unexpected issue can exceed, say, 8 hours. That’s $100 * 8 (dev) + $150 * 2 (reviews) = $1100. This policy has just become anti economic with a single deployment.

“What if removing this approval layer leads to downtime increasing?”

Ask yourself a question: do you have any reason to think that your engineers will not do a good job? If the answer is no: why are they still there? If the answer is yes: let them do their damn job. If things go south it’s most likely due to someone being undertrained (easily fixable) or incompetent (get rid of them). Making life harder for everyone due to the shortcomings of one person or team is a non refundable one way ticket to high churn.

Testing on production

Finally the part you were waiting for.

Testing on production is obviously not the right thing to do if your project sits in the top left quadrant of the criticality/confidence chart. But it can be an effective strategy in the top or bottom right one. It’s laughable that some teams, no matter how small or what stage their project is at, will never even consider something like that, out of pure irrationality.

That’s why determining which quadrant your project sits in and acting accordingly can make or break your engineering team. Processes, practices, patterns and workarounds weren’t born because they are the right thing to do. They were born to address specific concerns in specific teams under specific circumstances.

Employing them in a thoughtful manner is the engineering part of, well, software engineering. The absolute worst approach a team can have to “best” practices, patterns and processes is to employ them just because someone wants to or because a team fucked up once and some preventative policy became standard across the whole company.

Developer experience is as important, if not more important, than customer experience. Because if customers aren’t happy, they won’t buy your product.

If developers aren’t happy, that product may never get built.

Notes

Some readers have been kind enough to make some really good points, which I’ll collect below:

Criticality is multidimensional so we should be careful about not oversimplifying for a whole product or organization
(NewEntryHN, source)
“Everybody has a testing environment. Some people are lucky enough enough to have a totally separate environment to run production in”
(trollied, source) — (not the original author of the quote AFAIK)
Other executives/middle managers might be extremely scared of something happening, while the core shareholders (often times the CEO) don’t really care. Aligning priorities across teams is fundamental.
(Scubabear68, source)