Skip to main content

Command Palette

Search for a command to run...

Schrödinger's DR Site: Testing Would Kill It

Published
4 min read
Schrödinger's DR Site: Testing Would Kill It
J

I design cloud systems that survive failure. Focused on resilience, real-world recoverability, and the hidden cost of technical debt — turning disaster recovery into a competitive advantage, not a checkbox.

You have a disaster recovery site. It cost a lot to build. It runs 24/7 (at least the critical parts, data backups). Your audit reports say: DR: Compliant. But you’ve never tested it. Because testing would break it.

The dream that wakes you up at 4:40

CFO: So our DR site works, right?

You: Absolutely. It’s ready.

CFO: When did we last test it?

You: Well… we can’t really test it.

CFO: Why not?

You: Because testing would break it. Then we wouldn’t have DR.

CFO: …

You: …

Why testing breaks DR

Let’s say you have this setup:

Production SQL → (log shipping, continuous) → DR Site SQL (warm standby)

Every transaction in production gets shipped to DR. Minutes of lag. Beautiful.

Now you want to test DR:

  1. You activate DR site

  2. DR SQL becomes primary

  3. You run tests: Great! It works!

  4. You shut down test

  5. Now what?

You’ve got three options, and they all suck.

Keep DR as the new primary. Yay! you just did an unplanned failover. Hope your users enjoy the latency. Production is now your DR site. That’s not what anyone meant by test.

Fail back to production. But DR was active during the test. Log shipping is confused. Which database has the correct state? Data written during the test: lost or duplicated? Nobody knows until someone checks. Manually.

Restore from backup. Rebuild DR to pre-test state. Re-establish log shipping. All transactions during the test need to resync. If you have a 2TB database: see you in two weeks.

Before you test: Map your dependencies

Before you even think about testing, you need to know what you have. Not what’s in the diagram from 2019, what’s actually running.

Start with infrastructure: what depends on what? Your app needs SQL. SQL needs Key Vault for secrets. Key Vault needs networking. Networking needs… you get it. If you don’t know this chain, your DR test will be chaos.

Then services: your API talks to SQL, Key Vault, and that external payment provider nobody documented. Your frontend needs the API, CDN, and three storage accounts. Your background jobs need queues and an email service that might or might not still exist.

Finally: what order do you start things? Networking first. Then storage. Then secrets. Then compute. Then load balancers. Then DNS. Get this wrong and you’ll spend four hours debugging why the app can’t connect to a database that isn’t up yet.

If you don’t know this, your first DR test will be chaos.

How to actually test without killing your DR

So how do you test without breaking everything?

The cleanest option: don’t test your real DR at all. Build a throwaway environment. Export your infrastructure as code, spin it up in a test subscription, run your tests, burn it down. Cost: maybe €50–100 per hour and a couple of days other works. The hardest part isn’t technical, it’s convincing finance why you need to do that test. Good luck with the leprechauns.

If you can’t do that, at least verify your DR is alive. Monthly checks: Is log shipping working? Is the lag under 15 minutes? Do all the resources actually exist? Do configs match production? This won’t tell you if services actually start, but it tells you the foundation isn’t rotten.

It’s a health check. Better than nothing.

And if full testing feels too risky, test in pieces.

Week one: networking. Can DR reach what it needs?

Week two: database. Activate in isolation, verify data, roll back.

Week three: application. Deploy to test subscription, point at test database. Week four: put it all together.

You build confidence gradually. You find problems in small doses. You don’t bet everything on one big test that might ruin your month.

RTO math

Your DR plan says: RTO 2 hours.

Reality check:

  • Disaster strikes: T+0

  • Alert triggers: T+15 min

  • Assess situation: T+45 min (is this real? should we failover?)

  • Decision to activate DR: T+1–3 hours

  • Start DR activation: T+1 hour

  • Update DNS: 15 min

  • Start services: 30 min

  • Discover config issue: 1 hour

  • Fix config issue: 2 hours

  • Discover dependency issue: 1 hour

  • Fix dependency: 3 hours

  • Services operational: T+8 hours 45 min

Your 2-hour RTO is actually 10 hours. And this is only if everything is about as they should be in your code.

What you should actually do

Map your dependencies. Build a test environment separate from DR. Document everything. Test. And measure how long it actually takes, that’s your real RTO, not what the PowerPoint says.

P.S. If you’re reading this and thinking “we should really test our DR site”… but then deciding “maybe next quarter”…

That’s exactly why it’s called Schrödinger’s DR site.

It both works and doesn’t work. Until you test it. And testing might kill it.

So you don’t test.

And the cat stays in the box.

Until a real disaster opens it.

BCDR

Part 2 of 3

in this series, I will break down key principles of Business Continuity and Disaster Recovery, covering strategies, tools, and real-world examples to help organizations stay resilient.

Up next

S3nd me your bitcoinz — the moment no DR plan describes

The fly freezes mid-air. The security manager’s beard trembles. And for one impossible beat, the whole office goes silent in a way you didn’t know silence could exist. This is the moment no DR documen