Schrödinger's DR Site: Testing Would Kill It

I design cloud systems that survive failure. Focused on resilience, real-world recoverability, and the hidden cost of technical debt — turning disaster recovery into a competitive advantage, not a checkbox.
You have a disaster recovery site. It cost a lot to build. It runs 24/7 (at least the critical parts, data backups). Your audit reports say: DR: Compliant. But you’ve never tested it. Because testing would break it.
The dream that wakes you up at 4:40
CFO: So our DR site works, right?
You: Absolutely. It’s ready.
CFO: When did we last test it?
You: Well… we can’t really test it.
CFO: Why not?
You: Because testing would break it. Then we wouldn’t have DR.
CFO: …
You: …
Why testing breaks DR
Let’s say you have this setup:
Production SQL → (log shipping, continuous) → DR Site SQL (warm standby)
Every transaction in production gets shipped to DR. Minutes of lag. Beautiful.
Now you want to test DR:
You activate DR site
DR SQL becomes primary
You run tests: Great! It works!
You shut down test
Now what?
You’ve got three options, and they all suck.
Keep DR as the new primary. Yay! you just did an unplanned failover. Hope your users enjoy the latency. Production is now your DR site. That’s not what anyone meant by test.
Fail back to production. But DR was active during the test. Log shipping is confused. Which database has the correct state? Data written during the test: lost or duplicated? Nobody knows until someone checks. Manually.
Restore from backup. Rebuild DR to pre-test state. Re-establish log shipping. All transactions during the test need to resync. If you have a 2TB database: see you in two weeks.
Before you test: Map your dependencies
Before you even think about testing, you need to know what you have. Not what’s in the diagram from 2019, what’s actually running.
Start with infrastructure: what depends on what? Your app needs SQL. SQL needs Key Vault for secrets. Key Vault needs networking. Networking needs… you get it. If you don’t know this chain, your DR test will be chaos.
Then services: your API talks to SQL, Key Vault, and that external payment provider nobody documented. Your frontend needs the API, CDN, and three storage accounts. Your background jobs need queues and an email service that might or might not still exist.
Finally: what order do you start things? Networking first. Then storage. Then secrets. Then compute. Then load balancers. Then DNS. Get this wrong and you’ll spend four hours debugging why the app can’t connect to a database that isn’t up yet.
If you don’t know this, your first DR test will be chaos.
How to actually test without killing your DR
So how do you test without breaking everything?
The cleanest option: don’t test your real DR at all. Build a throwaway environment. Export your infrastructure as code, spin it up in a test subscription, run your tests, burn it down. Cost: maybe €50–100 per hour and a couple of days other works. The hardest part isn’t technical, it’s convincing finance why you need to do that test. Good luck with the leprechauns.
If you can’t do that, at least verify your DR is alive. Monthly checks: Is log shipping working? Is the lag under 15 minutes? Do all the resources actually exist? Do configs match production? This won’t tell you if services actually start, but it tells you the foundation isn’t rotten.
It’s a health check. Better than nothing.
And if full testing feels too risky, test in pieces.
Week one: networking. Can DR reach what it needs?
Week two: database. Activate in isolation, verify data, roll back.
Week three: application. Deploy to test subscription, point at test database. Week four: put it all together.
You build confidence gradually. You find problems in small doses. You don’t bet everything on one big test that might ruin your month.
RTO math
Your DR plan says: RTO 2 hours.
Reality check:
Disaster strikes: T+0
Alert triggers: T+15 min
Assess situation: T+45 min (is this real? should we failover?)
Decision to activate DR: T+1–3 hours
Start DR activation: T+1 hour
Update DNS: 15 min
Start services: 30 min
Discover config issue: 1 hour
Fix config issue: 2 hours
Discover dependency issue: 1 hour
Fix dependency: 3 hours
Services operational: T+8 hours 45 min
Your 2-hour RTO is actually 10 hours. And this is only if everything is about as they should be in your code.
What you should actually do
Map your dependencies. Build a test environment separate from DR. Document everything. Test. And measure how long it actually takes, that’s your real RTO, not what the PowerPoint says.
P.S. If you’re reading this and thinking “we should really test our DR site”… but then deciding “maybe next quarter”…
That’s exactly why it’s called Schrödinger’s DR site.
It both works and doesn’t work. Until you test it. And testing might kill it.
So you don’t test.
And the cat stays in the box.
Until a real disaster opens it.

