Skip to main content

Command Palette

Search for a command to run...

Your DR Site Has a Problem: It Trusts the Same People

Published
6 min read
Your DR Site Has a Problem: It Trusts the Same People
J

I design cloud systems that survive failure. Focused on resilience, real-world recoverability, and the hidden cost of technical debt — turning disaster recovery into a competitive advantage, not a checkbox.

We spend ziljardians on disaster recovery sites.

Secondary regions. Backup systems. Failover procedures. RTO measured in minutes.

And then someone with a compromised Global Admin account deletes everything.

In both regions.

Why?

Because your DR site trusts the same identity provider.

The Scenario Nobody Plans For

Picture this:

It's Friday afternoon. Your monitoring goes crazy. Production is gone. Someone ran:

Get-AzResourceGroup | Remove-AzResourceGroup -Force

Panic. But wait—you have DR!

You failover to secondary region. Services come up. Crisis averted.

Monday morning: You realize the same admin account that nuked production? It had contributor rights to DR too.

The attacker just waited.

Now both regions are gone.

grThe Uncomfortable Truth

We plan for datacenter fires that never happen.

We don't plan for the developer who got phished last Tuesday.

We plan for hardware failures.

We don't plan for the contractor whose last day is Friday and who still has Contributor rights.

We plan for natural disasters.

We don't plan for "I thought I was in DEV"—which happens weekly.

Why?

Because admitting "we need to protect against our own admins" feels wrong.

But your DR plan shouldn't be based on trust. It should be based on blast radius limitation.

The Problem With Shared Identity

Your production and DR share everything that matters:

  • Same identity provider

  • Same service principals

  • Same admin accounts

  • Same role-based access control

You built geographic redundancy on top of a single point of failure.

Two regions, one set of keys.

Real-World Attack Vectors

Vector 1: Compromised DevOps

Your CI/CD pipeline has a service principal with Contributor rights.

It deploys to production. Automatically. On every commit.

Now imagine: Someone compromises your DevOps account.

They have your service principal credentials.

They can deploy to production. And to DR.

One infrastructure destroy command later, you're explaining to the board why both regions are gone.

Vector 2: The Patient Ransomware

Modern ransomware doesn't encrypt immediately.

It waits. Learns your environment. Finds your backups. Finds your DR.

Then:

  • Day 1: Encrypts production

  • You: "No problem, we have DR!" fails over

  • Day 2: Ransomware encrypts DR

  • You: "How did they..."

  • Attacker: "Same credentials work everywhere. Thanks for the geo-redundancy though."

Vector 3: The Friday Deployment

Tired admin. 11 PM. Deployment script.

az group delete --name "test-rg" --yes --no-wait

Except the script had --subscription "all" somewhere in there.

And "all" includes your DR subscription.

Because the same account has access to both.

Weekend ruined.

The "Just Use PIM" Myth

"Use Privileged Identity Management! Just-in-Time access!"

Yes. Do that. But it's not enough.

PIM means:

  • No standing admin privileges

  • Time-limited access

  • Approval workflows

PIM doesn't mean:

  • Admin can't do damage during their justified access window

  • Compromised credentials during active session are safe

  • DevOps service principals are protected

PIM reduces the attack window. It doesn't close it.

What Actually Works: Defense in Depth

Here's what actually works. Not a checklist—a philosophy: make it hard to lose everything at once.

Start with the pipeline. Your DevOps has credentials to production. Fine. But those credentials stop at production. DR subscription? Different keys. Compromised pipeline can't touch it.

Now the backups. Backup vault with Multi-User Authorization. Three people to approve critical operations. One compromised admin can't delete your backups. Ransomware hits a wall.

DR itself stays minimal. Not a hot copy of production—just networking, one database receiving log shipping, enough compute to keep the data flowing. When disaster strikes, you pull infrastructure code from backup (remember, MUA protected) and deploy. Takes longer. But attackers need to compromise multiple systems to get here.

Different service principals everywhere. Prod service principals can't touch DR. DR service principals can't touch prod. Sounds obvious. Almost nobody does it.

Your DR database? Local authentication enabled. Credentials printed, in a safe. If your identity provider burns, the database doesn't care. It keeps accepting log shipping.

And finally: break-glass accounts. Three of them. Printed passwords. Hardware MFA tokens. Physical safe. Any login triggers board-level alert. You use these when everything else has failed. Not for "I forgot my password." Not for Friday deployments.

What This Architecture Protects Against

  • Compromised DevOps: Can't access DR directly

  • Compromised Prod Admin: DR has different RBAC

  • Ransomware: MUA prevents backup deletion, DR isolated

  • "Oops" moment: Can't accidentally nuke DR

  • Patient attacker: Multiple systems need compromise

What This DOESN'T Protect Against

Full identity tenant takeover at Global Admin level.

If an attacker gets Global Admin on your identity tenant, they can technically access everything.

But:

  • MUA still requires 3 people

  • Break-glass alerts trigger

  • DB local auth still works

  • Audit logs show everything

Separate identity tenant for DR would solve this...

But let's be honest: How many organizations will actually do that?

  • Different tenant = different billing

  • Different admin portal

  • Different support contracts

  • Cross-tenant authentication nightmare

  • Nobody wants this complexity

This architecture is the realistic middle ground:

  • 80% of security benefit

  • 20% of the complexity

  • Actually implementable

Let's Talk Money

"But this is expensive!"

Is it though?

Your alternative hot DR setup:

  • Hot site running 24/7: €50k/month

  • Same security posture as prod: €0 (because same)

  • RTO: 15 minutes

  • Risk: One compromised account = total loss

This architecture:

  • Minimal DR infrastructure: €5k/month

  • MUA, separate service principals, layered defense: €10k setup

  • RTO: 12-48 hours (honest number, not boardroom fantasy)

  • Risk: Requires compromise of multiple systems

Savings: €45k/month = €540k/year

And you're more secure.

Yeah, it's gonna take 2 days. But here's why that's actually acceptable and way more secure than the alternative.

Before You Start: Just See What You Have

You're not rebuilding this tomorrow.

Start with visibility: who actually has admin access to both prod and DR?

That list will scare you. I promise.

Then MUA on backup vaults—that's your cheapest win.

Then separate the service principals.

Then break-glass accounts.

Then actually test it.

Find what breaks. Fix it.

Repeat until it doesn't hurt anymore.

The Question You Need to Answer

Your board asks: "What's our disaster recovery plan?"

Version A: "We have a hot DR site. RTO is 15 minutes."

Sounds impressive. Until one compromised admin account takes down both regions.

Version B: "We have layered DR. Real RTO is 12-48 hours depending on complexity. It's the 20% cost, 80% benefit approach. But it requires an attacker to compromise DevOps, bypass Multi-User Authorization, get separate DR credentials, and penetrate multiple security layers. Our risk of total loss is significantly lower."

Sounds realistic. And honest.

Which one do you want to say?

The Uncomfortable Truth

Geographic redundancy is not security.

Running the same vulnerable architecture in two regions just means you can lose twice as fast.

Real disaster recovery in 2026 means:

  • Accept that trust is not a security model

  • Design for compromise

  • Protect against your own infrastructure

  • Make it hard to lose everything at once

So, what's your DR plan protecting against?

Datacenter fires?

Or the more likely scenario: someone with legitimate access making an illegitimate decision?


P.S. If your immediate reaction is "this is too complex"—compare it to explaining to your CEO why both regions got deleted by the same attacker.

P.P.S. If you're thinking "but separate identity tenants would be better"—you're right. But will you actually do it? Or will you keep talking about it in meetings for the next 3 years?

Start with what's realistic. This is realistic.