Architecting Multi-Region Failover on AWS

Context

At Ubiquo, our primary infrastructure in US-East-1 serves production traffic for clients across several Central American countries. A regional AWS outage would mean total service disruption with no recovery path — an unacceptable risk for a platform handling critical communications.

I was tasked with designing and implementing a multi-region failover architecture that could take over critical services quickly, without duplicating the full cost of the primary region.

The Problem I Solved

Single point of failure: All production services ran in one AWS region — any regional outage meant total downtime
No disaster recovery plan: There was no tested strategy for recovering services if US-East-1 went down
Cost constraints: A full active-active setup would double infrastructure costs — the business needed a cost-optimized approach
Data integrity risk: Failing over without proper coordination could cause message and data duplication across services

My Approach

Multi-Region Architecture

Primary region components:

API cluster across 3 Availability Zones with dynamic scaling
MongoDB ReplicaSet: 2 replicas in US-East-1 + 1 replica in US-East-2 (continuous cross-region replication)
MariaDB: Single-zone with daily backups replicated to US-East-2
VPC Peering between regions for MongoDB sync and private routing

Hybrid DR Strategy — My Key Design Decision

Instead of picking one DR strategy for everything, I designed a hybrid approach that matches each component to the right strategy based on its criticality and cost profile:

Strategy	Applied To	Cost Impact
Backup & Restore	MariaDB — daily backups replicated and restored during failover	Low
Pilot Light	Compute cluster and queue servers — inactive until activation	Low
Warm Standby	MongoDB — active replica at reduced capacity, scalable during failover	Medium
Active-Active	VPC networking, DNS records, ALB/NLB — always running and ready	Medium

This was the most impactful design decision: it kept the secondary region’s monthly cost significantly lower than a full active-active setup while maintaining fast recovery.

DNS-Based Traffic Switching

I implemented failover using Route53 weighted routing — no changes needed on external provider DNS records:

Normal operation:

US-East-1: Weight 255 (receives 100% traffic)
US-East-2: Weight 0 (receives 0% traffic)

During failover:

Flip the weights — US-East-2 takes 100% traffic

Why I chose this approach:

No external DNS changes required — provider CNAMEs always point to the same internal domain
Fast switchover — only weight values change, no record creation/deletion
Simple rollback — revert weights to restore primary routing
TTLs aligned with RTO objectives for fast propagation

Failover Activation Process

I designed a strict activation sequence focused on preventing data duplication — the most dangerous risk in a multi-region cutover:

Verification Checklist

I built a comprehensive post-failover checklist covering:

Network: DNS resolution points to secondary ALBs; all targets healthy
Applications: Pods in Ready state, WebSocket sessions stable, file upload/download working
Messaging: SQS queues consuming without backlog; Lambdas executing correctly
Data: MariaDB restoration complete with read/write verified; MongoDB ReplicaSet synced
Integrity: No message or conversation duplication detected
Observability: Logs clean, CPU/memory within thresholds, alerts firing correctly

Keeping Failover in Sync

A failover environment is useless if it’s outdated. I established a synchronization workflow:

Code changes → No failover deployment needed — Docker images publish via CI and are pulled during activation
Config/Infrastructure changes → Explicit sync required — Kubernetes cluster versions, add-ons, ConfigMaps, Terraform recipes, and AWS Backup plans
Mandatory deployment checklist ensures no sync step is missed and obsolete resources are cleaned up

Results

Disaster recovery capability from zero — the organization went from no DR plan to a tested, documented multi-region failover
Recovery time under target — the full failover process (from trigger to stable) was validated within the defined RTO window
Cost-optimized secondary region — the hybrid strategy kept standby costs at a fraction of the primary region’s cost
Zero data duplication — the planned interruption strategy proved effective across all failover tests
Simple rollback — return-to-primary process validated with clean DNS weight reversion during low-traffic windows
Cross-team operational readiness — documented runbooks and verification checklists enabled multiple teams to execute failover confidently

Key Takeaways

Combine DR strategies — match strategy to component criticality and cost; don’t force everything into one model
Planned interruption prevents data duplication — stopping processes before cutover is essential for integrity
Weighted DNS routing enables fast switching — no external DNS changes, just weight adjustments
Keep failover in sync, not identical — code deploys via CI images; only config/infra needs explicit sync
Test the failover — a DR plan that hasn’t been drilled is just a document

Tools & Technologies

AWS Route53 — Weighted routing for DNS-based failover
AWS ALB / NLB — Multi-region load balancing
AWS EKS — Kubernetes clusters in both regions
MongoDB ReplicaSet — Cross-region continuous replication
MariaDB — Daily backup and restore strategy
Terraform — Infrastructure provisioning for both regions
Karpenter / KEDA / Nginx Ingress — Kubernetes scaling and routing
AWS SQS / Lambda — Asynchronous processing in both regions
VPC Peering — Cross-region private connectivity
GitLab CI — Deployment pipelines targeting both environments