Architecting Multi-Region Failover on AWS
Context
At Ubiquo, our primary infrastructure in US-East-1 serves production traffic for clients across several Central American countries. A regional AWS outage would mean total service disruption with no recovery path — an unacceptable risk for a platform handling critical communications.
I was tasked with designing and implementing a multi-region failover architecture that could take over critical services quickly, without duplicating the full cost of the primary region.
The Problem I Solved
- Single point of failure: All production services ran in one AWS region — any regional outage meant total downtime
- No disaster recovery plan: There was no tested strategy for recovering services if US-East-1 went down
- Cost constraints: A full active-active setup would double infrastructure costs — the business needed a cost-optimized approach
- Data integrity risk: Failing over without proper coordination could cause message and data duplication across services
My Approach
Multi-Region Architecture
Primary region components:
- API cluster across 3 Availability Zones with dynamic scaling
- MongoDB ReplicaSet: 2 replicas in US-East-1 + 1 replica in US-East-2 (continuous cross-region replication)
- MariaDB: Single-zone with daily backups replicated to US-East-2
- VPC Peering between regions for MongoDB sync and private routing
Hybrid DR Strategy — My Key Design Decision
Instead of picking one DR strategy for everything, I designed a hybrid approach that matches each component to the right strategy based on its criticality and cost profile:
| Strategy | Applied To | Cost Impact |
|---|---|---|
| Backup & Restore | MariaDB — daily backups replicated and restored during failover | Low |
| Pilot Light | Compute cluster and queue servers — inactive until activation | Low |
| Warm Standby | MongoDB — active replica at reduced capacity, scalable during failover | Medium |
| Active-Active | VPC networking, DNS records, ALB/NLB — always running and ready | Medium |
This was the most impactful design decision: it kept the secondary region’s monthly cost significantly lower than a full active-active setup while maintaining fast recovery.
DNS-Based Traffic Switching
I implemented failover using Route53 weighted routing — no changes needed on external provider DNS records:
Normal operation:
- US-East-1: Weight 255 (receives 100% traffic)
- US-East-2: Weight 0 (receives 0% traffic)
During failover:
- Flip the weights — US-East-2 takes 100% traffic
Why I chose this approach:
- No external DNS changes required — provider CNAMEs always point to the same internal domain
- Fast switchover — only weight values change, no record creation/deletion
- Simple rollback — revert weights to restore primary routing
- TTLs aligned with RTO objectives for fast propagation
Failover Activation Process
I designed a strict activation sequence focused on preventing data duplication — the most dangerous risk in a multi-region cutover:
Verification Checklist
I built a comprehensive post-failover checklist covering:
- Network: DNS resolution points to secondary ALBs; all targets healthy
- Applications: Pods in Ready state, WebSocket sessions stable, file upload/download working
- Messaging: SQS queues consuming without backlog; Lambdas executing correctly
- Data: MariaDB restoration complete with read/write verified; MongoDB ReplicaSet synced
- Integrity: No message or conversation duplication detected
- Observability: Logs clean, CPU/memory within thresholds, alerts firing correctly
Keeping Failover in Sync
A failover environment is useless if it’s outdated. I established a synchronization workflow:
- Code changes → No failover deployment needed — Docker images publish via CI and are pulled during activation
- Config/Infrastructure changes → Explicit sync required — Kubernetes cluster versions, add-ons, ConfigMaps, Terraform recipes, and AWS Backup plans
- Mandatory deployment checklist ensures no sync step is missed and obsolete resources are cleaned up
Results
- Disaster recovery capability from zero — the organization went from no DR plan to a tested, documented multi-region failover
- Recovery time under target — the full failover process (from trigger to stable) was validated within the defined RTO window
- Cost-optimized secondary region — the hybrid strategy kept standby costs at a fraction of the primary region’s cost
- Zero data duplication — the planned interruption strategy proved effective across all failover tests
- Simple rollback — return-to-primary process validated with clean DNS weight reversion during low-traffic windows
- Cross-team operational readiness — documented runbooks and verification checklists enabled multiple teams to execute failover confidently
Key Takeaways
- Combine DR strategies — match strategy to component criticality and cost; don’t force everything into one model
- Planned interruption prevents data duplication — stopping processes before cutover is essential for integrity
- Weighted DNS routing enables fast switching — no external DNS changes, just weight adjustments
- Keep failover in sync, not identical — code deploys via CI images; only config/infra needs explicit sync
- Test the failover — a DR plan that hasn’t been drilled is just a document
Tools & Technologies
- AWS Route53 — Weighted routing for DNS-based failover
- AWS ALB / NLB — Multi-region load balancing
- AWS EKS — Kubernetes clusters in both regions
- MongoDB ReplicaSet — Cross-region continuous replication
- MariaDB — Daily backup and restore strategy
- Terraform — Infrastructure provisioning for both regions
- Karpenter / KEDA / Nginx Ingress — Kubernetes scaling and routing
- AWS SQS / Lambda — Asynchronous processing in both regions
- VPC Peering — Cross-region private connectivity
- GitLab CI — Deployment pipelines targeting both environments