At Ubiquo, our primary infrastructure in US-East-1 serves production traffic for clients across several Central American countries. A regional AWS outage would mean total service disruption with no recovery path — an unacceptable risk for a platform handling critical communications.
I was tasked with designing and implementing a multi-region failover architecture that could take over critical services quickly, without duplicating the full cost of the primary region.
The Problem I Solved
Single point of failure: All production services ran in one AWS region — any regional outage meant total downtime
No disaster recovery plan: There was no tested strategy for recovering services if US-East-1 went down
Cost constraints: A full active-active setup would double infrastructure costs — the business needed a cost-optimized approach
Data integrity risk: Failing over without proper coordination could cause message and data duplication across services
My Approach
Multi-Region Architecture
Primary region components:
API cluster across 3 Availability Zones with dynamic scaling
MongoDB ReplicaSet: 2 replicas in US-East-1 + 1 replica in US-East-2 (continuous cross-region replication)
MariaDB: Single-zone with daily backups replicated to US-East-2
VPC Peering between regions for MongoDB sync and private routing
Hybrid DR Strategy — My Key Design Decision
Instead of picking one DR strategy for everything, I designed a hybrid approach that matches each component to the right strategy based on its criticality and cost profile:
Strategy
Applied To
Cost Impact
Backup & Restore
MariaDB — daily backups replicated and restored during failover
Low
Pilot Light
Compute cluster and queue servers — inactive until activation
Low
Warm Standby
MongoDB — active replica at reduced capacity, scalable during failover
Medium
Active-Active
VPC networking, DNS records, ALB/NLB — always running and ready
Medium
This was the most impactful design decision: it kept the secondary region’s monthly cost significantly lower than a full active-active setup while maintaining fast recovery.
DNS-Based Traffic Switching
I implemented failover using Route53 weighted routing — no changes needed on external provider DNS records:
Normal operation:
US-East-1: Weight 255 (receives 100% traffic)
US-East-2: Weight 0 (receives 0% traffic)
During failover:
Flip the weights — US-East-2 takes 100% traffic
Why I chose this approach:
No external DNS changes required — provider CNAMEs always point to the same internal domain
Fast switchover — only weight values change, no record creation/deletion
Simple rollback — revert weights to restore primary routing
TTLs aligned with RTO objectives for fast propagation
Failover Activation Process
I designed a strict activation sequence focused on preventing data duplication — the most dangerous risk in a multi-region cutover:
Operations TeamUS-East-1Route53US-East-21. Confirm regional failure2. Stop critical processes3. Scale up instances and pods4. Switch weights to secondary5. Restore MariaDB backup6. Run verification checklist7. Declare stable
Verification Checklist
I built a comprehensive post-failover checklist covering:
Network: DNS resolution points to secondary ALBs; all targets healthy
Applications: Pods in Ready state, WebSocket sessions stable, file upload/download working
Messaging: SQS queues consuming without backlog; Lambdas executing correctly
Data: MariaDB restoration complete with read/write verified; MongoDB ReplicaSet synced
Integrity: No message or conversation duplication detected
Observability: Logs clean, CPU/memory within thresholds, alerts firing correctly
Keeping Failover in Sync
A failover environment is useless if it’s outdated. I established a synchronization workflow:
Code changes → No failover deployment needed — Docker images publish via CI and are pulled during activation