Julio Rodriguez

Architecting Multi-Region Failover on AWS

Context

At Ubiquo, our primary infrastructure in US-East-1 serves production traffic for clients across several Central American countries. A regional AWS outage would mean total service disruption with no recovery path — an unacceptable risk for a platform handling critical communications.

I was tasked with designing and implementing a multi-region failover architecture that could take over critical services quickly, without duplicating the full cost of the primary region.

The Problem I Solved

  • Single point of failure: All production services ran in one AWS region — any regional outage meant total downtime
  • No disaster recovery plan: There was no tested strategy for recovering services if US-East-1 went down
  • Cost constraints: A full active-active setup would double infrastructure costs — the business needed a cost-optimized approach
  • Data integrity risk: Failing over without proper coordination could cause message and data duplication across services

My Approach

Multi-Region Architecture

Route53 - Weighted RoutingUS-East-1 (Primary)US-East-2 (Secondary)ALBWeb Servers - Multi AZNLBEKS Cluster - 3 AZsMongoDB - 2 ReplicasMariaDBALBWeb ServersNLBEKS - Pilot LightMongoDB - 1 ReplicaMariaDB Backup Active: Weight 255 Standby: Weight 0Continuous ReplicationDaily Backup

Primary region components:

  • API cluster across 3 Availability Zones with dynamic scaling
  • MongoDB ReplicaSet: 2 replicas in US-East-1 + 1 replica in US-East-2 (continuous cross-region replication)
  • MariaDB: Single-zone with daily backups replicated to US-East-2
  • VPC Peering between regions for MongoDB sync and private routing

Hybrid DR Strategy — My Key Design Decision

Instead of picking one DR strategy for everything, I designed a hybrid approach that matches each component to the right strategy based on its criticality and cost profile:

StrategyApplied ToCost Impact
Backup & RestoreMariaDB — daily backups replicated and restored during failoverLow
Pilot LightCompute cluster and queue servers — inactive until activationLow
Warm StandbyMongoDB — active replica at reduced capacity, scalable during failoverMedium
Active-ActiveVPC networking, DNS records, ALB/NLB — always running and readyMedium

This was the most impactful design decision: it kept the secondary region’s monthly cost significantly lower than a full active-active setup while maintaining fast recovery.

DNS-Based Traffic Switching

I implemented failover using Route53 weighted routing — no changes needed on external provider DNS records:

Normal operation:

  • US-East-1: Weight 255 (receives 100% traffic)
  • US-East-2: Weight 0 (receives 0% traffic)

During failover:

  • Flip the weights — US-East-2 takes 100% traffic

Why I chose this approach:

  • No external DNS changes required — provider CNAMEs always point to the same internal domain
  • Fast switchover — only weight values change, no record creation/deletion
  • Simple rollback — revert weights to restore primary routing
  • TTLs aligned with RTO objectives for fast propagation

Failover Activation Process

I designed a strict activation sequence focused on preventing data duplication — the most dangerous risk in a multi-region cutover:

Operations TeamUS-East-1Route53US-East-2 1. Confirm regional failure2. Stop critical processes3. Scale up instances and pods4. Switch weights to secondary5. Restore MariaDB backup6. Run verification checklist7. Declare stable

Verification Checklist

I built a comprehensive post-failover checklist covering:

  • Network: DNS resolution points to secondary ALBs; all targets healthy
  • Applications: Pods in Ready state, WebSocket sessions stable, file upload/download working
  • Messaging: SQS queues consuming without backlog; Lambdas executing correctly
  • Data: MariaDB restoration complete with read/write verified; MongoDB ReplicaSet synced
  • Integrity: No message or conversation duplication detected
  • Observability: Logs clean, CPU/memory within thresholds, alerts firing correctly

Keeping Failover in Sync

A failover environment is useless if it’s outdated. I established a synchronization workflow:

  • Code changes → No failover deployment needed — Docker images publish via CI and are pulled during activation
  • Config/Infrastructure changes → Explicit sync required — Kubernetes cluster versions, add-ons, ConfigMaps, Terraform recipes, and AWS Backup plans
  • Mandatory deployment checklist ensures no sync step is missed and obsolete resources are cleaned up

Results

  • Disaster recovery capability from zero — the organization went from no DR plan to a tested, documented multi-region failover
  • Recovery time under target — the full failover process (from trigger to stable) was validated within the defined RTO window
  • Cost-optimized secondary region — the hybrid strategy kept standby costs at a fraction of the primary region’s cost
  • Zero data duplication — the planned interruption strategy proved effective across all failover tests
  • Simple rollback — return-to-primary process validated with clean DNS weight reversion during low-traffic windows
  • Cross-team operational readiness — documented runbooks and verification checklists enabled multiple teams to execute failover confidently

Key Takeaways

  1. Combine DR strategies — match strategy to component criticality and cost; don’t force everything into one model
  2. Planned interruption prevents data duplication — stopping processes before cutover is essential for integrity
  3. Weighted DNS routing enables fast switching — no external DNS changes, just weight adjustments
  4. Keep failover in sync, not identical — code deploys via CI images; only config/infra needs explicit sync
  5. Test the failover — a DR plan that hasn’t been drilled is just a document

Tools & Technologies

  • AWS Route53 — Weighted routing for DNS-based failover
  • AWS ALB / NLB — Multi-region load balancing
  • AWS EKS — Kubernetes clusters in both regions
  • MongoDB ReplicaSet — Cross-region continuous replication
  • MariaDB — Daily backup and restore strategy
  • Terraform — Infrastructure provisioning for both regions
  • Karpenter / KEDA / Nginx Ingress — Kubernetes scaling and routing
  • AWS SQS / Lambda — Asynchronous processing in both regions
  • VPC Peering — Cross-region private connectivity
  • GitLab CI — Deployment pipelines targeting both environments