Migrating Production Workloads to Kubernetes on AWS EKS
Context
At Ubiquo, our production services ran on static EC2 instances with fixed capacity, serving clients across multiple Central American countries. Traffic spikes meant degraded performance or manual intervention — and we were paying for peak-capacity instances that sat idle most of the day.
I led the migration of these workloads to AWS EKS, designing the cluster architecture, autoscaling strategy, networking layer, and deployment pipelines from scratch.
The Problem I Solved
The existing infrastructure had critical limitations:
No horizontal scaling: Fixed EC2 instances couldn’t respond to traffic spikes — peak hours caused degraded response times
No self-healing: Crashed processes required manual SSH and restart, often during off-hours
Resource waste: Instances provisioned for peak capacity ran at ~15-25% utilization during off-peak
Slow deployments: Releasing new versions required SSH access and manual restarts across multiple servers
No isolation: Multiple services sharing instances caused noisy-neighbor issues
My Approach
Cluster Architecture
I designed the EKS cluster across 3 Availability Zones with:
Managed node groups with instance diversity for cost optimization
Karpenter for intelligent node provisioning — selecting the right instance type based on pending pod requirements instead of fixed node group sizes
Namespace isolation per environment and product, with resource quotas to prevent runaway workloads
Networking Layer
I implemented a two-layer networking stack for production-grade traffic management:
Why this architecture:
AWS NLB at Layer 4 provides high throughput, low latency, and static IPs for firewall requirements
Nginx Ingress Controller handles all Layer 7 routing (host-based, path-based), TLS, and rate limiting
This eliminates the need for one ALB per service, centralizing routing configuration in Kubernetes
Dynamic Autoscaling — The Biggest Win
I implemented a three-layer autoscaling strategy that replaced all static capacity:
SQS Queue DepthCPU / Memory MetricsKEDAHPAQueue ProcessorsAPI ServicesKarpenterRight-sized Nodes0 to 30 pods2 to 20 podsProvisions optimal EC2
1. HPA for API services — scales pods based on CPU/memory thresholds:
KEDA’s scale-to-zero capability was a game changer: queue processors with no messages consume zero resources, compared to the always-on EC2 instances we had before.
3. Karpenter for nodes — automatically provisions optimal instance types when pods need capacity, and consolidates workloads to terminate underutilized nodes.
Zero-Downtime Deployments
All workloads use rolling updates with strict safety guarantees:
Combined with readiness/liveness probes and PodDisruptionBudgets for critical services.
Observability
I deployed Prometheus + Grafana for full cluster visibility:
Pod resource utilization, HPA/KEDA scaling events, ingress metrics
Alerting on pod restart loops, OOMKills, HPA at max replicas, and queue SLA breaches
Centralized logging with correlation IDs for distributed tracing
Migration Strategy
I executed the migration in 4 phases to minimize risk:
Internal tools — validated pipelines, autoscaling, and monitoring with non-critical services
Queue processors — moved SQS consumers to KEDA, immediately seeing cost reduction from scale-to-zero
API services — migrated customer-facing APIs with parallel traffic validation before cutover
Critical processes — migrated core platform with dedicated resource quotas and PodDisruptionBudgets
Results
Zero downtime during the entire migration — all phases completed with rolling deployments
Eliminated idle compute costs — KEDA’s scale-to-zero for async processors removed always-on instances that ran at under 20% utilization
Auto-scaling from 2 to 20+ replicas — services now respond to demand in seconds, handling traffic spikes without degradation
Deployment time reduced from ~30min (SSH + manual) to ~3min (automated rolling updates via CI pipeline)
Self-healing infrastructure — automatic restarts, rescheduling on node failures, and multi-AZ distribution eliminated manual incident response for common failures