Julio Rodriguez

Securing MongoDB ReplicaSets at Scale

Context

At Ubiquo, our data layer consisted of 9 MongoDB ReplicaSets (3 members each) running on EC2, consumed by 80+ services across different tech stacks (Node.js, Bun, Java, Lambdas, Python). These clusters had no authentication enabled — connections relied entirely on network-level restrictions (Security Groups, NACLs).

I was responsible for designing and executing the full security hardening: authentication, role-based access control, secrets management, and a migration strategy that couldn’t take down any service.

The Problem I Solved

  • No database authentication: Any service or user within the VPC could connect to any database with full admin access
  • No access control: There was no distinction between a read-only auditor and a production admin — everyone had the same unlimited access
  • No secrets management: Connection strings were hardcoded in configs with no encryption or rotation capability
  • Zero tolerance for downtime: 80+ services across multiple countries couldn’t afford any interruption during the migration

My Approach

Security Architecture

I implemented a Zero Trust internal model with two security layers:

Before - Open AccessAfter - Zero TrustAny VPC ServiceMongoDBApp ServicesDBAMonitoringMongoDB PrimarySecondary 1Secondary 2 No AuthSCRAM-SHA-256adminProd rolemonitoringAgent KeyFile AuthKeyFile Auth
  • Inter-node security: KeyFile authentication between ReplicaSet members, ensuring only authorized nodes participate in replication
  • Client-to-DB security: SCRAM-SHA-256 authentication for all application connections

Role-Based Access Control (RBAC)

Instead of giving every service admin access, I designed a standardized role matrix applied consistently across all 9 ReplicaSets.

This follows the principle of least privilege — each role gets exactly the permissions it needs, nothing more.

Secrets Management

I centralized all credentials in AWS Secrets Manager with KMS encryption:

{
  "username": "app_user_<env>",
  "hosts": ["mongo01:27017", "mongo02:27017", "mongo03:27017"],
  "replicaSet": "rs-<env>",
  "authSource": "admin",
  "retryWrites": true,
  "w": "majority"
}

KeyFiles distributed via SSM/Ansible automation — never stored in repositories.

Zero-Downtime Migration — The Critical Challenge

With 80+ services consuming these databases, enabling auth in the traditional way would require exact coordination between database changes and 80 application deployments simultaneously. That was not an option.

I designed a hybrid pre-auth deployment strategy:

Phase 1 - Pre-Auth DeploymentPhase 2 - Enable AuthenticationCreate users and roles on ReplicaSetsDeploy 80+ services with credentialsMongoDB accepts connections - auth not enforcedRolling restart with auth enabledApplications auto-reconnectFull RBAC enforcement active

Phase 1 — Pre-Auth State:

  1. Create all users and roles on the ReplicaSets (auth not yet enforced)
  2. Deploy all 80+ services with updated connection strings that include credentials
  3. MongoDB accepts connections — credentials are sent but not validated

Phase 2 — Post-Auth State:

  1. Enable authentication on each ReplicaSet using a rolling restart
  2. Applications automatically reconnect with valid credentials after brief failover
  3. Zero coordination needed on activation day

This eliminated the biggest risk: the need to synchronize database changes with 80 application deployments.

Rolling Restart Process

For each ReplicaSet, I followed this sequence:

  1. Distribute keyFile to secondary nodes
  2. Restart secondaries with security.authorization: enabled and security.keyFile
  3. Secondaries rejoin the ReplicaSet with internal auth
  4. Trigger rs.stepDown() on the primary
  5. Apply configuration to the former primary
  6. Create initial superAdmin user via localhost exception
  7. Create all standardized roles and application users

This maintains quorum throughout — no ReplicaSet lost availability at any point.

Risk Mitigation

I addressed every identified risk before starting:

  • Auth failure blocking APIs: Rollback scripts ready to disable auth in mongod.conf within seconds
  • Lambda cold starts: Validated connection pooling outside the handler to survive cold starts
  • Handshake latency: Configured connection pools to avoid per-request authentication overhead
  • KeyFile loss: Encrypted backup in Secrets Manager with restricted IAM access
  • Driver incompatibility: Pre-migration audit of all driver versions across 80 projects for SCRAM-SHA-256 support

Phased Rollout

I executed the deployment progressively:

  1. MVP: 1 Development ReplicaSet with 5 services + 1 Lambda — validated the full flow
  2. Expansion: Remaining development environments
  3. Production canary: First production ReplicaSet with intensive monitoring
  4. Full rollout: Remaining production ReplicaSets in batches

Each phase included validation of authentication, permission enforcement, failover behavior, and automatic reconnection.

Results

  • Zero downtime — all 9 ReplicaSets hardened without a single service interruption
  • 80+ services migrated across Node.js, Bun, Java, Lambda, and Python — all reconnected automatically after auth activation
  • Least-privilege access enforced — moved from open access to 8 standardized roles, eliminating unauthorized administrative access
  • Centralized secrets management — all credentials encrypted in AWS Secrets Manager with KMS, replacing hardcoded connection strings
  • Audit-ready security posture — RBAC matrix and secrets rotation capability meet compliance requirements
  • Reusable playbook — the pre-auth deployment strategy and rolling restart process became the standard for future database security initiatives

Key Takeaways

  1. Pre-auth deployment eliminates coordination nightmares — deploy credentials before enforcing auth to avoid synchronized cutover across dozens of services
  2. Standardized roles reduce operational burden — one role matrix across all environments simplifies onboarding, auditing, and incident response
  3. Rolling restarts preserve availability — never take down more than one ReplicaSet member at a time
  4. Secrets management is non-negotiable — AWS Secrets Manager + KMS provides encryption, rotation, and audit trails
  5. Progressive rollout catches issues early — starting with dev, then canary production, then full rollout prevented potential incidents

Tools & Technologies

  • MongoDB 7.0 — SCRAM-SHA-256, RBAC, keyFile internal auth
  • AWS EC2 — ReplicaSet hosting
  • AWS Secrets Manager + KMS — Credential storage and encryption
  • AWS SSM / Ansible — Automated keyFile distribution
  • CheckMK + CloudWatch — Monitoring and alerting
  • GitLab CI / GitHub Actions — Deployment pipelines for the 80+ services