Securing MongoDB ReplicaSets at Scale

Context

At Ubiquo, our data layer consisted of 9 MongoDB ReplicaSets (3 members each) running on EC2, consumed by 80+ services across different tech stacks (Node.js, Bun, Java, Lambdas, Python). These clusters had no authentication enabled — connections relied entirely on network-level restrictions (Security Groups, NACLs).

I was responsible for designing and executing the full security hardening: authentication, role-based access control, secrets management, and a migration strategy that couldn’t take down any service.

The Problem I Solved

No database authentication: Any service or user within the VPC could connect to any database with full admin access
No access control: There was no distinction between a read-only auditor and a production admin — everyone had the same unlimited access
No secrets management: Connection strings were hardcoded in configs with no encryption or rotation capability
Zero tolerance for downtime: 80+ services across multiple countries couldn’t afford any interruption during the migration

My Approach

Security Architecture

I implemented a Zero Trust internal model with two security layers:

Inter-node security: KeyFile authentication between ReplicaSet members, ensuring only authorized nodes participate in replication
Client-to-DB security: SCRAM-SHA-256 authentication for all application connections

Role-Based Access Control (RBAC)

Instead of giving every service admin access, I designed a standardized role matrix applied consistently across all 9 ReplicaSets.

This follows the principle of least privilege — each role gets exactly the permissions it needs, nothing more.

Secrets Management

I centralized all credentials in AWS Secrets Manager with KMS encryption:

{
  "username": "app_user_<env>",
  "hosts": ["mongo01:27017", "mongo02:27017", "mongo03:27017"],
  "replicaSet": "rs-<env>",
  "authSource": "admin",
  "retryWrites": true,
  "w": "majority"
}

KeyFiles distributed via SSM/Ansible automation — never stored in repositories.

Zero-Downtime Migration — The Critical Challenge

With 80+ services consuming these databases, enabling auth in the traditional way would require exact coordination between database changes and 80 application deployments simultaneously. That was not an option.

I designed a hybrid pre-auth deployment strategy:

Phase 1 — Pre-Auth State:

Create all users and roles on the ReplicaSets (auth not yet enforced)
Deploy all 80+ services with updated connection strings that include credentials
MongoDB accepts connections — credentials are sent but not validated

Phase 2 — Post-Auth State:

Enable authentication on each ReplicaSet using a rolling restart
Applications automatically reconnect with valid credentials after brief failover
Zero coordination needed on activation day

This eliminated the biggest risk: the need to synchronize database changes with 80 application deployments.

Rolling Restart Process

For each ReplicaSet, I followed this sequence:

Distribute keyFile to secondary nodes
Restart secondaries with security.authorization: enabled and security.keyFile
Secondaries rejoin the ReplicaSet with internal auth
Trigger rs.stepDown() on the primary
Apply configuration to the former primary
Create initial superAdmin user via localhost exception
Create all standardized roles and application users

This maintains quorum throughout — no ReplicaSet lost availability at any point.

Risk Mitigation

I addressed every identified risk before starting:

Auth failure blocking APIs: Rollback scripts ready to disable auth in mongod.conf within seconds
Lambda cold starts: Validated connection pooling outside the handler to survive cold starts
Handshake latency: Configured connection pools to avoid per-request authentication overhead
KeyFile loss: Encrypted backup in Secrets Manager with restricted IAM access
Driver incompatibility: Pre-migration audit of all driver versions across 80 projects for SCRAM-SHA-256 support

Phased Rollout

I executed the deployment progressively:

MVP: 1 Development ReplicaSet with 5 services + 1 Lambda — validated the full flow
Expansion: Remaining development environments
Production canary: First production ReplicaSet with intensive monitoring
Full rollout: Remaining production ReplicaSets in batches

Each phase included validation of authentication, permission enforcement, failover behavior, and automatic reconnection.

Results

Zero downtime — all 9 ReplicaSets hardened without a single service interruption
80+ services migrated across Node.js, Bun, Java, Lambda, and Python — all reconnected automatically after auth activation
Least-privilege access enforced — moved from open access to 8 standardized roles, eliminating unauthorized administrative access
Centralized secrets management — all credentials encrypted in AWS Secrets Manager with KMS, replacing hardcoded connection strings
Audit-ready security posture — RBAC matrix and secrets rotation capability meet compliance requirements
Reusable playbook — the pre-auth deployment strategy and rolling restart process became the standard for future database security initiatives

Key Takeaways

Pre-auth deployment eliminates coordination nightmares — deploy credentials before enforcing auth to avoid synchronized cutover across dozens of services
Standardized roles reduce operational burden — one role matrix across all environments simplifies onboarding, auditing, and incident response
Rolling restarts preserve availability — never take down more than one ReplicaSet member at a time
Secrets management is non-negotiable — AWS Secrets Manager + KMS provides encryption, rotation, and audit trails
Progressive rollout catches issues early — starting with dev, then canary production, then full rollout prevented potential incidents

Tools & Technologies

MongoDB 7.0 — SCRAM-SHA-256, RBAC, keyFile internal auth
AWS EC2 — ReplicaSet hosting
AWS Secrets Manager + KMS — Credential storage and encryption
AWS SSM / Ansible — Automated keyFile distribution
CheckMK + CloudWatch — Monitoring and alerting
GitLab CI / GitHub Actions — Deployment pipelines for the 80+ services