Securing MongoDB ReplicaSets at Scale
Context
At Ubiquo, our data layer consisted of 9 MongoDB ReplicaSets (3 members each) running on EC2, consumed by 80+ services across different tech stacks (Node.js, Bun, Java, Lambdas, Python). These clusters had no authentication enabled — connections relied entirely on network-level restrictions (Security Groups, NACLs).
I was responsible for designing and executing the full security hardening: authentication, role-based access control, secrets management, and a migration strategy that couldn’t take down any service.
The Problem I Solved
- No database authentication: Any service or user within the VPC could connect to any database with full admin access
- No access control: There was no distinction between a read-only auditor and a production admin — everyone had the same unlimited access
- No secrets management: Connection strings were hardcoded in configs with no encryption or rotation capability
- Zero tolerance for downtime: 80+ services across multiple countries couldn’t afford any interruption during the migration
My Approach
Security Architecture
I implemented a Zero Trust internal model with two security layers:
- Inter-node security: KeyFile authentication between ReplicaSet members, ensuring only authorized nodes participate in replication
- Client-to-DB security: SCRAM-SHA-256 authentication for all application connections
Role-Based Access Control (RBAC)
Instead of giving every service admin access, I designed a standardized role matrix applied consistently across all 9 ReplicaSets.
This follows the principle of least privilege — each role gets exactly the permissions it needs, nothing more.
Secrets Management
I centralized all credentials in AWS Secrets Manager with KMS encryption:
{
"username": "app_user_<env>",
"hosts": ["mongo01:27017", "mongo02:27017", "mongo03:27017"],
"replicaSet": "rs-<env>",
"authSource": "admin",
"retryWrites": true,
"w": "majority"
}
KeyFiles distributed via SSM/Ansible automation — never stored in repositories.
Zero-Downtime Migration — The Critical Challenge
With 80+ services consuming these databases, enabling auth in the traditional way would require exact coordination between database changes and 80 application deployments simultaneously. That was not an option.
I designed a hybrid pre-auth deployment strategy:
Phase 1 — Pre-Auth State:
- Create all users and roles on the ReplicaSets (auth not yet enforced)
- Deploy all 80+ services with updated connection strings that include credentials
- MongoDB accepts connections — credentials are sent but not validated
Phase 2 — Post-Auth State:
- Enable authentication on each ReplicaSet using a rolling restart
- Applications automatically reconnect with valid credentials after brief failover
- Zero coordination needed on activation day
This eliminated the biggest risk: the need to synchronize database changes with 80 application deployments.
Rolling Restart Process
For each ReplicaSet, I followed this sequence:
- Distribute keyFile to secondary nodes
- Restart secondaries with
security.authorization: enabledandsecurity.keyFile - Secondaries rejoin the ReplicaSet with internal auth
- Trigger
rs.stepDown()on the primary - Apply configuration to the former primary
- Create initial
superAdminuser via localhost exception - Create all standardized roles and application users
This maintains quorum throughout — no ReplicaSet lost availability at any point.
Risk Mitigation
I addressed every identified risk before starting:
- Auth failure blocking APIs: Rollback scripts ready to disable auth in
mongod.confwithin seconds - Lambda cold starts: Validated connection pooling outside the handler to survive cold starts
- Handshake latency: Configured connection pools to avoid per-request authentication overhead
- KeyFile loss: Encrypted backup in Secrets Manager with restricted IAM access
- Driver incompatibility: Pre-migration audit of all driver versions across 80 projects for SCRAM-SHA-256 support
Phased Rollout
I executed the deployment progressively:
- MVP: 1 Development ReplicaSet with 5 services + 1 Lambda — validated the full flow
- Expansion: Remaining development environments
- Production canary: First production ReplicaSet with intensive monitoring
- Full rollout: Remaining production ReplicaSets in batches
Each phase included validation of authentication, permission enforcement, failover behavior, and automatic reconnection.
Results
- Zero downtime — all 9 ReplicaSets hardened without a single service interruption
- 80+ services migrated across Node.js, Bun, Java, Lambda, and Python — all reconnected automatically after auth activation
- Least-privilege access enforced — moved from open access to 8 standardized roles, eliminating unauthorized administrative access
- Centralized secrets management — all credentials encrypted in AWS Secrets Manager with KMS, replacing hardcoded connection strings
- Audit-ready security posture — RBAC matrix and secrets rotation capability meet compliance requirements
- Reusable playbook — the pre-auth deployment strategy and rolling restart process became the standard for future database security initiatives
Key Takeaways
- Pre-auth deployment eliminates coordination nightmares — deploy credentials before enforcing auth to avoid synchronized cutover across dozens of services
- Standardized roles reduce operational burden — one role matrix across all environments simplifies onboarding, auditing, and incident response
- Rolling restarts preserve availability — never take down more than one ReplicaSet member at a time
- Secrets management is non-negotiable — AWS Secrets Manager + KMS provides encryption, rotation, and audit trails
- Progressive rollout catches issues early — starting with dev, then canary production, then full rollout prevented potential incidents
Tools & Technologies
- MongoDB 7.0 — SCRAM-SHA-256, RBAC, keyFile internal auth
- AWS EC2 — ReplicaSet hosting
- AWS Secrets Manager + KMS — Credential storage and encryption
- AWS SSM / Ansible — Automated keyFile distribution
- CheckMK + CloudWatch — Monitoring and alerting
- GitLab CI / GitHub Actions — Deployment pipelines for the 80+ services