Context
A single EBS volume failure froze an internal dashboard used by 200+ support agents and no alerts fired. I inherited the mess and was told, 'Make sure this never happens again.'
Threats
- Tight coupling between compute, storage, and crons.
- No metrics or alerts meant outages surfaced via Slack complaints.
- Manual backup/restore procedures that required human intervention at 2 a.m.
- Shared IAM roles with god-mode permissions.
Approach
- Split the monolith into loosely coupled services behind ALBs, each running in its own Auto Scaling Group spanning two AZs.
- Migrated persistence to Amazon RDS Multi-AZ with automatic failover and PITR; nightly snapshots replicate to another region.
- Stored assets + backups in versioned S3 buckets with lifecycle policies and cross-region replication.
- Codified everything in Terraform, including IAM least-privilege roles, so rebuilds are deterministic.
- Implemented CloudWatch metrics + alarms feeding PagerDuty, plus distributed tracing via AWS X-Ray/Grafana.
- Ran monthly game days that simulate EBS loss, AZ failure, and credential compromise to keep the playbooks sharp.
Outcome
The next incident (intentional AZ failure) healed in under 90 seconds with zero pager noise. MTTR dropped from hours to minutes, and leadership finally trusted the platform enough to onboard another business unit.
Lessons Learned
Reliability is an attitude, not a feature. By assuming every dependency fails, we made resilience boring — which is exactly what mission-critical dashboards need.