- ProblemRebuilt a brittle monolith into multi-AZ microservices with observability and game-day drills.
- StackAWS • DevOps • Reliability • Monitoring
- FocusAWS • DevOps • Reliability
- ResultsThe next incident (intentional AZ failure) healed in under 90 seconds with zero pager noise.
Problem
Rebuilt a brittle monolith into multi-AZ microservices with observability and game-day drills.
Context
A single EBS volume failure froze an internal dashboard used by 200+ support agents and no alerts fired. I inherited the mess and was told, 'Make sure this never happens again.'
Self-healing AWS architecture with multi-AZ resilience
Services run in multi-AZ Auto Scaling Groups with RDS Multi-AZ.
Backups and replication make recovery deterministic.
Observability and game-day drills for reliability
CloudWatch, X-Ray, and Grafana surface issues quickly.
Monthly game days keep incident response sharp.
Architecture
- Split the monolith into loosely coupled services behind ALBs, each running in its own Auto Scaling Group spanning two AZs.
- Migrated persistence to Amazon RDS Multi-AZ with automatic failover and PITR; nightly snapshots replicate to another region.
- Stored assets + backups in versioned S3 buckets with lifecycle policies and cross-region replication.
- Codified everything in Terraform, including IAM least-privilege roles, so rebuilds are deterministic.
- Implemented CloudWatch metrics + alarms feeding PagerDuty, plus distributed tracing via AWS X-Ray/Grafana.
- Ran monthly game days that simulate EBS loss, AZ failure, and credential compromise to keep the playbooks sharp.
Security / Threat Model
- Tight coupling between compute, storage, and crons.
- No metrics or alerts meant outages surfaced via Slack complaints.
- Manual backup/restore procedures that required human intervention at 2 a.m.
- Shared IAM roles with god-mode permissions.
Tradeoffs & Lessons
Reliability is an attitude, not a feature. By assuming every dependency fails, we made resilience boring — which is exactly what mission-critical dashboards need.
Results
The next incident (intentional AZ failure) healed in under 90 seconds with zero pager noise. MTTR dropped from hours to minutes, and leadership finally trusted the platform enough to onboard another business unit.
Stack
FAQ
What caused the original outage?
A single EBS failure with no alerts or automated recovery.
How is resilience achieved now?
Multi-AZ design, Terraform IaC, and automated backups.
What changed in outcomes?
MTTR dropped from hours to minutes and incidents self-heal.