Cloud Backup & Disaster Recovery
debt(d5/e7/b7/t7)
Closest to 'specialist tool catches' (d5), Prowler/Checkov/AWS Config can detect missing cross-region replication, short retention periods, and unreplicated S3 buckets, but cannot verify whether restores actually work — that gap remains invisible.
Closest to 'cross-cutting refactor across the codebase' (e7), the quick_fix spans IaC changes for replication, written RPO/RTO documentation, runbook authoring, and recurring restore drills — touching infra, app config, secrets, and process across the org.
Closest to 'strong gravitational pull' (b7), DR posture applies_to every runtime context (web/cli/queue/api/cron) and shapes architecture decisions like multi-region design, secret distribution, and deployment automation throughout the system's life.
Closest to 'serious trap' (t7), the misconception that 'automated snapshots = DR' is the canonical wrong belief — teams confidently believe they're protected when single-region untested backups will fail at the worst moment, contradicting the intuition that 'backup enabled' means 'recoverable'.
Also Known As
TL;DR
Explanation
Backup and disaster recovery (DR) in the cloud means more than enabling automated snapshots — it means defining and testing recovery objectives. RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time; RTO (Recovery Time Objective) is the maximum acceptable downtime. A 5-minute RPO requires continuous replication or frequent snapshots; a 1-hour RTO requires warm standby infrastructure, not cold backups.
DR strategies range in cost and complexity: Backup-and-restore (cheapest, RTO hours-days), Pilot Light (core services running in DR region, scaled up on failover), Warm Standby (scaled-down full stack, scaled up on failover), and Multi-Site Active-Active (full capacity in both regions, instant failover). Choose based on business cost-of-downtime, not engineering preference.
For PHP applications: RDS automated backups with point-in-time recovery, cross-region read replicas, S3 versioning with cross-region replication, and Infrastructure-as-Code (Terraform/CloudFormation) to recreate environments. Database snapshots alone are insufficient — you also need application code (in version control), container images (in ECR with cross-region replication), secrets (in Secrets Manager with replication), and DNS failover (Route 53 health checks).
The most common failure: backups that have never been restored. A backup you have not tested is a hope, not a recovery plan. Schedule quarterly DR drills where you actually restore to a separate environment and validate application functionality. Document the runbook step-by-step so any on-call engineer can execute it at 3am under pressure.
Common Misconception
Why It Matters
Common Mistakes
- Backups stored only in the same region as production
- Never testing restore procedures until a real disaster strikes
- Backing up the database but not the application config, secrets, or container images
- Setting unrealistic RTO/RPO targets without budgeting for warm standby infrastructure
- No documented runbook so only one senior engineer knows how to recover
Code Examples
# RDS in us-east-1 only
# Automated backups enabled, retention 7 days
# No cross-region copy, no restore testing
# Application secrets only in us-east-1 Secrets Manager
# Runbook: "Ask Dave, he set it up"
# RDS with cross-region automated backups
aws rds modify-db-instance --db-instance-identifier prod \
--backup-retention-period 30 \
--apply-immediately
# Cross-region snapshot copy via EventBridge rule
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:us-east-1:... \
--target-db-snapshot-identifier prod-dr-snapshot \
--source-region us-east-1 --region us-west-2
# S3 cross-region replication
aws s3api put-bucket-replication --bucket prod-assets \
--replication-configuration file://replication.json
# Quarterly: restore to staging account, run smoke tests
# Documented runbook in runbooks/dr-failover.md
References
https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html
https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/
https://cloud.google.com/architecture/dr-scenarios-planning-guide