Disaster Recovery
- backup strategies need to be part of the disaster recovery planning
- How to decide RTO / RPO:
- Impact on business if app outage is extended?
- Cost of loss
- Dependencies
- Consumers
- Regulatory requirements
RTO
- Maximum amount of time a service is allowed to be unavailable.
RPO
- Maximum amount of time for which data could be lost for a service.
Disaster Recovery Strategies
Strategies |
RPO |
RTO |
Backup and Restore |
hours |
24 hours or less |
Pilot Light |
mins |
hours |
Warm Standby |
seconds |
mins |
Multi-site |
close to zero |
close to zero |
Backup & Restore
- RTO is usually 24 hours and RPO is hours.
Point in time recovery options:
Database |
Storage |
RDS Snapshot |
EBS Snapshot |
Aurora Snapshot |
EFS Snapshot |
DynamoBD Snapshot |
|
Redshift Snapshot |
|
DocumentDB Snapshot |
|
Neptune Snapshot |
|
Pilot Light
- Introduction of data replication between DR Region
- Core infrastructure is in place in DR Region
- Ability to scale out faster
- RPO reduced as asynchronous data replication across regions.
- Servers are switched off and not running
- Cross Region data replications
- S3 cross-region replication - Automatically replicates objects from source to DR region
- RDS cross-region replicas - Replica DB in a separate region created from snapshots
- Aurora Global Database - low latency reads in multiple regions
- DyanmoDB global tables - multi-region deployment of table without need to crete replica
- DocumentDB global Clusters - single primary region and 5 replicas across regions
- Global Datastore for Amazon ElasticCache for Redis - cross-regional replica cluster and low latency reads
Warm Standby
- Scaled down always running resources
- Can start serving request as soon as failure occurs
- Increased costs as opposed to Pilot Light
Multi-site Active/Active
- Most complex
- Most Costly
- Lowest RTO/RPO.