Skip to content

Disaster Recovery

  • backup strategies need to be part of the disaster recovery planning
  • How to decide RTO / RPO:
    • Impact on business if app outage is extended?
    • Cost of loss
    • Dependencies
    • Consumers
    • Regulatory requirements

RTO

  • Maximum amount of time a service is allowed to be unavailable.

RPO

  • Maximum amount of time for which data could be lost for a service.

Disaster Recovery Strategies

Strategies RPO RTO
Backup and Restore hours 24 hours or less
Pilot Light mins hours
Warm Standby seconds mins
Multi-site close to zero close to zero

Backup & Restore

  • RTO is usually 24 hours and RPO is hours.

Point in time recovery options:

Database Storage
RDS Snapshot EBS Snapshot
Aurora Snapshot EFS Snapshot
DynamoBD Snapshot
Redshift Snapshot
DocumentDB Snapshot
Neptune Snapshot

Pilot Light

  • Introduction of data replication between DR Region
  • Core infrastructure is in place in DR Region
  • Ability to scale out faster
  • RPO reduced as asynchronous data replication across regions.
  • Servers are switched off and not running
  • Cross Region data replications
    • S3 cross-region replication - Automatically replicates objects from source to DR region
    • RDS cross-region replicas - Replica DB in a separate region created from snapshots
    • Aurora Global Database - low latency reads in multiple regions
    • DyanmoDB global tables - multi-region deployment of table without need to crete replica
    • DocumentDB global Clusters - single primary region and 5 replicas across regions
    • Global Datastore for Amazon ElasticCache for Redis - cross-regional replica cluster and low latency reads

Warm Standby

  • Scaled down always running resources
  • Can start serving request as soon as failure occurs
  • Increased costs as opposed to Pilot Light

Multi-site Active/Active

  • Most complex
  • Most Costly
  • Lowest RTO/RPO.