The Amazon Aurora MySQL-compatible Edition is a fully managed relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. It is purpose-built for the cloud using a new architectural model and distributed systems techniques. It provides far higher performance, availability, and durability than previously possible using conventional monolithic database architectures. Amazon Aurora packs a lot of innovations in the engine and storage layers. In this session, we do a deep-dive into some key innovations behind Amazon Aurora MySQL-compatible edition. We explore new improvements to the service and discuss best practices and optimal configurations.
4. Amazon Aurora:
A re lational datab ase re imag ine d for the clou d
Speed and availability of high-end commercial databases
Simplicity and cost-effectiveness of open source databases
Drop-in compatibility with MySQL and PostgreSQL
Simple pay as you go pricing
Delivered as a managed service
5. R e lational datab ase s we re not de sig ne d for the clou d
Monolithic architecture
Large failure blast radius
SQL
TRANSACTIONS
CACHING
LOGGING
6. Scale-out, distributed architecture
Master Replica Replica Replica
AVAILABILITY
ZONE 1
SHARED STORAGE VOLUME
AVAILABILITY
ZONE 2
AVAILABILITY
ZONE 3
STORAGE NODES WITH SSDS
Logging pushed down to a purpose-built
log-structured distributed storage system
Storage volume is striped across hundreds
of storage nodes distributed across 3
availability zones (AZ)
Six copies of data, two copies in each AZ
SQL
TRANSACTIONS
CACHING
SQL
TRANSACTIONS
CACHING
SQL
TRANSACTIONS
CACHING
7. Why are 6 copies necessary?
In a large fleet, always some
failures
AZ failures have ”shared fate”
Need to tolerate AZ+1 failures
and still be able to repair
For 3 AZs, requires 6 copies
AZ 1 AZ 2 AZ 3
Quorum
break on
AZ failure
2/3 read
2/3 write
AZ 1 AZ 2 AZ 3
Quorum
survives
AZ failure
3/6 read
4/6 write
8. Minimal time to repair
Small segments shorten repair
Aurora uses 10GB segments
Can use fault-tolerance to:
• patch without impact
• balance hot and cold nodes
Segment size
9. Membership changes without stalls
Most systems use consensus for
membership changes. Causes jitter
and stalls.
Aurora uses quorum sets and epochs
No stalls, AZ+1 fault–tolerance, can
aggressively detect failure
A B C D E F
A B C D E F
A B C D E G
A B C D E F
A B C D E G
Epoch 1: All node healthy
Epoch 2: Node F is in suspect state; second
quorum group is formed with node G; both
quorums are active
Epoch 3: Node F is confirmed unhealthy; new
quorum group with node G is active.
10. Avoiding quorum reads
Reads are expensive in most
quorum-based systems
Aurora knows which nodes are
up to date, latency to each node
Read quorum only needed for
repairs or crash recovery
LSN
N
LSN
N
LSN
N-1
LSN
N
LSN
N-1
LSN
N
LSN
N
LSN
N
LSN
N-1
LSN
N
LSN
N-1
LSN
N
For each data block, at least 4 nodes in the
quorum group will have the most recent
data
Read from any one of these four
nodes will return most recent data
12. Do fewer IOs
Minimize network packets
Cache prior results
Offload the database engine
DO LESS WORK
Process asynchronously
Reduce latency path
Use lock-free data structures
Batch operations together
BE MORE EFFICIENT
How did we achieve this?
DATABASES ARE ALL ABOUT I/O
NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND
HIGH-THROUGHPUT PROCESSING IS ALL ABOUT CONTEXT SWITCHES
13. IO traffic in MySQL
BINLOG DATA DOUBLE-WRITELOG FRM FILES
T Y P E O F W R I T E
MYSQL WITH REPLICA
EBS MIRROREBS MIRROR
AZ 1 AZ 2
AMAZON S3
EBS
AMAZON
ELASTIC BLOCK
STORE (EBS)
PRIMARY
INSTANCE
REPLICA
INSTANCE
1
2
3
4
5
Issue write to EBS – EBS issues to mirror, ack when both done
Stage write to standby instance
Issue write to EBS on standby instance
IO FLOW
Steps 1, 3, 4 are sequential and synchronous
This amplifies both latency and jitter
Many types of writes for each user operation
Have to write data blocks twice to avoid torn writes
OBSERVATIONS
780K transactions
7,388K I/Os per million txns (excludes mirroring, standby)
Average 7.4 I/Os per transaction
PERFORMANCE
30 minute SysBench writeonly workload, 100GB dataset, RDS MultiAZ, 30K PIOPS
14. IO traffic in Aurora
AZ 1 AZ 3
PRIMARY
INSTANCE
Amazon S3
AZ 2
REPLICA
INSTANCE
AMAZON AURORA
ASYNC
4/6 QUORUM
DISTRIBUTED
WRITES
BINLOG DATA DOUBLE-WRITELOG FRM FILES
T Y P E O F W R I T E
IO FLOW
Only write redo log records; all steps asynchronous
No data block writes (checkpoint, cache replacement)
6X more log writes, but 9X less network traffic
Tolerant of network and storage outlier latency
OBSERVATIONS
27,378K transactions 35X MORE
950K I/Os per 1M txns (6X amplification) 7.7X LESS
PERFORMANCE
Boxcar redo log records – fully ordered by LSN
Shuffle to appropriate segments – partially ordered
Boxcar to storage nodes and issue writesREPLICA
INSTANCE
15. IO traffic in Aurora (Storage Node)
LOG RECORDS
Primary
Instance
INCOMING QUEUE
STORAGE NODE
S3 BACKUP
1
2
3
4
5
6
7
8
UPDATE
QUEUE
ACK
HOT
LOG
DATA
BLOCKS
POINT IN TIME
SNAPSHOT
GC
SCRUB
COALESCE
SORT
GROUP
PEER TO PEER GOSSIPPeer
Storage
Nodes
All steps are asynchronous
Only steps 1 and 2 are in foreground latency path
Input queue is 46X less than MySQL (unamplified, per node)
Favor latency-sensitive operations
Use disk space to buffer against spikes in activity
OBSERVATIONS
IO FLOW
① Receive record and add to in-memory queue
② Persist record and ACK
③ Organize records and identify gaps in log
④ Gossip with peers to fill in holes
⑤ Coalesce log records into new data block versions
⑥ Periodically stage log and new block versions to S3
⑦ Periodically garbage collect old versions
⑧ Periodically validate CRC codes on blocks
16. TRADITIONAL DATABASE
Have to replay logs since the last
checkpoint
Typically 5 minutes between checkpoints
Single-threaded in MySQL; requires a
large number of disk accesses
AMAZON AURORA
Underlying storage replays redo records
on demand as part of a disk read
Parallel, distributed, asynchronous
No replay for startup
Checkpointed Data Redo Log
Crash at T0 requires
a re-application of the
SQL in the redo log since
last checkpoint
T0 T0
Crash at T0 will result in redo logs being
applied to each segment on demand, in
parallel, asynchronously
Instant crash recovery
19. Fast database cloning *August 2017*
Clone database without copying data
Creation of a clone is nearly instantaneous
Data copy happens only on write – when
original and cloned volume data differ
Example use cases
Clone a production DB to run tests
Reorganize a database
Save a point in time snapshot for analysis
without impacting production system
PRODUCTION DATABASE
CLONE CLONE
CLONE
DEV/TEST
APPLICATIONS
BENCHMARKS
PRODUCTION
APPLICATIONS
PRODUCTION
APPLICATIONS
20. Backup and restore in Aurora
Take periodic snapshot of each segment in parallel; stream the redo logs to Amazon S3
Backup happens continuously without performance or availability impact
At restore, retrieve the appropriate segment snapshots and log streams to storage nodes
Apply log streams to segment snapshots in parallel and asynchronously
SEGMENT
SNAPSHOT
LOG
RECORDS
RECOVERY
POINT
SEGMENT 1
SEGMENT 2
SEGMENT 3
TIME
22. Read replica end-point auto-scaling *Nov. 2017*
Up to 15 promotable read replicas across multiple availability zones
Re-do log based replication leads to low replica lag – typically < 10ms
Reader end-point with load balancing and auto-scaling * NEW *
MASTER
READ
REPLICA
READ
REPLICA
READ
REPLICA
SHARED DISTRIBUTED STORAGE VOLUME
READER END-POINT
23. Online DDL: Aurora vs. MySQL
Full Table copy in the background
Rebuilds all indexes – can take hours or days
DDL operation impacts DML throughput
Table lock applied to apply DML changes
INDEX
LEAFLEAFLEAF LEAF
INDEX
ROOT
table name operation column-name time-stamp
Table 1
Table 2
Table 3
add-col
add-col
add-col
column-abc
column-qpr
column-xyz
t1
t2
t3
Use schema versioning to decode the block
Modify-on-write primitive to upgrade to latest schema
Currently support add NULLable column at end of table
Add column anywhere and with default coming soon
MySQL Amazon Aurora
38. Multi-region Multi-Master
Write accepted locally
Optimistic concurrency control – no distributed lock
manager, no chatty lock management protocol
REGION 1 REGION 2
HEAD NODES HEAD NODES
MULTI-AZ STORAGE VOLUME MULTI-AZ STORAGE VOLUME
LOCAL PARTITION LOCAL PARTITIONREMOTE PARTITION REMOTE PARTITION
Conflicts handled hierarchically – at head nodes, at
storage nodes, at AZ and region level arbitrators
Near-linear performance scaling when there is no or
low levels of conflicts