2. ØYSTEIN GRØVLEN
Sr. Staff Engineer @ Alibaba Cloud
Bio:
Before joining Alibaba, Øystein worked for 10 years in the
MySQL optimizer team at Sun/Oracle. At Sun Microsystems,
he was also a contributor on the Apache Derby project and
Sun's Architectural Lead on Java DB. Prior to that, he worked
for 10 years on development of Clustra, a highly available
DBMS.
3. POLARDB:a Cloud Native Database
Emerging
Hardware
• NVM
• RDMA
• FPGA
Serverless
• Auto Scaling
• Paid by Usage
• Zero Downtime
Security
• Encryption
• Audit
• Access Control
Intelligence
• Self-configuration
• Self-optimization
• Self-diagnosis
• Self-healing
CLOUD NATIVE
User Oriented
4. Database Architecture Revolution : Separation of Storage and Computation
Transaction
Architecture: Separation of Storage and Computation
Database Storage Engine
Computation OffloadingStorage
Compatibility
SecurityHTAPMulti-Model
Usability
Self-Driving
Manageability
5. Cloud Native Architecture
• Scale compute and storage independently
• Shared storage
• Across AZ fail-over without data loss
• Optimize division of functionality between
storage and compute
• Tight integration with other cloud components
like metering, monitoring, control plan
• Optimize for hardware in the data centers
• Compatible with MySQL/PG etc
• Security
PolarProxy
PolarStore
POLARDB
Intelligent proxy
100% Compatible
Storage Optimized
For Database
PolarFS
7. PolarStore: Design for Emerging Hardware
- No Context Switch
- OS-bypass & zero-copy
RDMA-NIC
Network Over RDMA
libpfs
POLARDB
Memory
- Parallel Random I/O absorbed by Optane
- Excellent performance with less long tail latency issue
- No need of Over Provisioning
WAL Log in 3Dxpoint optane
RDMA Network
RDMA
RDMA-NIC
Optane
NVMe SSDs
Memory
Chunkserver 1
RDMA-NIC
Optane
NVMe SSDs
Memory
Chunkserver 3
RDMA-NIC
Optane
NVMe SSDs
Memory
Chunkserver 2
PolarDB write to shm
8. PolarFS: posix distributed file system closely with DB
Pure User Space
For Extra-low Latency
- No Sys call
- No Context Switch
- Zero Data Copy
Posix Semantics
- Easy Porting
Node 1
libpfs
POLARDB
Journal file
Paxos file
Low Latency Oriented
libpfs
POLARDB
libpfs
POLARDB
Node 2 Node 3
1 2 3
4
5 6
head
pending tail
tail
POLARDB Cluster File System Metadata Cache
Directory Tree File Mapping Table
root FileBlk VolBlk
0
1
2
…
348
1500
0 201
…
6
Database Volume
Chunks
…
Block Mapping Table
FileID FileBlk
489
478
…
16
0 201
…
VolBlk
200
201
202
0 2010 316
…
3
PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared
Storage Cloud Database (VLDB 2018)
9. Dynamic Scaling
Local
Storage
Fast Scaling
MySQL
POLARDB
Master
Local
Storage
Replica
Local
Storage
Replica
Master Replica Replica
Shared Storage
Upgrade 2vCPU to 32vCPU, only in 5 minutes
Add more Replicas, only in 5 minutes.
数值轴
1 Replica 2 Replica 3 Replica 4 Replica 5 Replica 10 Replica
20,949
11,349
9,749
8,149
6,549
4,949
39,844
20,102
16,811
13,521
10,230
6,940
RDS MySQL POLARDB
Lower Cost: 30%~50% OFF
Total costs of 4vCPU 32G Memory 500G Storage with
different replica numbers
0
10000
20000
30000
40000
10. Shared Nothing Logical Replication vs. Shared Storage Physical Replication
Local Storage Local Storage
Master
POLARSTORE
Slave Master Slave
Data
Binlog
Redo
log Data
Master
Binlog
Slave
Binlog
Redo
log
Data
Redo
log
Data
Redo
log
Binlog
Physical Replication is much more reliable than Logical Replication
11. Shared Nothing Logical Replication vs. Shared Storage Physical Replication
Non-blocking low-latency DDL synchronization
Master
Slave
Timeline
Add Column
Running 1 Hour
Add Column
Blocked 1 Hour
Applying DDL will block following events
Add Column
Update
data files
Update metadata
Need not modify data files
MySQL POLARDB
Shared Storage
Master
Slave
12. Physical Replication by Redo Log
Commit
Async Flush
Data File Redo Log
DATA LOG & MEMORY
Primary
Shared Storage
Log Parse
Hash
Table
Redo
Buffer
Pool
Buffer Pool
Write Memory
Query
Snapshot of T4
T2
T4
T5
T1
T3
T3T2T1 T4 T5
T3T2T1 T4
T3T2T1 T4 T5
RO Node
T4
Transactions
Buffer Pool
Shared Storage Continuous Recovery Consistent Snapshot Read
T1
13. Physical Replication - Page from Past
Oldest read view
Control purge
Avoid Data Gap
Checkpoint LSN
(T1)
Primary
Shared Storage
Log Parse
Hash
Table
Redo
Buffer Pool
Snapshot of T4
T2
T4
T5
T1
T3
T3T2T1 T4 T5
T3T2T1 T4
T1
T4Buffer Pool
Data
Redo Log
Checkpoint
T1
T3T2T1 T4
Purgeable Unpurgeable
RO
Node
Primary
RO Node
14. Physical Replication - Page from Future
Avoid Data Overstep
Control flush datafile
Primary
Shared Storage
Log Parse
Hash
Table
Redo
Buffer Pool
Snapshot of T4
T2
T4
T5
T1
T3
T3T2T1 T4 T5
T3T2T1 T4
T1
T4Buffer Pool
Data
Redo Log
Snapshot Version
T4
Unflushable
T5
T3T2T1 T4
Flushable
T4T3T2Primary
Snapshot Version
T4
LSN of the latest
applied redolog
RO Node
RO Node
15. Single Master
Single Endpoint Transparent Failover
Attacks Protection Causal Consist Read
Proxy Cluster
Master Replica Replica
Shared Storage
Application
Replica
Read/Write Split
High Availability
Load Balance
Security
16. Read and Write Separation - Session Consistent
Problem Can’t read latest data Solved!
connection.query
{
UPDATE user SET name=‘Jimmy’ WHERE id=1;
COMMIT;
SELECT name FROM user WHERE id=1; // name is Jimmy
}
SELECT can always get the latest data
POLARDB
Cluster
LSN 30 LSN 35
1. UPDATE
2. SELECT
Log Serial Number
LSN 35
1. UPDATE 3. SELECT Require LSN>=35)
2. Return LSN=35
M R1 R2
Application
Smart
Proxy
Read & Write
Separation
Load Balance
Module