This slide deck from Siddon Tang, Chief engineer from PingCAP, was for Siddon's talk at Percona Live 2018 regarding how to scale TiKV, an open source transactional Key-Value store to 100+ nodes.
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Building a transactional key-value store that scales to 100+ nodes (percona live 2018)
1. Building a Transactional Key-
Value Store
That Scales to 100+ Nodes
Siddon Tang at PingCAP
(Twitter: @siddontang; @pingcap)
1
2. About Me
● Chief Engineer at PingCAP
● Leader of TiKV project
● My other open-source projects:
○ go-mysql
○ go-mysql-elasticsearch
○ LedisDB
○ raft-rs
○ etc..
2
3. Agenda
● Why did we build TiKV?
● How do we build TiKV?
● Going beyond TiKV
3
9. What we need to build...
1. A high-performance Key-Value engine to store data
2. A consensus model to ensure data consistency in different machines
3. A transaction model to meet ACID compliance across machines
4. A network framework for communication
5. A scheduler to manage the whole cluster
9
13. Rust - Cons (2 years ago):
● Makes you think differently
● Long compile time
● Lack of libraries and tools
● Few Rust programmers
● Uncertain future
Time
Rust
Learning Curve
13
14. Rust - Pros:
● Blazing Fast
● Memory safety
● Thread safety
● No GC
● Fast FFI
● Vibrant package ecosystem
14
17. Why RocksDB?
● High Write/Read Performance
● Stability
● Easy to be embedded in Rust
● Rich functionality
● Continuous development
● Active community
17
21. Raft - Election
Follower
Candidate Leader
Start
Election Timeout,
Start new election.
Find leader or
receive higher
term msg
Receive majority vote
Election, re-
campaign
Receive higher
term msg
21
22. Raft - Log Replicated State Machine
a <- 1 b <- 2
State
Machine
Log
Raft
Module
Client
a <- 1 b <- 2
State
Machine
Log
Raft
Module
a <- 1 b <- 2
State
Machine
Log
Raft
Module
22
1a
2b
1a
2b
1a
2b
23. Raft - Optimization
● Leader appends logs and sends msgs in parallel
● Prevote
● Pipeline
● Batch
● Learner
● Lease based Read
● Follower Read
23
24. A Raft can’t manage a huge dataset.
So we need Multi-Raft!!!
24
25. Multi-Raft: Data sharding
(-∞, a)
[a, b)
(b, +∞)
Range Sharding (TiKV)
Chunk 1
Chunk 2
Chunk 3
Hash Sharding
Dataset
Key Hash
Dataset
25
26. Multi-Raft in TiKV
Region 1
Region 2
Region 3
Region 1
Region 2
Region 3
Region 1
Region 2
Region 3
Raft Group
Raft Group
Raft Group
A - B
B - C
C - D
Range Sharding
26
Node 1 Node 2 Node 3
27. Multi-Raft: Split and Merge
Region A
Region A
Region B
Region A
Region A
Region B
Split
Region A
Region A
Region B
Merge
27
Node 2Node 1
33. Distributed Transaction
Region 1 Region 1 Region 1
Region 2 Region 2 Region 2
Begin
Set a = 1
Set b = 2
Commit
Raft Group
Raft Group
33
34. Transaction in TiKV
● Optimized two phase commit, inspired by Google Percolator
● Multi-version concurrency control
● Optimistic Commit
● Snapshot Isolation
● Use Timestamp Oracle to allocate unique timestamp for transactions
34
35. Percolator Optimization
● Use a latch on TiDB to support pessimistic commit
● Concurrent Prewrite
○ We are formally proving it with TLA+
35
47. Scheduler - Region Size Balance
Some regions are very hot for Read/Write
R1
R2
R3
R4
R5
R6
Hot
Cold
Normal
47
48. Scheduler - Hot balance
R1
R2
R3
R4
R5
R6
R1
R3
R2
R4
R5
R6
TiKV reports region Read/Write traffic to PD
48
49. Scheduler - More
● More…
○ Weight Balance - High-weight TiKV will save more data
○ Evict Leader Balance - Some TiKV node can’t have any Raft
leader
● OpInfluence - Avoid over frequent balancing
49
51. Scheduler - Cross DC
DC
Rack
R1
Rack
R1
DC
Rack
R2
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
51
52. Scheduler - three DCs in two cities
DC - Seattle 1
Rack
R1
Rack
R2
DC - Seattle 2
Rack
R1
Rack
R2
DC - Santa Clara
Rack
R1’
Rack
R2’
DC - Seattle 1
Rack
R1’
Rack
R2
DC - Seattle 2
Rack
R1
Rack
R2’
DC - Santa Clara
Rack
R1
Rack
R2
52
54. Test
● Unit Test
● Integration Test
● Performance Test
● Linearizability Test
● Jepsen Test
● Chaos Test
○ Published on The New Stack https://thenewstack.io/chaos-tools-and-
techniques-for-testing-the-tidb-distributed-newsql-database
54
59. To sum up, TiKV is ...
● An open-source, unifying distributed storage layer that supports:
○ Strong consistency
○ ACID compliance
○ Horizontal scalability
○ Cloud-native architecture
● Building block to simplify building other systems
○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for
their own S3), Ele.me (Redis Protocol Layer)
○ Sky is the limit!
59