SlideShare une entreprise Scribd logo
1  sur  60
Building a Transactional Key-
Value Store
That Scales to 100+ Nodes
Siddon Tang at PingCAP
(Twitter: @siddontang; @pingcap)
1
About Me
● Chief Engineer at PingCAP
● Leader of TiKV project
● My other open-source projects:
○ go-mysql
○ go-mysql-elasticsearch
○ LedisDB
○ raft-rs
○ etc..
2
Agenda
● Why did we build TiKV?
● How do we build TiKV?
● Going beyond TiKV
3
Why?
Is it worthwhile to build another Key-Value store?
4
We want to build a
distributed relational database
to solve the scaling problem of MySQL!!!
5
Inspired by Google F1 + Spanner
F1
Spanner
Client
TiDB
TiKV
MySQL Client
6
How?
7
A High Building,
A Low Foundation
8
What we need to build...
1. A high-performance Key-Value engine to store data
2. A consensus model to ensure data consistency in different machines
3. A transaction model to meet ACID compliance across machines
4. A network framework for communication
5. A scheduler to manage the whole cluster
9
Choose a Language!
10
Hello Rust
11
Rust...?
12
Rust - Cons (2 years ago):
● Makes you think differently
● Long compile time
● Lack of libraries and tools
● Few Rust programmers
● Uncertain future
Time
Rust
Learning Curve
13
Rust - Pros:
● Blazing Fast
● Memory safety
● Thread safety
● No GC
● Fast FFI
● Vibrant package ecosystem
14
Let’s start from the beginning!
15
Key-Value engine
16
Why RocksDB?
● High Write/Read Performance
● Stability
● Easy to be embedded in Rust
● Rich functionality
● Continuous development
● Active community
17
RocksDB: The data is in one machine.
We need fault tolerance.
18
Consensus Algorithm
19
Raft - Roles
● Leader
● Follower
● Candidate
20
Raft - Election
Follower
Candidate Leader
Start
Election Timeout,
Start new election.
Find leader or
receive higher
term msg
Receive majority vote
Election, re-
campaign
Receive higher
term msg
21
Raft - Log Replicated State Machine
a <- 1 b <- 2
State
Machine
Log
Raft
Module
Client
a <- 1 b <- 2
State
Machine
Log
Raft
Module
a <- 1 b <- 2
State
Machine
Log
Raft
Module
22
1a
2b
1a
2b
1a
2b
Raft - Optimization
● Leader appends logs and sends msgs in parallel
● Prevote
● Pipeline
● Batch
● Learner
● Lease based Read
● Follower Read
23
A Raft can’t manage a huge dataset.
So we need Multi-Raft!!!
24
Multi-Raft: Data sharding
(-∞, a)
[a, b)
(b, +∞)
Range Sharding (TiKV)
Chunk 1
Chunk 2
Chunk 3
Hash Sharding
Dataset
Key Hash
Dataset
25
Multi-Raft in TiKV
Region 1
Region 2
Region 3
Region 1
Region 2
Region 3
Region 1
Region 2
Region 3
Raft Group
Raft Group
Raft Group
A - B
B - C
C - D
Range Sharding
26
Node 1 Node 2 Node 3
Multi-Raft: Split and Merge
Region A
Region A
Region B
Region A
Region A
Region B
Split
Region A
Region A
Region B
Merge
27
Node 2Node 1
Multi-Raft: Scalability
Region A’
Region B’
How to
Move Region A?
28
Node 1 Node 2
Multi-Raft: Scalability
Region A’
Region B’
How to
Move Region A? Region A
Add
Replica
29
Node 1 Node 2
Multi-Raft: Scalability
Region A
Region B’
How to
Move Region A? Region A’
Transfer Leader
30
Node 1 Node 2
Multi-Raft: Scalability
Region B’
How to
Move Region A? Region A’
Remove Replica
31
Node 1 Node 2
How to ensure cross-region data
consistency?
32
Distributed Transaction
Region 1 Region 1 Region 1
Region 2 Region 2 Region 2
Begin
Set a = 1
Set b = 2
Commit
Raft Group
Raft Group
33
Transaction in TiKV
● Optimized two phase commit, inspired by Google Percolator
● Multi-version concurrency control
● Optimistic Commit
● Snapshot Isolation
● Use Timestamp Oracle to allocate unique timestamp for transactions
34
Percolator Optimization
● Use a latch on TiDB to support pessimistic commit
● Concurrent Prewrite
○ We are formally proving it with TLA+
35
How to communicate with each other?
RPC Framework!
36
Hello gRPC
37
Why gRPC?
● Widely used
● Supported by many languages
● Works with Protocol Buffers and FlatBuffers
● Rich interface
● Benefits from HTTP/2
38
TiKV Stack
Raft Group
Client
gRPC
RocksDB
Raft
Transaction
Txn KV API
TiKV
gRPC gRPC
RocksDB
Raft
Transaction
Txn KV API
TiKV
RocksDB
Raft
Transaction
Txn KV API
TiKV
39
How to manage 100+ nodes?
40
Scheduler - Goal
● Make the load and data size balanced
● Avoid hotspot performance issue
41
Scheduler in TiKV
TiKV
We are Gods!!!
TiKV
TiKV
TiKV TiKV
TiKV
TiKV
TiKV
42
PD PD
PD
Placement Drivers
Scheduler - How
PD
TiKV TiKV TiKV
Store Heatbeat
Region Heatbeat
Add Replica
Remove Replica
Transfer Leader
...
Schedule Operator
43
PD’ PD
Scheduler - Region Count Balance
Assume the Regions have the same size
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4 R6
R5
44
Scheduler - Region Count Balance
Regions’ sizes are not the same
R1 - 0 MB
R2 - 0 MB
R3 - 0 MB
R4 - 64 MB
R5 - 64 MB
R6 - 96 MB
45
Scheduler - Region Size balance
Use size for calculation
R1 - 0 MB
R2 - 0 MB
R3 - 0 MB
R4 - 64 MB
R5 - 64 MB
R6 - 96 MB
R1 - 0 MB
R5 - 64 MB
R3 - 0 MB
R4 - 64 MB
R2 - 0 MB
R6 - 96 MB
46
Scheduler - Region Size Balance
Some regions are very hot for Read/Write
R1
R2
R3
R4
R5
R6
Hot
Cold
Normal
47
Scheduler - Hot balance
R1
R2
R3
R4
R5
R6
R1
R3
R2
R4
R5
R6
TiKV reports region Read/Write traffic to PD
48
Scheduler - More
● More…
○ Weight Balance - High-weight TiKV will save more data
○ Evict Leader Balance - Some TiKV node can’t have any Raft
leader
● OpInfluence - Avoid over frequent balancing
49
Geo-Replication
50
Scheduler - Cross DC
DC
Rack
R1
Rack
R1
DC
Rack
R2
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
51
Scheduler - three DCs in two cities
DC - Seattle 1
Rack
R1
Rack
R2
DC - Seattle 2
Rack
R1
Rack
R2
DC - Santa Clara
Rack
R1’
Rack
R2’
DC - Seattle 1
Rack
R1’
Rack
R2
DC - Seattle 2
Rack
R1
Rack
R2’
DC - Santa Clara
Rack
R1
Rack
R2
52
How to ensure data safety?
53
Test
● Unit Test
● Integration Test
● Performance Test
● Linearizability Test
● Jepsen Test
● Chaos Test
○ Published on The New Stack https://thenewstack.io/chaos-tools-and-
techniques-for-testing-the-tidb-distributed-newsql-database
54
Going beyond TiKV
55
TiDB HTAP Solution
TiDB
TiDB
Worker
Spark Driver
TiKV Cluster (Storage)
Metadata
TiKV TiKV
TiKV
Data location
Job
TiSpark
DistSQL API
TiKV
TiDB
TSO/Data location
Worker
Worker
Spark Cluster
TiDB Cluster
TiDB
DistSQL API
PD
PD Cluster
TiKV TiKV
TiDB
KV API
Application
Syncer
SparkSQL
PD
PD
Cloud-Native
......
KV
Who’s Using TiKV Now?
To sum up, TiKV is ...
● An open-source, unifying distributed storage layer that supports:
○ Strong consistency
○ ACID compliance
○ Horizontal scalability
○ Cloud-native architecture
● Building block to simplify building other systems
○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for
their own S3), Ele.me (Redis Protocol Layer)
○ Sky is the limit!
59
Thank you!
TiKV: https://github.com/pingcap/tikv
Email: tl@pingcap.com
Github: siddontang
Twitter: @siddontang; @pingcap
60

Contenu connexe

Tendances

Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Aggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsAggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream Windows
Paris Carbone
 

Tendances (20)

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
Migration strategies for a mission critical cluster
Migration strategies for a mission critical clusterMigration strategies for a mission critical cluster
Migration strategies for a mission critical cluster
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. SimmonsInfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
 
OVN Controller Incremental Processing
OVN Controller Incremental ProcessingOVN Controller Incremental Processing
OVN Controller Incremental Processing
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaIntro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)
 
Aggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsAggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream Windows
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Druid + Superset (資料的快速通道)
Druid + Superset (資料的快速通道)Druid + Superset (資料的快速通道)
Druid + Superset (資料的快速通道)
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
 

Similaire à Building a transactional key-value store that scales to 100+ nodes (percona live 2018)

Introduction to hd insight
Introduction to hd insightIntroduction to hd insight
Introduction to hd insight
MSDEVMTL
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalability
lucboudreau
 

Similaire à Building a transactional key-value store that scales to 100+ nodes (percona live 2018) (20)

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdf
 
"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends DevroomFOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup GroupTiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupIntroducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps Meetup
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Introduction to hd insight
Introduction to hd insightIntroduction to hd insight
Introduction to hd insight
 
Introduction to hd insight
Introduction to hd insightIntroduction to hd insight
Introduction to hd insight
 
Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]
 
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - MydbopsTiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalability
 
Data High Availability With TIDB
Data High Availability With TIDBData High Availability With TIDB
Data High Availability With TIDB
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live FrankfurtIntroducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live Frankfurt
 

Plus de PingCAP

[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] QAGen: Generating query-aware test databases[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] QAGen: Generating query-aware test databases
PingCAP
 

Plus de PingCAP (20)

[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
 
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
 
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
 
[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] QAGen: Generating query-aware test databases[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] QAGen: Generating query-aware test databases
 
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
[Paper Reading]  Leases: An Efficient Fault-Tolerant Mechanism for Distribute...[Paper Reading]  Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
[Paperreading] Paxos made easy (by sen han)
[Paperreading]  Paxos made easy (by sen han)[Paperreading]  Paxos made easy (by sen han)
[Paperreading] Paxos made easy (by sen han)
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
 
Finding Logic Bugs in Database Management Systems
Finding Logic Bugs in Database Management SystemsFinding Logic Bugs in Database Management Systems
Finding Logic Bugs in Database Management Systems
 
Chaos Practice in PingCAP
Chaos Practice in PingCAPChaos Practice in PingCAP
Chaos Practice in PingCAP
 
TiDB at PayPay
TiDB at PayPayTiDB at PayPay
TiDB at PayPay
 
Paper Reading: FPTree
Paper Reading: FPTreePaper Reading: FPTree
Paper Reading: FPTree
 
Paper Reading: Smooth Scan
Paper Reading: Smooth ScanPaper Reading: Smooth Scan
Paper Reading: Smooth Scan
 
Paper Reading: Flexible Paxos
Paper Reading: Flexible PaxosPaper Reading: Flexible Paxos
Paper Reading: Flexible Paxos
 
Paper reading: Cost-based Query Transformation in Oracle
Paper reading: Cost-based Query Transformation in OraclePaper reading: Cost-based Query Transformation in Oracle
Paper reading: Cost-based Query Transformation in Oracle
 
Paper reading: HashKV and beyond
Paper reading: HashKV and beyondPaper reading: HashKV and beyond
Paper reading: HashKV and beyond
 

Dernier

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Building a transactional key-value store that scales to 100+ nodes (percona live 2018)

  • 1. Building a Transactional Key- Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1
  • 2. About Me ● Chief Engineer at PingCAP ● Leader of TiKV project ● My other open-source projects: ○ go-mysql ○ go-mysql-elasticsearch ○ LedisDB ○ raft-rs ○ etc.. 2
  • 3. Agenda ● Why did we build TiKV? ● How do we build TiKV? ● Going beyond TiKV 3
  • 4. Why? Is it worthwhile to build another Key-Value store? 4
  • 5. We want to build a distributed relational database to solve the scaling problem of MySQL!!! 5
  • 6. Inspired by Google F1 + Spanner F1 Spanner Client TiDB TiKV MySQL Client 6
  • 8. A High Building, A Low Foundation 8
  • 9. What we need to build... 1. A high-performance Key-Value engine to store data 2. A consensus model to ensure data consistency in different machines 3. A transaction model to meet ACID compliance across machines 4. A network framework for communication 5. A scheduler to manage the whole cluster 9
  • 13. Rust - Cons (2 years ago): ● Makes you think differently ● Long compile time ● Lack of libraries and tools ● Few Rust programmers ● Uncertain future Time Rust Learning Curve 13
  • 14. Rust - Pros: ● Blazing Fast ● Memory safety ● Thread safety ● No GC ● Fast FFI ● Vibrant package ecosystem 14
  • 15. Let’s start from the beginning! 15
  • 17. Why RocksDB? ● High Write/Read Performance ● Stability ● Easy to be embedded in Rust ● Rich functionality ● Continuous development ● Active community 17
  • 18. RocksDB: The data is in one machine. We need fault tolerance. 18
  • 20. Raft - Roles ● Leader ● Follower ● Candidate 20
  • 21. Raft - Election Follower Candidate Leader Start Election Timeout, Start new election. Find leader or receive higher term msg Receive majority vote Election, re- campaign Receive higher term msg 21
  • 22. Raft - Log Replicated State Machine a <- 1 b <- 2 State Machine Log Raft Module Client a <- 1 b <- 2 State Machine Log Raft Module a <- 1 b <- 2 State Machine Log Raft Module 22 1a 2b 1a 2b 1a 2b
  • 23. Raft - Optimization ● Leader appends logs and sends msgs in parallel ● Prevote ● Pipeline ● Batch ● Learner ● Lease based Read ● Follower Read 23
  • 24. A Raft can’t manage a huge dataset. So we need Multi-Raft!!! 24
  • 25. Multi-Raft: Data sharding (-∞, a) [a, b) (b, +∞) Range Sharding (TiKV) Chunk 1 Chunk 2 Chunk 3 Hash Sharding Dataset Key Hash Dataset 25
  • 26. Multi-Raft in TiKV Region 1 Region 2 Region 3 Region 1 Region 2 Region 3 Region 1 Region 2 Region 3 Raft Group Raft Group Raft Group A - B B - C C - D Range Sharding 26 Node 1 Node 2 Node 3
  • 27. Multi-Raft: Split and Merge Region A Region A Region B Region A Region A Region B Split Region A Region A Region B Merge 27 Node 2Node 1
  • 28. Multi-Raft: Scalability Region A’ Region B’ How to Move Region A? 28 Node 1 Node 2
  • 29. Multi-Raft: Scalability Region A’ Region B’ How to Move Region A? Region A Add Replica 29 Node 1 Node 2
  • 30. Multi-Raft: Scalability Region A Region B’ How to Move Region A? Region A’ Transfer Leader 30 Node 1 Node 2
  • 31. Multi-Raft: Scalability Region B’ How to Move Region A? Region A’ Remove Replica 31 Node 1 Node 2
  • 32. How to ensure cross-region data consistency? 32
  • 33. Distributed Transaction Region 1 Region 1 Region 1 Region 2 Region 2 Region 2 Begin Set a = 1 Set b = 2 Commit Raft Group Raft Group 33
  • 34. Transaction in TiKV ● Optimized two phase commit, inspired by Google Percolator ● Multi-version concurrency control ● Optimistic Commit ● Snapshot Isolation ● Use Timestamp Oracle to allocate unique timestamp for transactions 34
  • 35. Percolator Optimization ● Use a latch on TiDB to support pessimistic commit ● Concurrent Prewrite ○ We are formally proving it with TLA+ 35
  • 36. How to communicate with each other? RPC Framework! 36
  • 38. Why gRPC? ● Widely used ● Supported by many languages ● Works with Protocol Buffers and FlatBuffers ● Rich interface ● Benefits from HTTP/2 38
  • 39. TiKV Stack Raft Group Client gRPC RocksDB Raft Transaction Txn KV API TiKV gRPC gRPC RocksDB Raft Transaction Txn KV API TiKV RocksDB Raft Transaction Txn KV API TiKV 39
  • 40. How to manage 100+ nodes? 40
  • 41. Scheduler - Goal ● Make the load and data size balanced ● Avoid hotspot performance issue 41
  • 42. Scheduler in TiKV TiKV We are Gods!!! TiKV TiKV TiKV TiKV TiKV TiKV TiKV 42 PD PD PD Placement Drivers
  • 43. Scheduler - How PD TiKV TiKV TiKV Store Heatbeat Region Heatbeat Add Replica Remove Replica Transfer Leader ... Schedule Operator 43 PD’ PD
  • 44. Scheduler - Region Count Balance Assume the Regions have the same size R1 R2 R3 R4 R5 R6 R1 R2 R3 R4 R6 R5 44
  • 45. Scheduler - Region Count Balance Regions’ sizes are not the same R1 - 0 MB R2 - 0 MB R3 - 0 MB R4 - 64 MB R5 - 64 MB R6 - 96 MB 45
  • 46. Scheduler - Region Size balance Use size for calculation R1 - 0 MB R2 - 0 MB R3 - 0 MB R4 - 64 MB R5 - 64 MB R6 - 96 MB R1 - 0 MB R5 - 64 MB R3 - 0 MB R4 - 64 MB R2 - 0 MB R6 - 96 MB 46
  • 47. Scheduler - Region Size Balance Some regions are very hot for Read/Write R1 R2 R3 R4 R5 R6 Hot Cold Normal 47
  • 48. Scheduler - Hot balance R1 R2 R3 R4 R5 R6 R1 R3 R2 R4 R5 R6 TiKV reports region Read/Write traffic to PD 48
  • 49. Scheduler - More ● More… ○ Weight Balance - High-weight TiKV will save more data ○ Evict Leader Balance - Some TiKV node can’t have any Raft leader ● OpInfluence - Avoid over frequent balancing 49
  • 51. Scheduler - Cross DC DC Rack R1 Rack R1 DC Rack R2 Rack R2 DC Rack R1 Rack R2 DC Rack R1 Rack R2 DC Rack R1 Rack R2 DC Rack R1 Rack R2 51
  • 52. Scheduler - three DCs in two cities DC - Seattle 1 Rack R1 Rack R2 DC - Seattle 2 Rack R1 Rack R2 DC - Santa Clara Rack R1’ Rack R2’ DC - Seattle 1 Rack R1’ Rack R2 DC - Seattle 2 Rack R1 Rack R2’ DC - Santa Clara Rack R1 Rack R2 52
  • 53. How to ensure data safety? 53
  • 54. Test ● Unit Test ● Integration Test ● Performance Test ● Linearizability Test ● Jepsen Test ● Chaos Test ○ Published on The New Stack https://thenewstack.io/chaos-tools-and- techniques-for-testing-the-tidb-distributed-newsql-database 54
  • 56. TiDB HTAP Solution TiDB TiDB Worker Spark Driver TiKV Cluster (Storage) Metadata TiKV TiKV TiKV Data location Job TiSpark DistSQL API TiKV TiDB TSO/Data location Worker Worker Spark Cluster TiDB Cluster TiDB DistSQL API PD PD Cluster TiKV TiKV TiDB KV API Application Syncer SparkSQL PD PD
  • 59. To sum up, TiKV is ... ● An open-source, unifying distributed storage layer that supports: ○ Strong consistency ○ ACID compliance ○ Horizontal scalability ○ Cloud-native architecture ● Building block to simplify building other systems ○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for their own S3), Ele.me (Redis Protocol Layer) ○ Sky is the limit! 59
  • 60. Thank you! TiKV: https://github.com/pingcap/tikv Email: tl@pingcap.com Github: siddontang Twitter: @siddontang; @pingcap 60

Notes de l'éditeur

  1. Beijing Bank to support their core banking system