SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
AF Ceph: Ceph Performance
Analysis & Improvement on
Flash
Byung-Su Park
SDS Tech. Lab, Corporate R&D Center
SK Telecom
1
5G
Why we care about All-Flash Storage …
Flash device
High Performance, Low Latency, SLA
UHD
4K
2
Transforming to 5G Network
New ICT infrastructure should be Programmable, Scalable, Flexible,
and Cost Effective
Software Defined Technologies based on Open Software & Open Hardware
5G
Massive
Connectivity
10x
Lower latency
100x-1000x
Higher speed
Efficiency &
Reliability
Virtualization
3
Open HW & SW Projects @ SKT
Open Software
OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop …
Open Hardware
Open Compute Project (OCP), Telecom Infra Project (TIP)
All-Flash Storage, Server Switch, Telco Specific H/W…
Software-Defined Technologies
4
Why we care about All-Flash Ceph …
Scalable, Available,
Reliable, Unified Interface,
Open Platform
High Performance,
Low Latency
All-Flash Ceph !
5
Agenda
 All-flash Ceph Storage Cluster Environment
 Performance Issues on All-flash Ceph (Ver. Hammer)
 OSD Write Operation Latency Analysis
 Optimizing Ceph OSD details & Results
 Ceph deployment in SKT Private Cloud
 Operations & Maintenance Tool
 The Future of All-flash Ceph
6
All-flash Ceph Storage Cluster Environment
Ceph Node Cluster (4)
Ceph Clients
10GbE Network Switch
10GbE Network Switch
Storage NetworkService Network
CPU: 2x E5-2660v3
DRAM: 256GB, Network: Intel 10GbE NIC
Linux: CentOS 7.0 (w/ KRBD)
Kernel: 3.16.4 or 4.1.6
CPU: 2x E5-2690v3
DRAM: 128GB, Network: Intel 2x 10GbE NIC
Linux: CentOS 7.0
Kernel: 3.10.0-123.el7.x86_64
Ceph Version: Hammer version based
NVRAM
Journal
SATA SSD 10ea
Data Store
7
Performance Issues on All-Flash Ceph (Ver. Hammer)
 Issue: Low Throughput & High Latency
• SSD Spec.
 4KB Random Read: up to 95K IOPS
 4KB Random Write: up to 85K IOPS
(Sustained 14K IOPS)
 < 1ms Latency
• Theoretical Max IOPs
 4KB Random Read: 95K * 10EA * 4 Node
 3800K IOPS
 4KB Random Write: 14K * 10EA * 4 Node / 2
Replication
 280K IOPS
※ Fill 100% RBD Image, Clients use krbd to Ceph
Results
Theoretical
Maximum
4KB
Random Read
154K IOPS 3800K IOPS
4KB
Random Write
42K IOPS 280K IOPS
42 27
154
109
7.7
12.1
2.1
2.9
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
0
50
100
150
200
250
300
350
4KB RW 32KB RW 4KB RR 32KB RR
msecKIOPs Random Workload
IOPs Latency
1.6
2.1
4.1 4.0
97.7
600.6
38.5
315.0
0
200
400
600
800
0
1
2
3
4
5
512KB SW 4MB SW 512KB SR 4MB SR
msecGB/s Sequential Workload
Bandwdith Latency
8
Ceph Write IO Flow: Receiving Request
OSD
Secondary OSD
5. Send Rep
Operations to 2nd
OSD
Operation
Threads
L: ReplicatedBackend
Journal
L:
Messenger
Operation
Threads
Journal
2. Receive
Write Req.
and Data.
4. Do Operation
3. Queue
Op.
Queue
Rep Op.
Operation WQ
Operation WQ
L: Messenger
Client
Cluster
Network
Public
Network
1. Send Write Req.
6. Enqueue
Transaction to
Journal write Q.
PG Lock
PG Unlock
PG Lock
PG Unlock
9
Ceph Write IO Flow: Transaction Execution
File Store (Data)
Journal
Operation
Threads
Writer
Thread
Committed
Data DiskJournal Disk
Operation
Threads
1.Queue
Transaction
2. Operate
Journal
Transactions
5. Queue Op.
6. Queue to
Finisher
Finisher
Thread
writeq
Operation WQ
3. AIO Write to
Journal Disk
Write
Finisher
Thread
4. AIO Write
Complete
Journal or
Data
completion?
Finisher
Thread
7. Buffered
Write to Data
Disk
8. Queue to
Finisher
Applied
PG Lock
PG Lock
PG Unlock
Send RepOp Reply to Primary if this is
secondary OSD
10
Ceph Write IO Flow: Send ACK to Client
Primary OSD
Secondary OSD
Operation
Threads
L: ReplicatedBackend
FileStore
Operation
WQ
L: Messenger
Client
Cluster
Network
Public
Network
5. Send Rep
Operations
to 2nd OSD
2. Receive
Write Req.
4. Do
Operation
3. Queue
Op.
1. Send Write
Req.
6. Enqueue
Transaction to
FileStore
A. Send
RepOp
Reply
B.
Receive
RepOp
Reply
C. Queue
RepOp
Reply
D. Do Operation
E. Send ACK to
Client
F. Send
ACK to
Client
G. Receive
ACK
PG Lock
PG Unlock
All Journal
or Data
Completion
?
11
PG
PG
PG
PG
OSD Write Operation Latency Analysis
Operation WQ
Message
Header
Receive
Operation
WQ
Enqueue
L: OSD
Operation
WQ
Dequeue
L:
Backend
L:
FileStore
Node:
Peer1
Node:
Peer2
L: PG
Backend
L:
Messen
ger
0 0.262 ms 1.029 ms
Submit
Op to
PG Backend
4.048 ms
Send
Transaction
to FileStore
6.663 ms
Enqueue to
Journal
7.379 ms
L:
Journal
JournalQ
Dequeue
from
JournalQ
7.674 ms
Journal
Write
Complete
& enq
8.228 ms
Wait Sub Operation Reply
Second
SubOp
Commit
Received
15.605 ms
First
SubOp
Commit
Received
9.819 ms
0.262 ms 2.956 ms0.767 ms 2.615 ms
11.557 ms
5.771 ms
0.716 ms 0.295 ms 0.554 ms
5.301 ms
Send to
PG
Backend
9.349 ms
FinisherQ
Send to
PG
Backend
11.015 ms
Send to
PG
Backend
16.747 ms
0.47 ms1.121 ms
1.196 ms
4.59 ms
1.142 ms
0.36 ms
12.699 ms
6.967 ms
※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime
PG
PG
PG
PG
1. There are many areas where PG locking occurs during one write operation
 Write OP execution section in Operation Thread
 Journal Commit section (FileStore Submit section)
 Reply handling section from secondary OSD in Operation Thread
2. Coarse-grained PG locks during one write operation
 About 10 msec / Total 17 msec
12
Optimizing Ceph OSD A. PG Lock related Issues
 Too many heavy locks  Gather redundant processing codes together & Deliver
them to the dedicated thread
 Delay of Client ACKs  Return ACKs as soon as possible
 Waiting of Ops in another PG at Ops queue  Make only Ops in the same PG wait
PG
PG
PG
PG
Operation WQ
Message
Header
Receive
Operation
WQ
Enqueue
L: OSD
Operation
WQ
Dequeue
L:
Backend
L:
FileStore
Node:
Peer1
Node:
Peer2
L: PG
Backend
L:
Messen
ger
0 0.262 ms 1.029 ms
Submit
Op to
PG Backend
4.048 ms
Send
Transaction
to FileStore
6.663 ms
Enqueue to
Journal
7.379 ms
L:
Journal
JournalQ
Dequeue
from
JournalQ
7.674 ms
Journal
Write
Complete
& enq
8.228 ms
Wait Sub Operation Reply
Second
SubOp
Commit
Received
15.605 ms
First
SubOp
Commit
Received
9.819 ms
0.262 ms 2.956 ms0.767 ms 2.615 ms
11.557 ms
5.771 ms
0.716 ms 0.295 ms 0.554 ms
5.301 ms
Send to
PG
Backend
9.349 ms
FinisherQ
Send to
PG
Backend
11.015 ms
Send to
PG
Backend
16.747 ms
0.47 ms1.121 ms
1.196 ms
4.59 ms
1.142 ms
0.36 ms
12.699 ms
6.967 ms
※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime
PG
PG
PG
PG
13
Evaluation Results A. PG Lock related Issues
42
27
154
109
55
29
268
108
0
50
100
150
200
250
300
350
4KB RW 32KB RW 4KB RR 32KB RR
KIOPs Random Workload
Hammer Opt. A
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
0
1
2
3
4
5
512KB SW 4MB SW 512KB SR 4MB SR
GB/s Sequential Workload
Hammer Opt. A
 Performance Improvement
• Random 4K Write: 42K → 55K (13K ↑)
• Random 4K Read: 154K → 268K (114K ↑)
14
Optimizing Ceph OSD B. Async Logging & System Tuning
 Long logging time
 Split logging into another thread and do it later
 HDD based throttling configuration
 Change configuration from HDD based to SSD based
 Tcmalloc is too much CPU intensive
 Use JEmalloc
 Batching rule in TCP/IP Stack
 To reduce latency, turn off TCP/IP batching rule
15
Evaluation Results B. Async Logging & System Tuning
Logging
 Performance Improvement
• Random 4K Write: 42K → 61K (19K ↑)
• Random 4K Read: 154K → 285K (131K ↑)
42
27
154
109
55
29
268
108
61
29
285
103
4KB RW 32KB RW 4KB RR 32KB RR
0
50
100
150
200
250
300
350
KIOPs Random Workload
Hammer Opt A. Opt B.
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
1.6
2.1
4.1 4.1
512KB SW 4MB SW 512KB SR 4MB SR
0
1
2
3
4
5
GB/s Sequential Workload
Hammer Opt A. Opt B.
16
Optimizing Ceph OSD
Journal
Operation
WQ Thread
FileStore
Write Request
SSDHeavyweight Transaction
Write
OMAP
File system
Xattr
Data
C. Lightweight Transaction
 Transaction writing overhead
 Merge transaction sub-operations and reduce the weight of transaction
 Transaction lock contention
 Increase cache size related lock contention to prevent lock contention
 Useless system call when small workload executed
 Remove useless system call
 HDD based DB configuration
 Change configuration from HDD based to SSD based
17
Evaluation Results C. Lightweight Transaction
 Performance Improvement
• Random 4K Write: 42K → 89K (47K ↑)
• Random 4K Read: 154K → 321K (167K ↑)
42
27
154
109
55
29
268
108
61
29
285
103
89
29
321
103
4KB RW 32KB RW 4KB RR 32KB RR
0
50
100
150
200
250
300
350
KIOPs Random Workload
Hammer Opt. A Opt. B Opt. C
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
1.6
2.1
4.1 4.1
1.6
2.0
3.9
4.1
512KB SW 4MB SW 512KB SR 4MB SR
0
1
2
3
4
5
GB/s Sequential Workload
Hammer Opt. A Opt. B Opt. C
18
Ceph deployment in SKT Private Cloud
• Deployed for high performance block storage in private cloud
OPENSTACK
Cinder
OSD OSD OSD
Scale-out for
Capacity & Performance
General
Servers
SSD-array
1,000 1,000 1,000
898
685
1.3 1.2 1.3 2.2 2.9
0
6
12
18
24
10 20 40 60 80
0
300
600
900
1,200
msec
number of VM
IOPs
4KB Random Write, CAP=1,000 IOPs
IOPS per VM Latency
19
Operations & Maintenance
Real Time Monitoring
Multi Dashboard
Rule Base Alarm
Drag & Drop Admin
Dashboard Configuration
Rest API
Graph Merge
Drag & Zooming
Cluster Monitoring RBD Management
Object Storage Management
20
The Future of All-flash Ceph
 Data Reduction techniques for Flash device
 Quality of Service (QoS) in distributed environment
 Fully exploits NVRAM/SSD for performance
All-Flash
JBOF
NVMe SSD
• High Performance (PCIe 3.0)
• High Density (2.5” NVMe SSD 24EA: Up to 96TB)
• ‘16. 4Q expected
…
21
Thank you
Contact Info.
Byung-Su Park, bspark8@sk.com

Contenu connexe

Tendances

Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redisTanu Siwag
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Ceph Community
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyAlexander Kukushkin
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture ForumChristopher Spring
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Community
 

Tendances (20)

Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 

Similaire à AF Ceph: Ceph Performance Analysis and Improvement on Flash

Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Community
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2MySQLConference
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performanceRicky Zhu
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloudinside-BigData.com
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A CloudSky Jian
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Nexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta Systems
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudCeph Community
 

Similaire à AF Ceph: Ceph Performance Analysis and Improvement on Flash (20)

Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Nexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on Lab
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 

Dernier

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Dernier (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

AF Ceph: Ceph Performance Analysis and Improvement on Flash

  • 1. AF Ceph: Ceph Performance Analysis & Improvement on Flash Byung-Su Park SDS Tech. Lab, Corporate R&D Center SK Telecom
  • 2. 1 5G Why we care about All-Flash Storage … Flash device High Performance, Low Latency, SLA UHD 4K
  • 3. 2 Transforming to 5G Network New ICT infrastructure should be Programmable, Scalable, Flexible, and Cost Effective Software Defined Technologies based on Open Software & Open Hardware 5G Massive Connectivity 10x Lower latency 100x-1000x Higher speed Efficiency & Reliability Virtualization
  • 4. 3 Open HW & SW Projects @ SKT Open Software OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop … Open Hardware Open Compute Project (OCP), Telecom Infra Project (TIP) All-Flash Storage, Server Switch, Telco Specific H/W… Software-Defined Technologies
  • 5. 4 Why we care about All-Flash Ceph … Scalable, Available, Reliable, Unified Interface, Open Platform High Performance, Low Latency All-Flash Ceph !
  • 6. 5 Agenda  All-flash Ceph Storage Cluster Environment  Performance Issues on All-flash Ceph (Ver. Hammer)  OSD Write Operation Latency Analysis  Optimizing Ceph OSD details & Results  Ceph deployment in SKT Private Cloud  Operations & Maintenance Tool  The Future of All-flash Ceph
  • 7. 6 All-flash Ceph Storage Cluster Environment Ceph Node Cluster (4) Ceph Clients 10GbE Network Switch 10GbE Network Switch Storage NetworkService Network CPU: 2x E5-2660v3 DRAM: 256GB, Network: Intel 10GbE NIC Linux: CentOS 7.0 (w/ KRBD) Kernel: 3.16.4 or 4.1.6 CPU: 2x E5-2690v3 DRAM: 128GB, Network: Intel 2x 10GbE NIC Linux: CentOS 7.0 Kernel: 3.10.0-123.el7.x86_64 Ceph Version: Hammer version based NVRAM Journal SATA SSD 10ea Data Store
  • 8. 7 Performance Issues on All-Flash Ceph (Ver. Hammer)  Issue: Low Throughput & High Latency • SSD Spec.  4KB Random Read: up to 95K IOPS  4KB Random Write: up to 85K IOPS (Sustained 14K IOPS)  < 1ms Latency • Theoretical Max IOPs  4KB Random Read: 95K * 10EA * 4 Node  3800K IOPS  4KB Random Write: 14K * 10EA * 4 Node / 2 Replication  280K IOPS ※ Fill 100% RBD Image, Clients use krbd to Ceph Results Theoretical Maximum 4KB Random Read 154K IOPS 3800K IOPS 4KB Random Write 42K IOPS 280K IOPS 42 27 154 109 7.7 12.1 2.1 2.9 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 0 50 100 150 200 250 300 350 4KB RW 32KB RW 4KB RR 32KB RR msecKIOPs Random Workload IOPs Latency 1.6 2.1 4.1 4.0 97.7 600.6 38.5 315.0 0 200 400 600 800 0 1 2 3 4 5 512KB SW 4MB SW 512KB SR 4MB SR msecGB/s Sequential Workload Bandwdith Latency
  • 9. 8 Ceph Write IO Flow: Receiving Request OSD Secondary OSD 5. Send Rep Operations to 2nd OSD Operation Threads L: ReplicatedBackend Journal L: Messenger Operation Threads Journal 2. Receive Write Req. and Data. 4. Do Operation 3. Queue Op. Queue Rep Op. Operation WQ Operation WQ L: Messenger Client Cluster Network Public Network 1. Send Write Req. 6. Enqueue Transaction to Journal write Q. PG Lock PG Unlock PG Lock PG Unlock
  • 10. 9 Ceph Write IO Flow: Transaction Execution File Store (Data) Journal Operation Threads Writer Thread Committed Data DiskJournal Disk Operation Threads 1.Queue Transaction 2. Operate Journal Transactions 5. Queue Op. 6. Queue to Finisher Finisher Thread writeq Operation WQ 3. AIO Write to Journal Disk Write Finisher Thread 4. AIO Write Complete Journal or Data completion? Finisher Thread 7. Buffered Write to Data Disk 8. Queue to Finisher Applied PG Lock PG Lock PG Unlock Send RepOp Reply to Primary if this is secondary OSD
  • 11. 10 Ceph Write IO Flow: Send ACK to Client Primary OSD Secondary OSD Operation Threads L: ReplicatedBackend FileStore Operation WQ L: Messenger Client Cluster Network Public Network 5. Send Rep Operations to 2nd OSD 2. Receive Write Req. 4. Do Operation 3. Queue Op. 1. Send Write Req. 6. Enqueue Transaction to FileStore A. Send RepOp Reply B. Receive RepOp Reply C. Queue RepOp Reply D. Do Operation E. Send ACK to Client F. Send ACK to Client G. Receive ACK PG Lock PG Unlock All Journal or Data Completion ?
  • 12. 11 PG PG PG PG OSD Write Operation Latency Analysis Operation WQ Message Header Receive Operation WQ Enqueue L: OSD Operation WQ Dequeue L: Backend L: FileStore Node: Peer1 Node: Peer2 L: PG Backend L: Messen ger 0 0.262 ms 1.029 ms Submit Op to PG Backend 4.048 ms Send Transaction to FileStore 6.663 ms Enqueue to Journal 7.379 ms L: Journal JournalQ Dequeue from JournalQ 7.674 ms Journal Write Complete & enq 8.228 ms Wait Sub Operation Reply Second SubOp Commit Received 15.605 ms First SubOp Commit Received 9.819 ms 0.262 ms 2.956 ms0.767 ms 2.615 ms 11.557 ms 5.771 ms 0.716 ms 0.295 ms 0.554 ms 5.301 ms Send to PG Backend 9.349 ms FinisherQ Send to PG Backend 11.015 ms Send to PG Backend 16.747 ms 0.47 ms1.121 ms 1.196 ms 4.59 ms 1.142 ms 0.36 ms 12.699 ms 6.967 ms ※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime PG PG PG PG 1. There are many areas where PG locking occurs during one write operation  Write OP execution section in Operation Thread  Journal Commit section (FileStore Submit section)  Reply handling section from secondary OSD in Operation Thread 2. Coarse-grained PG locks during one write operation  About 10 msec / Total 17 msec
  • 13. 12 Optimizing Ceph OSD A. PG Lock related Issues  Too many heavy locks  Gather redundant processing codes together & Deliver them to the dedicated thread  Delay of Client ACKs  Return ACKs as soon as possible  Waiting of Ops in another PG at Ops queue  Make only Ops in the same PG wait PG PG PG PG Operation WQ Message Header Receive Operation WQ Enqueue L: OSD Operation WQ Dequeue L: Backend L: FileStore Node: Peer1 Node: Peer2 L: PG Backend L: Messen ger 0 0.262 ms 1.029 ms Submit Op to PG Backend 4.048 ms Send Transaction to FileStore 6.663 ms Enqueue to Journal 7.379 ms L: Journal JournalQ Dequeue from JournalQ 7.674 ms Journal Write Complete & enq 8.228 ms Wait Sub Operation Reply Second SubOp Commit Received 15.605 ms First SubOp Commit Received 9.819 ms 0.262 ms 2.956 ms0.767 ms 2.615 ms 11.557 ms 5.771 ms 0.716 ms 0.295 ms 0.554 ms 5.301 ms Send to PG Backend 9.349 ms FinisherQ Send to PG Backend 11.015 ms Send to PG Backend 16.747 ms 0.47 ms1.121 ms 1.196 ms 4.59 ms 1.142 ms 0.36 ms 12.699 ms 6.967 ms ※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime PG PG PG PG
  • 14. 13 Evaluation Results A. PG Lock related Issues 42 27 154 109 55 29 268 108 0 50 100 150 200 250 300 350 4KB RW 32KB RW 4KB RR 32KB RR KIOPs Random Workload Hammer Opt. A 1.6 2.1 4.1 4.0 1.7 2.3 4.0 4.0 0 1 2 3 4 5 512KB SW 4MB SW 512KB SR 4MB SR GB/s Sequential Workload Hammer Opt. A  Performance Improvement • Random 4K Write: 42K → 55K (13K ↑) • Random 4K Read: 154K → 268K (114K ↑)
  • 15. 14 Optimizing Ceph OSD B. Async Logging & System Tuning  Long logging time  Split logging into another thread and do it later  HDD based throttling configuration  Change configuration from HDD based to SSD based  Tcmalloc is too much CPU intensive  Use JEmalloc  Batching rule in TCP/IP Stack  To reduce latency, turn off TCP/IP batching rule
  • 16. 15 Evaluation Results B. Async Logging & System Tuning Logging  Performance Improvement • Random 4K Write: 42K → 61K (19K ↑) • Random 4K Read: 154K → 285K (131K ↑) 42 27 154 109 55 29 268 108 61 29 285 103 4KB RW 32KB RW 4KB RR 32KB RR 0 50 100 150 200 250 300 350 KIOPs Random Workload Hammer Opt A. Opt B. 1.6 2.1 4.1 4.0 1.7 2.3 4.0 4.0 1.6 2.1 4.1 4.1 512KB SW 4MB SW 512KB SR 4MB SR 0 1 2 3 4 5 GB/s Sequential Workload Hammer Opt A. Opt B.
  • 17. 16 Optimizing Ceph OSD Journal Operation WQ Thread FileStore Write Request SSDHeavyweight Transaction Write OMAP File system Xattr Data C. Lightweight Transaction  Transaction writing overhead  Merge transaction sub-operations and reduce the weight of transaction  Transaction lock contention  Increase cache size related lock contention to prevent lock contention  Useless system call when small workload executed  Remove useless system call  HDD based DB configuration  Change configuration from HDD based to SSD based
  • 18. 17 Evaluation Results C. Lightweight Transaction  Performance Improvement • Random 4K Write: 42K → 89K (47K ↑) • Random 4K Read: 154K → 321K (167K ↑) 42 27 154 109 55 29 268 108 61 29 285 103 89 29 321 103 4KB RW 32KB RW 4KB RR 32KB RR 0 50 100 150 200 250 300 350 KIOPs Random Workload Hammer Opt. A Opt. B Opt. C 1.6 2.1 4.1 4.0 1.7 2.3 4.0 4.0 1.6 2.1 4.1 4.1 1.6 2.0 3.9 4.1 512KB SW 4MB SW 512KB SR 4MB SR 0 1 2 3 4 5 GB/s Sequential Workload Hammer Opt. A Opt. B Opt. C
  • 19. 18 Ceph deployment in SKT Private Cloud • Deployed for high performance block storage in private cloud OPENSTACK Cinder OSD OSD OSD Scale-out for Capacity & Performance General Servers SSD-array 1,000 1,000 1,000 898 685 1.3 1.2 1.3 2.2 2.9 0 6 12 18 24 10 20 40 60 80 0 300 600 900 1,200 msec number of VM IOPs 4KB Random Write, CAP=1,000 IOPs IOPS per VM Latency
  • 20. 19 Operations & Maintenance Real Time Monitoring Multi Dashboard Rule Base Alarm Drag & Drop Admin Dashboard Configuration Rest API Graph Merge Drag & Zooming Cluster Monitoring RBD Management Object Storage Management
  • 21. 20 The Future of All-flash Ceph  Data Reduction techniques for Flash device  Quality of Service (QoS) in distributed environment  Fully exploits NVRAM/SSD for performance All-Flash JBOF NVMe SSD • High Performance (PCIe 3.0) • High Density (2.5” NVMe SSD 24EA: Up to 96TB) • ‘16. 4Q expected …
  • 22. 21 Thank you Contact Info. Byung-Su Park, bspark8@sk.com