AF Ceph: Ceph Performance Analysis and Improvement on Flash

AF Ceph: Ceph Performance
Analysis & Improvement on
Flash
Byung-Su Park
SDS Tech. Lab, Corporate R&D Center
SK Telecom

1
5G
Why we care about All-Flash Storage …
Flash device
High Performance, Low Latency, SLA
UHD
4K

2
Transforming to 5G Network
New ICT infrastructure should be Programmable, Scalable, Flexible,
and Cost Effective
Software Defined Technologies based on Open Software & Open Hardware
5G
Massive
Connectivity
10x
Lower latency
100x-1000x
Higher speed
Efficiency &
Reliability
Virtualization

3
Open HW & SW Projects @ SKT
Open Software
OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop …
Open Hardware
Open Compute Project (OCP), Telecom Infra Project (TIP)
All-Flash Storage, Server Switch, Telco Specific H/W…
Software-Defined Technologies

4
Why we care about All-Flash Ceph …
Scalable, Available,
Reliable, Unified Interface,
Open Platform
High Performance,
Low Latency
All-Flash Ceph !

5
Agenda
 All-flash Ceph Storage Cluster Environment
 Performance Issues on All-flash Ceph (Ver. Hammer)
 OSD Write Operation Latency Analysis
 Optimizing Ceph OSD details & Results
 Ceph deployment in SKT Private Cloud
 Operations & Maintenance Tool
 The Future of All-flash Ceph

6
All-flash Ceph Storage Cluster Environment
Ceph Node Cluster (4)
Ceph Clients
10GbE Network Switch
10GbE Network Switch
Storage NetworkService Network
CPU: 2x E5-2660v3
DRAM: 256GB, Network: Intel 10GbE NIC
Linux: CentOS 7.0 (w/ KRBD)
Kernel: 3.16.4 or 4.1.6
CPU: 2x E5-2690v3
DRAM: 128GB, Network: Intel 2x 10GbE NIC
Linux: CentOS 7.0
Kernel: 3.10.0-123.el7.x86_64
Ceph Version: Hammer version based
NVRAM
Journal
SATA SSD 10ea
Data Store

7
Performance Issues on All-Flash Ceph (Ver. Hammer)
 Issue: Low Throughput & High Latency
• SSD Spec.
 4KB Random Read: up to 95K IOPS
 4KB Random Write: up to 85K IOPS
(Sustained 14K IOPS)
 < 1ms Latency
• Theoretical Max IOPs
 4KB Random Read: 95K * 10EA * 4 Node
 3800K IOPS
 4KB Random Write: 14K * 10EA * 4 Node / 2
Replication
 280K IOPS
※ Fill 100% RBD Image, Clients use krbd to Ceph
Results
Theoretical
Maximum
4KB
Random Read
154K IOPS 3800K IOPS
4KB
Random Write
42K IOPS 280K IOPS
42 27
154
109
7.7
12.1
2.1
2.9
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
0
50
100
150
200
250
300
350
4KB RW 32KB RW 4KB RR 32KB RR
msecKIOPs Random Workload
IOPs Latency
1.6
2.1
4.1 4.0
97.7
600.6
38.5
315.0
0
200
400
600
800
0
1
2
3
4
5
512KB SW 4MB SW 512KB SR 4MB SR
msecGB/s Sequential Workload
Bandwdith Latency

8
Ceph Write IO Flow: Receiving Request
OSD
Secondary OSD
5. Send Rep
Operations to 2nd
OSD
Operation
Threads
L: ReplicatedBackend
Journal
L:
Messenger
Operation
Threads
Journal
2. Receive
Write Req.
and Data.
4. Do Operation
3. Queue
Op.
Queue
Rep Op.
Operation WQ
Operation WQ
L: Messenger
Client
Cluster
Network
Public
Network
1. Send Write Req.
6. Enqueue
Transaction to
Journal write Q.
PG Lock
PG Unlock
PG Lock
PG Unlock

9
Ceph Write IO Flow: Transaction Execution
File Store (Data)
Journal
Operation
Threads
Writer
Thread
Committed
Data DiskJournal Disk
Operation
Threads
1.Queue
Transaction
2. Operate
Journal
Transactions
5. Queue Op.
6. Queue to
Finisher
Finisher
Thread
writeq
Operation WQ
3. AIO Write to
Journal Disk
Write
Finisher
Thread
4. AIO Write
Complete
Journal or
Data
completion?
Finisher
Thread
7. Buffered
Write to Data
Disk
8. Queue to
Finisher
Applied
PG Lock
PG Lock
PG Unlock
Send RepOp Reply to Primary if this is
secondary OSD

10
Ceph Write IO Flow: Send ACK to Client
Primary OSD
Secondary OSD
Operation
Threads
L: ReplicatedBackend
FileStore
Operation
WQ
L: Messenger
Client
Cluster
Network
Public
Network
5. Send Rep
Operations
to 2nd OSD
2. Receive
Write Req.
4. Do
Operation
3. Queue
Op.
1. Send Write
Req.
6. Enqueue
Transaction to
FileStore
A. Send
RepOp
Reply
B.
Receive
RepOp
Reply
C. Queue
RepOp
Reply
D. Do Operation
E. Send ACK to
Client
F. Send
ACK to
Client
G. Receive
ACK
PG Lock
PG Unlock
All Journal
or Data
Completion
?

11
PG
PG
PG
PG
OSD Write Operation Latency Analysis
Operation WQ
Message
Header
Receive
Operation
WQ
Enqueue
L: OSD
Operation
WQ
Dequeue
L:
Backend
L:
FileStore
Node:
Peer1
Node:
Peer2
L: PG
Backend
L:
Messen
ger
0 0.262 ms 1.029 ms
Submit
Op to
PG Backend
4.048 ms
Send
Transaction
to FileStore
6.663 ms
Enqueue to
Journal
7.379 ms
L:
Journal
JournalQ
Dequeue
from
JournalQ
7.674 ms
Journal
Write
Complete
& enq
8.228 ms
Wait Sub Operation Reply
Second
SubOp
Commit
Received
15.605 ms
First
SubOp
Commit
Received
9.819 ms
0.262 ms 2.956 ms0.767 ms 2.615 ms
11.557 ms
5.771 ms
0.716 ms 0.295 ms 0.554 ms
5.301 ms
Send to
PG
Backend
9.349 ms
FinisherQ
Send to
PG
Backend
11.015 ms
Send to
PG
Backend
16.747 ms
0.47 ms1.121 ms
1.196 ms
4.59 ms
1.142 ms
0.36 ms
12.699 ms
6.967 ms
※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime
PG
PG
PG
PG
1. There are many areas where PG locking occurs during one write operation
 Write OP execution section in Operation Thread
 Journal Commit section (FileStore Submit section)
 Reply handling section from secondary OSD in Operation Thread
2. Coarse-grained PG locks during one write operation
 About 10 msec / Total 17 msec

12
Optimizing Ceph OSD A. PG Lock related Issues
 Too many heavy locks  Gather redundant processing codes together & Deliver
them to the dedicated thread
 Delay of Client ACKs  Return ACKs as soon as possible
 Waiting of Ops in another PG at Ops queue  Make only Ops in the same PG wait
PG
PG
PG
PG
Operation WQ
Message
Header
Receive
Operation
WQ
Enqueue
L: OSD
Operation
WQ
Dequeue
L:
Backend
L:
FileStore
Node:
Peer1
Node:
Peer2
L: PG
Backend
L:
Messen
ger
0 0.262 ms 1.029 ms
Submit
Op to
PG Backend
4.048 ms
Send
Transaction
to FileStore
6.663 ms
Enqueue to
Journal
7.379 ms
L:
Journal
JournalQ
Dequeue
from
JournalQ
7.674 ms
Journal
Write
Complete
& enq
8.228 ms
Wait Sub Operation Reply
Second
SubOp
Commit
Received
15.605 ms
First
SubOp
Commit
Received
9.819 ms
0.262 ms 2.956 ms0.767 ms 2.615 ms
11.557 ms
5.771 ms
0.716 ms 0.295 ms 0.554 ms
5.301 ms
Send to
PG
Backend
9.349 ms
FinisherQ
Send to
PG
Backend
11.015 ms
Send to
PG
Backend
16.747 ms
0.47 ms1.121 ms
1.196 ms
4.59 ms
1.142 ms
0.36 ms
12.699 ms
6.967 ms
※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime
PG
PG
PG
PG

13
Evaluation Results A. PG Lock related Issues
42
27
154
109
55
29
268
108
0
50
100
150
200
250
300
350
KIOPs Random Workload
Hammer Opt. A
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
0
1
2
3
4
5
GB/s Sequential Workload
Hammer Opt. A
 Performance Improvement
• Random 4K Write: 42K → 55K (13K ↑)
• Random 4K Read: 154K → 268K (114K ↑)

14
Optimizing Ceph OSD B. Async Logging & System Tuning
 Long logging time
 Split logging into another thread and do it later
 HDD based throttling configuration
 Change configuration from HDD based to SSD based
 Tcmalloc is too much CPU intensive
 Use JEmalloc
 Batching rule in TCP/IP Stack
 To reduce latency, turn off TCP/IP batching rule

15
Evaluation Results B. Async Logging & System Tuning
Logging
• Random 4K Read: 154K → 285K (131K ↑)
42
27
154
109
55
29
268
108
61
29
285
103
0
50
100
150
200
250
300
350
Hammer Opt A. Opt B.
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
1.6
2.1
4.1 4.1
0
1
2
3
4
5
Hammer Opt A. Opt B.

16
Optimizing Ceph OSD
Journal
Operation
WQ Thread
FileStore
Write Request
SSDHeavyweight Transaction
Write
OMAP
File system
Xattr
Data
C. Lightweight Transaction
 Transaction writing overhead
 Merge transaction sub-operations and reduce the weight of transaction
 Transaction lock contention
 Increase cache size related lock contention to prevent lock contention
 Useless system call when small workload executed
 Remove useless system call
 HDD based DB configuration
 Change configuration from HDD based to SSD based

17
Evaluation Results C. Lightweight Transaction
• Random 4K Read: 154K → 321K (167K ↑)
42
27
154
109
55
29
268
108
61
29
285
103
89
29
321
103
0
50
100
150
200
250
300
350
Hammer Opt. A Opt. B Opt. C
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
1.6
2.1
4.1 4.1
1.6
2.0
3.9
4.1
0
1
2
3
4
5
Hammer Opt. A Opt. B Opt. C

18
Ceph deployment in SKT Private Cloud
• Deployed for high performance block storage in private cloud
OPENSTACK
Cinder
OSD OSD OSD
Scale-out for
Capacity & Performance
General
Servers
SSD-array
1,000 1,000 1,000
898
685
1.3 1.2 1.3 2.2 2.9
0
6
12
18
24
10 20 40 60 80
0
300
600
900
1,200
msec
number of VM
IOPs
4KB Random Write, CAP=1,000 IOPs
IOPS per VM Latency

19
Operations & Maintenance
Real Time Monitoring
Multi Dashboard
Rule Base Alarm
Drag & Drop Admin
Dashboard Configuration
Rest API
Graph Merge
Drag & Zooming
Cluster Monitoring RBD Management
Object Storage Management

20
The Future of All-flash Ceph
 Data Reduction techniques for Flash device
 Quality of Service (QoS) in distributed environment
 Fully exploits NVRAM/SSD for performance
All-Flash
JBOF
NVMe SSD
• High Performance (PCIe 3.0)
• High Density (2.5” NVMe SSD 24EA: Up to 96TB)
• ‘16. 4Q expected
…

21
Thank you
Contact Info.
Byung-Su Park, bspark8@sk.com

AF Ceph: Ceph Performance Analysis and Improvement on Flash

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à AF Ceph: Ceph Performance Analysis and Improvement on Flash

Similaire à AF Ceph: Ceph Performance Analysis and Improvement on Flash (20)

Dernier

Dernier (20)

AF Ceph: Ceph Performance Analysis and Improvement on Flash