The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
AF Ceph: Ceph Performance Analysis and Improvement on Flash
1. AF Ceph: Ceph Performance
Analysis & Improvement on
Flash
Byung-Su Park
SDS Tech. Lab, Corporate R&D Center
SK Telecom
2. 1
5G
Why we care about All-Flash Storage …
Flash device
High Performance, Low Latency, SLA
UHD
4K
3. 2
Transforming to 5G Network
New ICT infrastructure should be Programmable, Scalable, Flexible,
and Cost Effective
Software Defined Technologies based on Open Software & Open Hardware
5G
Massive
Connectivity
10x
Lower latency
100x-1000x
Higher speed
Efficiency &
Reliability
Virtualization
4. 3
Open HW & SW Projects @ SKT
Open Software
OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop …
Open Hardware
Open Compute Project (OCP), Telecom Infra Project (TIP)
All-Flash Storage, Server Switch, Telco Specific H/W…
Software-Defined Technologies
5. 4
Why we care about All-Flash Ceph …
Scalable, Available,
Reliable, Unified Interface,
Open Platform
High Performance,
Low Latency
All-Flash Ceph !
10. 9
Ceph Write IO Flow: Transaction Execution
File Store (Data)
Journal
Operation
Threads
Writer
Thread
Committed
Data DiskJournal Disk
Operation
Threads
1.Queue
Transaction
2. Operate
Journal
Transactions
5. Queue Op.
6. Queue to
Finisher
Finisher
Thread
writeq
Operation WQ
3. AIO Write to
Journal Disk
Write
Finisher
Thread
4. AIO Write
Complete
Journal or
Data
completion?
Finisher
Thread
7. Buffered
Write to Data
Disk
8. Queue to
Finisher
Applied
PG Lock
PG Lock
PG Unlock
Send RepOp Reply to Primary if this is
secondary OSD
11. 10
Ceph Write IO Flow: Send ACK to Client
Primary OSD
Secondary OSD
Operation
Threads
L: ReplicatedBackend
FileStore
Operation
WQ
L: Messenger
Client
Cluster
Network
Public
Network
5. Send Rep
Operations
to 2nd OSD
2. Receive
Write Req.
4. Do
Operation
3. Queue
Op.
1. Send Write
Req.
6. Enqueue
Transaction to
FileStore
A. Send
RepOp
Reply
B.
Receive
RepOp
Reply
C. Queue
RepOp
Reply
D. Do Operation
E. Send ACK to
Client
F. Send
ACK to
Client
G. Receive
ACK
PG Lock
PG Unlock
All Journal
or Data
Completion
?
12. 11
PG
PG
PG
PG
OSD Write Operation Latency Analysis
Operation WQ
Message
Header
Receive
Operation
WQ
Enqueue
L: OSD
Operation
WQ
Dequeue
L:
Backend
L:
FileStore
Node:
Peer1
Node:
Peer2
L: PG
Backend
L:
Messen
ger
0 0.262 ms 1.029 ms
Submit
Op to
PG Backend
4.048 ms
Send
Transaction
to FileStore
6.663 ms
Enqueue to
Journal
7.379 ms
L:
Journal
JournalQ
Dequeue
from
JournalQ
7.674 ms
Journal
Write
Complete
& enq
8.228 ms
Wait Sub Operation Reply
Second
SubOp
Commit
Received
15.605 ms
First
SubOp
Commit
Received
9.819 ms
0.262 ms 2.956 ms0.767 ms 2.615 ms
11.557 ms
5.771 ms
0.716 ms 0.295 ms 0.554 ms
5.301 ms
Send to
PG
Backend
9.349 ms
FinisherQ
Send to
PG
Backend
11.015 ms
Send to
PG
Backend
16.747 ms
0.47 ms1.121 ms
1.196 ms
4.59 ms
1.142 ms
0.36 ms
12.699 ms
6.967 ms
※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime
PG
PG
PG
PG
1. There are many areas where PG locking occurs during one write operation
Write OP execution section in Operation Thread
Journal Commit section (FileStore Submit section)
Reply handling section from secondary OSD in Operation Thread
2. Coarse-grained PG locks during one write operation
About 10 msec / Total 17 msec
13. 12
Optimizing Ceph OSD A. PG Lock related Issues
Too many heavy locks Gather redundant processing codes together & Deliver
them to the dedicated thread
Delay of Client ACKs Return ACKs as soon as possible
Waiting of Ops in another PG at Ops queue Make only Ops in the same PG wait
PG
PG
PG
PG
Operation WQ
Message
Header
Receive
Operation
WQ
Enqueue
L: OSD
Operation
WQ
Dequeue
L:
Backend
L:
FileStore
Node:
Peer1
Node:
Peer2
L: PG
Backend
L:
Messen
ger
0 0.262 ms 1.029 ms
Submit
Op to
PG Backend
4.048 ms
Send
Transaction
to FileStore
6.663 ms
Enqueue to
Journal
7.379 ms
L:
Journal
JournalQ
Dequeue
from
JournalQ
7.674 ms
Journal
Write
Complete
& enq
8.228 ms
Wait Sub Operation Reply
Second
SubOp
Commit
Received
15.605 ms
First
SubOp
Commit
Received
9.819 ms
0.262 ms 2.956 ms0.767 ms 2.615 ms
11.557 ms
5.771 ms
0.716 ms 0.295 ms 0.554 ms
5.301 ms
Send to
PG
Backend
9.349 ms
FinisherQ
Send to
PG
Backend
11.015 ms
Send to
PG
Backend
16.747 ms
0.47 ms1.121 ms
1.196 ms
4.59 ms
1.142 ms
0.36 ms
12.699 ms
6.967 ms
※ 4K Random Write, QEMU 3 Clients, 4 x 12 OSDs (3 Replica), 16 Jobs x 16 IO Depth FIO Test, 600 Rampime 600 Runtime
PG
PG
PG
PG
14. 13
Evaluation Results A. PG Lock related Issues
42
27
154
109
55
29
268
108
0
50
100
150
200
250
300
350
4KB RW 32KB RW 4KB RR 32KB RR
KIOPs Random Workload
Hammer Opt. A
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
0
1
2
3
4
5
512KB SW 4MB SW 512KB SR 4MB SR
GB/s Sequential Workload
Hammer Opt. A
Performance Improvement
• Random 4K Write: 42K → 55K (13K ↑)
• Random 4K Read: 154K → 268K (114K ↑)
15. 14
Optimizing Ceph OSD B. Async Logging & System Tuning
Long logging time
Split logging into another thread and do it later
HDD based throttling configuration
Change configuration from HDD based to SSD based
Tcmalloc is too much CPU intensive
Use JEmalloc
Batching rule in TCP/IP Stack
To reduce latency, turn off TCP/IP batching rule
16. 15
Evaluation Results B. Async Logging & System Tuning
Logging
Performance Improvement
• Random 4K Write: 42K → 61K (19K ↑)
• Random 4K Read: 154K → 285K (131K ↑)
42
27
154
109
55
29
268
108
61
29
285
103
4KB RW 32KB RW 4KB RR 32KB RR
0
50
100
150
200
250
300
350
KIOPs Random Workload
Hammer Opt A. Opt B.
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
1.6
2.1
4.1 4.1
512KB SW 4MB SW 512KB SR 4MB SR
0
1
2
3
4
5
GB/s Sequential Workload
Hammer Opt A. Opt B.
17. 16
Optimizing Ceph OSD
Journal
Operation
WQ Thread
FileStore
Write Request
SSDHeavyweight Transaction
Write
OMAP
File system
Xattr
Data
C. Lightweight Transaction
Transaction writing overhead
Merge transaction sub-operations and reduce the weight of transaction
Transaction lock contention
Increase cache size related lock contention to prevent lock contention
Useless system call when small workload executed
Remove useless system call
HDD based DB configuration
Change configuration from HDD based to SSD based
18. 17
Evaluation Results C. Lightweight Transaction
Performance Improvement
• Random 4K Write: 42K → 89K (47K ↑)
• Random 4K Read: 154K → 321K (167K ↑)
42
27
154
109
55
29
268
108
61
29
285
103
89
29
321
103
4KB RW 32KB RW 4KB RR 32KB RR
0
50
100
150
200
250
300
350
KIOPs Random Workload
Hammer Opt. A Opt. B Opt. C
1.6
2.1
4.1 4.0
1.7
2.3
4.0 4.0
1.6
2.1
4.1 4.1
1.6
2.0
3.9
4.1
512KB SW 4MB SW 512KB SR 4MB SR
0
1
2
3
4
5
GB/s Sequential Workload
Hammer Opt. A Opt. B Opt. C
19. 18
Ceph deployment in SKT Private Cloud
• Deployed for high performance block storage in private cloud
OPENSTACK
Cinder
OSD OSD OSD
Scale-out for
Capacity & Performance
General
Servers
SSD-array
1,000 1,000 1,000
898
685
1.3 1.2 1.3 2.2 2.9
0
6
12
18
24
10 20 40 60 80
0
300
600
900
1,200
msec
number of VM
IOPs
4KB Random Write, CAP=1,000 IOPs
IOPS per VM Latency
20. 19
Operations & Maintenance
Real Time Monitoring
Multi Dashboard
Rule Base Alarm
Drag & Drop Admin
Dashboard Configuration
Rest API
Graph Merge
Drag & Zooming
Cluster Monitoring RBD Management
Object Storage Management
21. 20
The Future of All-flash Ceph
Data Reduction techniques for Flash device
Quality of Service (QoS) in distributed environment
Fully exploits NVRAM/SSD for performance
All-Flash
JBOF
NVMe SSD
• High Performance (PCIe 3.0)
• High Density (2.5” NVMe SSD 24EA: Up to 96TB)
• ‘16. 4Q expected
…