SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Ceph Goes Online at Qihoo 360
Xu Xuehan
xuxuehan@360.cn
Outline
• Motivation
• Ceph RBD
• CephFS
Motivation
Products	at	Qihoo 360
Motivation
• Virtualization
• Benefits
– Avoidance of hardware resource waste
– Ease of products deployment
• Not all problems solved
– Long VM failover interval
– Long VM creation time
Need	for	a	separated	
VM	image	storage	
backend
Motivation
• Need for a separated VM storage
backend
– Ceph RBD
• Separation of Computation and Storage.
• Scalable storage
• Open source & active community support
Motivation
• Need for a shared file system
CMS
Platform
Shared File
System
Web	
Server
Web	
Server
Web	
Server
………..
Data
Motivation
• Need for a shared file system
Scheduler
Task
Executor
Task
Executor
Task
Executor
Task
Executor
Shared File System
Tasks
DATA
• Need for a shared file system
– CephFS
• POSIX compliance
• Read-after-write consistency
• Scalability
Outline
• Motivation
• Ceph RBD
• CephFS
Ceph RBD
Production Deployment
– 500+ Nodes
– 30+ Clusters
– Largest Cluster: 135 nodes, 1000+ OSDs
– Hammer 0.94.5, Jewel 10.2.5;
Ceph RBD
Ceph RBD
• Online Clusters
– Cost VS Performance
• Full SSD cluster, for users sensitive to I/O latency(Game Server, etc)
• SSD + HDD hybrid cluster, for other users
OSD Nodes
CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
RAM 128GB
NIC 10GbE
Hard Drives 8*SSD(SDLF1DAM-800G-1HA1)
OSD Nodes
CPU Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz
RAM 64GB
NIC 10GbE
Hard Drives 2*SSD(INTEL SSDSC2BB300G4) + 9*HDD(WDC
WD4000FYYZ-03UL1B2)
Ceph RBD
Online Clusters
– Cost VS Performance
Full SSD
OSD
Data
Partition
Journal
ext4/xfs
SSD
Data Journal
SSD + HDD Hybrid
OSD
Data
Partition
ext4/xfs
Data
OSD
Data
Partition
ext4/xfs
Data
sdk
1
sdk
2
Journal
SATA HDD SATA HDD SSD
Ceph RBD
Online Clusters
– One Big Pool or Multiple Small Pools?
• PipeMessenger (Default in Hammer/Jewel): two threads per connection
Huge Pool
Small Pool1 Small Pool2
VS
Too many OSDs in
one pool could lead
to too many threads
in one Machine!
Max_threads_per_osd =
2*(clients + 1 + 6*pg_num_per_osd)
Any other problem with
huge pool?
WE DON’T KNOW YET
Ceph RBD
RADOS CLUSTER
Pool1(SSD) Pool2(SSD) Cloud1(SATA) RGW(SATA)
VM
Disk1 Disk2
VM
Disk1 Disk2
VM
Disk1
Librbd Librbd Librbd Librbd Librbd
APP
radosgw
Online Clusters
– One Big Pool or Multiple Small Pools?
• Typical Deployment Pattern
Ceph RBD
Online Clusters
• One Big Pool or Multiple Small Pools?(Openstack
Modification)
– OpenStack support for multi-pool(supported upstream now) and
multi-cluster
– VM creation based on rbd image clone(supported upstream
now)
– Flatten after VM image creation(supported upstream now)
KEYSTONE SWIFT CINDER GLANCE NOVA MANILA
Cluster 1 Cluster 2 Cluster 3
Pool	A Pool	B Pool	A Pool	B Pool	A Pool	B
OpenStack
LIBRBD LIBRBD LIBRBD
Cluster	Selection
Pool	Selection Pool	Selection Pool	Selection
Online Clusters
– Final Deployment Pattern
Ceph RBD
Online Clusters
• Stability
– No single point of failure
HOST1
(Monitor)
HOST2
HOST3
HOST4
HOST5
HOST6
(Monitor)
HOST7
(Monitor)
HOST8
HOST9
HOST10
Physical Deployment
rack1 rack2
HOST1
(Monitor)
HOST2
HOST3
HOST6
(Monitor)
HOST7
(Monitor)
HOST8
Logical Rack Setting
rack1 rack2
HOST4
HOST5
HOST9
HOST10
rack2
SwitchA SwitchB
Core	Switch
Ceph RBD
Online Clusters
• QoS
– Done in QEMU
– Full SSD clusters:
• IOPS: 10 iops/GB
• Throughput: 260 MB/s
– SSD + HDD Hybrid clusters:
• READ iops: 1400
• WRITE iops: 300~600 iops
• Throughput: 70 MB/s
Ceph RBD
Online Clusters
• Capacity
– Thin provisioning
– Capacity requirement prediction:
Ctotal = (NVM_num * Ccapacity_per_vm )/(Tthin_provision_ratio * 0.7)
– Create new pool instead of pool expansion
Ceph RBD
Online Experience
• High I/O util on Full SSD cluster
– I/O utils: 10%+(Full SSD Ceph) VS 1%-(Local disk)
– Users may complain, but NOT a problem
3
3.62
5.63
7.02
0
1
2
3
4
5
6
7
8
50000 99415 156327 186462
Latency(ms)
IOPS
Full SSD ceph: Latency VS iops
Latency VS IOPS
Ceph RBD
Online Experience
• Burst image deletion
– Users remove massive images all at once
– Cluster almost not available
– Solution:
• Modify openstack to remove images asynchronously, do concurrency
control
• Scrub
– Could severely impact I/O performance
– Only between 2:00 and 6:00 AM
Ceph RBD
Online Experience
• Full SSD ceph (Hammer): really cpu consuming
– 104 IOPS per CPU (CPU: Xeon E5-2630 v3)
– The more SSDs per cpu, the less IOPS per cpu
Single	CPU	states
CPU	%idle
Online Experience
• One OSD full == Cluster full (Hammer, Jewel)
• Daily Inspection: An intuitional way to observe
cluster topology maybe needed
– For now, we use a script to
draw a topology graph
Ceph RBD
Ceph RBD
Online Experience
• Tracing
– Hard	to	reproduce	some	online	problems
– Can’t	turn	on	high	priority	log	online
• Alerting
– Integrate	with	other	alerting	services	like	Nagios?
– A	new	alerting	module?	Or	at	least	some	alerting	
interface	for	users	to	capture	exceptional	events?
Ceph RBD
Online Experience
• RBD	image	Backup
Ark-
conductor
Ark-
resume
Ark-
fullmd5 DB
Ark-
check
Ark-
worker
Ark-
worker
Ark-
worker
Main
Cluster
Backup
Cluster
Vm1_disk
Vm2_disk
Vm1_disk
Vm2_disk
Export-diff/Import-diff
Md5	check
Snapshot	reclaim
Snapshot	rollback
recover
Outline
• Motivation
• Ceph RBD
• CephFS
CephFS
• MDS	Performance	Evaluation(mdtest)
– MDS	machine
– 1	active	MDS,	1	standby-replay	MDS
– 27	OSDs	in	data	pool,	6	SSD	OSDs	in	metadata	pool
– 7*106 files/directories,	70	clients
MDS nodes
CPU Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
RAM 192GB
NIC 10GbE
OS CentOS 7.1.1503
• MDS	Performance	Evaluation
CephFS
Metadata Operation Result (ops/sec)
File Creation(shared directory) 2624.09
File Creation(job separated directories) 4311.339
Stat 11000
File Removal(shared directory) 788.960
File Removal(job separated directories) 2531.538
Directory Creation(shared directory) 794.030
Directory Creation(job separated directories) 3497.949
Directory Removal(shared directory) 697.333
Directory Removal(job separated directories) 2848.889
File Open 6757.269269
File Rename(shared directory) 485.083123
File Rename(job separated directories) 3073.370671
Utime 2947.364765
Readdir 243844.3312
CephFS
• MDS	Performance	Evaluation
– Slow	metadata	modification	writeback
• Caused	by	O(n)	list::size()	in	gcc earlier	than	5.0
• https://github.com/ceph/ceph/commit/7e0a27a5c8b7d12d378de
4d700ed7a95af7860c3
– Single	Thread	MDS,	low	cpu utilization
CephFS
• Considerations	about	putting	CephFS online
– Access	Control
• Namespace?
• Kernel	limitation	à One	pool	for	each	userL
– Active-Active	MDS	or	Active-Standby	MDS?
– QoS
• No	available	solution	yet
CephFS
• Online	Clusters
– 3	small	Clusters
• 1	active	MDS,	1	standby-replay	MDS
• 3	OSD/MON	machines,	27	OSDs	in	data	pool,	6	SSD	OSDs	in	
metadata	pool
OSD
Data Partition
ext4/xfs
Data
OSD
Data Partition
ext4/xfs
Data
sdk1 sdk2
Journal
SATA HDD SATA HDD SSD
sdk3 sdk2
OSD
Data	Pool	OSDs
Metadata	Pool	OSDs
CephFS
Online Experience
• mds “r”	cap	must	be	given	to	every	user
– Users	may	see	directory	subtree	structures	of	
each	others.
• Kernel	limitations
– Most	users	use	CentOS	7.4,	kernel	3.10.0-693,	
many	patches	are	NOT backported
– Some	users	run	kernel	2.6.32….L
– Could	NFS	or	Samba	be	a	solution
CephFS
Online Experience
– Slow	“getattr”	when	lots	of	clients	are	issuing	reads/writes
• http://tracker.ceph.com/issues/22925
• getattrs are	blocked	in	filelock’s LOCK_SYNC_MIX	state
• When	filelock gets	out	LOCK_SYNC_MIX	state,	mds has	to	
reprocess	all	blocked	getattrs one	by	one
• Almost	every	getattr request	make	filelock go	into	
LOCK_SYNC_MIX	state	again
Thanks
Q&A

Contenu connexe

Tendances

Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Ceph Community
 
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanBasic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanCeph Community
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiCeph Community
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaKamesh Pemmaraju
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph Community
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
Global deduplication for Ceph - Myoungwon Oh
Global deduplication for Ceph - Myoungwon OhGlobal deduplication for Ceph - Myoungwon Oh
Global deduplication for Ceph - Myoungwon OhCeph Community
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNCeph Community
 
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex LauDoing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex LauCeph Community
 
Automatic Operation Bot for Ceph - You Ji
Automatic Operation Bot for Ceph - You JiAutomatic Operation Bot for Ceph - You Ji
Automatic Operation Bot for Ceph - You JiCeph Community
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on CephCeph Community
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Ceph Community
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph Community
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Community
 

Tendances (20)

Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanBasic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Global deduplication for Ceph - Myoungwon Oh
Global deduplication for Ceph - Myoungwon OhGlobal deduplication for Ceph - Myoungwon Oh
Global deduplication for Ceph - Myoungwon Oh
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex LauDoing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
 
Automatic Operation Bot for Ceph - You Ji
Automatic Operation Bot for Ceph - You JiAutomatic Operation Bot for Ceph - You Ji
Automatic Operation Bot for Ceph - You Ji
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
 

Similaire à Ceph Goes Online at Qihoo 360: How Qihoo Deployed Ceph at Scale

Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stackNikos Kormpakis
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and BeyondSage Weil
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...Rakuten Group, Inc.
 
OpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairOpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairRed_Hat_Storage
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarKamesh Pemmaraju
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
Openstack Summit HK - Ceph defacto - eNovance
Openstack Summit HK - Ceph defacto - eNovanceOpenstack Summit HK - Ceph defacto - eNovance
Openstack Summit HK - Ceph defacto - eNovanceeNovance
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraCeph Community
 
Managing Open vSwitch Across a Large Heterogenous Fleet
Managing Open vSwitch Across a Large Heterogenous FleetManaging Open vSwitch Across a Large Heterogenous Fleet
Managing Open vSwitch Across a Large Heterogenous Fleetandyhky
 
Ceph openstack-jun-2015-meetup
Ceph openstack-jun-2015-meetupCeph openstack-jun-2015-meetup
Ceph openstack-jun-2015-meetupopenstackindia
 
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015Deepak Shetty
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With CephCeph Community
 
SUSE - performance analysis-with_ceph
SUSE - performance analysis-with_cephSUSE - performance analysis-with_ceph
SUSE - performance analysis-with_cephinwin stack
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 

Similaire à Ceph Goes Online at Qihoo 360: How Qihoo Deployed Ceph at Scale (20)

Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
[Rakuten TechConf2014] [F-4] At Rakuten, The Rakuten OpenStack Platform and B...
 
OpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairOpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning Pair
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Openstack Summit HK - Ceph defacto - eNovance
Openstack Summit HK - Ceph defacto - eNovanceOpenstack Summit HK - Ceph defacto - eNovance
Openstack Summit HK - Ceph defacto - eNovance
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
Managing Open vSwitch Across a Large Heterogenous Fleet
Managing Open vSwitch Across a Large Heterogenous FleetManaging Open vSwitch Across a Large Heterogenous Fleet
Managing Open vSwitch Across a Large Heterogenous Fleet
 
OpenStack and Windows
OpenStack and WindowsOpenStack and Windows
OpenStack and Windows
 
Ceph openstack-jun-2015-meetup
Ceph openstack-jun-2015-meetupCeph openstack-jun-2015-meetup
Ceph openstack-jun-2015-meetup
 
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With Ceph
 
SUSE - performance analysis-with_ceph
SUSE - performance analysis-with_cephSUSE - performance analysis-with_ceph
SUSE - performance analysis-with_ceph
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Ceph Goes Online at Qihoo 360: How Qihoo Deployed Ceph at Scale

  • 1. Ceph Goes Online at Qihoo 360 Xu Xuehan xuxuehan@360.cn
  • 4. Motivation • Virtualization • Benefits – Avoidance of hardware resource waste – Ease of products deployment • Not all problems solved – Long VM failover interval – Long VM creation time Need for a separated VM image storage backend
  • 5. Motivation • Need for a separated VM storage backend – Ceph RBD • Separation of Computation and Storage. • Scalable storage • Open source & active community support
  • 6. Motivation • Need for a shared file system CMS Platform Shared File System Web Server Web Server Web Server ……….. Data
  • 7. Motivation • Need for a shared file system Scheduler Task Executor Task Executor Task Executor Task Executor Shared File System Tasks DATA
  • 8. • Need for a shared file system – CephFS • POSIX compliance • Read-after-write consistency • Scalability
  • 10. Ceph RBD Production Deployment – 500+ Nodes – 30+ Clusters – Largest Cluster: 135 nodes, 1000+ OSDs – Hammer 0.94.5, Jewel 10.2.5; Ceph RBD
  • 11. Ceph RBD • Online Clusters – Cost VS Performance • Full SSD cluster, for users sensitive to I/O latency(Game Server, etc) • SSD + HDD hybrid cluster, for other users OSD Nodes CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz RAM 128GB NIC 10GbE Hard Drives 8*SSD(SDLF1DAM-800G-1HA1) OSD Nodes CPU Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz RAM 64GB NIC 10GbE Hard Drives 2*SSD(INTEL SSDSC2BB300G4) + 9*HDD(WDC WD4000FYYZ-03UL1B2)
  • 12. Ceph RBD Online Clusters – Cost VS Performance Full SSD OSD Data Partition Journal ext4/xfs SSD Data Journal SSD + HDD Hybrid OSD Data Partition ext4/xfs Data OSD Data Partition ext4/xfs Data sdk 1 sdk 2 Journal SATA HDD SATA HDD SSD
  • 13. Ceph RBD Online Clusters – One Big Pool or Multiple Small Pools? • PipeMessenger (Default in Hammer/Jewel): two threads per connection Huge Pool Small Pool1 Small Pool2 VS Too many OSDs in one pool could lead to too many threads in one Machine! Max_threads_per_osd = 2*(clients + 1 + 6*pg_num_per_osd) Any other problem with huge pool? WE DON’T KNOW YET
  • 14. Ceph RBD RADOS CLUSTER Pool1(SSD) Pool2(SSD) Cloud1(SATA) RGW(SATA) VM Disk1 Disk2 VM Disk1 Disk2 VM Disk1 Librbd Librbd Librbd Librbd Librbd APP radosgw Online Clusters – One Big Pool or Multiple Small Pools? • Typical Deployment Pattern
  • 15. Ceph RBD Online Clusters • One Big Pool or Multiple Small Pools?(Openstack Modification) – OpenStack support for multi-pool(supported upstream now) and multi-cluster – VM creation based on rbd image clone(supported upstream now) – Flatten after VM image creation(supported upstream now)
  • 16. KEYSTONE SWIFT CINDER GLANCE NOVA MANILA Cluster 1 Cluster 2 Cluster 3 Pool A Pool B Pool A Pool B Pool A Pool B OpenStack LIBRBD LIBRBD LIBRBD Cluster Selection Pool Selection Pool Selection Pool Selection Online Clusters – Final Deployment Pattern
  • 17. Ceph RBD Online Clusters • Stability – No single point of failure HOST1 (Monitor) HOST2 HOST3 HOST4 HOST5 HOST6 (Monitor) HOST7 (Monitor) HOST8 HOST9 HOST10 Physical Deployment rack1 rack2 HOST1 (Monitor) HOST2 HOST3 HOST6 (Monitor) HOST7 (Monitor) HOST8 Logical Rack Setting rack1 rack2 HOST4 HOST5 HOST9 HOST10 rack2 SwitchA SwitchB Core Switch
  • 18. Ceph RBD Online Clusters • QoS – Done in QEMU – Full SSD clusters: • IOPS: 10 iops/GB • Throughput: 260 MB/s – SSD + HDD Hybrid clusters: • READ iops: 1400 • WRITE iops: 300~600 iops • Throughput: 70 MB/s
  • 19. Ceph RBD Online Clusters • Capacity – Thin provisioning – Capacity requirement prediction: Ctotal = (NVM_num * Ccapacity_per_vm )/(Tthin_provision_ratio * 0.7) – Create new pool instead of pool expansion
  • 20. Ceph RBD Online Experience • High I/O util on Full SSD cluster – I/O utils: 10%+(Full SSD Ceph) VS 1%-(Local disk) – Users may complain, but NOT a problem 3 3.62 5.63 7.02 0 1 2 3 4 5 6 7 8 50000 99415 156327 186462 Latency(ms) IOPS Full SSD ceph: Latency VS iops Latency VS IOPS
  • 21. Ceph RBD Online Experience • Burst image deletion – Users remove massive images all at once – Cluster almost not available – Solution: • Modify openstack to remove images asynchronously, do concurrency control • Scrub – Could severely impact I/O performance – Only between 2:00 and 6:00 AM
  • 22. Ceph RBD Online Experience • Full SSD ceph (Hammer): really cpu consuming – 104 IOPS per CPU (CPU: Xeon E5-2630 v3) – The more SSDs per cpu, the less IOPS per cpu Single CPU states CPU %idle
  • 23. Online Experience • One OSD full == Cluster full (Hammer, Jewel) • Daily Inspection: An intuitional way to observe cluster topology maybe needed – For now, we use a script to draw a topology graph Ceph RBD
  • 24. Ceph RBD Online Experience • Tracing – Hard to reproduce some online problems – Can’t turn on high priority log online • Alerting – Integrate with other alerting services like Nagios? – A new alerting module? Or at least some alerting interface for users to capture exceptional events?
  • 25. Ceph RBD Online Experience • RBD image Backup Ark- conductor Ark- resume Ark- fullmd5 DB Ark- check Ark- worker Ark- worker Ark- worker Main Cluster Backup Cluster Vm1_disk Vm2_disk Vm1_disk Vm2_disk Export-diff/Import-diff Md5 check Snapshot reclaim Snapshot rollback recover
  • 27. CephFS • MDS Performance Evaluation(mdtest) – MDS machine – 1 active MDS, 1 standby-replay MDS – 27 OSDs in data pool, 6 SSD OSDs in metadata pool – 7*106 files/directories, 70 clients MDS nodes CPU Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz RAM 192GB NIC 10GbE OS CentOS 7.1.1503
  • 28. • MDS Performance Evaluation CephFS Metadata Operation Result (ops/sec) File Creation(shared directory) 2624.09 File Creation(job separated directories) 4311.339 Stat 11000 File Removal(shared directory) 788.960 File Removal(job separated directories) 2531.538 Directory Creation(shared directory) 794.030 Directory Creation(job separated directories) 3497.949 Directory Removal(shared directory) 697.333 Directory Removal(job separated directories) 2848.889 File Open 6757.269269 File Rename(shared directory) 485.083123 File Rename(job separated directories) 3073.370671 Utime 2947.364765 Readdir 243844.3312
  • 29. CephFS • MDS Performance Evaluation – Slow metadata modification writeback • Caused by O(n) list::size() in gcc earlier than 5.0 • https://github.com/ceph/ceph/commit/7e0a27a5c8b7d12d378de 4d700ed7a95af7860c3 – Single Thread MDS, low cpu utilization
  • 30. CephFS • Considerations about putting CephFS online – Access Control • Namespace? • Kernel limitation à One pool for each userL – Active-Active MDS or Active-Standby MDS? – QoS • No available solution yet
  • 31. CephFS • Online Clusters – 3 small Clusters • 1 active MDS, 1 standby-replay MDS • 3 OSD/MON machines, 27 OSDs in data pool, 6 SSD OSDs in metadata pool OSD Data Partition ext4/xfs Data OSD Data Partition ext4/xfs Data sdk1 sdk2 Journal SATA HDD SATA HDD SSD sdk3 sdk2 OSD Data Pool OSDs Metadata Pool OSDs
  • 32. CephFS Online Experience • mds “r” cap must be given to every user – Users may see directory subtree structures of each others. • Kernel limitations – Most users use CentOS 7.4, kernel 3.10.0-693, many patches are NOT backported – Some users run kernel 2.6.32….L – Could NFS or Samba be a solution
  • 33. CephFS Online Experience – Slow “getattr” when lots of clients are issuing reads/writes • http://tracker.ceph.com/issues/22925 • getattrs are blocked in filelock’s LOCK_SYNC_MIX state • When filelock gets out LOCK_SYNC_MIX state, mds has to reprocess all blocked getattrs one by one • Almost every getattr request make filelock go into LOCK_SYNC_MIX state again