SlideShare a Scribd company logo
1 of 38
Download to read offline
SF BAY AREA CEPH
USERS GROUP

INAUGURAL MEETUP

Thursday, January 16, 14
AGENDA
Intro to Ceph
Ceph Networking
Public Topologies
Cluster Topologies
Network Hardware

2

Thursday, January 16, 14
THE FORECAST

By 2020
over 39 ZB
of data will
be stored.
1.5 ZB are
stored today.

3
THE PROBLEM

Growth of data

 Existing systems don’t
scale

IT Storage Budget

 Increasing cost and
complexity
2010

4

Thursday, January 16, 14

2020

 Need to invest in new
platforms ahead of time
THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT

5

Thursday, January 16, 14
CEPH
Thursday, January 16, 14
INTRO TO CEPH
 Distributed storage system
 Horizontally scalable
 No single point of failure
 Self healing and self managing
 Runs on commodity hardware
 GPLv2 License

7

Thursday, January 16, 14
ARCHITECTURE

8

Thursday, January 16, 14
SERVICE COMPONENTS
MONITOR
 PAXOS for consensus
 Maintain cluster state
 Typically 3-5 nodes
 NOT in write path

OSD
 Object storage interface
 Gossips with peers
 Data lives here

9

Thursday, January 16, 14

PART 1
SERVICE COMPONENTS
RADOS GATEWAY
 Provides S3/Swift compatibility
 Scale out

METADATA
 Object storage interface
 Gossips with peers
 Dynamic subtree partitioning

10

Thursday, January 16, 14

PART 2
CRUSH
 Ceph uses CRUSH for data placement
 Aware of cluster topography
 Statistically even distribution across pool
 Supports asymmetric nodes and devices
 Hierarchal weighting

11

Thursday, January 16, 14
DATA PLACEMENT

12

Thursday, January 16, 14
POOLS
 Groupings of OSDs
 Both physical and logical
 Volumes / Images
 Hot SSD pool
 Cold SATA pool
 DMCrypt pool

13

Thursday, January 16, 14
REPLICATION
 Original data durability mechanism
 Ceph creates N replicas of each RADOS object
 Uses CRUSH to determine replica placement
 Required for mutable objects (RBD, CephFS)
 More reasonable for smaller installations

14

Thursday, January 16, 14
ERASURE CODING
 (8:4) MDS code in example
 1.5x overhead
 8 units of client data to write
 4 parity units generated using FEC
 All 12 units placed with CRUSH
 8/12 total units to satisfy a read

15

Thursday, January 16, 14

Firefly Release
CLIENT COMPONENTS
Native API
 Mutable object store
 Many language bindings
 Object classes

CephFS
 Linux Kernel CephFS client since 2.6.34
 FUSE client
 Hadoop JNI bindings

16

Thursday, January 16, 14
CLIENT COMPONENTS
Block Storage
 Linux Kernel RBD client since 2.6.37+
 KVM/QEMU integration
 Xen integration

S3/Swift
S3/SWIFT
OSD
 RESTful interfaces (HTTP)
 CRUD operations
 Usage accounting for billing

17

Thursday, January 16, 14
Ceph Networking
Thursday, January 16, 14
INFINIBAND
 Currently only supported via IPoIB
 Accelio (libxio) integration in Ceph is in early stages
 Accelio supports multiple transports RDMA, TCP and
Shared-Memory
 Accelio supports multiple RDMA transports (IB, RoCE,
iWARP)

19

Thursday, January 16, 14
ETHERNET
 Tried and true
 Proven at scale
 Economical
 Many suitable vendors

20

Thursday, January 16, 14
10GbE or 1GbE
 Cost of 10GbE trending downward
 White box switches turning up heat on vendors
 Twinax relatively inexpensive and low power
 SFP+ is versatile wrt distance
 Single 10GbE for object
 Dual 10GbE for block storage (public/cluster)
 Bonding many 1GbE links adds lots of complexity

21

Thursday, January 16, 14
IPv4 or IPv6 Native
 It’s 2014, is this really a question?
 Ceph fully supports both modes of operation
 Hierarchal allocation models allows “roll up” of routes
 Optimal efficiency in RIB
 Some tools believe the earth is flat

22

Thursday, January 16, 14
LAYER 2
 Spanning tree
 Switch table size
 Broadcast domains (ARP)
 MAC frame checksum
 Storage protocols (FCoE, ATAoE)
 TRILL, MLAG
 Layer 2 DCI is crazy pants
 Layer 2 tunneled over internet is super crazy pants

23

Thursday, January 16, 14
LAYER 3
 Address and subnet planning
 Proven scale at big web shops
 Error detection only on TCP header
 Equal cost multi-path (ECMP)
 Reasonable for inter-site connectivity

24

Thursday, January 16, 14
Public Topologies
Thursday, January 16, 14
CLIENT TOPOLOGIES
 Path diversity for resiliency
 Minimize network diameter
 Consistent hop count to minimize net long tail latency
 Ease of scaling
 Tolerate adversarial traffic patterns (fan-in/fan-out)

26

Thursday, January 16, 14
FOLDED CLOS
 Sometimes called Fat Tree or Spine and Leaf
 Minimum 4 fixed switches, grows to 10k+ node fabrics
 Rack or cluster oversubscription possible
 Non-blocking also possible
S
S

S

S

 Path diversity

S
....

....
1

27

Thursday, January 16, 14

2

N

1

2

S

....
N

1

2

....
N

1

2

N
Cluster Topologies
Thursday, January 16, 14
REPLICA TOPOLOGIES
 Replica and erasure fan-out
 Recovery and remap impact on cluster bandwidth
 OSD peering
 Backfill served from primary
 Tune backfills to avoid large fan-in

29

Thursday, January 16, 14
FOLDED CLOS
 Sometimes called Fat Tree or Spine and Leaf
 Minimum 4, grows to 10k+ node fabrics
 Rack or cluster oversubscription possible
 Non-blocking also possible
S
S

S

S

 Path diversity

S
....

....
1

30

Thursday, January 16, 14

2

N

1

2

S

....
N

1

2

....
N

1

2

N
N-WAY PARTIAL MESH

31

Thursday, January 16, 14
EVALUATE
 Replication
 Erasure coding
 Special purpose vs general purpose
 Extra port cost

32

Thursday, January 16, 14
Network Hardware
Thursday, January 16, 14
Features
 Buffer sizes
 Cut through vs store and forward
 Oversubscribed vs non-blocking
 Automation and monitoring

34

Thursday, January 16, 14
FIXED
 Fixed switches can easily build large clusters
 Easier to source
 Smaller failure domains
 Fixed designs have many control planes
 Virtual chassis.. L3 split brain hilarity?

35

Thursday, January 16, 14
LESS SKU
 Utilize as few vendor SKUs as possible
 If permitted, use same fixed switch for spine and leaf
 More affordable to have spares on site or more spares
 Quicker MTTR when gear is ready to go

36

Thursday, January 16, 14
Thanks to our host!

37

Thursday, January 16, 14
Kyle Bader
Sr. Solutions Architect

kyle@inktank.com

Thursday, January 16, 14

More Related Content

What's hot

Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
Ian Colle
 

What's hot (19)

Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
librados
libradoslibrados
librados
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 

Viewers also liked

A Storage Story #ChefConf2013
A Storage Story #ChefConf2013A Storage Story #ChefConf2013
A Storage Story #ChefConf2013
Kyle Bader
 

Viewers also liked (20)

Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability Matters
 
Tiery Eyed
Tiery EyedTiery Eyed
Tiery Eyed
 
Framework Shootout
Framework ShootoutFramework Shootout
Framework Shootout
 
PHP on IBM i Tutorial
PHP on IBM i TutorialPHP on IBM i Tutorial
PHP on IBM i Tutorial
 
Oracle cloud ravello介绍及测试账户申请
Oracle cloud ravello介绍及测试账户申请Oracle cloud ravello介绍及测试账户申请
Oracle cloud ravello介绍及测试账户申请
 
MySQL Tech Tour 2015 - 5.7 Connector/J/Net
MySQL Tech Tour 2015 - 5.7 Connector/J/NetMySQL Tech Tour 2015 - 5.7 Connector/J/Net
MySQL Tech Tour 2015 - 5.7 Connector/J/Net
 
Solving the C20K problem: Raising the bar in PHP Performance and Scalability
Solving the C20K problem: Raising the bar in PHP Performance and ScalabilitySolving the C20K problem: Raising the bar in PHP Performance and Scalability
Solving the C20K problem: Raising the bar in PHP Performance and Scalability
 
Oracle Compute Cloud Service快速实践
Oracle Compute Cloud Service快速实践Oracle Compute Cloud Service快速实践
Oracle Compute Cloud Service快速实践
 
Oracle Compute Cloud Service介绍
Oracle Compute Cloud Service介绍Oracle Compute Cloud Service介绍
Oracle Compute Cloud Service介绍
 
Zend Core on IBM i - Security Considerations
Zend Core on IBM i - Security ConsiderationsZend Core on IBM i - Security Considerations
Zend Core on IBM i - Security Considerations
 
MySQL in your laptop
MySQL in your laptopMySQL in your laptop
MySQL in your laptop
 
Zend_Tool: Practical use and Extending
Zend_Tool: Practical use and ExtendingZend_Tool: Practical use and Extending
Zend_Tool: Practical use and Extending
 
Script it
Script itScript it
Script it
 
MySQL Manchester TT - 5.7 Whats new
MySQL Manchester TT - 5.7 Whats newMySQL Manchester TT - 5.7 Whats new
MySQL Manchester TT - 5.7 Whats new
 
A Storage Story #ChefConf2013
A Storage Story #ChefConf2013A Storage Story #ChefConf2013
A Storage Story #ChefConf2013
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
Application Diagnosis with Zend Server Tracing
Application Diagnosis with Zend Server TracingApplication Diagnosis with Zend Server Tracing
Application Diagnosis with Zend Server Tracing
 
Oracle cloud 使用云市场快速搭建小型电商网站
Oracle cloud 使用云市场快速搭建小型电商网站Oracle cloud 使用云市场快速搭建小型电商网站
Oracle cloud 使用云市场快速搭建小型电商网站
 
PHP on Windows - What's New
PHP on Windows - What's NewPHP on Windows - What's New
PHP on Windows - What's New
 
PHP and Platform Independance in the Cloud
PHP and Platform Independance in the CloudPHP and Platform Independance in the Cloud
PHP and Platform Independance in the Cloud
 

Similar to SF Ceph Users Jan. 2014

Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
Dan Frincu
 
SNIA Europe - DCSEurope_April2013 (AOrdoubadian)
SNIA Europe - DCSEurope_April2013 (AOrdoubadian)SNIA Europe - DCSEurope_April2013 (AOrdoubadian)
SNIA Europe - DCSEurope_April2013 (AOrdoubadian)
Ali Ordoubadian
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
Open-NFP
 
P4 for Custom Identification, Flow Tagging, Monitoring and Control
P4 for Custom Identification, Flow Tagging, Monitoring and ControlP4 for Custom Identification, Flow Tagging, Monitoring and Control
P4 for Custom Identification, Flow Tagging, Monitoring and Control
Open-NFP
 
June 2004 IPv6 – Hands on
June 2004 IPv6 – Hands on June 2004 IPv6 – Hands on
June 2004 IPv6 – Hands on
Videoguy
 
Basic of ip subnet and addressing
Basic of ip subnet and addressingBasic of ip subnet and addressing
Basic of ip subnet and addressing
rahul_cuet
 
Openlab.2014 02-13.major.vi sion
Openlab.2014 02-13.major.vi sionOpenlab.2014 02-13.major.vi sion
Openlab.2014 02-13.major.vi sion
Ccie Light
 

Similar to SF Ceph Users Jan. 2014 (20)

Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
SNIA Europe - DCSEurope_April2013 (AOrdoubadian)
SNIA Europe - DCSEurope_April2013 (AOrdoubadian)SNIA Europe - DCSEurope_April2013 (AOrdoubadian)
SNIA Europe - DCSEurope_April2013 (AOrdoubadian)
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
ONOS Deployment Brigade
ONOS Deployment BrigadeONOS Deployment Brigade
ONOS Deployment Brigade
 
BSDCan2006.pdf
BSDCan2006.pdfBSDCan2006.pdf
BSDCan2006.pdf
 
Ceph Day New York 2014: Ceph Ecosystem Update
Ceph Day New York 2014: Ceph Ecosystem UpdateCeph Day New York 2014: Ceph Ecosystem Update
Ceph Day New York 2014: Ceph Ecosystem Update
 
6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol
 
I Pv6
I Pv6I Pv6
I Pv6
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
FOSDEM 2017 Trip Report
FOSDEM 2017 Trip ReportFOSDEM 2017 Trip Report
FOSDEM 2017 Trip Report
 
The advantages of Arista/OVH configurations, and the technologies behind buil...
The advantages of Arista/OVH configurations, and the technologies behind buil...The advantages of Arista/OVH configurations, and the technologies behind buil...
The advantages of Arista/OVH configurations, and the technologies behind buil...
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data Center
 
IPv6 ND 2020
IPv6 ND 2020IPv6 ND 2020
IPv6 ND 2020
 
Fb i pv6-sparchimanv1.0
Fb i pv6-sparchimanv1.0Fb i pv6-sparchimanv1.0
Fb i pv6-sparchimanv1.0
 
P4 for Custom Identification, Flow Tagging, Monitoring and Control
P4 for Custom Identification, Flow Tagging, Monitoring and ControlP4 for Custom Identification, Flow Tagging, Monitoring and Control
P4 for Custom Identification, Flow Tagging, Monitoring and Control
 
Webinar-Linux Networking is Awesome
Webinar-Linux Networking is AwesomeWebinar-Linux Networking is Awesome
Webinar-Linux Networking is Awesome
 
June 2004 IPv6 – Hands on
June 2004 IPv6 – Hands on June 2004 IPv6 – Hands on
June 2004 IPv6 – Hands on
 
Basic of ip subnet and addressing
Basic of ip subnet and addressingBasic of ip subnet and addressing
Basic of ip subnet and addressing
 
Openlab.2014 02-13.major.vi sion
Openlab.2014 02-13.major.vi sionOpenlab.2014 02-13.major.vi sion
Openlab.2014 02-13.major.vi sion
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

SF Ceph Users Jan. 2014

  • 1. SF BAY AREA CEPH USERS GROUP INAUGURAL MEETUP Thursday, January 16, 14
  • 2. AGENDA Intro to Ceph Ceph Networking Public Topologies Cluster Topologies Network Hardware 2 Thursday, January 16, 14
  • 3. THE FORECAST By 2020 over 39 ZB of data will be stored. 1.5 ZB are stored today. 3
  • 4. THE PROBLEM Growth of data  Existing systems don’t scale IT Storage Budget  Increasing cost and complexity 2010 4 Thursday, January 16, 14 2020  Need to invest in new platforms ahead of time
  • 5. THE SOLUTION PAST: SCALE UP FUTURE: SCALE OUT 5 Thursday, January 16, 14
  • 7. INTRO TO CEPH  Distributed storage system  Horizontally scalable  No single point of failure  Self healing and self managing  Runs on commodity hardware  GPLv2 License 7 Thursday, January 16, 14
  • 9. SERVICE COMPONENTS MONITOR  PAXOS for consensus  Maintain cluster state  Typically 3-5 nodes  NOT in write path OSD  Object storage interface  Gossips with peers  Data lives here 9 Thursday, January 16, 14 PART 1
  • 10. SERVICE COMPONENTS RADOS GATEWAY  Provides S3/Swift compatibility  Scale out METADATA  Object storage interface  Gossips with peers  Dynamic subtree partitioning 10 Thursday, January 16, 14 PART 2
  • 11. CRUSH  Ceph uses CRUSH for data placement  Aware of cluster topography  Statistically even distribution across pool  Supports asymmetric nodes and devices  Hierarchal weighting 11 Thursday, January 16, 14
  • 13. POOLS  Groupings of OSDs  Both physical and logical  Volumes / Images  Hot SSD pool  Cold SATA pool  DMCrypt pool 13 Thursday, January 16, 14
  • 14. REPLICATION  Original data durability mechanism  Ceph creates N replicas of each RADOS object  Uses CRUSH to determine replica placement  Required for mutable objects (RBD, CephFS)  More reasonable for smaller installations 14 Thursday, January 16, 14
  • 15. ERASURE CODING  (8:4) MDS code in example  1.5x overhead  8 units of client data to write  4 parity units generated using FEC  All 12 units placed with CRUSH  8/12 total units to satisfy a read 15 Thursday, January 16, 14 Firefly Release
  • 16. CLIENT COMPONENTS Native API  Mutable object store  Many language bindings  Object classes CephFS  Linux Kernel CephFS client since 2.6.34  FUSE client  Hadoop JNI bindings 16 Thursday, January 16, 14
  • 17. CLIENT COMPONENTS Block Storage  Linux Kernel RBD client since 2.6.37+  KVM/QEMU integration  Xen integration S3/Swift S3/SWIFT OSD  RESTful interfaces (HTTP)  CRUD operations  Usage accounting for billing 17 Thursday, January 16, 14
  • 19. INFINIBAND  Currently only supported via IPoIB  Accelio (libxio) integration in Ceph is in early stages  Accelio supports multiple transports RDMA, TCP and Shared-Memory  Accelio supports multiple RDMA transports (IB, RoCE, iWARP) 19 Thursday, January 16, 14
  • 20. ETHERNET  Tried and true  Proven at scale  Economical  Many suitable vendors 20 Thursday, January 16, 14
  • 21. 10GbE or 1GbE  Cost of 10GbE trending downward  White box switches turning up heat on vendors  Twinax relatively inexpensive and low power  SFP+ is versatile wrt distance  Single 10GbE for object  Dual 10GbE for block storage (public/cluster)  Bonding many 1GbE links adds lots of complexity 21 Thursday, January 16, 14
  • 22. IPv4 or IPv6 Native  It’s 2014, is this really a question?  Ceph fully supports both modes of operation  Hierarchal allocation models allows “roll up” of routes  Optimal efficiency in RIB  Some tools believe the earth is flat 22 Thursday, January 16, 14
  • 23. LAYER 2  Spanning tree  Switch table size  Broadcast domains (ARP)  MAC frame checksum  Storage protocols (FCoE, ATAoE)  TRILL, MLAG  Layer 2 DCI is crazy pants  Layer 2 tunneled over internet is super crazy pants 23 Thursday, January 16, 14
  • 24. LAYER 3  Address and subnet planning  Proven scale at big web shops  Error detection only on TCP header  Equal cost multi-path (ECMP)  Reasonable for inter-site connectivity 24 Thursday, January 16, 14
  • 26. CLIENT TOPOLOGIES  Path diversity for resiliency  Minimize network diameter  Consistent hop count to minimize net long tail latency  Ease of scaling  Tolerate adversarial traffic patterns (fan-in/fan-out) 26 Thursday, January 16, 14
  • 27. FOLDED CLOS  Sometimes called Fat Tree or Spine and Leaf  Minimum 4 fixed switches, grows to 10k+ node fabrics  Rack or cluster oversubscription possible  Non-blocking also possible S S S S  Path diversity S .... .... 1 27 Thursday, January 16, 14 2 N 1 2 S .... N 1 2 .... N 1 2 N
  • 29. REPLICA TOPOLOGIES  Replica and erasure fan-out  Recovery and remap impact on cluster bandwidth  OSD peering  Backfill served from primary  Tune backfills to avoid large fan-in 29 Thursday, January 16, 14
  • 30. FOLDED CLOS  Sometimes called Fat Tree or Spine and Leaf  Minimum 4, grows to 10k+ node fabrics  Rack or cluster oversubscription possible  Non-blocking also possible S S S S  Path diversity S .... .... 1 30 Thursday, January 16, 14 2 N 1 2 S .... N 1 2 .... N 1 2 N
  • 32. EVALUATE  Replication  Erasure coding  Special purpose vs general purpose  Extra port cost 32 Thursday, January 16, 14
  • 34. Features  Buffer sizes  Cut through vs store and forward  Oversubscribed vs non-blocking  Automation and monitoring 34 Thursday, January 16, 14
  • 35. FIXED  Fixed switches can easily build large clusters  Easier to source  Smaller failure domains  Fixed designs have many control planes  Virtual chassis.. L3 split brain hilarity? 35 Thursday, January 16, 14
  • 36. LESS SKU  Utilize as few vendor SKUs as possible  If permitted, use same fixed switch for spine and leaf  More affordable to have spares on site or more spares  Quicker MTTR when gear is ready to go 36 Thursday, January 16, 14
  • 37. Thanks to our host! 37 Thursday, January 16, 14
  • 38. Kyle Bader Sr. Solutions Architect kyle@inktank.com Thursday, January 16, 14