SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Challenges in Using Persistent Memory in
Gluster
Gluster Summit 2016
Dan Lambright
Storage System Software Developer
Adjunct Professor University of Massachusetts Lowell
Aug. 23, 2016
RED HAT
● Technologies
○ Persistent memory, aka storage class memory (SCM)
○ Distributed storage
● Challenges
○ Network latency (studied Gluster)
○ Accelerating parts of the system with SCM
○ CPU latency (studied Ceph)
2
Overview
RED HAT3
● Near DRAM speeds
● Wearability better than SSDs (claims Intel for 3DxPoint)
● API available
○ Crash-proof transactions
○ Byte or block addressable
● Likely to be at least as expensive as SSDs
● Fast random access
● Has support in Linux
● NVDIMMs available today.
Storage Class Memory
What do we know / expect?
RED HAT
Media Latency
HDD 10ms
SSD 1ms
SCM < 1us
CPU
(Ceph, aprox.)
~1000us
Network
(RDMA)
~10-50us
4
The problem
Must lower latencies throughout system : storage, network, CPU
RED HAT
● “Primary copy” : update replicas in parallel,
○ processes reads and writes
○ Ceph’s choice, also JBR
● Other design options
○ Read at “tail” - the data there is always committed
Server Side Replication
Latency cost to replicate across nodes
client
Primary
server
Replica 1
Replica 2
RED HAT
● Uses more client side bandwidth
● Likely client has slower network than server.
Client Side Replication
Latency price lower than server software replication
client
Replica 1
Replica 1
Replica 2
RED HAT
● Avoid OS data copy; free CPU from transfer
● Application must manage buffers
○ Reserve memory up-front, sized properly
○ Both Ceph and Gluster have good RDMA interfaces
● Extend protocol to further improve latency?
○ Proposed protocol extensions, could shrink latency to ~3us.
○ RDMA write completion does not indicate data was persisted (“ship and pray”)
○ ACK in higher level protocol - adds overhead
○ Add “commit bit”, perhaps combine with last write?
Improving Network Latency
RDMA
RED HAT
IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients
Biggest gain with reads, little gain for small I/O.
Sequential I/O (1024 block size)
Random I/O (1024 block size)
1024 bytes transfers
RED HAT
● Reduce protocol traffic (discuss more next section)
● Coalesce protocol operations
○ WIth this, observed 10% gain in small file creates on Gluster
● Pipelining
○ In Ceph, on two updates to same object, start replicating second before first
completes
Improving Network Latency
Other techniques
RED HAT
Adding SCM to Parts of System
Kernel and application level tiering
DM-cache Ceph tiering
RED HAT
● Heterogeneous storage in single volume
○ Fast/expensive storage cache for slower storage
○ Fast “Hot tier” (e.g. SSD, SCM)
○ Slow “Cold tier” (e.g. erasure coded)
● Database tracks files
○ Pro: easy metadata manipulation
○ Con: very slow O(n) enumeration of objects for promotion/demotion scans.
● Policies:
○ Data put on hot tier, until “full”
○ Once “full”, data “promoted/demoted” based on access frequency
Gluster Tiering
Illustration of network problem
RED HAT
Gluster Tiering
RED HAT
● Tiering helped large I/Os, not small
● Pattern seen elsewhere ..
○ RDMA tests
○ Customer Feedback, overall GlusterFS reputation …
● Observed many “LOOKUP operations” over network
● Hypothesis: metadata transfers dominate data transfers for small files
○ small file data transfer speedup fails to help overall IO latency
Gluster’s “Small File” Tiering Problem
Analysis
RED HAT
● Each directory in path is tested on an open(), by client’s VFS layer
○ Full path traversal
○ d1/d2/d3/f1
○ Existence
○ Permission
Understanding LOOKUPs in Gluster
Problem : Path Traversal
d1
d2
d3
f1
RED HAT
● Distributed hash space is split in parts
○ Unique space for each directory
○ Each node owns a piece of this “layout”
○ Stored in extended attributes
○ When new nodes added to cluster, rebuild the layouts
● When file opened, entire layout is rechecked, for each directory
○ Each node receives a lookup to retrieve its part of the layout
● Work is underway to improve this.
Understanding LOOKUPs in Gluster
Problem : Coalescing Distributed Hash Ranges
RED HAT
LOOKUP Amplification
d2 d3 f1
S1 S2 S3 S4
d1/d2/d3/f1
Four LOOKUPs
Four servers
16 LOOKUPs total in worse case
d2 VFS layer
Gluster client
Gluster server
Client
Path used in Ben England’s “smallfile” utility..
/mnt/p6.1/file_dstdir/192.168.1.2/thrd_00/d_000
RED HAT
● Cache file metadata at client long term
○ WIP - under development
● Invalidate cache entry on another client’s change
○ Invalidate intelligently, not spuriously
○ Some attributes may change a lot (ctime, ..)
Client Metadata Cache: Tiering
Gluster’s “md-cache” translator
Red is
cached
RED HAT
Client Metadata Cache RDMA
One client , 16K 64 byte files, 8 threads, 14Gbps Mellonox IB
Red is
cached
RED HAT
● Near term (2016)
○ Add-remove brick, partially in review stages
○ Md-cache integration, working with SMB team
○ Better performance counters
● Longer term (2017)
○ Policy based tiering
○ More than two tiers?
○ Replace database with “hitsets” (ceph-like bloom filters) ?
Tiering
Development notes
RED HAT
● Network latency reductions
○ Use RDMA
○ Reduce round trips by streamlining protocol , coalescing etc
○ Cache at client
● CPU latency reductions
○ Aggressively optimize / shrink stack
○ Remove/replace large components
● Consider
○ SCM as a tier/cache
○ 2 way replication
Summary and Discussion
Distributed storage and SCM pose unique problems with latency.
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

Contenu connexe

Tendances

Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Tommy Lee
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB Integration
Etsuji Nakai
 

Tendances (20)

Gluster overview & future directions vault 2015
Gluster overview & future directions vault 2015Gluster overview & future directions vault 2015
Gluster overview & future directions vault 2015
 
Dedupe nmamit
Dedupe nmamitDedupe nmamit
Dedupe nmamit
 
Red Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSRed Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFS
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalab
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
Accessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willsonAccessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willson
 
Gluster d2
Gluster d2Gluster d2
Gluster d2
 
Software defined storage
Software defined storageSoftware defined storage
Software defined storage
 
Gluster fs architecture_future_directions_tlv
Gluster fs architecture_future_directions_tlvGluster fs architecture_future_directions_tlv
Gluster fs architecture_future_directions_tlv
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fs
 
Gluster technical overview
Gluster technical overviewGluster technical overview
Gluster technical overview
 
Gluster fs hadoop_fifth-elephant
Gluster fs hadoop_fifth-elephantGluster fs hadoop_fifth-elephant
Gluster fs hadoop_fifth-elephant
 
Sdc challenges-2012
Sdc challenges-2012Sdc challenges-2012
Sdc challenges-2012
 
Lcna 2012-tutorial
Lcna 2012-tutorialLcna 2012-tutorial
Lcna 2012-tutorial
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
 
Scale out backups-with_bareos_and_gluster
Scale out backups-with_bareos_and_glusterScale out backups-with_bareos_and_gluster
Scale out backups-with_bareos_and_gluster
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB Integration
 

En vedette (7)

Hands On Gluster with Jeff Darcy
Hands On Gluster with Jeff DarcyHands On Gluster with Jeff Darcy
Hands On Gluster with Jeff Darcy
 
Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!
 
Gluster the ugly parts with Jeff Darcy
Gluster  the ugly parts with Jeff DarcyGluster  the ugly parts with Jeff Darcy
Gluster the ugly parts with Jeff Darcy
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-on
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & Tricks
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
 

Similaire à Challenges with Gluster and Persistent Memory with Dan Lambright

Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
DataStax Academy
 
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingDAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
Rui Liu
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
abord
 

Similaire à Challenges with Gluster and Persistent Memory with Dan Lambright (20)

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
SNIA SDC 2016 final
SNIA SDC 2016 finalSNIA SDC 2016 final
SNIA SDC 2016 final
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
 
Lt2013 glusterfs.talk
Lt2013 glusterfs.talkLt2013 glusterfs.talk
Lt2013 glusterfs.talk
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Efficient data maintaince in GlusterFS using Databases
Efficient data maintaince in GlusterFS using DatabasesEfficient data maintaince in GlusterFS using Databases
Efficient data maintaince in GlusterFS using Databases
 
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingDAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Gluster.next feb-2016
Gluster.next feb-2016Gluster.next feb-2016
Gluster.next feb-2016
 

Plus de Gluster.org

nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravaranfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Gluster.org
 
Facebook’s upstream approach to GlusterFS - David Hasson
Facebook’s upstream approach to GlusterFS  - David HassonFacebook’s upstream approach to GlusterFS  - David Hasson
Facebook’s upstream approach to GlusterFS - David Hasson
Gluster.org
 

Plus de Gluster.org (20)

Automating Gluster @ Facebook - Shreyas Siravara
Automating Gluster @ Facebook - Shreyas SiravaraAutomating Gluster @ Facebook - Shreyas Siravara
Automating Gluster @ Facebook - Shreyas Siravara
 
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravaranfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
 
Facebook’s upstream approach to GlusterFS - David Hasson
Facebook’s upstream approach to GlusterFS  - David HassonFacebook’s upstream approach to GlusterFS  - David Hasson
Facebook’s upstream approach to GlusterFS - David Hasson
 
Throttling Traffic at Facebook Scale
Throttling Traffic at Facebook ScaleThrottling Traffic at Facebook Scale
Throttling Traffic at Facebook Scale
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
Gluster Metrics: why they are crucial for running stable deployments of all s...
Gluster Metrics: why they are crucial for running stable deployments of all s...Gluster Metrics: why they are crucial for running stable deployments of all s...
Gluster Metrics: why they are crucial for running stable deployments of all s...
 
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
 
Data Reduction for Gluster with VDO
Data Reduction for Gluster with VDOData Reduction for Gluster with VDO
Data Reduction for Gluster with VDO
 
Releases: What are contributors responsible for
Releases: What are contributors responsible forReleases: What are contributors responsible for
Releases: What are contributors responsible for
 
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
RIO Distribution: Reconstructing the onion - Shyamsundar RanganathanRIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
 
Gluster and Kubernetes
Gluster and KubernetesGluster and Kubernetes
Gluster and Kubernetes
 
Native Clients, more the merrier with GFProxy!
Native Clients, more the merrier with GFProxy!Native Clients, more the merrier with GFProxy!
Native Clients, more the merrier with GFProxy!
 
Gluster: a SWOT Analysis
Gluster: a SWOT Analysis Gluster: a SWOT Analysis
Gluster: a SWOT Analysis
 
GlusterD-2.0: What's Happening? - Kaushal Madappa
GlusterD-2.0: What's Happening? - Kaushal MadappaGlusterD-2.0: What's Happening? - Kaushal Madappa
GlusterD-2.0: What's Happening? - Kaushal Madappa
 
Scalability and Performance of CNS 3.6
Scalability and Performance of CNS 3.6Scalability and Performance of CNS 3.6
Scalability and Performance of CNS 3.6
 
What Makes Us Fail
What Makes Us FailWhat Makes Us Fail
What Makes Us Fail
 
Gluster as Native Storage for Containers - past, present and future
Gluster as Native Storage for Containers - past, present and futureGluster as Native Storage for Containers - past, present and future
Gluster as Native Storage for Containers - past, present and future
 
Heketi Functionality into Glusterd2
Heketi Functionality into Glusterd2Heketi Functionality into Glusterd2
Heketi Functionality into Glusterd2
 
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Architecture of the High Availability Solution for Ganesha and Samba with Kal...Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
 
Gluster Containerized Storage for Cloud Applications
Gluster Containerized Storage for Cloud ApplicationsGluster Containerized Storage for Cloud Applications
Gluster Containerized Storage for Cloud Applications
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Challenges with Gluster and Persistent Memory with Dan Lambright

  • 1. Challenges in Using Persistent Memory in Gluster Gluster Summit 2016 Dan Lambright Storage System Software Developer Adjunct Professor University of Massachusetts Lowell Aug. 23, 2016
  • 2. RED HAT ● Technologies ○ Persistent memory, aka storage class memory (SCM) ○ Distributed storage ● Challenges ○ Network latency (studied Gluster) ○ Accelerating parts of the system with SCM ○ CPU latency (studied Ceph) 2 Overview
  • 3. RED HAT3 ● Near DRAM speeds ● Wearability better than SSDs (claims Intel for 3DxPoint) ● API available ○ Crash-proof transactions ○ Byte or block addressable ● Likely to be at least as expensive as SSDs ● Fast random access ● Has support in Linux ● NVDIMMs available today. Storage Class Memory What do we know / expect?
  • 4. RED HAT Media Latency HDD 10ms SSD 1ms SCM < 1us CPU (Ceph, aprox.) ~1000us Network (RDMA) ~10-50us 4 The problem Must lower latencies throughout system : storage, network, CPU
  • 5. RED HAT ● “Primary copy” : update replicas in parallel, ○ processes reads and writes ○ Ceph’s choice, also JBR ● Other design options ○ Read at “tail” - the data there is always committed Server Side Replication Latency cost to replicate across nodes client Primary server Replica 1 Replica 2
  • 6. RED HAT ● Uses more client side bandwidth ● Likely client has slower network than server. Client Side Replication Latency price lower than server software replication client Replica 1 Replica 1 Replica 2
  • 7. RED HAT ● Avoid OS data copy; free CPU from transfer ● Application must manage buffers ○ Reserve memory up-front, sized properly ○ Both Ceph and Gluster have good RDMA interfaces ● Extend protocol to further improve latency? ○ Proposed protocol extensions, could shrink latency to ~3us. ○ RDMA write completion does not indicate data was persisted (“ship and pray”) ○ ACK in higher level protocol - adds overhead ○ Add “commit bit”, perhaps combine with last write? Improving Network Latency RDMA
  • 8. RED HAT IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients Biggest gain with reads, little gain for small I/O. Sequential I/O (1024 block size) Random I/O (1024 block size) 1024 bytes transfers
  • 9. RED HAT ● Reduce protocol traffic (discuss more next section) ● Coalesce protocol operations ○ WIth this, observed 10% gain in small file creates on Gluster ● Pipelining ○ In Ceph, on two updates to same object, start replicating second before first completes Improving Network Latency Other techniques
  • 10. RED HAT Adding SCM to Parts of System Kernel and application level tiering DM-cache Ceph tiering
  • 11. RED HAT ● Heterogeneous storage in single volume ○ Fast/expensive storage cache for slower storage ○ Fast “Hot tier” (e.g. SSD, SCM) ○ Slow “Cold tier” (e.g. erasure coded) ● Database tracks files ○ Pro: easy metadata manipulation ○ Con: very slow O(n) enumeration of objects for promotion/demotion scans. ● Policies: ○ Data put on hot tier, until “full” ○ Once “full”, data “promoted/demoted” based on access frequency Gluster Tiering Illustration of network problem
  • 13. RED HAT ● Tiering helped large I/Os, not small ● Pattern seen elsewhere .. ○ RDMA tests ○ Customer Feedback, overall GlusterFS reputation … ● Observed many “LOOKUP operations” over network ● Hypothesis: metadata transfers dominate data transfers for small files ○ small file data transfer speedup fails to help overall IO latency Gluster’s “Small File” Tiering Problem Analysis
  • 14. RED HAT ● Each directory in path is tested on an open(), by client’s VFS layer ○ Full path traversal ○ d1/d2/d3/f1 ○ Existence ○ Permission Understanding LOOKUPs in Gluster Problem : Path Traversal d1 d2 d3 f1
  • 15. RED HAT ● Distributed hash space is split in parts ○ Unique space for each directory ○ Each node owns a piece of this “layout” ○ Stored in extended attributes ○ When new nodes added to cluster, rebuild the layouts ● When file opened, entire layout is rechecked, for each directory ○ Each node receives a lookup to retrieve its part of the layout ● Work is underway to improve this. Understanding LOOKUPs in Gluster Problem : Coalescing Distributed Hash Ranges
  • 16. RED HAT LOOKUP Amplification d2 d3 f1 S1 S2 S3 S4 d1/d2/d3/f1 Four LOOKUPs Four servers 16 LOOKUPs total in worse case d2 VFS layer Gluster client Gluster server Client Path used in Ben England’s “smallfile” utility.. /mnt/p6.1/file_dstdir/192.168.1.2/thrd_00/d_000
  • 17. RED HAT ● Cache file metadata at client long term ○ WIP - under development ● Invalidate cache entry on another client’s change ○ Invalidate intelligently, not spuriously ○ Some attributes may change a lot (ctime, ..) Client Metadata Cache: Tiering Gluster’s “md-cache” translator Red is cached
  • 18. RED HAT Client Metadata Cache RDMA One client , 16K 64 byte files, 8 threads, 14Gbps Mellonox IB Red is cached
  • 19. RED HAT ● Near term (2016) ○ Add-remove brick, partially in review stages ○ Md-cache integration, working with SMB team ○ Better performance counters ● Longer term (2017) ○ Policy based tiering ○ More than two tiers? ○ Replace database with “hitsets” (ceph-like bloom filters) ? Tiering Development notes
  • 20. RED HAT ● Network latency reductions ○ Use RDMA ○ Reduce round trips by streamlining protocol , coalescing etc ○ Cache at client ● CPU latency reductions ○ Aggressively optimize / shrink stack ○ Remove/replace large components ● Consider ○ SCM as a tier/cache ○ 2 way replication Summary and Discussion Distributed storage and SCM pose unique problems with latency.