SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Lincoln Bryant • University of Chicago
Ceph Day Chicago
August 18th, 2015
Using Ceph for Large
Hadron Collider Data
about us
●
○
■
○
●
○
○
●
○
○
ATLAS site
5.4 miles
(17 mi circumference)
⇐ Standard Model of Particle
Physics
Higgs boson: final piece
discoved in 2012
⇒ Nobel Prize
2015 → 2018:
Cool new physics searches
underway at 13 TeV
← credit: Katherine Leney
(UCL, March 2015)
LHC
ATLAS detector
● Run2 center of mass energy = 13 TeV (Run1: 8 TeV)
● 40 MHz proton bunch crossing rate
○ 20-50 collisions/bunch crossing (“pileup”)
● Trigger (filters) reduces raw rate to ~ 1kHz
● Events are written to disk at ~ 1.5 GB/s
LHC
ATLAS detector
100M active sensors
torroid
magnets
inner
tracking
person
(scale)
Not shown:
Tile calorimeters
(electrons,
photons)
Liquid argon
calorimeter
(hadrons)
Muon chambers
Forward detectors
ATLAS
data &
analysis
Primary data from CERN
globally processed (event
reconstruction and analysis)
Role for
Ceph:
analysis
datasets &
object store for
single events
3x100 Gbps
Ceph technologies used
● Currently:
○ RBD
○ CephFS
● Future:
○ librados
○ RadosGW
Our setup
● Ceph v0.94.2 on Scientific Linux 6.6
● 14 storage servers
● 12 x 6 TB disks, no dedicated journal devices
○ Could buy PCI-E SSD(s) if the performance is needed
● Each connected at 10 Gbps
● Mons and MDS virtualized
● CephFS pools using erasure coding + cache
tiering
Ceph Storage Element
● ATLAS uses the Open Science Grid
middleware in the US
○ among other things: facilitates data management and
transfer between sites
● Typical sites will use Lustre, dCache, etc as the
“storage element” (SE)
● Goal: Build and productionize a storage
element based on Ceph
XRootD
● Primary file access protocol for accessing files
within ATLAS
● Developed by Stanford Linear Accelerator
(SLAC)
● Built to support standard high-energy physics
analysis tools (e.g., ROOT)
○ Supports remote reads, caching, etc
● Federated over WAN via hierarchical system of
‘redirectors’
Ceph and XRootD
● How to pair our favorite access protocol with
our favorite storage platform?
Ceph and XRootD
● How to pair our favorite access protocol with
our favorite storage platform?
● Original approach: RBD + XRootD
○ Performance was acceptable
○ Problem: RBD only mounted on 1 machine
■ Can only run one XRootD server
○ Could create new RBDs and add to XRootD cluster to
scale out
■ Problem: NFS exports for interactive users become
a lot trickier
Ceph and XRootD
● Current approach: CephFS + XRootD
○ All XRootD servers mount CephFS via kernel client
■ Scale out is a breeze
○ Fully POSIX filesystem, integrates simply with existing
infrastructure
● Problem: Users want to r/w to the filesystem
directly via CephFS, but XRootD needs to own
the files it serves
○ Permissions issues galore
Squashing with Ganesha NFS
● XRootD does not run in a privileged mode
○ Cannot modify/delete files written by users
○ Users can’t modify/delete files owned by XRootD
● How to allow users to read/write via FS mount?
● Using Ganesha to export CephFS as NFS and
squash all users to the XRootD user
○ Doesn’t prevent users from stomping on each other’s
files, but works well enough in practice
Transfers from CERN to Chicago
● Using Ceph as the backend store for data
from the LHC
● Analysis input data sets for regional physics
analysis
● Easily obtain 200 MB/s from Geneva to our
Ceph storage system in Chicago
How does it look in practice?
● Pretty good!
Potential evaluations
● XRootD with librados plugin
○ Skip the filesystem, write directly to object store
○ XRootD handles POSIX filesystem semantics as a
pseudo-MDS
○ Three ways of accessing:
■ Directly access files via XRootD clients
■ Mount XRootD via FUSE client
■ LD_PRELOAD hook to intercept system calls to
/xrootd
Cycle scavenging Ceph servers
Ceph and the batch system
● Goal: Run Ceph and user analysis jobs on the
same machines
● Problem: Poorly defined jobs can wreak havoc
on the Ceph cluster
○ e.g., machine starts heavily swapping, OOM killer starts
killing random processes including OSDs, load spikes to
hundreds, etc..
Ceph and the batch system
● Solution: control groups (cgroups)
● Configured batch system (HTCondor) to use
cgroups to limit the amount of CPU/RAM used
on a per-job basis
● We let HTCondor scavenge about 80% of the
cycles
○ May need to be tweaked as our Ceph usage increases.
Ceph and the batch system
● Working well thus far:
Ceph and the batch system
● Further work in this area:
○ Need to configure the batch system to immediately kill
jobs when Ceph-related load goes up
■ e.g., disk failure
○ Re-nice OSDs to maximum priority
○ May require investigation into limiting network saturation
ATLAS Event Service and RadosGW
Higgs boson detection
ATLAS Event Service
● Deliver single ATLAS events for processing
○ Rather than a complete dataset - “fine grained”
● Able to efficiently fill opportunistic resources
like AWS instances (spot pricing), semi-idle
HPC clusters, BOINC
● Can be evicted from resources immediately
with negligible loss of work
● Output data is streamed to remote object
storage
ATLAS Event Service
● Rather than pay for S3, RadosGW fits this use
case perfectly
● Colleagues at Brookhaven National Lab have
deployed a test instance already
○ interested in providing this service as well
○ could potentially federate gateways
● Still in the pre-planning stage at our site
Final thoughts
June 2015 event
17 p-p collisions
in one event
Final thoughts
● Overall, quite happy with Ceph
○ Storage endpoint should be in production soon
○ More nodes on the way: plan to expand to 2 PB
● Looking forward to new CephFS features like quotas,
offline fsck, etc
● Will be experimenting with Ceph pools shared between
data centers with low RTT ping in the near future
● Expect Ceph to play important role in ATLAS data
processing ⇒ new discoveries
Questions?
cleaning up
inside ATLAS :)

Contenu connexe

Tendances

Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
openstackindia
 

Tendances (20)

Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
RDFox Poster
RDFox PosterRDFox Poster
RDFox Poster
 
Future Science on Future OpenStack
Future Science on Future OpenStackFuture Science on Future OpenStack
Future Science on Future OpenStack
 
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Declarative Infrastructure Tools
Declarative Infrastructure Tools Declarative Infrastructure Tools
Declarative Infrastructure Tools
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an action
 
Building AuroraObjects- Ceph Day Frankfurt
Building AuroraObjects- Ceph Day Frankfurt Building AuroraObjects- Ceph Day Frankfurt
Building AuroraObjects- Ceph Day Frankfurt
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
"Metrics: Where and How", Vsevolod Polyakov
"Metrics: Where and How", Vsevolod Polyakov"Metrics: Where and How", Vsevolod Polyakov
"Metrics: Where and How", Vsevolod Polyakov
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Kraken mesoscon 2018
Kraken mesoscon 2018Kraken mesoscon 2018
Kraken mesoscon 2018
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
 
Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016
 
Responsive Distributed Routing Algorithm
Responsive Distributed Routing AlgorithmResponsive Distributed Routing Algorithm
Responsive Distributed Routing Algorithm
 

En vedette

Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Community
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Community
 
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by WorkloadCeph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Community
 

En vedette (20)

Ceph Day Shanghai - Ceph in Chinau Unicom Labs
Ceph Day Shanghai - Ceph in Chinau Unicom LabsCeph Day Shanghai - Ceph in Chinau Unicom Labs
Ceph Day Shanghai - Ceph in Chinau Unicom Labs
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
 
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
 
Ceph Day Shanghai - Community Update
Ceph Day Shanghai - Community Update Ceph Day Shanghai - Community Update
Ceph Day Shanghai - Community Update
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by WorkloadCeph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Ceph Day Taipei - Community Update
Ceph Day Taipei - Community Update Ceph Day Taipei - Community Update
Ceph Day Taipei - Community Update
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
 
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 

Similaire à Ceph Day Chicago: Using Ceph for Large Hadron Collider Data

Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
cpallares
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
Marcos García
 
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
JayjeetChakraborty
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

Similaire à Ceph Day Chicago: Using Ceph for Large Hadron Collider Data (20)

Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
OpenStackTage Cologne - OpenStack at 99.999% availability with CephOpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
 
Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Towards constrained semantic web
Towards constrained semantic webTowards constrained semantic web
Towards constrained semantic web
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Ceph Day Chicago: Using Ceph for Large Hadron Collider Data

  • 1. Lincoln Bryant • University of Chicago Ceph Day Chicago August 18th, 2015 Using Ceph for Large Hadron Collider Data
  • 3. ATLAS site 5.4 miles (17 mi circumference) ⇐ Standard Model of Particle Physics Higgs boson: final piece discoved in 2012 ⇒ Nobel Prize 2015 → 2018: Cool new physics searches underway at 13 TeV ← credit: Katherine Leney (UCL, March 2015) LHC
  • 4. ATLAS detector ● Run2 center of mass energy = 13 TeV (Run1: 8 TeV) ● 40 MHz proton bunch crossing rate ○ 20-50 collisions/bunch crossing (“pileup”) ● Trigger (filters) reduces raw rate to ~ 1kHz ● Events are written to disk at ~ 1.5 GB/s LHC
  • 5. ATLAS detector 100M active sensors torroid magnets inner tracking person (scale) Not shown: Tile calorimeters (electrons, photons) Liquid argon calorimeter (hadrons) Muon chambers Forward detectors
  • 6. ATLAS data & analysis Primary data from CERN globally processed (event reconstruction and analysis) Role for Ceph: analysis datasets & object store for single events 3x100 Gbps
  • 7. Ceph technologies used ● Currently: ○ RBD ○ CephFS ● Future: ○ librados ○ RadosGW
  • 8. Our setup ● Ceph v0.94.2 on Scientific Linux 6.6 ● 14 storage servers ● 12 x 6 TB disks, no dedicated journal devices ○ Could buy PCI-E SSD(s) if the performance is needed ● Each connected at 10 Gbps ● Mons and MDS virtualized ● CephFS pools using erasure coding + cache tiering
  • 10. ● ATLAS uses the Open Science Grid middleware in the US ○ among other things: facilitates data management and transfer between sites ● Typical sites will use Lustre, dCache, etc as the “storage element” (SE) ● Goal: Build and productionize a storage element based on Ceph
  • 11. XRootD ● Primary file access protocol for accessing files within ATLAS ● Developed by Stanford Linear Accelerator (SLAC) ● Built to support standard high-energy physics analysis tools (e.g., ROOT) ○ Supports remote reads, caching, etc ● Federated over WAN via hierarchical system of ‘redirectors’
  • 12. Ceph and XRootD ● How to pair our favorite access protocol with our favorite storage platform?
  • 13. Ceph and XRootD ● How to pair our favorite access protocol with our favorite storage platform? ● Original approach: RBD + XRootD ○ Performance was acceptable ○ Problem: RBD only mounted on 1 machine ■ Can only run one XRootD server ○ Could create new RBDs and add to XRootD cluster to scale out ■ Problem: NFS exports for interactive users become a lot trickier
  • 14. Ceph and XRootD ● Current approach: CephFS + XRootD ○ All XRootD servers mount CephFS via kernel client ■ Scale out is a breeze ○ Fully POSIX filesystem, integrates simply with existing infrastructure ● Problem: Users want to r/w to the filesystem directly via CephFS, but XRootD needs to own the files it serves ○ Permissions issues galore
  • 15. Squashing with Ganesha NFS ● XRootD does not run in a privileged mode ○ Cannot modify/delete files written by users ○ Users can’t modify/delete files owned by XRootD ● How to allow users to read/write via FS mount? ● Using Ganesha to export CephFS as NFS and squash all users to the XRootD user ○ Doesn’t prevent users from stomping on each other’s files, but works well enough in practice
  • 16. Transfers from CERN to Chicago ● Using Ceph as the backend store for data from the LHC ● Analysis input data sets for regional physics analysis ● Easily obtain 200 MB/s from Geneva to our Ceph storage system in Chicago
  • 17. How does it look in practice? ● Pretty good!
  • 18. Potential evaluations ● XRootD with librados plugin ○ Skip the filesystem, write directly to object store ○ XRootD handles POSIX filesystem semantics as a pseudo-MDS ○ Three ways of accessing: ■ Directly access files via XRootD clients ■ Mount XRootD via FUSE client ■ LD_PRELOAD hook to intercept system calls to /xrootd
  • 20. Ceph and the batch system ● Goal: Run Ceph and user analysis jobs on the same machines ● Problem: Poorly defined jobs can wreak havoc on the Ceph cluster ○ e.g., machine starts heavily swapping, OOM killer starts killing random processes including OSDs, load spikes to hundreds, etc..
  • 21. Ceph and the batch system ● Solution: control groups (cgroups) ● Configured batch system (HTCondor) to use cgroups to limit the amount of CPU/RAM used on a per-job basis ● We let HTCondor scavenge about 80% of the cycles ○ May need to be tweaked as our Ceph usage increases.
  • 22. Ceph and the batch system ● Working well thus far:
  • 23. Ceph and the batch system ● Further work in this area: ○ Need to configure the batch system to immediately kill jobs when Ceph-related load goes up ■ e.g., disk failure ○ Re-nice OSDs to maximum priority ○ May require investigation into limiting network saturation
  • 24. ATLAS Event Service and RadosGW Higgs boson detection
  • 25. ATLAS Event Service ● Deliver single ATLAS events for processing ○ Rather than a complete dataset - “fine grained” ● Able to efficiently fill opportunistic resources like AWS instances (spot pricing), semi-idle HPC clusters, BOINC ● Can be evicted from resources immediately with negligible loss of work ● Output data is streamed to remote object storage
  • 26. ATLAS Event Service ● Rather than pay for S3, RadosGW fits this use case perfectly ● Colleagues at Brookhaven National Lab have deployed a test instance already ○ interested in providing this service as well ○ could potentially federate gateways ● Still in the pre-planning stage at our site
  • 27. Final thoughts June 2015 event 17 p-p collisions in one event
  • 28. Final thoughts ● Overall, quite happy with Ceph ○ Storage endpoint should be in production soon ○ More nodes on the way: plan to expand to 2 PB ● Looking forward to new CephFS features like quotas, offline fsck, etc ● Will be experimenting with Ceph pools shared between data centers with low RTT ping in the near future ● Expect Ceph to play important role in ATLAS data processing ⇒ new discoveries