SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Lincoln Bryant • University of Chicago
Ceph Day Chicago
August 18th, 2015
Using Ceph for Large
Hadron Collider Data
about us
● ATLAS at the Large Hadron Collider at CERN
○ Tier2 center at UChicago, Indiana, UIUC/NCSA:
■ 12,000 processing slots, 4 PB of storage
○ 140 PB on disk in 120 data centers worldwide + CERN
● Open Science Grid: high throughput computing
○ Supporting “large science” and small research labs on
campuses nationwide
○ >100k cores, >800M CPU-hours/year, ~PB transfer/day
● Users of Ceph for 2 years
○ started with v0.67, using v0.94 now
○ 1PB, more to come
ATLAS site
5.4 miles
(17 mi circumference)
⇐ Standard Model of Particle
Physics
Higgs boson: final piece
discoved in 2012
⇒ Nobel Prize
2015 → 2018:
Cool new physics searches
underway at 13 TeV
← credit: Katherine Leney
(UCL, March 2015)
LHC
ATLAS detector
● Run2 center of mass energy = 13 TeV (Run1: 8 TeV)
● 40 MHz proton bunch crossing rate
○ 20-50 collisions/bunch crossing (“pileup”)
● Trigger (filters) reduces raw rate to ~ 1kHz
● Events are written to disk at ~ 1.5 GB/s
LHC
ATLAS detector
100M active sensors
torroid
magnets
inner
tracking
person
(scale)
Not shown:
Tile calorimeters
(electrons,
photons)
Liquid argon
calorimeter
(hadrons)
Muon chambers
Forward detectors
ATLAS
data &
analysis
Primary data from CERN
globally processed (event
reconstruction and analysis)
Role for
Ceph:
analysis
datasets &
object store for
single events
3x100 Gbps
Ceph technologies used
● Currently:
○ RBD
○ CephFS
● Future:
○ librados
○ RadosGW
Our setup
● Ceph v0.94.2 on Scientific Linux 6.6
● 14 storage servers
● 12 x 6 TB disks, no dedicated journal devices
○ Could buy PCI-E SSD(s) if the performance is needed
● Each connected at 10 Gbps
● Mons and MDS virtualized
● CephFS pools using erasure coding + cache
tiering
Ceph Storage Element
● ATLAS uses the Open Science Grid
middleware in the US
○ among other things: facilitates data management and
transfer between sites
● Typical sites will use Lustre, dCache, etc as the
“storage element” (SE)
● Goal: Build and productionize a storage
element based on Ceph
XRootD
● Primary file access protocol for accessing files
within ATLAS
● Developed by Stanford Linear Accelerator
(SLAC)
● Built to support standard high-energy physics
analysis tools (e.g., ROOT)
○ Supports remote reads, caching, etc
● Federated over WAN via hierarchical system of
‘redirectors’
Ceph and XRootD
● How to pair our favorite access protocol with
our favorite storage platform?
Ceph and XRootD
● How to pair our favorite access protocol with
our favorite storage platform?
● Original approach: RBD + XRootD
○ Performance was acceptable
○ Problem: RBD only mounted on 1 machine
■ Can only run one XRootD server
○ Could create new RBDs and add to XRootD cluster to
scale out
■ Problem: NFS exports for interactive users become
a lot trickier
Ceph and XRootD
● Current approach: CephFS + XRootD
○ All XRootD servers mount CephFS via kernel client
■ Scale out is a breeze
○ Fully POSIX filesystem, integrates simply with existing
infrastructure
● Problem: Users want to r/w to the filesystem
directly via CephFS, but XRootD needs to own
the files it serves
○ Permissions issues galore
Squashing with Ganesha NFS
● XRootD does not run in a privileged mode
○ Cannot modify/delete files written by users
○ Users can’t modify/delete files owned by XRootD
● How to allow users to read/write via FS mount?
● Using Ganesha to export CephFS as NFS and
squash all users to the XRootD user
○ Doesn’t prevent users from stomping on each other’s
files, but works well enough in practice
Transfers from CERN to Chicago
● Using Ceph as the backend store for data
from the LHC
● Analysis input data sets for regional physics
analysis
● Easily obtain 200 MB/s from Geneva to our
Ceph storage system in Chicago
How does it look in practice?
● Pretty good!
Potential evaluations
● XRootD with librados plugin
○ Skip the filesystem, write directly to object store
○ XRootD handles POSIX filesystem semantics as a
pseudo-MDS
○ Three ways of accessing:
■ Directly access files via XRootD clients
■ Mount XRootD via FUSE client
■ LD_PRELOAD hook to intercept system calls to
/xrootd
Cycle scavenging Ceph servers
Ceph and the batch system
● Goal: Run Ceph and user analysis jobs on the
same machines
● Problem: Poorly defined jobs can wreak havoc
on the Ceph cluster
○ e.g., machine starts heavily swapping, OOM killer starts
killing random processes including OSDs, load spikes to
hundreds, etc..
Ceph and the batch system
● Solution: control groups (cgroups)
● Configured batch system (HTCondor) to use
cgroups to limit the amount of CPU/RAM used
on a per-job basis
● We let HTCondor scavenge about 80% of the
cycles
○ May need to be tweaked as our Ceph usage increases.
Ceph and the batch system
● Working well thus far:
Ceph and the batch system
● Further work in this area:
○ Need to configure the batch system to immediately kill
jobs when Ceph-related load goes up
■ e.g., disk failure
○ Re-nice OSDs to maximum priority
○ May require investigation into limiting network saturation
ATLAS Event Service and RadosGW
Higgs boson detection
ATLAS Event Service
● Deliver single ATLAS events for processing
○ Rather than a complete dataset - “fine grained”
● Able to efficiently fill opportunistic resources
like AWS instances (spot pricing), semi-idle
HPC clusters, BOINC
● Can be evicted from resources immediately
with negligible loss of work
● Output data is streamed to remote object
storage
ATLAS Event Service
● Rather than pay for S3, RadosGW fits this use
case perfectly
● Colleagues at Brookhaven National Lab have
deployed a test instance already
○ interested in providing this service as well
○ could potentially federate gateways
● Still in the pre-planning stage at our site
Final thoughts
June 2015 event
17 p-p collisions
in one event
Final thoughts
● Overall, quite happy with Ceph
○ Storage endpoint should be in production soon
○ More nodes on the way: plan to expand to 2 PB
● Looking forward to new CephFS features like quotas,
offline fsck, etc
● Will be experimenting with Ceph pools shared between
data centers with low RTT ping in the near future
● Expect Ceph to play important role in ATLAS data
processing ⇒ new discoveries
Questions?
cleaning up
inside ATLAS :)

Contenu connexe

Tendances

Tendances (19)

SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph Research at UCSC
Ceph Research at UCSCCeph Research at UCSC
Ceph Research at UCSC
 
librados
libradoslibrados
librados
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 

En vedette (7)

QuarkNet
QuarkNetQuarkNet
QuarkNet
 
Red Hat Gluster Storage
Red Hat Gluster StorageRed Hat Gluster Storage
Red Hat Gluster Storage
 
Large Hadron Collider
Large Hadron ColliderLarge Hadron Collider
Large Hadron Collider
 
THE WONDER OF CERN...by Stefano Gallizio
THE WONDER OF CERN...by Stefano GallizioTHE WONDER OF CERN...by Stefano Gallizio
THE WONDER OF CERN...by Stefano Gallizio
 
レッドハット グラスター ストレージ Red Hat Gluster Storage (Japanese)
レッドハット グラスター ストレージ Red Hat Gluster Storage (Japanese)レッドハット グラスター ストレージ Red Hat Gluster Storage (Japanese)
レッドハット グラスター ストレージ Red Hat Gluster Storage (Japanese)
 
Particle Physics, CERN and the Large Hadron Collider
Particle Physics, CERN and the Large Hadron ColliderParticle Physics, CERN and the Large Hadron Collider
Particle Physics, CERN and the Large Hadron Collider
 
LHC
LHCLHC
LHC
 

Similaire à Using Ceph for Large Hadron Collider Data

Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
cpallares
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
openstackindia
 
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
JayjeetChakraborty
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
Marcos García
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

Similaire à Using Ceph for Large Hadron Collider Data (20)

Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
Skyhook: Towards an Arrow-Native Storage System, CCGrid 2022
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 

Plus de Rob Gardner (6)

Integrating the Campus to National NSF Cyberinfrastructure such as OSG and XSEDE
Integrating the Campus to National NSF Cyberinfrastructure such as OSG and XSEDEIntegrating the Campus to National NSF Cyberinfrastructure such as OSG and XSEDE
Integrating the Campus to National NSF Cyberinfrastructure such as OSG and XSEDE
 
The Open Science Grid
The Open Science GridThe Open Science Grid
The Open Science Grid
 
Science content delivery networks
Science content delivery networksScience content delivery networks
Science content delivery networks
 
Campus HTC at #TechEX15
Campus HTC at #TechEX15Campus HTC at #TechEX15
Campus HTC at #TechEX15
 
Ci Connect: A service for building multi-institutional cluster environments
Ci Connect: A service for building multi-institutional cluster environmentsCi Connect: A service for building multi-institutional cluster environments
Ci Connect: A service for building multi-institutional cluster environments
 
Open Science Grid
Open Science GridOpen Science Grid
Open Science Grid
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Using Ceph for Large Hadron Collider Data

  • 1. Lincoln Bryant • University of Chicago Ceph Day Chicago August 18th, 2015 Using Ceph for Large Hadron Collider Data
  • 2. about us ● ATLAS at the Large Hadron Collider at CERN ○ Tier2 center at UChicago, Indiana, UIUC/NCSA: ■ 12,000 processing slots, 4 PB of storage ○ 140 PB on disk in 120 data centers worldwide + CERN ● Open Science Grid: high throughput computing ○ Supporting “large science” and small research labs on campuses nationwide ○ >100k cores, >800M CPU-hours/year, ~PB transfer/day ● Users of Ceph for 2 years ○ started with v0.67, using v0.94 now ○ 1PB, more to come
  • 3. ATLAS site 5.4 miles (17 mi circumference) ⇐ Standard Model of Particle Physics Higgs boson: final piece discoved in 2012 ⇒ Nobel Prize 2015 → 2018: Cool new physics searches underway at 13 TeV ← credit: Katherine Leney (UCL, March 2015) LHC
  • 4. ATLAS detector ● Run2 center of mass energy = 13 TeV (Run1: 8 TeV) ● 40 MHz proton bunch crossing rate ○ 20-50 collisions/bunch crossing (“pileup”) ● Trigger (filters) reduces raw rate to ~ 1kHz ● Events are written to disk at ~ 1.5 GB/s LHC
  • 5. ATLAS detector 100M active sensors torroid magnets inner tracking person (scale) Not shown: Tile calorimeters (electrons, photons) Liquid argon calorimeter (hadrons) Muon chambers Forward detectors
  • 6. ATLAS data & analysis Primary data from CERN globally processed (event reconstruction and analysis) Role for Ceph: analysis datasets & object store for single events 3x100 Gbps
  • 7. Ceph technologies used ● Currently: ○ RBD ○ CephFS ● Future: ○ librados ○ RadosGW
  • 8. Our setup ● Ceph v0.94.2 on Scientific Linux 6.6 ● 14 storage servers ● 12 x 6 TB disks, no dedicated journal devices ○ Could buy PCI-E SSD(s) if the performance is needed ● Each connected at 10 Gbps ● Mons and MDS virtualized ● CephFS pools using erasure coding + cache tiering
  • 10. ● ATLAS uses the Open Science Grid middleware in the US ○ among other things: facilitates data management and transfer between sites ● Typical sites will use Lustre, dCache, etc as the “storage element” (SE) ● Goal: Build and productionize a storage element based on Ceph
  • 11. XRootD ● Primary file access protocol for accessing files within ATLAS ● Developed by Stanford Linear Accelerator (SLAC) ● Built to support standard high-energy physics analysis tools (e.g., ROOT) ○ Supports remote reads, caching, etc ● Federated over WAN via hierarchical system of ‘redirectors’
  • 12. Ceph and XRootD ● How to pair our favorite access protocol with our favorite storage platform?
  • 13. Ceph and XRootD ● How to pair our favorite access protocol with our favorite storage platform? ● Original approach: RBD + XRootD ○ Performance was acceptable ○ Problem: RBD only mounted on 1 machine ■ Can only run one XRootD server ○ Could create new RBDs and add to XRootD cluster to scale out ■ Problem: NFS exports for interactive users become a lot trickier
  • 14. Ceph and XRootD ● Current approach: CephFS + XRootD ○ All XRootD servers mount CephFS via kernel client ■ Scale out is a breeze ○ Fully POSIX filesystem, integrates simply with existing infrastructure ● Problem: Users want to r/w to the filesystem directly via CephFS, but XRootD needs to own the files it serves ○ Permissions issues galore
  • 15. Squashing with Ganesha NFS ● XRootD does not run in a privileged mode ○ Cannot modify/delete files written by users ○ Users can’t modify/delete files owned by XRootD ● How to allow users to read/write via FS mount? ● Using Ganesha to export CephFS as NFS and squash all users to the XRootD user ○ Doesn’t prevent users from stomping on each other’s files, but works well enough in practice
  • 16. Transfers from CERN to Chicago ● Using Ceph as the backend store for data from the LHC ● Analysis input data sets for regional physics analysis ● Easily obtain 200 MB/s from Geneva to our Ceph storage system in Chicago
  • 17. How does it look in practice? ● Pretty good!
  • 18. Potential evaluations ● XRootD with librados plugin ○ Skip the filesystem, write directly to object store ○ XRootD handles POSIX filesystem semantics as a pseudo-MDS ○ Three ways of accessing: ■ Directly access files via XRootD clients ■ Mount XRootD via FUSE client ■ LD_PRELOAD hook to intercept system calls to /xrootd
  • 20. Ceph and the batch system ● Goal: Run Ceph and user analysis jobs on the same machines ● Problem: Poorly defined jobs can wreak havoc on the Ceph cluster ○ e.g., machine starts heavily swapping, OOM killer starts killing random processes including OSDs, load spikes to hundreds, etc..
  • 21. Ceph and the batch system ● Solution: control groups (cgroups) ● Configured batch system (HTCondor) to use cgroups to limit the amount of CPU/RAM used on a per-job basis ● We let HTCondor scavenge about 80% of the cycles ○ May need to be tweaked as our Ceph usage increases.
  • 22. Ceph and the batch system ● Working well thus far:
  • 23. Ceph and the batch system ● Further work in this area: ○ Need to configure the batch system to immediately kill jobs when Ceph-related load goes up ■ e.g., disk failure ○ Re-nice OSDs to maximum priority ○ May require investigation into limiting network saturation
  • 24. ATLAS Event Service and RadosGW Higgs boson detection
  • 25. ATLAS Event Service ● Deliver single ATLAS events for processing ○ Rather than a complete dataset - “fine grained” ● Able to efficiently fill opportunistic resources like AWS instances (spot pricing), semi-idle HPC clusters, BOINC ● Can be evicted from resources immediately with negligible loss of work ● Output data is streamed to remote object storage
  • 26. ATLAS Event Service ● Rather than pay for S3, RadosGW fits this use case perfectly ● Colleagues at Brookhaven National Lab have deployed a test instance already ○ interested in providing this service as well ○ could potentially federate gateways ● Still in the pre-planning stage at our site
  • 27. Final thoughts June 2015 event 17 p-p collisions in one event
  • 28. Final thoughts ● Overall, quite happy with Ceph ○ Storage endpoint should be in production soon ○ More nodes on the way: plan to expand to 2 PB ● Looking forward to new CephFS features like quotas, offline fsck, etc ● Will be experimenting with Ceph pools shared between data centers with low RTT ping in the near future ● Expect Ceph to play important role in ATLAS data processing ⇒ new discoveries