SlideShare une entreprise Scribd logo
Architecting a 35 PB distributed
parallel file system for science
(formerly) Storage and I/O Software Engineer at NERSC, Berkeley Lab, US
(currently) HPC DevOps Engineer at Seqera Labs, Barcelona
Speck&Tech #53
Trento - May 29, 2023
Alberto Chiusole
- (2014-2017) BSc in Information and Business Organization Eng. - U. of Trento
- 5 months exchange student at Technical University of Denmark, Copenhagen
- (2017-2019) MSc in Data Science and Scientific Computing - U. of Trieste
- (2017-2020) HPC Sysadmin and Scientific software developer - eXact Lab, Trieste
- 3 months at CERN, Geneva, to work on Master’s thesis
- Comparison between CephFS at CERN and Lustre FS at eXact lab
- Presented at ISC High Performance in Frankfurt, July 2019
- (2020-2022) Storage and I/O Software Engineer - NERSC, Berkeley Lab, Cal., US
- Worked on Perlmutter and its Lustre FS, first full-flash only 35 PB parallel FS
- (2023 - now) HPC DevOps Engineer - Seqera Labs (remote)
https://www.linkedin.com/in/albertochiusole/
https://bit.ly/Alberto-Chiusole-Scholar
2
How I ended up working on Supercomputers
High Performance Computing (HPC) empowers breakthroughs
- Supercomputers run parallel applications to solve complex problems
- Applications come from all kind of sciences
- astrophysics, nuclear physics, molecular design, computational fluid dynamics, nuclear
warheads status simulation, climate and weather forecasts, COVID vaccines (!), to name a few
- Different from grid computing (nodes in HPC are more tightly coupled)
- At massive scale several complex problems appear
- Extremely expensive setups
- Certain labs are a matter of national security (think Men in black)
Let’s step back: why would anyone need such a FS?
3
So.. how do we get there? The hardware
HPC is a combination of advanced hardware and specialized software
4
A Namesake for Remarkable Contributions
Perlmutter is the newest supercomputer at NERSC (Berkeley
Lab, California, US)
Named after Saul Perlmutter, Nobel prize in Physics (2011)
for his 1988 discovery that the universe is expanding.
He confirmed his observations by running thousands of
simulations at NERSC, and his research team is believed to
have been the first to use supercomputers to analyze and
validate observational data in cosmology.
5
6
The hardware (Perlmutter)
- Hardware is made of several racks of “blades”
- CPU, GPU and now FPGA-enhanced nodes
- Fast network interconnection
- On PM: Cray (HPE) Slingshot 11
- Single-digit µs latency (~1-2 µs, <10 under heavy load)
- Optimized for HPC: offload into silicon
- Mix of Ethernet and InfiniBand protocols over fiber
- InfiniBand cheaper for same performance
- Liquid cooled units (note the colored pipes)
- Requires maintenance (& downtime) to change liquid
- Fast and large file systems
- Different tiers, for different time-scales
7
Special tiles!
The software landscape
- Linux-only world (mainly Red Hat, some SUSE, few Ubuntu, some custom)
- https://top500.org/statistics/list/
- Parallel programming
- OpenMP for intra-node comm., Message Passing Interface (MPI) for extra-node comm.
- Fortran kingdom!
- And C.. rarely C++. Python is gaining traction for data analysis and ML/AI steps
- Job schedulers to allocate resources to users
- Slurm (most popular), PBS, Torque, LSF, Moab, Grid Engine, etc
- User requests a certain “portion” of the cluster for their jobs
- Jobs are placed in a queue and wait for enough resources to start
- The scheduler prepares the environment, collects logs, wraps up when jobs are done
8
- 💿 Storage usage, I/O and data transfer
- Write the least to disk; write smartly (will see soon); avoid I/O bottlenecks
- 📐 Data locality
- Keep data as much as possible inside the node/rack
- ⚡ Power usage
- Servers use a lot of energy resources
- Perlmutter (US): 2.5 MW at full power – Fugaku (JP): 29.9 MW
- 830 households at max power (3 KW in Italy)
- 🥶 Cooling
- Location of data center is important
- Berkeley Lab benefits from the always cool temp of the Bay Area (19 C max year)
- Water is needed: can’t place DCs in deserts
Some of the challenges
9
What is I/O?
- Input/Output: everything that works with data and its storage
- At large scale you need multiple discs/drives and servers to store data
- Synchronization and consistency issues
- Two processes writing to a single file (strong or eventual consistency?)
- A process reading a file just written by another process (cache invalidation)
- A process writing to/reading from a file in a disc that crashed (fault tolerance)
- Duplicating files to increase aggregate read bandwidth
- Data locality: a temporary file may be written to a local FS rather than parallel FS
- Optimizing I/O is crucial
- CPUs work at the order of the ns (10-9
s), network/NVMe work at most at µs (10-6
s)
- Reducing network phase improves overall compute walltime considerably
10
Different file system scopes
The slower the drive, the higher the capacity
- Memory/NVMe drives are blazingly fast,
but they are expensive
- Scratch file systems should only be used
for temporary storage (are purged often)
11
- Data used in the same month should be moved to HDD
- Archive data should be moved to tape (it’s like VHS!)
- Movement of data may be enforced or automatic (like
S3 → Glacier)
- PM ships with the first all-flash file system in HPC
- 3,480 Samsung PM1733 PCIe NVMe drives (15.36 TB each)
- 3.5 GB/s seq. read, 3.2 seq. write speed by specifications
- 35 PB of usable POSIX storage (as in 'df -h')
- Directly integrated in the Slingshot compute network
- No need for LNet routers
Perlmutter scratch file system
12
- PM ships with the first all-flash file system in HPC
- 3,480 Samsung PM1733 PCIe NVMe drives (15.36 TB each)
- 3.5 GB/s seq. read, 3.2 seq. write speed by specifications
- 35 PB of usable POSIX storage (as in 'df -h')
- Directly integrated in the Slingshot compute network
- No need for LNet routers
- Enough to backup The Lord of The Rings trilogy 2.7M times
- Or 152k times for the extended cut in 4k Ultra HD
Perlmutter scratch file system
13
Metadata servers (MDS)
- Store the directory structure, file names,
inode locations inside OSS, etc
- Decide the file layout on OSSs (striping, etc)
- “Metadata” I/O, not bandwidth I/O
Object storage servers (OSS)
- Store chunks of data as binary
- Write 1 MiB stripes over OSS (like a RAID-0)
On PM: 16 MDS, 274 OSS
Parallel and distributed FS: Lustre
14
Inside ClusterStor E1000
MDS/OSS unit in the rack: twin servers
- Single-socket AMD Rome (128x PCIe Gen4 lanes)
- Allows switchless design
- 48 lanes for 24x NVMes, 32 lanes for 2x NICs
- Each server responsible for 12 NVMe drives, can
take over the other half if needed
- GridRAID (HPE) + ldiskfs to maximize
performance
- OSS = 8+2+1 RAID6 (GridRAID)
- MDS = 11-way RAID10 (mdraid)
15
Common HPC software used
16
Several tools are available to ease the coding for HPC: often intertwined
MPI is the bread and butter for multi-node communication
MPI-IO is its I/O layer, which helps managing files and transferring data
- File preallocation, offset management, etc
HDF5 uses MPI/MPI-IO to perform parallel I/O
NetCDF uses HDF5 as its storage format
IOR: benchmarking tool capable of generating synthetic I/O like HPC applications
Perlmutter: excellent performance end-to-end
17
41 GB/s read
27 GB/s write
1400 kIOPS read
29 kIOPS write
48 GB/s read
42 GB/s write
43 GB/s read
31 GB/s write
42 GB/s read
38 GB/s write
9,600 kIOPS read
1,600 kIOPS write
“Software distance” from drives
IOR:
88.4%(w) / 97.2%(r) NVMe block bandwidth (8+2 RAID on writes)
5.33%(w) / 15.1%(r) NVMe block IOPS (read-modify-write penalty RAID6)
Metadata performance of Perlmutter
Using IOR in a “production” run
- 230 clients x 6 procs/client = 1380 procs
- 1.6 M files/s created
In a “full-scale” run
- 1382 clients x 2 procs/client = 2764 procs
- 1.3 M files/s deleted
Way smoother User Experience than previous HDD-based Cori file system
18
Some surprises found during performance evaluation
SSDs slow down with age
- Like “HDD fragmentation”
- -10% write bandwidth after 5x capacity
written to an OST
- A fstrim is enough to fix it
- 5x OST size: 665 TB
- 2.2-2.9 PB daily expected writes
- 5x writes ~ 60/80 days
- The longer you wait, the longer fstrim takes
- Performed nightly to keep performance up
19
Thanks! Questions?
By the way, I use arch
20
PS: is hiring! seqera.io/careers
This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract
DE-AC02-05CH11231. This research used resources and data generated from resources of the National Energy Research Scientific
Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under
Contract No. DE-AC02-05CH11231.

Contenu connexe

Similaire à Architecting a 35 PB distributed parallel file system for science

The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
Lenovo Data Center
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
Yutaka Kawai
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
National Supercomputing Centre Singapore
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
inside-BigData.com
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
Saliya Ekanayake
 
Nehalem
NehalemNehalem
Nehalem
Ajmal Ak
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 
Lecture 2 computer evolution and performance
Lecture 2 computer evolution and performanceLecture 2 computer evolution and performance
Lecture 2 computer evolution and performance
Wajahat HuxaIn
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
Ilgın Kavaklıoğulları
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
Ryousei Takano
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
Slide_N
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
Heiko Joerg Schick
 

Similaire à Architecting a 35 PB distributed parallel file system for science (20)

The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Nehalem
NehalemNehalem
Nehalem
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Lecture 2 computer evolution and performance
Lecture 2 computer evolution and performanceLecture 2 computer evolution and performance
Lecture 2 computer evolution and performance
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 

Plus de Speck&Tech

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Dati aperti: un diritto digitale, da rivendicare e da alimentare
Dati aperti: un diritto digitale, da rivendicare e da alimentareDati aperti: un diritto digitale, da rivendicare e da alimentare
Dati aperti: un diritto digitale, da rivendicare e da alimentare
Speck&Tech
 
AI nel diritto penale, dalle indagini alla redazione delle sentenze
AI nel diritto penale, dalle indagini alla redazione delle sentenzeAI nel diritto penale, dalle indagini alla redazione delle sentenze
AI nel diritto penale, dalle indagini alla redazione delle sentenze
Speck&Tech
 
Vecchi e nuovi diritti per l'intelligenza artificiale
Vecchi e nuovi diritti per l'intelligenza artificialeVecchi e nuovi diritti per l'intelligenza artificiale
Vecchi e nuovi diritti per l'intelligenza artificiale
Speck&Tech
 
What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
Speck&Tech
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
Speck&Tech
 
AWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaAWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scala
Speck&Tech
 
Praticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesPraticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web Services
Speck&Tech
 
Data Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designData Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information design
Speck&Tech
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
Speck&Tech
 
Delve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsDelve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomics
Speck&Tech
 
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Speck&Tech
 
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Speck&Tech
 
Why LLMs should be handled with care
Why LLMs should be handled with careWhy LLMs should be handled with care
Why LLMs should be handled with care
Speck&Tech
 
Building intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsBuilding intelligent applications with Large Language Models
Building intelligent applications with Large Language Models
Speck&Tech
 
Privacy in the era of quantum computers
Privacy in the era of quantum computersPrivacy in the era of quantum computers
Privacy in the era of quantum computers
Speck&Tech
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computers
Speck&Tech
 
Give your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsGive your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUs
Speck&Tech
 
From leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologyFrom leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technology
Speck&Tech
 
Innovating Wood
Innovating WoodInnovating Wood
Innovating Wood
Speck&Tech
 

Plus de Speck&Tech (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Dati aperti: un diritto digitale, da rivendicare e da alimentare
Dati aperti: un diritto digitale, da rivendicare e da alimentareDati aperti: un diritto digitale, da rivendicare e da alimentare
Dati aperti: un diritto digitale, da rivendicare e da alimentare
 
AI nel diritto penale, dalle indagini alla redazione delle sentenze
AI nel diritto penale, dalle indagini alla redazione delle sentenzeAI nel diritto penale, dalle indagini alla redazione delle sentenze
AI nel diritto penale, dalle indagini alla redazione delle sentenze
 
Vecchi e nuovi diritti per l'intelligenza artificiale
Vecchi e nuovi diritti per l'intelligenza artificialeVecchi e nuovi diritti per l'intelligenza artificiale
Vecchi e nuovi diritti per l'intelligenza artificiale
 
What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
 
AWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaAWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scala
 
Praticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesPraticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web Services
 
Data Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designData Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information design
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
 
Delve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsDelve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomics
 
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
 
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
 
Why LLMs should be handled with care
Why LLMs should be handled with careWhy LLMs should be handled with care
Why LLMs should be handled with care
 
Building intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsBuilding intelligent applications with Large Language Models
Building intelligent applications with Large Language Models
 
Privacy in the era of quantum computers
Privacy in the era of quantum computersPrivacy in the era of quantum computers
Privacy in the era of quantum computers
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computers
 
Give your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsGive your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUs
 
From leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologyFrom leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technology
 
Innovating Wood
Innovating WoodInnovating Wood
Innovating Wood
 

Dernier

Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Dernier (20)

Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Architecting a 35 PB distributed parallel file system for science

  • 1. Architecting a 35 PB distributed parallel file system for science (formerly) Storage and I/O Software Engineer at NERSC, Berkeley Lab, US (currently) HPC DevOps Engineer at Seqera Labs, Barcelona Speck&Tech #53 Trento - May 29, 2023 Alberto Chiusole
  • 2. - (2014-2017) BSc in Information and Business Organization Eng. - U. of Trento - 5 months exchange student at Technical University of Denmark, Copenhagen - (2017-2019) MSc in Data Science and Scientific Computing - U. of Trieste - (2017-2020) HPC Sysadmin and Scientific software developer - eXact Lab, Trieste - 3 months at CERN, Geneva, to work on Master’s thesis - Comparison between CephFS at CERN and Lustre FS at eXact lab - Presented at ISC High Performance in Frankfurt, July 2019 - (2020-2022) Storage and I/O Software Engineer - NERSC, Berkeley Lab, Cal., US - Worked on Perlmutter and its Lustre FS, first full-flash only 35 PB parallel FS - (2023 - now) HPC DevOps Engineer - Seqera Labs (remote) https://www.linkedin.com/in/albertochiusole/ https://bit.ly/Alberto-Chiusole-Scholar 2 How I ended up working on Supercomputers
  • 3. High Performance Computing (HPC) empowers breakthroughs - Supercomputers run parallel applications to solve complex problems - Applications come from all kind of sciences - astrophysics, nuclear physics, molecular design, computational fluid dynamics, nuclear warheads status simulation, climate and weather forecasts, COVID vaccines (!), to name a few - Different from grid computing (nodes in HPC are more tightly coupled) - At massive scale several complex problems appear - Extremely expensive setups - Certain labs are a matter of national security (think Men in black) Let’s step back: why would anyone need such a FS? 3
  • 4. So.. how do we get there? The hardware HPC is a combination of advanced hardware and specialized software 4
  • 5. A Namesake for Remarkable Contributions Perlmutter is the newest supercomputer at NERSC (Berkeley Lab, California, US) Named after Saul Perlmutter, Nobel prize in Physics (2011) for his 1988 discovery that the universe is expanding. He confirmed his observations by running thousands of simulations at NERSC, and his research team is believed to have been the first to use supercomputers to analyze and validate observational data in cosmology. 5
  • 6. 6
  • 7. The hardware (Perlmutter) - Hardware is made of several racks of “blades” - CPU, GPU and now FPGA-enhanced nodes - Fast network interconnection - On PM: Cray (HPE) Slingshot 11 - Single-digit µs latency (~1-2 µs, <10 under heavy load) - Optimized for HPC: offload into silicon - Mix of Ethernet and InfiniBand protocols over fiber - InfiniBand cheaper for same performance - Liquid cooled units (note the colored pipes) - Requires maintenance (& downtime) to change liquid - Fast and large file systems - Different tiers, for different time-scales 7 Special tiles!
  • 8. The software landscape - Linux-only world (mainly Red Hat, some SUSE, few Ubuntu, some custom) - https://top500.org/statistics/list/ - Parallel programming - OpenMP for intra-node comm., Message Passing Interface (MPI) for extra-node comm. - Fortran kingdom! - And C.. rarely C++. Python is gaining traction for data analysis and ML/AI steps - Job schedulers to allocate resources to users - Slurm (most popular), PBS, Torque, LSF, Moab, Grid Engine, etc - User requests a certain “portion” of the cluster for their jobs - Jobs are placed in a queue and wait for enough resources to start - The scheduler prepares the environment, collects logs, wraps up when jobs are done 8
  • 9. - 💿 Storage usage, I/O and data transfer - Write the least to disk; write smartly (will see soon); avoid I/O bottlenecks - 📐 Data locality - Keep data as much as possible inside the node/rack - ⚡ Power usage - Servers use a lot of energy resources - Perlmutter (US): 2.5 MW at full power – Fugaku (JP): 29.9 MW - 830 households at max power (3 KW in Italy) - 🥶 Cooling - Location of data center is important - Berkeley Lab benefits from the always cool temp of the Bay Area (19 C max year) - Water is needed: can’t place DCs in deserts Some of the challenges 9
  • 10. What is I/O? - Input/Output: everything that works with data and its storage - At large scale you need multiple discs/drives and servers to store data - Synchronization and consistency issues - Two processes writing to a single file (strong or eventual consistency?) - A process reading a file just written by another process (cache invalidation) - A process writing to/reading from a file in a disc that crashed (fault tolerance) - Duplicating files to increase aggregate read bandwidth - Data locality: a temporary file may be written to a local FS rather than parallel FS - Optimizing I/O is crucial - CPUs work at the order of the ns (10-9 s), network/NVMe work at most at µs (10-6 s) - Reducing network phase improves overall compute walltime considerably 10
  • 11. Different file system scopes The slower the drive, the higher the capacity - Memory/NVMe drives are blazingly fast, but they are expensive - Scratch file systems should only be used for temporary storage (are purged often) 11 - Data used in the same month should be moved to HDD - Archive data should be moved to tape (it’s like VHS!) - Movement of data may be enforced or automatic (like S3 → Glacier)
  • 12. - PM ships with the first all-flash file system in HPC - 3,480 Samsung PM1733 PCIe NVMe drives (15.36 TB each) - 3.5 GB/s seq. read, 3.2 seq. write speed by specifications - 35 PB of usable POSIX storage (as in 'df -h') - Directly integrated in the Slingshot compute network - No need for LNet routers Perlmutter scratch file system 12
  • 13. - PM ships with the first all-flash file system in HPC - 3,480 Samsung PM1733 PCIe NVMe drives (15.36 TB each) - 3.5 GB/s seq. read, 3.2 seq. write speed by specifications - 35 PB of usable POSIX storage (as in 'df -h') - Directly integrated in the Slingshot compute network - No need for LNet routers - Enough to backup The Lord of The Rings trilogy 2.7M times - Or 152k times for the extended cut in 4k Ultra HD Perlmutter scratch file system 13
  • 14. Metadata servers (MDS) - Store the directory structure, file names, inode locations inside OSS, etc - Decide the file layout on OSSs (striping, etc) - “Metadata” I/O, not bandwidth I/O Object storage servers (OSS) - Store chunks of data as binary - Write 1 MiB stripes over OSS (like a RAID-0) On PM: 16 MDS, 274 OSS Parallel and distributed FS: Lustre 14
  • 15. Inside ClusterStor E1000 MDS/OSS unit in the rack: twin servers - Single-socket AMD Rome (128x PCIe Gen4 lanes) - Allows switchless design - 48 lanes for 24x NVMes, 32 lanes for 2x NICs - Each server responsible for 12 NVMe drives, can take over the other half if needed - GridRAID (HPE) + ldiskfs to maximize performance - OSS = 8+2+1 RAID6 (GridRAID) - MDS = 11-way RAID10 (mdraid) 15
  • 16. Common HPC software used 16 Several tools are available to ease the coding for HPC: often intertwined MPI is the bread and butter for multi-node communication MPI-IO is its I/O layer, which helps managing files and transferring data - File preallocation, offset management, etc HDF5 uses MPI/MPI-IO to perform parallel I/O NetCDF uses HDF5 as its storage format IOR: benchmarking tool capable of generating synthetic I/O like HPC applications
  • 17. Perlmutter: excellent performance end-to-end 17 41 GB/s read 27 GB/s write 1400 kIOPS read 29 kIOPS write 48 GB/s read 42 GB/s write 43 GB/s read 31 GB/s write 42 GB/s read 38 GB/s write 9,600 kIOPS read 1,600 kIOPS write “Software distance” from drives IOR: 88.4%(w) / 97.2%(r) NVMe block bandwidth (8+2 RAID on writes) 5.33%(w) / 15.1%(r) NVMe block IOPS (read-modify-write penalty RAID6)
  • 18. Metadata performance of Perlmutter Using IOR in a “production” run - 230 clients x 6 procs/client = 1380 procs - 1.6 M files/s created In a “full-scale” run - 1382 clients x 2 procs/client = 2764 procs - 1.3 M files/s deleted Way smoother User Experience than previous HDD-based Cori file system 18
  • 19. Some surprises found during performance evaluation SSDs slow down with age - Like “HDD fragmentation” - -10% write bandwidth after 5x capacity written to an OST - A fstrim is enough to fix it - 5x OST size: 665 TB - 2.2-2.9 PB daily expected writes - 5x writes ~ 60/80 days - The longer you wait, the longer fstrim takes - Performed nightly to keep performance up 19
  • 20. Thanks! Questions? By the way, I use arch 20 PS: is hiring! seqera.io/careers This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-05CH11231. This research used resources and data generated from resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.