SlideShare a Scribd company logo
1 of 33
Download to read offline
Running a 51k GPU burst for
Multi-Messenger Astrophysics
with IceCube across all available
GPUs in the Cloud
Frank Würthwein - OSG Executive Director
Igor Sfiligoi - Lead Scientific Researcher
UCSD/SDSC
Jensen Huang keynote @ SC19
2
The Largest Cloud Simulation in History
50k NVIDIA GPUs in the Cloud
350 Petaflops for 2 hours
Distributed across US, Europe & Asia
Saturday morning before SC19 we bought all GPU capacity that was for sale in
Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
How did we get here?
Annual IceCube GPU use via OSG
4
Peaked at ~3000 GPUs for a day.
Last 12 months
OSG supports global operations of IceCube.
IceCube made long term investment into
dHTC as their computing paradigm.
We produced ~3% of the annual photon
propagation simulations in a ~2h cloud burst.
Longterm Partnership between
IceCube, OSG, HTCondor, … lead to this cloud burst.
The Science Case
IceCube
6
A cubic kilometer of ice at the
south pole is instrumented
with 5160 optical sensors.
Astrophysics:
• Discovery of astrophysical neutrinos
• First evidence of neutrino point source (TXS)
• Cosmic rays with surface detector
Particle Physics:
• Atmospheric neutrino oscillation
• Neutrino cross sections at TeV scale
• New physics searches at highest energies
Earth Science:
• Glaciology
• Earth tomography
A facility with very
diverse science goals
Restrict this talk to
high energy Astrophysics
High Energy Astrophysics
Science case for IceCube
7
Universe is opaque to light
at highest energies and
distances.
Only gravitational waves
and neutrinos can pinpoint
most violent events in
universe.
Fortunately, highest energy
neutrinos are of cosmic origin.
Effectively “background free” as long
as energy is measured correctly.
High energy neutrinos from
outside the solar system
8
First 28 very high energy neutrinos from outside the solar system
Red curve is the photon flux
spectrum measured with the
Fermi satellite.
Black points show the
corresponding high energy
neutrino flux spectrum
measured by IceCube.
This demonstrates both the opaqueness of the universe to high energy
photons, and the ability of IceCube to detect neutrinos above the maximum
energy we can see light due to this opaqueness.
Science 342 (2013). DOI:
10.1126/science.1242856
Understanding the Origin
9
We now know high energy events happen in the universe. What are they?
p + g D + p + 0 p + g g
p + g D + n + + n + +
Co
Aya Ishihara
The hypothesis:
The same cosmic events produce
neutrinos and photons
We detect the electrons or muons from neutrino that interact in the ice.
Neutrino interact very weakly => need a very large array of ice instrumented
to maximize chances that a cosmic neutrino interacts inside the detector.
Need pointing accuracy to point back to origin of neutrino.
Telescopes the world over then try to identify the source in the direction
IceCube is pointing to for the neutrino.
Multi-messenger Astrophysics
The ν detection challenge
10
Optical Pro
Aya Ishiha
Combining all the possible info
These features are included in
We re al a s be de eloping h
Nature never tell us a perfec
satisfactory agreem
Ice properties change with
depth and wavelength
Observed pointing resolution at high
energies is systematics limited.
Central value moves
for different ice models
Improved e and τ reconstruction
Þ increased neutrino flux
detection
Þ more observations
Photon propagation through
ice runs efficiently on single
precision GPU.
Detailed simulation campaigns
to improve pointing resolution
by improving ice model.
Improvement in reconstruction with
better ice model near the detectors
First evidence of an origin
11
First location of a source of very high energy neutrinos.
Neutrino produced high energy muon
near IceCube. Muon produced light as it
traverses IceCube volume. Light is
detected by array of phototubes of
IceCube.
IceCube alerted the astronomy community of the
observation of a single high energy neutrino on
September 22 2017.
A blazar designated by astronomers as TXS
0506+056 was subsequently identified as most likely
source in the direction IceCube was pointing. Multiple
telescopes saw light from TXS at the same time
IceCube saw the neutrino.
Science 361, 147-151
(2018). DOI:10.1126/science.aat2890
IceCube’s Future Plans
12
| IceCube Upgrade and Gen2 | Summer Blot | TeVPA 2018
The IceCube-Gen2 Facility
Preliminary timeline
MeV- to EeV-scale physics
Surface array
High Energy
Array
Radio array
PINGU
IC86
2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 … 2032
Today
Surface air shower
ConstructionR&D Design & Approval
IceCube Upgrade
IceCube Upgrade
Deployment
Near term:
add more phototubes to deep core to increase granularity of measurements.
Longer term:
• Extend instrumented
volume at smaller
granularity.
• Extend even smaller
granularity deep core
volume.
• Add surface array.
Improve detector for low & high energy neutrinos
Details on the Cloud Burst
The Idea
• Integrate all GPUs available for sale
worldwide into a single HTCondor pool.
- use 28 regions across AWS, Azure, and Google
Cloud for a burst of a couple hours, or so.
• IceCube submits their photon propagation
workflow to this HTCondor pool.
- we handle the input, the jobs on the GPUs, and
the output as a single globally distributed system.
14
Run a GPU burst relevant in scale
for future Exascale HPC systems.
A global HTCondor pool
• IceCube, like all OSG user communities, relies on
HTCondor for resource orchestration
- This demo used the standard tools
• Dedicated HW setup
- Avoid disruption of OSG production system
- Optimize HTCondor setup for the spiky nature of the demo
§ multiple schedds for IceCube to submit to
§ collecting resources in each cloud region, then collecting from all
regions into global pool
15
HTCondor Distributed CI
16
Collector
Collector Collector
Collector
Collector
Negotiator
Scheduler SchedulerScheduler
IceCube
VM
VM
VM
10 schedd’s
One global resource pool
Using native Cloud storage
• Input data pre-staged into native Cloud storage
- Each file in one-to-few Cloud regions
§ some replication to deal with limited predictability of resources per region
- Local to Compute for large regions for maximum throughput
- Reading from “close” region for smaller ones to minimize ops
• Output staged back to region-local Cloud storage
• Deployed simple wrappers around Cloud native file
transfer tools
- IceCube jobs do not need to customize for different Clouds
- They just need to know where input data is available
(pretty standard OSG operation mode)
17
The Testing Ahead of Time
18
In-Cloud storage showed great scalability.
Exceeded 1 Tbps.
Networking inside a Cloud region not a concern.
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 200 400 600 800 1000 1200 1400
Gbps
Number of compute instances
AWS
Azure
GCP
Probably
can scale
much higher.
The Testing Ahead of Time
19
Cloud to on-prem networking in US good but not great.
At least compared to the in-Cloud networking.
60 Gbps
US East
US West
15-35 Gbps
20 Gbps
20 Gbps
20 Gbps
20 Gbps
Legend:
Blue dots are on-prem
Edges Cloud regions
One of the reasons for pre-staging the inputs.
The Testing Ahead of Time
20
Cloud to on-prem international networking so-so.
14-24 Gbps
7-32 Gbps
6-44 Gbps
1-18 Gbps
11 Gbps
Australia
Korea
Europe
Legend:
Blue dots are on-prem
Edges Cloud regions
Not knowing how much we would get from where,
we decided to play it safe.
The Testing Ahead of Time
21
~250,000 single threaded jobs
run across 28 cloud regions
during 80 minutes.
Peak at 90,000
jobs running.
up to 60k jobs started in ~10min.
Regions across US, EU, and
Asia were used in this test.
Demonstrated burst capability
of our infrastructure on CPUs.
Want scale of GPU burst to be limited
only by # of GPUs available for sale.
Science with 51,000 GPUs
achieved as peak performance
22
Time in Minutes
Each color is a different
cloud region in US, EU, or Asia.
Total of 28 Regions in use.
Peaked at 51,500 GPUs
~380 Petaflops of fp32
8 generations of NVIDIA GPUs used.
Summary of stats at peak
A Heterogenous Resource Pool
23
28 cloud Regions across 4 world regions
providing us with 8 GPU generations.
No one region or GPU type dominates!
Science Produced
24
Distributed High-Throughput
Computing (dHTC) paradigm
implemented via HTCondor provides
global resource aggregation.
Largest cloud region provided 10.8% of the total
dHTC paradigm can aggregate
on-prem anywhere
HPC at any scale
and multiple clouds
Performance and Cost Analysis
Performance vs GPU type
26
42% of the science was done on V100 in 19% of the wall time.
IceCube Performance/$$$
27
GPU V100 P100 T4 M60 RTX 2080 Ti GTX 1080
Relative
TFLOP32
100% 67% 57% 34%* 82% 62%
Relative
IceCube
Performance
100% 56% 48% 30%* 70% 48%
Relative
Science/$**
1.1-1.4 1.1-1.3 3.0-4.5 1.0-1.3 N/A N/A
*per CUDA device **spot market prices, range indicates cost differences between cloud vendors
IceCube performance scales better than TFLOP32 for high end GPUs
Science/$$$ x3 better for T4 than V100
Price differential between vendors ~10-30% on most GPUs
Price differential on-demand vs spot ~x3
Aside: Science/$$$ for on-prem of 2080Ti ~x3 better than V100
IceCube and dHTC
dHTC = distributed High Throughput Computing
IceCube Input Segmentable
29
IceCube prepared two types of input files that differed
in x10 in the number of input events per file.
Small files processed by K80 and K520, large files by all other GPU types.
seconds seconds
A total of 10.2 Billion events were processed across ~175,000 GPU jobs.
Each job fetched a file from cloud storage to local storage, processed that file, and wrote
the output to cloud storage. For ¼ of the regions cloud storage was not local to the
region. => we could have probably avoided data replication across regions given the
excellent networking between regions for each provider.
Applicability beyond IceCube
• All the large instruments we know off
- LHC, LIGO, DUNE, LSST, …
• Any midscale instrument we can think off
- XENON, GlueX, Clas12, Nova, DES, Cryo-EM, …
• A large fraction of Deep Learning
- But not all of it …
• Basically, anything that has bundles of
independently schedulable jobs that can be
partitioned to adjust workloads to have 0.5 to
few hour runtimes on modern GPUs.
30
Cost to support cloud as a
“24x7” capability
• Today, roughly $15k per 300 PFLOP32 hour
• This burst was executed by 2 people
- Igor Sfiligoi (SDSC) to support the infrastructure.
- David Schultz (UW Madison) to create and submit the
IceCube workflows.
§ “David” type person is needed also for on-prem science workflows.
• To make this a routine operations capability for any
open science that is dHTC capable would require
another 50% FTE “Cloud Budget Manager”.
- There is substantial effort involved in just dealing with cost &
budgets for a large community of scientists.
31
IceCube is ready for Exascale
• Humanity has built extraordinary instruments by pooling
human and financial resources globally.
• The computing for these large collaborations fits perfectly to
the cloud or scheduling holes in Exascale HPC systems due
to its “ingeniously parallel” nature. => dHTC
• The dHTC computing paradigm applies to a wide range of
problems across all of open science.
- We are happy to repeat this with anybody willing to spend $50k in the
clouds.
32
Contact us at: help@opensciencegrid.org
Or me personally at: fkw@ucsd.edu
Demonstrated elastic burst at 51,500 GPUs
IceCube is ready for Exascale
Acknowledgements
• This work was partially supported by the
NSF grants OAC-1941481, MPS-1148698,
OAC-1841530, OAC-1904444, and OAC-
1826967
33

More Related Content

What's hot

Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Amazon Web Services
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...Andrew Howard
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicTim Bell
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveTim Bell
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellAmrita Prasad
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...NECST Lab @ Politecnico di Milano
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4Tim Bell
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Amazon Web Services
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBInfluxData
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1Tim Bell
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3Tim Bell
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research PlatformLarry Smarr
 
Towards Exascale Simulations of Stellar Explosions with FLASH
Towards Exascale  Simulations of Stellar  Explosions with FLASHTowards Exascale  Simulations of Stellar  Explosions with FLASH
Towards Exascale Simulations of Stellar Explosions with FLASHGanesan Narayanasamy
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
Cycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC RunCycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC Runinside-BigData.com
 

What's hot (20)

Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
Federated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation TherapyFederated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation Therapy
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim Bell
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Towards Exascale Simulations of Stellar Explosions with FLASH
Towards Exascale  Simulations of Stellar  Explosions with FLASHTowards Exascale  Simulations of Stellar  Explosions with FLASH
Towards Exascale Simulations of Stellar Explosions with FLASH
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Cycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC RunCycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC Run
 

Similar to NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with IceCube

A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...Larry Smarr
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDLarry Smarr
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
 
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...Larry Smarr
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraLarry Smarr
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Larry Smarr
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureLarry Smarr
 
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Tracy Chen
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter OverviewLarry Smarr
 
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...Larry Smarr
 
PRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardPRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardLarry Smarr
 
Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Larry Smarr
 
Mateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarialMateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its ApplicationsLarry Smarr
 
Accelerating Astronomical Discoveries with Apache Spark
Accelerating Astronomical Discoveries with Apache SparkAccelerating Astronomical Discoveries with Apache Spark
Accelerating Astronomical Discoveries with Apache SparkDatabricks
 
Detecting solar farms with deep learning
Detecting solar farms with deep learningDetecting solar farms with deep learning
Detecting solar farms with deep learningJason Brown
 
Towards Telepresence
Towards TelepresenceTowards Telepresence
Towards TelepresenceLarry Smarr
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 

Similar to NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with IceCube (20)

A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway System
 
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
High Performance Cyberinfrastructure is Needed to Enable Data-Intensive Scien...
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing Cyberinfrastructure
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, Future
 
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
 
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...
 
PRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardPRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path Forward
 
Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020
 
Mateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarialMateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarial
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its Applications
 
Accelerating Astronomical Discoveries with Apache Spark
Accelerating Astronomical Discoveries with Apache SparkAccelerating Astronomical Discoveries with Apache Spark
Accelerating Astronomical Discoveries with Apache Spark
 
Detecting solar farms with deep learning
Detecting solar farms with deep learningDetecting solar farms with deep learning
Detecting solar farms with deep learning
 
Towards Telepresence
Towards TelepresenceTowards Telepresence
Towards Telepresence
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondor
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with IceCube

  • 1. Running a 51k GPU burst for Multi-Messenger Astrophysics with IceCube across all available GPUs in the Cloud Frank Würthwein - OSG Executive Director Igor Sfiligoi - Lead Scientific Researcher UCSD/SDSC
  • 2. Jensen Huang keynote @ SC19 2 The Largest Cloud Simulation in History 50k NVIDIA GPUs in the Cloud 350 Petaflops for 2 hours Distributed across US, Europe & Asia Saturday morning before SC19 we bought all GPU capacity that was for sale in Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
  • 3. How did we get here?
  • 4. Annual IceCube GPU use via OSG 4 Peaked at ~3000 GPUs for a day. Last 12 months OSG supports global operations of IceCube. IceCube made long term investment into dHTC as their computing paradigm. We produced ~3% of the annual photon propagation simulations in a ~2h cloud burst. Longterm Partnership between IceCube, OSG, HTCondor, … lead to this cloud burst.
  • 6. IceCube 6 A cubic kilometer of ice at the south pole is instrumented with 5160 optical sensors. Astrophysics: • Discovery of astrophysical neutrinos • First evidence of neutrino point source (TXS) • Cosmic rays with surface detector Particle Physics: • Atmospheric neutrino oscillation • Neutrino cross sections at TeV scale • New physics searches at highest energies Earth Science: • Glaciology • Earth tomography A facility with very diverse science goals Restrict this talk to high energy Astrophysics
  • 7. High Energy Astrophysics Science case for IceCube 7 Universe is opaque to light at highest energies and distances. Only gravitational waves and neutrinos can pinpoint most violent events in universe. Fortunately, highest energy neutrinos are of cosmic origin. Effectively “background free” as long as energy is measured correctly.
  • 8. High energy neutrinos from outside the solar system 8 First 28 very high energy neutrinos from outside the solar system Red curve is the photon flux spectrum measured with the Fermi satellite. Black points show the corresponding high energy neutrino flux spectrum measured by IceCube. This demonstrates both the opaqueness of the universe to high energy photons, and the ability of IceCube to detect neutrinos above the maximum energy we can see light due to this opaqueness. Science 342 (2013). DOI: 10.1126/science.1242856
  • 9. Understanding the Origin 9 We now know high energy events happen in the universe. What are they? p + g D + p + 0 p + g g p + g D + n + + n + + Co Aya Ishihara The hypothesis: The same cosmic events produce neutrinos and photons We detect the electrons or muons from neutrino that interact in the ice. Neutrino interact very weakly => need a very large array of ice instrumented to maximize chances that a cosmic neutrino interacts inside the detector. Need pointing accuracy to point back to origin of neutrino. Telescopes the world over then try to identify the source in the direction IceCube is pointing to for the neutrino. Multi-messenger Astrophysics
  • 10. The ν detection challenge 10 Optical Pro Aya Ishiha Combining all the possible info These features are included in We re al a s be de eloping h Nature never tell us a perfec satisfactory agreem Ice properties change with depth and wavelength Observed pointing resolution at high energies is systematics limited. Central value moves for different ice models Improved e and τ reconstruction Þ increased neutrino flux detection Þ more observations Photon propagation through ice runs efficiently on single precision GPU. Detailed simulation campaigns to improve pointing resolution by improving ice model. Improvement in reconstruction with better ice model near the detectors
  • 11. First evidence of an origin 11 First location of a source of very high energy neutrinos. Neutrino produced high energy muon near IceCube. Muon produced light as it traverses IceCube volume. Light is detected by array of phototubes of IceCube. IceCube alerted the astronomy community of the observation of a single high energy neutrino on September 22 2017. A blazar designated by astronomers as TXS 0506+056 was subsequently identified as most likely source in the direction IceCube was pointing. Multiple telescopes saw light from TXS at the same time IceCube saw the neutrino. Science 361, 147-151 (2018). DOI:10.1126/science.aat2890
  • 12. IceCube’s Future Plans 12 | IceCube Upgrade and Gen2 | Summer Blot | TeVPA 2018 The IceCube-Gen2 Facility Preliminary timeline MeV- to EeV-scale physics Surface array High Energy Array Radio array PINGU IC86 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 … 2032 Today Surface air shower ConstructionR&D Design & Approval IceCube Upgrade IceCube Upgrade Deployment Near term: add more phototubes to deep core to increase granularity of measurements. Longer term: • Extend instrumented volume at smaller granularity. • Extend even smaller granularity deep core volume. • Add surface array. Improve detector for low & high energy neutrinos
  • 13. Details on the Cloud Burst
  • 14. The Idea • Integrate all GPUs available for sale worldwide into a single HTCondor pool. - use 28 regions across AWS, Azure, and Google Cloud for a burst of a couple hours, or so. • IceCube submits their photon propagation workflow to this HTCondor pool. - we handle the input, the jobs on the GPUs, and the output as a single globally distributed system. 14 Run a GPU burst relevant in scale for future Exascale HPC systems.
  • 15. A global HTCondor pool • IceCube, like all OSG user communities, relies on HTCondor for resource orchestration - This demo used the standard tools • Dedicated HW setup - Avoid disruption of OSG production system - Optimize HTCondor setup for the spiky nature of the demo § multiple schedds for IceCube to submit to § collecting resources in each cloud region, then collecting from all regions into global pool 15
  • 16. HTCondor Distributed CI 16 Collector Collector Collector Collector Collector Negotiator Scheduler SchedulerScheduler IceCube VM VM VM 10 schedd’s One global resource pool
  • 17. Using native Cloud storage • Input data pre-staged into native Cloud storage - Each file in one-to-few Cloud regions § some replication to deal with limited predictability of resources per region - Local to Compute for large regions for maximum throughput - Reading from “close” region for smaller ones to minimize ops • Output staged back to region-local Cloud storage • Deployed simple wrappers around Cloud native file transfer tools - IceCube jobs do not need to customize for different Clouds - They just need to know where input data is available (pretty standard OSG operation mode) 17
  • 18. The Testing Ahead of Time 18 In-Cloud storage showed great scalability. Exceeded 1 Tbps. Networking inside a Cloud region not a concern. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 Gbps Number of compute instances AWS Azure GCP Probably can scale much higher.
  • 19. The Testing Ahead of Time 19 Cloud to on-prem networking in US good but not great. At least compared to the in-Cloud networking. 60 Gbps US East US West 15-35 Gbps 20 Gbps 20 Gbps 20 Gbps 20 Gbps Legend: Blue dots are on-prem Edges Cloud regions One of the reasons for pre-staging the inputs.
  • 20. The Testing Ahead of Time 20 Cloud to on-prem international networking so-so. 14-24 Gbps 7-32 Gbps 6-44 Gbps 1-18 Gbps 11 Gbps Australia Korea Europe Legend: Blue dots are on-prem Edges Cloud regions Not knowing how much we would get from where, we decided to play it safe.
  • 21. The Testing Ahead of Time 21 ~250,000 single threaded jobs run across 28 cloud regions during 80 minutes. Peak at 90,000 jobs running. up to 60k jobs started in ~10min. Regions across US, EU, and Asia were used in this test. Demonstrated burst capability of our infrastructure on CPUs. Want scale of GPU burst to be limited only by # of GPUs available for sale.
  • 22. Science with 51,000 GPUs achieved as peak performance 22 Time in Minutes Each color is a different cloud region in US, EU, or Asia. Total of 28 Regions in use. Peaked at 51,500 GPUs ~380 Petaflops of fp32 8 generations of NVIDIA GPUs used. Summary of stats at peak
  • 23. A Heterogenous Resource Pool 23 28 cloud Regions across 4 world regions providing us with 8 GPU generations. No one region or GPU type dominates!
  • 24. Science Produced 24 Distributed High-Throughput Computing (dHTC) paradigm implemented via HTCondor provides global resource aggregation. Largest cloud region provided 10.8% of the total dHTC paradigm can aggregate on-prem anywhere HPC at any scale and multiple clouds
  • 26. Performance vs GPU type 26 42% of the science was done on V100 in 19% of the wall time.
  • 27. IceCube Performance/$$$ 27 GPU V100 P100 T4 M60 RTX 2080 Ti GTX 1080 Relative TFLOP32 100% 67% 57% 34%* 82% 62% Relative IceCube Performance 100% 56% 48% 30%* 70% 48% Relative Science/$** 1.1-1.4 1.1-1.3 3.0-4.5 1.0-1.3 N/A N/A *per CUDA device **spot market prices, range indicates cost differences between cloud vendors IceCube performance scales better than TFLOP32 for high end GPUs Science/$$$ x3 better for T4 than V100 Price differential between vendors ~10-30% on most GPUs Price differential on-demand vs spot ~x3 Aside: Science/$$$ for on-prem of 2080Ti ~x3 better than V100
  • 28. IceCube and dHTC dHTC = distributed High Throughput Computing
  • 29. IceCube Input Segmentable 29 IceCube prepared two types of input files that differed in x10 in the number of input events per file. Small files processed by K80 and K520, large files by all other GPU types. seconds seconds A total of 10.2 Billion events were processed across ~175,000 GPU jobs. Each job fetched a file from cloud storage to local storage, processed that file, and wrote the output to cloud storage. For ¼ of the regions cloud storage was not local to the region. => we could have probably avoided data replication across regions given the excellent networking between regions for each provider.
  • 30. Applicability beyond IceCube • All the large instruments we know off - LHC, LIGO, DUNE, LSST, … • Any midscale instrument we can think off - XENON, GlueX, Clas12, Nova, DES, Cryo-EM, … • A large fraction of Deep Learning - But not all of it … • Basically, anything that has bundles of independently schedulable jobs that can be partitioned to adjust workloads to have 0.5 to few hour runtimes on modern GPUs. 30
  • 31. Cost to support cloud as a “24x7” capability • Today, roughly $15k per 300 PFLOP32 hour • This burst was executed by 2 people - Igor Sfiligoi (SDSC) to support the infrastructure. - David Schultz (UW Madison) to create and submit the IceCube workflows. § “David” type person is needed also for on-prem science workflows. • To make this a routine operations capability for any open science that is dHTC capable would require another 50% FTE “Cloud Budget Manager”. - There is substantial effort involved in just dealing with cost & budgets for a large community of scientists. 31
  • 32. IceCube is ready for Exascale • Humanity has built extraordinary instruments by pooling human and financial resources globally. • The computing for these large collaborations fits perfectly to the cloud or scheduling holes in Exascale HPC systems due to its “ingeniously parallel” nature. => dHTC • The dHTC computing paradigm applies to a wide range of problems across all of open science. - We are happy to repeat this with anybody willing to spend $50k in the clouds. 32 Contact us at: help@opensciencegrid.org Or me personally at: fkw@ucsd.edu Demonstrated elastic burst at 51,500 GPUs IceCube is ready for Exascale
  • 33. Acknowledgements • This work was partially supported by the NSF grants OAC-1941481, MPS-1148698, OAC-1841530, OAC-1904444, and OAC- 1826967 33