SlideShare une entreprise Scribd logo
1  sur  35
Data Automation
at Light Sources
Ian Foster
Argonne National Laboratory & The University of Chicago
1
Advanced Photon Source
Argonne Leadership
Computing Facility
1 km
5 μsec
2
von Laszewski et al., Real-time
analysis, visualization, and
steering of microtomography
experiments at photon
sources, SIAM Parallel
Processing, 1999
I have been working with light sources for some time!
“the data rates and
compute power
required ... are
prodigious, easily
reaching one gigabit per second
and a teraflop per second [respectively]”
Ptychography: Use GPU cluster for 360x speedup,
from 7 hours to 72 s
[Deng, Vine, Chen, Nashed, Philips, Jin,
Peterka, Ross, Jacobsen]
 Enable online analysis and use of fly scans
Microtomography: Use 32K Mira BG/Q nodes to
reduce reconstruction time from days to 2 mins
[Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]
 Identify and correct experimental
misconfiguration
High-energy diffraction microscopy: 10K BG/Q
nodes to reconstruct in 10 minutes
[Sharma, Almer, Wozniak, Wilde, Foster]
 Zoom in on crack locations (switch far field  near field)
Coherence
Brightness
High Energy
Micrometer porosity structure of shale samples
Microstructure of a copper wire, 0.2mm diameter
Work on high-speed analysis continues
We face a data crisis (and opportunity)
New instrumentation means that data rates
are growing much faster than Moore’s Law
 Neither humans nor computers can cope by
using current methods
We need new methods for designing
experiments, managing data, analyzing data,
and creating and delivering software
 “A knowledge-based society, connected by the
Internet and powered by AI …”
— Chen Chien-jen
6https://bit.ly/2l4gfgu
How industry deals with scale
7https://bit.ly/2l4gfgu
How industry deals with scale
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
8
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
9
Petrel online store
petrel.alcf.anl.gov
94 Gbit/s Petrel—Blue Waters
2 petabytes
100 Gbps
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
10
2 petabytes
100 Gbps
Globus APIs
globus.org
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
Automate:
(c) DMagic script uses Globus
APIs to transfer data and
configure permissions
12
http://dmagic.readthedocs.io
Francesco de Carlo
Given an experiment date:
• Retrieve user info from APS scheduler
• Create Globus “shared endpoint” and
configure permissions
• Monitor directory at beamline and use
Globus to copy new files to endpoint
• Email link to shared endpoint for data
retrieval
Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1313
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1414
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
(2) Publication and discovery
1515
Programmatic access (REST, Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository, …
Building a different custom pipeline for every situation is impractical
Automate and outsource:
(3) End-to-end data pipelines
For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository
Building a different custom pipeline for every situation is impractical
Automate: Trigger-action programming (“if this happens, then do that)
Outsource: Cloud-based trigger-action service for reliability,
scalability, ease of use, security, sustainability
Automate and outsource:
(3) End-to-end data pipelines
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive1
1
Rules
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Globus Transfer
Archive
• Set sharing ACLs
• Set timer for publication
to Materials Data Facility
Data publication
1
2
1
Rules
2
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
• IF new files THEN run feature
extraction
• IF feature detected THEN
transfer data to archival storage
• IF time since ingest > 6 months
THEN publish dataset to
Materials Data Facility
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
Data
Source
Collector Storage and
Compute
• Capture dataset creation
• Review center position
APS beamline 32-ID
ALCF Cooley Cluster
• Generate preview and
center images
• Reconstruct image
• Extract metadata
Ingest in Globus
Search
Set sharing ACLs
Data publication
1
2
1
Rules
2
• IF new HDF5 files THEN
transfer to Cooley
• IF new center_pos
THEN initiate
reconstruction
• IF transfer complete
THEN execute preview
and center finding
• IF results THEN return
data to APS
• IF reconstruction THEN
transfer data to Petrel
AND publish dataset
ALCF Petrel
Archive
Visualize with Neuroglancer
Another example: Mosaic tomography for neurocartography
(N. Kasthuri, R. Chard, et al.)
globus.org
Automate and outsource:
(4) Data transformation and analysis
“beam misaligned”
“…”
Say you want to use a deep neural network for online identification
of problems when running diffraction experiments
Automate and outsource:
(4) Data transformation and analysis
https://doi.org/10.1109/NYSDS.2017.8085045
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
Data and Learning Hub (DLHub): Overview
• Collect, publish, categorize models/code/weights/data from many sources
• Serve models via API to foster sharing, consumption, and access to data,
training sets, and models
• Automate training of models (using HPC as needed) as new data are available
• Enable new science through reuse and synthesis of existing models
TrainCollect Serve
DLHub: Collect, serve, train community models
DLHub
Collect
Data
1) Register a model
Train
Model
Register
Model Model /
transform
containers
Receive DOI
Send to DLHub
DLHub
Collect
Data
Receive
predicted
Properties
Send
compositions
Call
DLHub
Find
Model
2) Run a model
Model /
transform
containers
DLHub: Collect, serve, train community models
Collect
Data
Receive DOI
1) Register a model
Train
Model
Register
Model
Send to DLHub
32
33
Invoke model on data
Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
Thanks also to:
• Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer,
Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source
• Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility
In summary
More data demands new methods for designing experiments,
managing data, analyzing data, and creating and delivering software
We must automate and outsource to manage data, run pipelines,
and train and run (machine learning) models
I presented examples that illustrate what can be done:
• High-speed storage services for data staging and distribution: Petrel
• Cloud-based services for data transfer and sharing: Globus Transfer
• Data publication and discovery services: Materials Data Facility
• Cloud-based automation services: Globus Automate
• Model and transformation services to encapsulate software: DLHub
There are many opportunities, and great need, for collaboration
To follow up: foster@anl.gov

Contenu connexe

Tendances

NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015Tanu Malik
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualizationbigdataviz_bay
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 

Tendances (20)

NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 

Similaire à Data Automation at Light Sources

Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Globus
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudAdianto Wibisono
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobus
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudJamie Kinney
 
CPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the CloudCPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the CloudCameron Craddock
 

Similaire à Data Automation at Light Sources (20)

Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
 
CPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the CloudCPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the Cloud
 

Plus de Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformIan Foster
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformIan Foster
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudIan Foster
 

Plus de Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
 

Dernier

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 

Dernier (20)

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 

Data Automation at Light Sources

  • 1. Data Automation at Light Sources Ian Foster Argonne National Laboratory & The University of Chicago 1
  • 2. Advanced Photon Source Argonne Leadership Computing Facility 1 km 5 μsec 2
  • 3. von Laszewski et al., Real-time analysis, visualization, and steering of microtomography experiments at photon sources, SIAM Parallel Processing, 1999 I have been working with light sources for some time! “the data rates and compute power required ... are prodigious, easily reaching one gigabit per second and a teraflop per second [respectively]”
  • 4. Ptychography: Use GPU cluster for 360x speedup, from 7 hours to 72 s [Deng, Vine, Chen, Nashed, Philips, Jin, Peterka, Ross, Jacobsen]  Enable online analysis and use of fly scans Microtomography: Use 32K Mira BG/Q nodes to reduce reconstruction time from days to 2 mins [Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]  Identify and correct experimental misconfiguration High-energy diffraction microscopy: 10K BG/Q nodes to reconstruct in 10 minutes [Sharma, Almer, Wozniak, Wilde, Foster]  Zoom in on crack locations (switch far field  near field) Coherence Brightness High Energy Micrometer porosity structure of shale samples Microstructure of a copper wire, 0.2mm diameter Work on high-speed analysis continues
  • 5. We face a data crisis (and opportunity) New instrumentation means that data rates are growing much faster than Moore’s Law  Neither humans nor computers can cope by using current methods We need new methods for designing experiments, managing data, analyzing data, and creating and delivering software  “A knowledge-based society, connected by the Internet and powered by AI …” — Chen Chien-jen
  • 8. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable 8
  • 9. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution 9 Petrel online store petrel.alcf.anl.gov 94 Gbit/s Petrel—Blue Waters 2 petabytes 100 Gbps
  • 10. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution (b) Globus service for data transfer and sharing 10 2 petabytes 100 Gbps Globus APIs
  • 12. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution (b) Globus service for data transfer and sharing Automate: (c) DMagic script uses Globus APIs to transfer data and configure permissions 12 http://dmagic.readthedocs.io Francesco de Carlo Given an experiment date: • Retrieve user info from APS scheduler • Create Globus “shared endpoint” and configure permissions • Monitor directory at beamline and use Globus to copy new files to endpoint • Email link to shared endpoint for data retrieval
  • 13. Automate and outsource: (2) Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1313 2 petabytes 100 Gbps Globus APIs
  • 14. Automate and outsource: (2) Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1414 Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 15. Automate and outsource: (2) Publication and discovery 1515 Programmatic access (REST, Python, Jupyter) Web browse and search Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 16. For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository, … Building a different custom pipeline for every situation is impractical Automate and outsource: (3) End-to-end data pipelines
  • 17. For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository Building a different custom pipeline for every situation is impractical Automate: Trigger-action programming (“if this happens, then do that) Outsource: Cloud-based trigger-action service for reliability, scalability, ease of use, security, sustainability Automate and outsource: (3) End-to-end data pipelines
  • 18. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Archive Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 19. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument • Email / SMS notification Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Archive1 1 Rules • IF new files THEN run quality control scripts • IF quality is good THEN send email and transfer data to CSC Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 20. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument • Email / SMS notification Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Globus Transfer Archive • Set sharing ACLs • Set timer for publication to Materials Data Facility Data publication 1 2 1 Rules 2 • IF new files THEN run quality control scripts • IF quality is good THEN send email and transfer data to CSC • IF new files THEN run feature extraction • IF feature detected THEN transfer data to archival storage • IF time since ingest > 6 months THEN publish dataset to Materials Data Facility Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 21. Data Source Collector Storage and Compute • Capture dataset creation • Review center position APS beamline 32-ID ALCF Cooley Cluster • Generate preview and center images • Reconstruct image • Extract metadata Ingest in Globus Search Set sharing ACLs Data publication 1 2 1 Rules 2 • IF new HDF5 files THEN transfer to Cooley • IF new center_pos THEN initiate reconstruction • IF transfer complete THEN execute preview and center finding • IF results THEN return data to APS • IF reconstruction THEN transfer data to Petrel AND publish dataset ALCF Petrel Archive Visualize with Neuroglancer Another example: Mosaic tomography for neurocartography (N. Kasthuri, R. Chard, et al.)
  • 23. Automate and outsource: (4) Data transformation and analysis “beam misaligned” “…” Say you want to use a deep neural network for online identification of problems when running diffraction experiments
  • 24. Automate and outsource: (4) Data transformation and analysis https://doi.org/10.1109/NYSDS.2017.8085045
  • 25. Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 26. DLHub [“beam off image”, …] model/xray/batch_predict Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 27. DLHub [“beam off image”, …] model/xray/batch_predict Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 28. Data and Learning Hub (DLHub): Overview • Collect, publish, categorize models/code/weights/data from many sources • Serve models via API to foster sharing, consumption, and access to data, training sets, and models • Automate training of models (using HPC as needed) as new data are available • Enable new science through reuse and synthesis of existing models TrainCollect Serve
  • 29. DLHub: Collect, serve, train community models DLHub Collect Data 1) Register a model Train Model Register Model Model / transform containers Receive DOI Send to DLHub
  • 30. DLHub Collect Data Receive predicted Properties Send compositions Call DLHub Find Model 2) Run a model Model / transform containers DLHub: Collect, serve, train community models Collect Data Receive DOI 1) Register a model Train Model Register Model Send to DLHub
  • 31.
  • 32. 32
  • 34. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people Thanks also to: • Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer, Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source • Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility
  • 35. In summary More data demands new methods for designing experiments, managing data, analyzing data, and creating and delivering software We must automate and outsource to manage data, run pipelines, and train and run (machine learning) models I presented examples that illustrate what can be done: • High-speed storage services for data staging and distribution: Petrel • Cloud-based services for data transfer and sharing: Globus Transfer • Data publication and discovery services: Materials Data Facility • Cloud-based automation services: Globus Automate • Model and transformation services to encapsulate software: DLHub There are many opportunities, and great need, for collaboration To follow up: foster@anl.gov