SlideShare une entreprise Scribd logo
1  sur  44
1
Computing Just What You Need:
Online Data Analysis and Reduction
at Extreme Scales
Ian Foster, Argonne & U.Chicago
August 31, 2017
EuroPar, Santiago de Compostela
https://www.researchgate.net/publication/317703782
2
What I won’t talk about: globus.org
5major services
13
national labs
use Globus
300PB
transferred
10,000
active endpoints
50 Bn
files processed
70,000
registered users
99.5%
uptime
65+
institutional
subscribers
1 PB
largest single
transfer to date
3 months
longest continuously
managed transfer
300+
federated
campus
identities
12,000
active users/year
3
Shameless plugs
4
Three messages
Dramatic changes in HPC system geography …
… are driving new application structures …
… resulting in exciting new computer science challenges
5
Geography: (Part of) what determines how
long it takes to get from A to B
6
Geography: (Part of) what determines how
long it takes to get from A to B
The memory hierarchy plays a
big role in computing
geography
7
Geography: (Part of) what determines how
long it takes to get from A to B
• Computing geography
is changing rapidly
• Despite continued
exponential growth in
many technologies
• Different rates mean
that resources are
getting farther away
~1980-2000
Patterson, CACM, 2004
CPU high,
Disk low
8
A. C. Bauer et al.,
EuroVis 2016Titan supercomputer
9
10 180
x18
0.3 1
x3
10
11
12
13
Exascale climate goal: Ensembles of 1km
models at 15 simulated years/24 hours
Full state once per model day  260 TB every 16 seconds
 1.4 EB/day
14
Model selection in deep learning
Evaluate 1M alternative models, each with 100M
parameters  1014 parameter values
https://de.mathworks.com/company/newsletters/articles/cancer-diagnostics-with-deep-learning-and-photonic-time-stretch.html
15
Real-time analysis and experimental steering
• Current experimental protocols
typically process and validate data
only after an experiment has
completed, which can lead to
undetected errors and prevents
online steering
• We built an autonomous stream
processing system that allows data
streamed from beamline computers
to be processed in real time on a
remote supercomputer with a
control feedback loop used to make
decisions during experimentation
• The system has been tested in real-
world setting TXM beamline (32-
ID@APS) while performing cement
wetting experiment (2 experiments,
each with 8 hours of data
acquisition time)
Sustained # Projections/seconds
CircularBufferSize
Reconstruction Frequency
Image Quality w.r.t. Streamed Projections
SimilarityScore
# Streamed Projections Reconstructed
Image Sequence
Tekin Bicer et al., eScience 2017
16
Other examples
• Materials science
• Billion-atom atomistic simulations with femtosecond time steps
• Simulations may run for simulated seconds
• Want to study vibrational responses at 10s of femtoseconds
• Fusion science
• Full-device simulations may generate 100 PBs
• Need to reduce 1000:1 for effective output
• Eventual goal is real-time response during fusion experiments
17
HPC applications: Synopsis
Single
program
Multiple
program
Offline
analysis
Online
analysis
Many tasks
• Reliable or unreliable
• Loosely or tightly coupled
• Static or dynamic
New challenges: Efficient logistics!
• “Amateurs talk strategy while
professionals study logistics” –
Robert Barrow
• “The line between disorder and
order lies in logistics...” – Sun Tzu
Multiple
simulations
+ analyses
Simulation
+ analysis
Multiple
simulations
18
The need for online data analysis and reduction
Traditional approach:
Simulate, output, analyze
Write simulation output to secondary
storage; read back for analysis
Decimate in time when simulation
output rate exceeds output rate of
computer
Online: y = F(x)
Offline: a = A(y), b= B(y), …
19
The need for online data analysis and reduction
Traditional approach:
Simulate, output, analyze
Write simulation output to secondary
storage; read back for analysis
Decimate in time when simulation
output rate exceeds output rate of
computer
Online: y = F(x)
Offline: a = A(y), b= B(y), …
New approach:
Online data analysis & reduction
Co-optimize simulation, analysis,
reduction for performance and
information output
Substitute CPU cycles for I/O, via data
(de)compression and/or online data
analysis
a) Online: a = A(F(x)), b = B(F(x)), …
b) Online: r = R(F(x))
Offline: a = A’(r), b = B’(r), or
a = A(U(r)), b = B(U(r))
[R = reduce, U = un-reduce]
20
But reduction comes with challenges
• Handling high entropy
• Performance – no benefit
otherwise
• Not only errors in variable ∶
Ε ≡ 𝑓 − 𝑓
• Must also consider impact on
derived quantities:
Ε ≡ (𝑔𝑙
𝑡
(𝑓 𝑥, 𝑡 ) − 𝑔𝑙
𝑡
( 𝑓𝑙
𝑡
( 𝑥, 𝑡 )
S. Klasky
21
Data reduction challenges
Key research challenge:
How to manage the impact
of errors on derived
quantities?
Where did it go???
S. Klasky
22
CODAR: Center for Online Data Analysis and Reduction
A U.S. Department of Energy
Exascale Computing Program
Codesign Center
CODAR
Data services Exascale
platforms
Applications
23
Infrastructure – Matthew Wolf (Lead)
• Cheetah: Bryce Allen, Kshitij Mehta,
Tahsin Kurc, Li Tang
• Savannah: Justin Wozniak, Manish
Parashar, Philip Davis
• Chimbuko: Abid Malik, Line
Pouchard
Data Reduction – Franck Cappello (Lead)
• Multilevel: Mark Ainsworth, Ozan
Tugluk, Jong Choi
• Z-checker: Julie Bessac, Sheng Di
Data Analysis – Shinjae Yoo (Lead)
• Blobs: Tom Peterka, Hanqi Guo
• Hierarchical: Stefan Wild, Wendy Di
• Functional: George Ostrouchov
• Visual Analytics: Klaus Mueller, Wei
Xu
Management – Ian Foster (Lead)
• Scott Klasky
• Kerstin Kleese van Dam
• Todd Munson (Project Management)
24
Cross-cutting research questions
What are the best data analysis and reduction algorithms for different
application classes, in terms of speed, accuracy, and resource
requirements? How can we implement those algorithms to achieve
scalability and performance portability?
What are the tradeoffs in data analysis accuracy, resource needs, and
overall application performance between using various data reduction
methods to reduce file size prior to offline data reconstruction and
analysis vs. performing more online data analysis? How do these
tradeoffs vary with hardware and software choices?
How do we effectively orchestrate online data analysis and reduction to
reduce associated overheads? How can hardware and software help with
orchestration?
25
Prototypical CODAR data analysis and reduction pipeline
CODAR runtime
Reduced output and reconstruction info
I/O
system
CODAR data API
Running simulation
Multivariate statistics
Feature analysis
Outlier detection
Application-aware
Transforms
Encodings
Error calculation
Refinement hints
CODARdataAPI
Offlinedataanalysis
Simulation knowledge: application, models, numerics, performance optimization, …
CODAR
data analysis
CODAR
data reduction
CODAR
data monitoring
26
Overarching data reduction challenges
• Understanding the science requires massive data reduction
• How do we reduce
• The time spent in reducing the data to knowledge?
• The amount of data moved on the HPC platform?
• The amount of data read from the storage system?
• The amount of data stored in memory, on storage system, moved over WAN?
• Without removing the knowledge.
• Requires deep dives into application post processing routines and simulations
• Goal is to create both (a) co-design infrastructure and (b)
reduction and analysis routines
• General: e.g., reduce Nbytes to Mbytes, N<<M
• Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements
• Application-specific: e.g. reduced physics allows us to understand deltas
27
HPC floating point compression
• Current interest is with lossy algorithms, some use preprocessing
• Lossless may achieve up to ~3x reduction
• ISABELA
• SZ
• ZFP
• Linear auditing
• SVD
• Adaptive gradient methods
Compress each variable separately: Several variables simultaneously:
• PCA
• Tensor decomposition
• …
28
Lossy compression with SZ
No existing compressor can reduce hard to compress
datasets by more than a factor of 2.
Objective 1: Reduce hard to compress datasets by
one order of magnitude
Objective 2: Add user-required error controls (error
bound, shape of error distribution, spectral behavior
of error function, etc. etc.)
NCAR
atmosphere
simulation
output
(1.5 TB)
WRF
hurricane
simulation
output
Advanced
Photon Source
mouse brain
data
What we need to
compress
(bit map of 128
floating point
numbers):
Random noise
Franck Cappello
29
Lossy compression: Atmospheric simulation
Franck Cappello
Latest SZ
30
Characterizing compression error
0.0001
0.001
0.01
0.1
1 1/N 6/N 11/N 16/N 21/N 26/N 31/N 36/N 41/N 46/N
Amplitude
Frequency
0
2e-07
4e-07
6e-07
8e-07
1e-06
1.2e-06
1.4e-06
1.6e-06
1.8e-06
2e-06
0 20 40 60 80 100
120
140
160
180
200
MaximumCompressionError
Variables
SZ(max error)
SZ(avg error)
ZFP(max error)
ZFP(avg error)
Error distribution
Spectral behavior Laplacian (derivatives)
Autocorrelation of errors
Respect of error bounds Error propagation
Franck Cappello
31
Z-checker: Analysis of data reduction error
• Community tool to enable comprehensive assessment of lossy data reduction error:
• Collection of data quality criteria from applications
• Community repository for datasets, reduction quality requirements, compression
performance
• Modular design enables contributed analysis modules (C and R) and format readers
(ADIOS, HDF5, etc.)
• Off-line/on-line parallel statistical, spectral, point-wise distortion analysis with static &
dynamic visualization
Franck Cappello, Julie Bessac, Sheng Di
32
Science-driven decompositions
• Information-theoretically derived methods like SZ,
Isabella, ZFP make for good generic capabilities
• If scientists can provide additional details on how to
determine features of interest, we can use those to
drive further optimizations. E.g., if they can select:
• Regions of high gradient
• Regions near turbulent flow
• Particles with velocities > two standard deviations
• How can scientists help define features?
33
Multilevel compression techniques
A hierarchical reduction scheme produces
multiple levels of partial decompression of the
data so that users can work with reduced
representations that require minimal storage
whilst achieving the user-specified tolerance
Compression vs. user-specified toleranceResults for turbulence dataset: extremely large,
inherently non-smooth, resistant to compression Mark Ainsworth
34
Manifold learning for change detection and
adaptive sampling
Low dimensional manifold projection
of different state of MD trajectories
• A single molecular dynamics
trajectory can generate 32 PB
• Use online data analysis to detect
relevant or significant events
• Project MD trajectories to manifold
space (dimensionality reduction) across
time into two dimensional space
• Change detection on manifold space is
more robust than original full coordinate
space as it removes local vibrational
noise
• Apply adaptive sampling strategy based
on accumulated changes of trajectories
Shinjae Yoo
35
Critical points extracted
with topology analysis
Tracking blobs in XGC fusion simulations
Blobs, regions of high turbulence that can
damage the Tokamak, can run along the edge
wall down toward the diverter and damage it.
Blob extraction and tracking enables the
exploration and analysis of high-energy blobs
across timesteps. Our new visualizations will
help scientists understand the behavior of blob
dynamics in greater detail than previously
possible.
Research Details
• Access data with ADIOS I/O in high performance
• Precondition the input data with robust PCA
• Detect blobs as local extrema with topology analysis
• Track blobs over time with combinatorial feature flow
field method
A method to extract, track, and visualize blobs in large scale 5D gyrokinetic Tokamak simulations.
Hanqi Guo, Tom Peterka
Tracking graph that visualizes the dynamics of blobs
(birth, merge, split, and death) over time
Data preconditioning
with robust PCA
36
Reduction for visualization
“an extreme scale simulation … calculates
temperature and density over 1000 of time
steps. For both variables, a scientist would like
to visualize 10 isosurface values and X, Y, and Z
cut planes for 10 locations in each dimension.
One hundred different camera positions are
also selected, in a hemisphere above the
dataset pointing towards the data set. We will
run the in situ image acquisition for every time
step. These parameters will produce: 2
variables x 1000 time steps x (10 isosurface
values + 3 x 10 cut planes) x 100 camera
positions x 3 images (depth, float, lighting)
= 2.4 x 107 images.”
J. Ahrens et al., SC’14
103 time steps x
1015 B state per
time step = 1018 B
2.4 x 107 images x
1MB/image
(megapixel, 4B) =
2.4 x 1012 B
37
Fusion whole device model
XGC GENEInterpolator
100+ PB
PB/day on
Titan today;
10+ PB/day
in the future
10 TB/day on
Titan today;
100+ TB/day
in the future
Analysis
Analysis
Analysis
Read 10-100 PB
per analysis
http://bit.ly/2fcyznK
38
XGC GENEInterpolator
Reduction Reduction
XGC
Viz.
XGC
output
GENE
Viz.
GENE
output
Comparative
Viz.
NVRAM
PFS
TAPE
http://bit.ly/2fcyznK
Fusion whole device model
39
Integrates multiple technologies:
•ADIOS staging (DataSpaces) for coupling
•Sirius (ADIOS + Ceph) for storage
•ZFP, SZ, Dogstar for reduction
•VTK-M services for visualization
•TAU for instrumenting the code
•Cheetah + Savanna to test the different
configurations (same node, different node,
hybrid-combination) to determine where to
place the different services
•Flexpath for staged-write from XGC to storage
•Ceph + ADIOS to manage storage hierarchy
•Swift for workflow automation
XGC GENEInterpolator
Reduction Reduction
XGC
Viz.
XGC
output
GENE
Viz.
GENE
output
TAU TAU
Comparative
Viz.
NVRAM
PFS
TAPE
Performance
Viz.
Cheetah +
Savanna drive
codesign experiments
Fusion whole device model
40
Savannah: Swift workflows coupled with ADIOS
Z-Check
dup
Multi-node workflow components communicate over ADIOS
Application data
Cheetah
Experiment
configuration
and dispatch
User monitoring and
control of multiple
pipeline instances
Co-design data
Store
experiment
metadata
Chimbuko
captures co-design
performance data
Other co-design
output
(e.g., Z-Checker)
CODAR
campaign
definition
Analysis
ADIOS output
Job launch
Science
App
Reduce
Co-design experiment architecture
41
Transformation layer
• Designed for data conversions,
compression, and transformation
• zlib, bzip2, szip, ISOBAR, ALACRITY, FastBit
• Can transform local data on each processor
• Transparent for users
• User code read/write the original
untransformed data
• Applications
• Compressed output
• Automatically indexed data
• Local Data Reorganization
• Data Reduction
• Released in ADIOS 1.6 in 2013 with
compression transformations
User Application
ADIOS
Variable A
I/O Transport Layer
Regular var.
BP file, staging area, etc.
Data
Transform
Layer
Variable B
Plugin Read
Transform
Plugin
Plugin Write
Transformed var.
42
Codesign questions to be addressed
• How can we couple multiple codes? Files, staging on the same
node, different nodes, synchronous, asynchronous?
• How we can test different placement strategies for memory
optimization, performance optimizations?
• What are the best reduction technologies to allow us to capture
all relevant information during a simulation? E.g., Performance
vs. accuracy.
• How can we create visualization services that work on the
different architectures and use the data models in the codes?
• How do we manage data across storage hierarchies?
43
CODAR summary
• Infrastructure development and deployment
• Enable rapid composition of application and “data services” (data
reduction methods, data analysis methods, etc.)
• Support CODAR-developed and other data services
• Method development: new reduction & analysis routines
• Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements
• Application-specific: e.g., reduced physics to understand deltas
• Application engagement
• Understand data analysis and reduction requirements
• Integrate, deploy, evaluate impact
https://codarcode.github.io codar-info@cels.anl.gov
44
Dramatic changes in HPC system geography …
… are driving new application structures …
… resulting in exciting new computer science challenges
Thanks to US Department of Energy and CODAR team

Contenu connexe

Tendances

A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...aimsnist
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningKAMAL CHOUDHARY
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)Anubhav Jain
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsAnubhav Jain
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectAnubhav Jain
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...aimsnist
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...Anubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Applications of Machine Learning for Materials Discovery at NREL
Applications of Machine Learning for Materials Discovery at NRELApplications of Machine Learning for Materials Discovery at NREL
Applications of Machine Learning for Materials Discovery at NRELaimsnist
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructureAnubhav Jain
 
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...EarthCube
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionaimsnist
 

Tendances (20)

A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Applications of Machine Learning for Materials Discovery at NREL
Applications of Machine Learning for Materials Discovery at NRELApplications of Machine Learning for Materials Discovery at NREL
Applications of Machine Learning for Materials Discovery at NREL
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 

Similaire à Online Data Analysis and Reduction at Extreme Scales

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING JournalPandey_G
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...butest
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discoverydiannepatricia
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 

Similaire à Online Data Analysis and Reduction at Extreme Scales (20)

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING Journal
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
data-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptxdata-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptx
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Stream Processing
Stream Processing Stream Processing
Stream Processing
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discovery
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 

Plus de Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformIan Foster
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformIan Foster
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
 

Plus de Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Dernier (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Online Data Analysis and Reduction at Extreme Scales

  • 1. 1 Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales Ian Foster, Argonne & U.Chicago August 31, 2017 EuroPar, Santiago de Compostela https://www.researchgate.net/publication/317703782
  • 2. 2 What I won’t talk about: globus.org 5major services 13 national labs use Globus 300PB transferred 10,000 active endpoints 50 Bn files processed 70,000 registered users 99.5% uptime 65+ institutional subscribers 1 PB largest single transfer to date 3 months longest continuously managed transfer 300+ federated campus identities 12,000 active users/year
  • 4. 4 Three messages Dramatic changes in HPC system geography … … are driving new application structures … … resulting in exciting new computer science challenges
  • 5. 5 Geography: (Part of) what determines how long it takes to get from A to B
  • 6. 6 Geography: (Part of) what determines how long it takes to get from A to B The memory hierarchy plays a big role in computing geography
  • 7. 7 Geography: (Part of) what determines how long it takes to get from A to B • Computing geography is changing rapidly • Despite continued exponential growth in many technologies • Different rates mean that resources are getting farther away ~1980-2000 Patterson, CACM, 2004 CPU high, Disk low
  • 8. 8 A. C. Bauer et al., EuroVis 2016Titan supercomputer
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13 Exascale climate goal: Ensembles of 1km models at 15 simulated years/24 hours Full state once per model day  260 TB every 16 seconds  1.4 EB/day
  • 14. 14 Model selection in deep learning Evaluate 1M alternative models, each with 100M parameters  1014 parameter values https://de.mathworks.com/company/newsletters/articles/cancer-diagnostics-with-deep-learning-and-photonic-time-stretch.html
  • 15. 15 Real-time analysis and experimental steering • Current experimental protocols typically process and validate data only after an experiment has completed, which can lead to undetected errors and prevents online steering • We built an autonomous stream processing system that allows data streamed from beamline computers to be processed in real time on a remote supercomputer with a control feedback loop used to make decisions during experimentation • The system has been tested in real- world setting TXM beamline (32- ID@APS) while performing cement wetting experiment (2 experiments, each with 8 hours of data acquisition time) Sustained # Projections/seconds CircularBufferSize Reconstruction Frequency Image Quality w.r.t. Streamed Projections SimilarityScore # Streamed Projections Reconstructed Image Sequence Tekin Bicer et al., eScience 2017
  • 16. 16 Other examples • Materials science • Billion-atom atomistic simulations with femtosecond time steps • Simulations may run for simulated seconds • Want to study vibrational responses at 10s of femtoseconds • Fusion science • Full-device simulations may generate 100 PBs • Need to reduce 1000:1 for effective output • Eventual goal is real-time response during fusion experiments
  • 17. 17 HPC applications: Synopsis Single program Multiple program Offline analysis Online analysis Many tasks • Reliable or unreliable • Loosely or tightly coupled • Static or dynamic New challenges: Efficient logistics! • “Amateurs talk strategy while professionals study logistics” – Robert Barrow • “The line between disorder and order lies in logistics...” – Sun Tzu Multiple simulations + analyses Simulation + analysis Multiple simulations
  • 18. 18 The need for online data analysis and reduction Traditional approach: Simulate, output, analyze Write simulation output to secondary storage; read back for analysis Decimate in time when simulation output rate exceeds output rate of computer Online: y = F(x) Offline: a = A(y), b= B(y), …
  • 19. 19 The need for online data analysis and reduction Traditional approach: Simulate, output, analyze Write simulation output to secondary storage; read back for analysis Decimate in time when simulation output rate exceeds output rate of computer Online: y = F(x) Offline: a = A(y), b= B(y), … New approach: Online data analysis & reduction Co-optimize simulation, analysis, reduction for performance and information output Substitute CPU cycles for I/O, via data (de)compression and/or online data analysis a) Online: a = A(F(x)), b = B(F(x)), … b) Online: r = R(F(x)) Offline: a = A’(r), b = B’(r), or a = A(U(r)), b = B(U(r)) [R = reduce, U = un-reduce]
  • 20. 20 But reduction comes with challenges • Handling high entropy • Performance – no benefit otherwise • Not only errors in variable ∶ Ε ≡ 𝑓 − 𝑓 • Must also consider impact on derived quantities: Ε ≡ (𝑔𝑙 𝑡 (𝑓 𝑥, 𝑡 ) − 𝑔𝑙 𝑡 ( 𝑓𝑙 𝑡 ( 𝑥, 𝑡 ) S. Klasky
  • 21. 21 Data reduction challenges Key research challenge: How to manage the impact of errors on derived quantities? Where did it go??? S. Klasky
  • 22. 22 CODAR: Center for Online Data Analysis and Reduction A U.S. Department of Energy Exascale Computing Program Codesign Center CODAR Data services Exascale platforms Applications
  • 23. 23 Infrastructure – Matthew Wolf (Lead) • Cheetah: Bryce Allen, Kshitij Mehta, Tahsin Kurc, Li Tang • Savannah: Justin Wozniak, Manish Parashar, Philip Davis • Chimbuko: Abid Malik, Line Pouchard Data Reduction – Franck Cappello (Lead) • Multilevel: Mark Ainsworth, Ozan Tugluk, Jong Choi • Z-checker: Julie Bessac, Sheng Di Data Analysis – Shinjae Yoo (Lead) • Blobs: Tom Peterka, Hanqi Guo • Hierarchical: Stefan Wild, Wendy Di • Functional: George Ostrouchov • Visual Analytics: Klaus Mueller, Wei Xu Management – Ian Foster (Lead) • Scott Klasky • Kerstin Kleese van Dam • Todd Munson (Project Management)
  • 24. 24 Cross-cutting research questions What are the best data analysis and reduction algorithms for different application classes, in terms of speed, accuracy, and resource requirements? How can we implement those algorithms to achieve scalability and performance portability? What are the tradeoffs in data analysis accuracy, resource needs, and overall application performance between using various data reduction methods to reduce file size prior to offline data reconstruction and analysis vs. performing more online data analysis? How do these tradeoffs vary with hardware and software choices? How do we effectively orchestrate online data analysis and reduction to reduce associated overheads? How can hardware and software help with orchestration?
  • 25. 25 Prototypical CODAR data analysis and reduction pipeline CODAR runtime Reduced output and reconstruction info I/O system CODAR data API Running simulation Multivariate statistics Feature analysis Outlier detection Application-aware Transforms Encodings Error calculation Refinement hints CODARdataAPI Offlinedataanalysis Simulation knowledge: application, models, numerics, performance optimization, … CODAR data analysis CODAR data reduction CODAR data monitoring
  • 26. 26 Overarching data reduction challenges • Understanding the science requires massive data reduction • How do we reduce • The time spent in reducing the data to knowledge? • The amount of data moved on the HPC platform? • The amount of data read from the storage system? • The amount of data stored in memory, on storage system, moved over WAN? • Without removing the knowledge. • Requires deep dives into application post processing routines and simulations • Goal is to create both (a) co-design infrastructure and (b) reduction and analysis routines • General: e.g., reduce Nbytes to Mbytes, N<<M • Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements • Application-specific: e.g. reduced physics allows us to understand deltas
  • 27. 27 HPC floating point compression • Current interest is with lossy algorithms, some use preprocessing • Lossless may achieve up to ~3x reduction • ISABELA • SZ • ZFP • Linear auditing • SVD • Adaptive gradient methods Compress each variable separately: Several variables simultaneously: • PCA • Tensor decomposition • …
  • 28. 28 Lossy compression with SZ No existing compressor can reduce hard to compress datasets by more than a factor of 2. Objective 1: Reduce hard to compress datasets by one order of magnitude Objective 2: Add user-required error controls (error bound, shape of error distribution, spectral behavior of error function, etc. etc.) NCAR atmosphere simulation output (1.5 TB) WRF hurricane simulation output Advanced Photon Source mouse brain data What we need to compress (bit map of 128 floating point numbers): Random noise Franck Cappello
  • 29. 29 Lossy compression: Atmospheric simulation Franck Cappello Latest SZ
  • 30. 30 Characterizing compression error 0.0001 0.001 0.01 0.1 1 1/N 6/N 11/N 16/N 21/N 26/N 31/N 36/N 41/N 46/N Amplitude Frequency 0 2e-07 4e-07 6e-07 8e-07 1e-06 1.2e-06 1.4e-06 1.6e-06 1.8e-06 2e-06 0 20 40 60 80 100 120 140 160 180 200 MaximumCompressionError Variables SZ(max error) SZ(avg error) ZFP(max error) ZFP(avg error) Error distribution Spectral behavior Laplacian (derivatives) Autocorrelation of errors Respect of error bounds Error propagation Franck Cappello
  • 31. 31 Z-checker: Analysis of data reduction error • Community tool to enable comprehensive assessment of lossy data reduction error: • Collection of data quality criteria from applications • Community repository for datasets, reduction quality requirements, compression performance • Modular design enables contributed analysis modules (C and R) and format readers (ADIOS, HDF5, etc.) • Off-line/on-line parallel statistical, spectral, point-wise distortion analysis with static & dynamic visualization Franck Cappello, Julie Bessac, Sheng Di
  • 32. 32 Science-driven decompositions • Information-theoretically derived methods like SZ, Isabella, ZFP make for good generic capabilities • If scientists can provide additional details on how to determine features of interest, we can use those to drive further optimizations. E.g., if they can select: • Regions of high gradient • Regions near turbulent flow • Particles with velocities > two standard deviations • How can scientists help define features?
  • 33. 33 Multilevel compression techniques A hierarchical reduction scheme produces multiple levels of partial decompression of the data so that users can work with reduced representations that require minimal storage whilst achieving the user-specified tolerance Compression vs. user-specified toleranceResults for turbulence dataset: extremely large, inherently non-smooth, resistant to compression Mark Ainsworth
  • 34. 34 Manifold learning for change detection and adaptive sampling Low dimensional manifold projection of different state of MD trajectories • A single molecular dynamics trajectory can generate 32 PB • Use online data analysis to detect relevant or significant events • Project MD trajectories to manifold space (dimensionality reduction) across time into two dimensional space • Change detection on manifold space is more robust than original full coordinate space as it removes local vibrational noise • Apply adaptive sampling strategy based on accumulated changes of trajectories Shinjae Yoo
  • 35. 35 Critical points extracted with topology analysis Tracking blobs in XGC fusion simulations Blobs, regions of high turbulence that can damage the Tokamak, can run along the edge wall down toward the diverter and damage it. Blob extraction and tracking enables the exploration and analysis of high-energy blobs across timesteps. Our new visualizations will help scientists understand the behavior of blob dynamics in greater detail than previously possible. Research Details • Access data with ADIOS I/O in high performance • Precondition the input data with robust PCA • Detect blobs as local extrema with topology analysis • Track blobs over time with combinatorial feature flow field method A method to extract, track, and visualize blobs in large scale 5D gyrokinetic Tokamak simulations. Hanqi Guo, Tom Peterka Tracking graph that visualizes the dynamics of blobs (birth, merge, split, and death) over time Data preconditioning with robust PCA
  • 36. 36 Reduction for visualization “an extreme scale simulation … calculates temperature and density over 1000 of time steps. For both variables, a scientist would like to visualize 10 isosurface values and X, Y, and Z cut planes for 10 locations in each dimension. One hundred different camera positions are also selected, in a hemisphere above the dataset pointing towards the data set. We will run the in situ image acquisition for every time step. These parameters will produce: 2 variables x 1000 time steps x (10 isosurface values + 3 x 10 cut planes) x 100 camera positions x 3 images (depth, float, lighting) = 2.4 x 107 images.” J. Ahrens et al., SC’14 103 time steps x 1015 B state per time step = 1018 B 2.4 x 107 images x 1MB/image (megapixel, 4B) = 2.4 x 1012 B
  • 37. 37 Fusion whole device model XGC GENEInterpolator 100+ PB PB/day on Titan today; 10+ PB/day in the future 10 TB/day on Titan today; 100+ TB/day in the future Analysis Analysis Analysis Read 10-100 PB per analysis http://bit.ly/2fcyznK
  • 39. 39 Integrates multiple technologies: •ADIOS staging (DataSpaces) for coupling •Sirius (ADIOS + Ceph) for storage •ZFP, SZ, Dogstar for reduction •VTK-M services for visualization •TAU for instrumenting the code •Cheetah + Savanna to test the different configurations (same node, different node, hybrid-combination) to determine where to place the different services •Flexpath for staged-write from XGC to storage •Ceph + ADIOS to manage storage hierarchy •Swift for workflow automation XGC GENEInterpolator Reduction Reduction XGC Viz. XGC output GENE Viz. GENE output TAU TAU Comparative Viz. NVRAM PFS TAPE Performance Viz. Cheetah + Savanna drive codesign experiments Fusion whole device model
  • 40. 40 Savannah: Swift workflows coupled with ADIOS Z-Check dup Multi-node workflow components communicate over ADIOS Application data Cheetah Experiment configuration and dispatch User monitoring and control of multiple pipeline instances Co-design data Store experiment metadata Chimbuko captures co-design performance data Other co-design output (e.g., Z-Checker) CODAR campaign definition Analysis ADIOS output Job launch Science App Reduce Co-design experiment architecture
  • 41. 41 Transformation layer • Designed for data conversions, compression, and transformation • zlib, bzip2, szip, ISOBAR, ALACRITY, FastBit • Can transform local data on each processor • Transparent for users • User code read/write the original untransformed data • Applications • Compressed output • Automatically indexed data • Local Data Reorganization • Data Reduction • Released in ADIOS 1.6 in 2013 with compression transformations User Application ADIOS Variable A I/O Transport Layer Regular var. BP file, staging area, etc. Data Transform Layer Variable B Plugin Read Transform Plugin Plugin Write Transformed var.
  • 42. 42 Codesign questions to be addressed • How can we couple multiple codes? Files, staging on the same node, different nodes, synchronous, asynchronous? • How we can test different placement strategies for memory optimization, performance optimizations? • What are the best reduction technologies to allow us to capture all relevant information during a simulation? E.g., Performance vs. accuracy. • How can we create visualization services that work on the different architectures and use the data models in the codes? • How do we manage data across storage hierarchies?
  • 43. 43 CODAR summary • Infrastructure development and deployment • Enable rapid composition of application and “data services” (data reduction methods, data analysis methods, etc.) • Support CODAR-developed and other data services • Method development: new reduction & analysis routines • Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements • Application-specific: e.g., reduced physics to understand deltas • Application engagement • Understand data analysis and reduction requirements • Integrate, deploy, evaluate impact https://codarcode.github.io codar-info@cels.anl.gov
  • 44. 44 Dramatic changes in HPC system geography … … are driving new application structures … … resulting in exciting new computer science challenges Thanks to US Department of Energy and CODAR team