IAC 2024 - IA Fast Track to Search Focused AI Solutions
Summary of 3DPAS
1. Review of 3DPAS Theme
Daniel S. Katz, University of Chicago & Argonne National Laboratory
ShantenuJha, Rutgers University
Neil Chue Hong, University of Edinburgh
Simon Dobson, University of St. Andrews
Andre Luckow, Louisiana State University
Omer Rana, University of Cardiff
YogeshSimmhan, University of Southern California
www.ci.anl.gov
www.ci.uchicago.edu
3. e-Science Institute (e-SI)
• A 10-year project (Aug 2001 – July 2011), located in Edinburgh
• Aimed at, but not limited to, UK
• http://www.esi.ac.uk/
• Tagline – time & space to think
• Mission: to stimulate the creation of new insights in e-Science and computing
science by bringing together international experts and enabling them to
successfully address significant and diverse challenges
• Research themes formed the core of eSI’s activity
– Theme: connected programme of visitors, workshops and events
– Conceived and driven by Theme Leader
– Focusing on a specific issue in e-Science that crosses boundaries and raises new
research questions
– Goals:
o Identify research issues
o Rally a community of researchers
o Map a path of future research that will make best progress towards new e-Science methods
and capabilities.
www.ci.anl.gov
3 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
4. Context – Data and Science
• Data has always been important to science
• Some use the concept of paradigms
– First (thousand years ago) – empirical –
describe natural phenomena
– Second (few hundred years ago) –
theoretical – use models and generalizations
– Third (few decades ago) – computational –
solve complex problem
– Fourth (few years ago) – data exploration –
gain knowledge directly from data from experiment, theory,
simulation
• Problem – we cannot keep declaring new paradigms at an
exponentially increasing rate
• But it’s true that there is an emerging science of
“listening to data”, as defined by Jim Gray, Google, etc.
www.ci.anl.gov
4 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
5. Distributed Programming Abstractions
• DPA theme at eSI
– http://wiki.esi.ac.uk/Distributed_Programming_Abstractions
• Series of workshops
• Led to book in progress: ShantenuJha, Daniel S. Katz, Manish Parashar, Omer
Rana, and Jon Weissman, “Abstractions for Distributed Applications and
Systems,” to be published by Wiley in 2012
• And multiple papers, including: S. Jha, D. S. Katz, M. Parashar, O. Rana, and J.
Weissman, "Critical Perspectives on Large-Scale Distributed Applications and
Production Grids," (Best Paper Award Winner), Proceedings of the 10th
IEEE/ACM International Conference on Grid Computing (Grid 2009), 2009.
• Idea – start with distributed science and engineering applications – analyze
them (determine `vectors’); examine interaction with infrastructures and
tools; find abstractions
– Tech report on infrastructures (much of Chapter 3) available now:
http://www.ci.uchicago.edu/research/papers/CI-TR-7-0811
– Vectors: Execution Unit, Coordination, Communication, Execution Environment
• In the process, we realized that data intensive applications had some unique
challenges and issues
www.ci.anl.gov
5 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
6. Dynamic Distributed Data-intensive Programming
Systems and Applications (3DPAS)
• This led to 3DPAS theme at eSI
– http://wiki.esi.ac.uk/3DPAS
• Similar idea to DPA
– Start with science and engineering applications
– See if DPA vector suffice or if new vectors are
needed
– Examine what is different with respect to
infrastructures and programming systems
• Initially done through workshops at eSI
• Continuing through weekly teleconferences
• Driving towards a report/paper
www.ci.anl.gov
6 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
7. D3 (data intensive, distributed, dynamic)
• Data intensive: order of magnitude of large data and large computing
– Exascale data and petascale computing
– Petascale data and exascale computing
– Exascale data and exascale computing.
• Distributed: number, dispersion, and replication of distributed data or
computation resources
– Low in a cloud or cluster that resides in a single building
– High in a grid that spans multiple geographically-separated administrative
domains, or multiple data centers
• Dynamic: perhaps both data and computation
– Data may emerge at runtime
– Mechanisms to handle data during application execution, e.g., data
transfer, scheduling
– Application components may be launched at runtime in response to
data, application, or environment dynamics
• All may vary in different stages of an application
• Most applications have data collection, storage, analysis stages
www.ci.anl.gov
7 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
8. Value/Impact
• All data-intensive applications do not have
dynamic and distributed elements today
• However, as scales increase, applications will
have to be distributed and dynamic
– And these issues will be increasingly correlated
• Analyzing current D3 applications should impact
many future applications
– And lead to lessons about and requirements on
future infrastructures and programming systems
www.ci.anl.gov
8 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
9. Applications Process
• Asked questions about possible applications
1. What is the purpose of the application?
2. How is the application used to do this?
3. What infrastructure is used? (including compute, data, network,
instruments, etc.)
4. What dynamic data is used in the application?
a. What are the types of data,
b. What is the size of the data set(s)?
5. How does the application get the data?
6. What are the time (or quality) constraints on the application?
7. How much diverse data integration is involved?
8. How diverse is the data?
9. Please feel free to also talk about the current state of the
application, if it exists today, and any specific gaps that you know
need to be overcome
www.ci.anl.gov
9 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
10. Applications Process (2)
• In workshops, discussed current applications, and
considered if news application “felt” the same as a
previous application in terms of the answers to the
questions
• Came to 14 applications
• Noted they fall into different categories
– Traditional applications, single program that is run by a user
– Archetypical applications: a group of applications,
independent programs, written by different authors, may be
competing, usually not intended to run together
– Infrastructural applications: set of applications (or
archetypical applications) that need to be run in series
(perhaps in different phases), may be run by different groups
that do not frequently interact
www.ci.anl.gov
10 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
11. Applications
Application Area Type Lead Person/Site
Metagenomics Biosciences Archetypical Amsterdam Medical Centre,
Netherlands
ATLAS experiment Particle Infrastructural CERN &Daresbury Lab +
(WLCG) Physics RAL, UK
Large Synoptic Sky Astrophysics Infrastructural University of Edinburgh –
Survey (LSST) Institute of Astronomy, UK
Virtual Astronomy Astrophysics Archetypical University of Edinburgh –
Institute of Astronomy, UK
Cosmic Microwave Astrophysics Traditional Lawrence Berkeley National
Background Laboratory, USA
Marine (Sea Biosciences Infrastructural University of St. Andrews,
Mammal) Sensors UK
Climate Earth Science Infrastructural National Center for
Atmospheric Research, USA
www.ci.anl.gov
11 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
12. Applications (2)
Application Area Type Lead Person/Site
Interactive Exploration of Earth Archetypical University of Reading, UK
Environmental Data Science
Power Grids Energy Infrastructural University of Southern
Informatics California, USA
Fusion (International Chemistry/ Traditional Oak Ridge National
Thermonuclear Physics Laboratory & Rutgers
Experimental Reactor) University, USA
Industrial Incident Emergency Infrastructural THALES, The Netherlands
Notification and Response Response
MODIS Data Processing Earth Traditional Lawrence Berkeley
Science National Laboratory, USA
Floating Sensors Earth Infrastructural Lawrence Berkeley
Science National Laboratory, USA
Distributed Network Security Infrastructural University of Minnesota,
Intrusion Detection USA
www.ci.anl.gov
12 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
13. Climate (infrastructural)
• CMIP/ICPP process runs and analyses climate
models in 3 stages
• Data are generated by distributed HPC centers
• Data are stored by distributed ESGF gateways
and data nodes
• Data are analyzed by distributed
researchers, who search for particular
data, gather them to a site, process them
• Resources for analysis can be dynamic, as can
data stored in data nodes
Thanks: Don Middleton
www.ci.anl.gov
13 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
14. Fusion (traditional)
• ITER needs a variety of codes
• Codes run on distributed set of leadership-class
facilities, using advance reservations to co-schedule
the simulations
• Codes reads and writes data files, using ADIOS and
HDF5
• Files output by each code are transformed and
transferred to be used as inputs by other
codes, linking the codes into a single coupled
simulation
• Data generated are too large to be written to disk
for post-run analysis; in-situ analysis and
visualization tools are being developed
Thanks: Scott Klasky
www.ci.anl.gov
14 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
15. Metagenomics (archetypical)
• Analysis of genome sequence data being
produced by next gen devices
• Sequencers are producing data at a rate
increasing faster than computing capability
• Sequencers are distributed; data produced
cannot all be co-located
• Multiple analyses (using different software) by
multiple users need to make best use of
available computing resources, understanding
location and access issueswrt datasets
www.ci.anl.gov
15 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
16. CMB (traditional)
• Cosmic Microwave Background (CMB) performs data simulation and analysis to
understand the Universe 400,000 years after the Big Bang
– Detectors take O(1012 - 1015) time-ordered sequences
– Observations reduced to map of O(106 - 108) sky pixels
– Pixels reduced to O(103 - 104) angular power spectrum coefficients
– Coefficient reduced to O(10) cosmological parameters
• Computationally most expensive step is from map to angular power spectrum
– Exact solution is O(pixels3) – prohibitive
– Approximate solution: sets of O(104) Monte Carlo realizations of observed sky to
remove biases and quantify uncertainties, each of which involves simulating and
mapping the time-ordered data
– Map-making is applied to both real and simulated data, but O(104) more times to
simulated data (uses on-the-fly simulation module – simulations performed when
requested)
• Currently uses single HPC system, but would be faster with distributed systems
• Central system that builds map would launch data simulations on available
remote resources; output data from the simulations would be asynchronously
delivered back to that central system as files incorporated in map as they are
produced
Thanks: Julian Borrill
www.ci.anl.gov
16 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
17. Some Additional Applications
• ATLAS/WLCG (Infrastructural)
– Hierarchy of systems; data centrally stored, and locally cached (and copied to
where they likely will be used), perhaps at various levels of the hierarchy
– Processing is done by applications that are independent of each other
– Processing of one data file is independent of processing of another file, but
groups of processing results are collected to obtain statistical outputs about the
data
• LSST (Infrastructural)
– Data taken by a telescope
– Quick analysis is done at the telescope site for interesting (urgent) events (which
may involve comparing new data with previous data)
– System can get more data from other observatories if needed; request other
observatories to take more data; or call a human
– Data then transferred to an archive site, may be at observatory, where data are
analyzed, reduced, and classified, some of which may be farmed out to grid
resources
– Detailed analysis of new data vs. archived data is performed
– Reanalysis of all data is done periodically
– Data are stored in files and databases
www.ci.anl.gov
17 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
18. Some More Additional Applications
• Virtual Astronomy (Archetypical)
– Services are orchestrated through a pipeline, including a data retrieval
service that is used to share data across VO sites
– Data are moved through the pipeline, and intermediate and final
products can be stored in Grid storage service
• Marine (Sea Mammal) Sensors (Infrastructural)
– Data are brought to a central site when sensors periodically transmit
– Stored data are analyzed using statistical techniques, then visualized with
tools such as Google Earth
• Power Grids (Infrastructural)
– Diverse streams arrive at a central utility private cloud at dynamic rates
controlled by the application
– Real-time event detection pipeline can trigger load curtailment
operations
– Data mining is performed on current and historical data for forecasting
– Partial application execution on remote micro-grid sites is possible.
www.ci.anl.gov
18 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
19. Even More Additional Applications
• Industrial Incident Notification and Response (Infrastructural)
– Data are streamed from diverse sources, and sometimes manually
entered into the system
– Disaster detection causes additional information sources to be
requested from that region and applications to be composed based
on available data
– Some applications run on remote sites for data privacy
– Escalation can cause more humans in the loop and additional
operations
• MODIS Data Processing (Traditional)
– Data brought into system from various FTP servers
– Pipeline of initial standardized processing steps on data is done on
clouds or HPC resources
– Scientists can then submit executables that do further custom
processing on subsets of the data, which likely include some
summarization processing (building graphs)
www.ci.anl.gov
19 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
20. 3DPAS Vectors
• DPA vectors
– Execution Unit
– Communication
– Coordination
– Execution Environment
• What changes for D3 applications?
– DPA already assumed distributed; data-intensive is somewhat
orthogonal to vectors, last D is dynamic
• So, what can be dynamic?
– Data (in value or type)
– Application (for archetypical and infrastructural applications)
– Execution Environment
• And how can the application respond?
– All 3 vectors can change (under user control, or autonomically)
www.ci.anl.gov
20 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
21. Infrastructure
• Software infrastructure to support D3 applications and users exists at three
levels:
– System-level software capabilities (e.g., notifications, file system consistency)
– Middleware (e.g., databases, metadata servers)
– Programming systems, services and tools (e.g., data-centric workflows)
• Strong connection between software infrastructure and execution units
– Infrastructure supports the communication between and coordination of
execution units, e.g., to allow co-scheduling
• What changes for D3 applications?
– Boundary between infrastructure and application often blurred
o e.g., a catalog may be provided by underlying infrastructure or implemented in application
– Sometimes infrastructure requires knowledge of data models
o e.g., to support semantic information integration, triggers, optimized data transport
• General need for infrastructure components to support
– Data management: sources, storage, access, movement, discovery, notification,
provenance
– Data analysis: conversion, enrichment, analysis, workflow, calibration, integration
www.ci.anl.gov
21 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
22. Programming Systems
• Pipelines/workflows a key concept
• Loosely, 3 stages for many applications – data collection, data
storage, data analysis
– But the order varies: Sometimes analysis is done during collection to
reduce storage
• Some stages are built from legacy (heritage) applications
• Some applications don’t include all stages (some stages happen
elsewhere; data is just “there”)
• Stream processing also is important to some applications (or some
stages) – the complete data can never be stored, and can only be
accessed once in time
• Issues that programming systems should address
– Programming provisioning of resources
– Use of existing services, or building of new services
– How to adapt to changes? Autonomics?
– Recording provenance
www.ci.anl.gov
22 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
23. Programming Systems (2)
• Possible change: replace ad hoc and scripted approaches by more formal
workflow tools
– Potential benefits: efficiency, productivity, reproducibility, increased software
reuse, ability to add provenance tracking
– Potential issues: can application-specific knowledge by used by generic tools?
www.ci.anl.gov
23 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu
24. Conclusions
• D3 applications exist, the number is increasing
• There are some similarities across some applications
– Stages, streaming, dynamism and adaptivity
– Probably means there are generic abstractions that could be used
• Programming systems are somewhat ad hoc
• We want generic tools that
– Allow applications to adapt to dynamism in various elements
o E.g., developers can find and use available systems at
runtime, applications can run in the best location with respect to data
sources
– Provide good performance
• Further research needed
– How do we abstract the set of distributed systems to allow this?
– What middleware and tools are needed?
www.ci.anl.gov
24 3DPAS review for D3Science – d.katz@ieee.org
www.ci.uchicago.edu