3. Research data management and analysis challenges
• Data acquired at various
locations/times
• Analyses executed on
distributed resources with
different capabilities
– Processing time decreases
with distance
• Dynamic collaborations
around data and analysis
Raw
data
Catalog
DOE LabCampus
Community Archive
FPGACloud
4. Exacerbated by large-scale science
• Best practices overlooked, useful
data forgotten, errors propagate
• Researchers allocated short periods
of instrument and compute time
• Inefficiencies less science
• Errors long delays, missed
opportunity …forever!
5. Making research data reliably, rapidly, and securely
accessible discoverable, and usable
• Automation: encode research pipelines comprised of triggers and actions
• funcX: scalable function as a service for science
• Parsl: intuitive parallel programming in Python
• PolyNER: extracting scientific facts from published literature
• DLHub: model publication and inference
• MDF: publication and scarping of materials datasets
• XtractHub: deriving metadata from scientific files
• Cost-aware computing: application profiling, resource prediction, automated
provisioning
• Cloud classification: identifying different types of (real) clouds in climate data
5
6. Ripple: A Trigger-Action platform for data
• Monitors events on various
file system types
• Includes a set of triggers
and actions to create rules
• Ripple processes data
triggers and reliably
executes actions
• Usable by non-experts
7. Automating the research lifecycle
• Simple state machine model
– JSON-based language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
Auth
Search
Manage
Execute
8. Remote execution of scientific workloads
• Compute wherever it makes the most sense:
– Hardware or software availability, data location,
analysis time, wait time, etc.
• Remote computing has always been
complex and expensive
– Now we have high speed networks, universal
trust fabrics (Globus Auth), and containers
• Many scientific workloads are comprised
of a collection of short duration functions
– E.g., machine learning inference, real-time
analyses, metadata extraction, image
reconstruction, sensor stream analysis
8
9. funcX: High Performance Function as a Service for
Science
• Endpoints deployed at resource
– Manage provisioning and scheduling of
resources and data
– Scale-out based on resource needs
• Cloud service routes requests to
endpoints
• Singularity containers run functions
securely
• Globus Auth secures communication
9
10. Composition and parallelism in Python
• Software is increasingly assembled rather than written
– High-level language (e.g., Python) to integrate and wrap components
from many sources
• Parallel and distributed computing is pervasive
– Increasing data sizes combined with plateauing sequential processing
power
– Parallel hardware (e.g., accelerators) and distributed computing systems
10parsl-project.org
11. Parsl: Pervasive Parallel Programming in Python
Apps define opportunities for parallelism
• Python apps call Python functions
• Bash apps call external applications
Apps return “futures”: a proxy for a result
that might not yet be available
Apps run concurrently respecting data
dependencies. Natural parallel programming!
Parsl scripts are independent of where they
run. Write once run anywhere!
11
pip install parsl
12. Parsl executors scale to 2M tasks/256K workers
(weak scaling)
Weak scaling: 10 tasks (0-1s) per worker
HTEX and EXEX outperform other Python-
based approaches and scale to millions of tasks
HTEX and EXEX scale to 2K* and 8K* nodes,
respectively, with >1K tasks/s
14. PolyNER: Generalizable Scientific Named Entity
Recognition
14
Word
embedding
Labelling
Trained classifier
Active
learning
Active Learning
• Scientific NER challenges:
– NLP approaches are not yet suitable for application to scientific
information extraction
– There is a lack of training data for applying ML
• PolyNER automates the creation of training data using
minimal human guidance
– Word embedding models to generate entity-rich corpora
– Context- and content-based classifiers
– Active learning to prioritize expert effort
• Better performance than leading chemical entity
extractors at a fraction of the cost
– 1000 labels, 5 hours of expert time
• Training data for lexicon-infused Bi-LSTM