This presentation was given at the GlobusWorld 2020 Virtual Conference, by Ian Foster, Rachana Ananthakrishnan, and Vas Vasiliadis from the University of Chicago.
16. Globus Labs Mission
To make research data research data
are reliably, rapidly, and securely
accessible, discoverable, and usable
By..
Developing an automated and scalable
platform for reproducible research that
can exploit heterogenous resources
that span the computing continuum
ƒuncX
Model
registry
Flows
Cost
map
Write
programs
Function
fabric
Data/Trust
fabric
Automate
DLHub
Globus
SCRIMP
Metadata
Extraction Xtract
17. Portable code Any access Any computer
Python
Docker, Shifter,
Singularity
Clusters,
clouds, HPC,
accelerators
Cloud API,
cluster or HPC
scheduler
funcX distributed function as a service
ƒuncX
Model
registry
Flows
Cost
map
Write
programs
Function
fabric
Data/Trust
fabric
Automate
DLHub
Globus
SCRIMP
Metadata
Extraction Xtract
18. funcX: Transform clouds, clusters, and supercomputers into
high-performance function serving systems
18
EP(x) EP(x) EP(x) EP(x)
funcX
Simply deploy funcX endpoint to transform
a computer into a function serving system
repo2dockerRegister
EP(x)
f(x) g(x)
h(x) k(x)
f(x) g(x)
EP(x) h(x) k(x)
f(x), …
+
depend
-encies
19. 19
EP(x) EP(x) EP(x) EP(x)
f(x)
g(x)
h(x)
k(x)
repo2dockerRegister
f(x) g(x)
h(x) k(x)
Registration
f(x), g(x), … + dependencies
EP(x) registry
Execution
f(x), …
[1,2,3 … n]
Simply deploy funcX endpoint to transform
a computer into a function serving system
repo2dockerRegister
EP(x)
f(x) g(x)
h(x) k(x)
f(x) g(x)
EP(x) h(x) k(x)
f(x), …
+
depend
-encies
funcX: Transform clouds, clusters, and supercomputers into
high-performance function serving systems
20. Parsl: parallel programming in
Python
arxiv.org/pdf/1905.02158 parsl-project.org
ƒuncX
Model
registry
Flows
Cost
map
Write
programs
Function
fabric
Data/Trust
fabric
Automate
DLHub
Globus
SCRIMP
Metadata
Extraction Xtract
21. Cost-aware computing with
heterogeneous platforms
Incremental construction of a personalized cost map
• Build black-box performance models from observed
execution times for different codes on different
platforms
• Transfer learning across codes, problem sizes, and
hardware platforms
• Experiment design to choose experiments that maximize
reduction in uncertainty
• Evolve models over time as codes and platforms change
• Use models for instance selection and scheduling
ƒuncX
Model
registry
Flows
Cost
map
Write
programs
Function
fabric
Data/Trust
fabric
Automate
DLHub
Globus
SCRIMP
Metadata
Extraction Xtract
22. 22
Virtual CPUs
RAM(GB)
Example: A cost map for bioinformatics applications on
different AWS instance types IndexBam performs better on compute-
optimized instances. Poorly chosen
experiments mislead the model
On average, within 30% of final error after 4 experiments and within 2.3% after 6
23. Metadata extraction at the edge
• Dynamic extraction pipelines composed of many
independent extractors
– Metadata and content (images, text, tables, maps, …)
• Centralized vs edge extractor execution to weigh
tradeoffs between compute and transfer costs
23
ƒuncX
Model
registry
Flows
Cost
map
Write
programs
Function
fabric
Data/Trust
fabric
Automate
DLHub
Globus
SCRIMP
Metadata
Extraction Xtract
24. DLHub: model publication and
serving
dlhub.orgarxiv.org/abs/1811.11213
ƒuncX
Model
registry
Flows
Cost
map
Write
programs
Function
fabric
Data/Trust
fabric
Automate
DLHub
Globus
SCRIMP
Metadata
Extraction Xtract
25. Assets:
RNAseq, variants,
patient phenotypes, expression
profiles to small molecules
At multiples sites:
Managed/hosted
by specialists
Goals:
Increase discoverability
Combine, reuse, share assets
Increase analysis, enabling
clinical research
NIH Common Fund Data Ecosystem
Data
automation
Data Ingest
Index
Search
Analyze
27. Simplifying the Globus Connect Personal Experience
• Option to login in from
the application during
installation
• Setup key method
available for automation
use cases
• Available next week
30. The new Globus Connect v5
architecture provides numerous new
features for users and administrators,
and serves as a platform for richer
data management capabilities.
30
31. For users and developers
• Web addressable storage system in addition to bulk data
access
• Credential management for cloud storage systems
• No re-authentication needed for duration of tasks
• Eliminate user certificates and move to OAuth tokens
• …
31
32. For administrators
• Single DTN pool connect multiple storage systems
• Eliminate need for shared file system across DTNs
• Complete backup and recovery solution
• Configuration management API
• …
32
33. Next point release GCSv5.4
• Targeted for May 2020
• Deployments with multiple DTNs
• Support both standard data access and high assurance access
• Custom mapping from user identity (user@domain.edu) to local
account
• Role based management for GCS
• Guest collection root selection via browse
• Connectors supported:
– POSIX, Google Drive, Google Cloud, Box, Ceph, AWS S3 SpectraLogic
Black Pearl
43. Other product updates
• For users: Several new features in web app
– Consolidated view options, HTTPS upload/download via browser,
custom message on access, accessibility improvements…
• For admins: Transfer updates for checksum handling
– Support for additional algorithms (SHA1, SHA256, SHA512), custom
checksum value to verify file integrity
• For developers: Globus Groups platform service
– First release with minimal feature to get group membership
information
43
46. DataCite switches to Globus Auth for authentication
• Globus Auth to secure
their Profiles services
• Brings federated login
to DataCite users
• Ongoing collaboration
to use Globus Auth for
securing other API
• Globus to use DataCite
for persistent identifiers
46
blog.datacite.org/globus-authentication
47. Cancer Registry Records for Research (CR3)
• Vision: enable broad, controlled, access to cancer
patient data
• Solution: Build a network of federated cancer registries
– Self service data exploration across registries
– Secure, auditable, access controls for data sharing
• Federation via Globus: network scale local control
– Owners input/export data, apply QC, set access
policies
– Registry data remain at generating institution
– Identities provided/authenticated by the institution
49. Programmatic adoption of Globus
49
“…over 60 research groups …moving over 2PB of
data off aging near-line storage…”
“Globus sharing and group functionality have also
eased the thorny issue of sharing access with
remote collaborators in a more controlled manner.” www.technology.pitt.edu/blog/globus
50. Instrument data delivery at scale
Use Globus to deliver
100s of TB of genomic
data to researchers
Credits: Joe George, University of Michigan
51. Simplified data sharing for ALCF users
Argonne Leadership Computing
Facility (ALCF) “Eagle” provides
a 50 PB community file system to
make data-sharing easier than
ever among ALCF users, their
collaborators and with third
parties.
Eagle Community
File System
Globus sharing
53. Current service enhancements
• MFA policy for data access
• IPv6 support
• Conditional fault handling
• Enhancements for storage
with staging requirements
53
• Enhancements to
application registration
and management
• Groups service
– Membership API
– Management API
54. Platform Challenge
54
Transform how research applications and services are…
created, used and
delivered
orchestrated to
achieve automation
sustained
Enable an interoperable ecosystem of research
applications and services
55. Globus platform services
• Identity and Access Management (IAM)
– Auth
– Groups
• Data Services
– Connect
– Transfer
– Manifest
• Search
• Identifiers (collaboration with DataCite)
• Flows
55
57. Automation Action Providers
Delete ACLs
Search
DLHub
User Form Notification
Expression
Evaluation
Describe
Web FormIdentifier
Transfer
Ingest
Xtract
funcX
Globus action
providers
Custom action
providers
58. Enabling serial crystallography at scale
• Serially image chips with
thousands of embedded crystals
• Quality control first 1,000 to report
failures
• Analyze batches of images as they
are collected
• Report statistics and images during
experiment
• Return crystal structure to scientist
Darren Sherrell, Gyorgy Babnigg, Andrzej Joachimiak
60. PaaS: develop custom action providers
• Directly use the platform to build and run
extensible flows
• Develop action providers
– Fit for purpose
– Developed and deployed by the project
– Plugged into their flows
• Action Provider Development toolkit
60
61. XPCS: X-ray Photon Correlation Spectroscopy
ALCF Data
Portal
Argonne
JLSE
Argonne
Leadership
Computing
Facility
APS
Publication5
Lab Server 1
Acquisition2Imaging1
Plot results4
XPCS-Eigen3
Science!6
● Automate flows stage
data to ALCF for on-
demand analysis and
publication
● Metadata and plots
dynamically extracted,
and published into a
search catalog
● Scientists can select
datasets and initiate
flows to perform batch
analysis tasks
Suresh Narayanan, Nicholas Schwarz
63. SaaS: instrument data management
• Templated solution
• Configurable…
– Set transfer triggers
– Select destination(s)
– Define metadata
• Extensible…
– Add/remove actions
– Change action providers
• No development required
Cryo EM
Lightsheet
Sequencer
….
Indexing for
search
Image reconstruction,
analysis, visualization
Automated egress
from device
--/cohort045
--/cohort096
--/cohort127
Transfer
funcXXtract
64. Materials Data Facility
> 40 TB of data
> 320 published
authors
> 400 datasets
• Accept data from many
locations with flexible
interfaces
• Index dataset contents in
science-aware ways
• Dispatch data to the
community
• Using Automate to
simplify building
composable flows of
services
65. MDF Data Publication Automation
Ingest
Bulk
Ingest
Auth
Get
Credentials
Automate
Transfer
Transfer
Dataset
XTract
Extract
Metadata
Share
Set
permissions
Transfer
Move
metadata
Transfer
Transfer
Dataset
Transfers
Transfer
Dataset
Identifier
Mint DOI
Web form
Metadata
Notify
Notify
Curator
Web form
Curation
Notify
Notify
user
66. SaaS: Data Management Plans
• “Turnkey” DMP enablement
• Select dataset (collection)…
• …add metadata for indexing
• …generate persistent ID
(DOI, ARK, etc.)
66
Transfer
Identifier
Ingest
“Point & Click” to
findable and
accessible data
70. To go (way) beyond file transfer…
• Remove friction for external collaborators
• Automate/scale research data flows
• Diversify research storage options—with a unified interface
• Gain visibility into research storage utilization
• Integrate robust data management into research apps
• Optimize data transfer performance
• Access expert support resources
70
71. To help our community share the load…
0
1000
2000
3000
4000
5000
6000
2015/04 2015/08 2015/12 2016/04 2016/08 2016/12 2017/04 2017/08 2017/12 2018/04 2018/08 2018/12 2019/04 2019/08 2019/12
Active Endpoints by Month
Subscribed Free