We present our work on creating sustainable science services using Globus, Amazon Web Services and Galaxy framework. We focus on Globus Genomics as successful usecase
AWS Public Sector Summit 2014 Talk - Science as a Service using AWS
1. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
madduri@anl.gov
Science as a Service on
AWS
Ravi K Madduri
3. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Outline
• CI Mission and Introduction of Science as a
Service
• Motivation
– Why is this important?
• Separation of concerns – Going far together
• Examples of Science as a Service
• Focus on Globus Genomics as a Success story
– Announcing Globus Genomics AWS Test Drive
4. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Our Vision for a 21st Century
Discovery Infrastructure
Provide more capability for people at
lower cost by delivering
Science as a service
www.globus.org
5. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Two Broader Themes
• Productivity of Researchers
– Time spent performing administrative tasks Vs
time spent doing science
– Reproducibility
• Sustainability of scientific software
– Reduction in funding for science
6. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Time-consuming tasks in
science
• Run experiments
• Collect data
• Manage data
• Move data
• Acquire computers
• Analyze data
• Run simulations
• Compare experiment
with simulation
• Search the literature
• Communicate with
colleagues
• Publish papers
• Find, configure, install
relevant software
• Find, access, analyze
relevant data
• Order supplies
• Write proposals
• Write reports
10. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Presenting
21st Century Discovery Infrastructure
11. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Going Far Together
Separation of Concerns
13. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Time-consuming tasks in
science
• Communicate with
colleagues
• Publish papers
• Find, configure, install
relevant software
• Find, access, analyze
relevant data
• Order supplies
• Write proposals
• Write reports
• Run experiments
• Collect data
• Manage data
• Move data
• Acquire computers
• Analyze data
• Run simulations
• Compare experiment
with simulation
• Search the literature
14. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Data
Source
Data
Destinatio
n
User initiates
transfer request1
Globus moves
and syncs files2
Globus
notifies user3
Globus: Fast, reliable data
transfer
15. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Amazon S3 Endpoints
16. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Data
Source
User A selects
file(s) to share,
selects user or
group, and sets
permissions
1
Globus tracks shared
files; no need to
move files to cloud
storage!
2
User B logs into
Globus and
accesses
shared file
3
Globus: Sharing off existing
systems
17. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Globus: Federated identity
18. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
>25,000 registered users; >150 daily
50 PB moved; >1B files
10x (or better) performance vs. scp
99.9% availability
Entirely hosted on Amazon
Globus Transfer
19. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Metadata
Access Control
License
Storage
Curation
Workflow
Policies
Collection
Globus: Data publication
service
Metadata
DataMetadata
Data
Metadata
Data
Dataset
Dataset
Dataset
Community
20. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Time-consuming tasks in
science
• Run experiments
• Collect data
• Manage data
• Move data
• Acquire computers
• Analyze data
• Run simulations
• Compare experiment
with simulation
• Search the literature
• Communicate with
colleagues
• Publish papers
• Find, configure, install
relevant software
• Find, access, analyze
relevant data
• Order supplies
• Write proposals
• Write reports
21. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Globus Science Stack in
Action
Sequencing
Centers
Sequencing
Centers
Public
Data
Storage
Local Cluster/
CloudSeq
Center
Research Lab
Globus Provides a
• High-performance
• Fault-tolerant
• Secure
file transfer Service between
all data-endpoints
Data Management Data Analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant
Calling
Galaxy
Data Libraries
Globus Genomics on
Amazon EC2
• Analytical tools are
automatically run
on the scalable
compute resources
when possible
• Globus Integrated wit
Galaxy
• Web-based UI
• Drag-Drop workflow
creations
• Easily modify Workflo
with new tools
Galaxy Based Workflow
Management SystemGlobus SaaS
22. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Flexible, scalable,
affordable
genomics analysis
for all biologists
23. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Globus Genomics
• Analysis tools profiled for
optimal performance
• Workload management for
parallel execution
• Resources provisioned on
demand
• High performance, reliable
data movement
• Seamless access using
institution’s credentials
• Best practice + extensible,
customizable pipelines
24. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Globus Climate
25. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Globus Materials
26. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Cardio Vascular Research
27. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Proton Cancer Treatment
No.
Histo
ries
Execu
tion
Time
(s)
No.
Per
Hou
r
On-
demand
Cost
($2.10)
Spot
Cost
($0.50)
1.5B 570 6 $35 $9
1B 445 8 $27 $7
0.5B 283 12 $18 $5
0.25
B
170 21 $10 $2
28. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Usage has been promising
0
2000
4000
6000
8000
10000
12000
0
200000
400000
600000
800000
1000000
1200000
January February March April May June
Cost($)
InstanceHours
Date
Instance Hours Cost
29. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Exome: 3 – 12hrs ~1hr
Whole Genome: ~22hrs
~10hrs
RNA-Seq: 1 – 12hrs ~minutes
30. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Diversity of collaborations
Dobyns
Lab
Cox Lab
Volchenboum Lab
Olopade Lab
Nagarajan Lab
31. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Common misconceptions
• Cloud is expensive
• Cloud is insecure
• It takes a long time to move data and its hard
• Cloud is about VMs and we got VMs
• My codes won’t run on the cloud
• Cloud is not HPC-enough
• Amazon will be acquired or will file for
bankruptcy – What happens to my data?
32. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Possible Solutions
• Outreach
• Case studies with TCO for various
domains and problem types
• Compliance
• Transparency in Billing
33. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Our Vision for a 21st Century
Discovery Infrastructure
To make advanced
computational capabilities
available to all researchers at
substantially lower cost
34. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
We’re “all in” on cloud
Identify time-consuming activities
amenable to automation, outsourcing
and deliver as high-quality, low-touch
SaaS
Extract common elements as a research
data management automation PaaS
Leverage IaaS for reliability, economies
of scale
35. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Thank you to our
sponsors!
Notes de l'éditeur
Carnegie Library – Represents how science was done
About 5 years ago we started building a solution to this problem, Globus
Reliable, secure, high-performance file transfer and synchronization
Hosted service…
“Fire-and-forget” transfers
Automatic fault recovery
Seamless security integration
You retain full control of your storage system
Local security mechanisms are respected
Access limited by the policies you have set on the file system
Your offload the burden of managing shared data to the users themselves
Both in the case of moving and sharing data, we have do deal with the thorny issue of security
In moving the data you’re typically crossing multiple security domains
And when sharing, your collaborators are typically in different institutions with different identity providers
Ideally the researcher wants to use a single username password, most likely their campus identity
Some services such as InCommon provide the foundation for this at an institutional level
But they don’t address how a researcher accesses resources at the institution
How easy is it for one of your colleagues at another organization to just access your HPC cluster and run an analysis or pull down some of the data from your experiment
Extending Globus to deliver publishing and discovery capabilities as a hosted service.
Metadata is stored in the Cloud
Published data is stored on campus, institutional, group resources that are managed and operated by external administrators
To associate storage with a collection administrators must configure Globus Connect Server with sharing on their resources and then associate the endpoint with the collection through Globus.
Published datasets are organized by “communities” and their member “collections”
E.g., Argonne National Laboratory community has several member collections (APS, CNM, CELS)
Often collections will map to a department or group within an institution, but they don’t have to.
Globus users can create and manage their own communities and collections through the service
A Collection enables the submission of datasets with policies regarding access
A Dataset is data and metadata
Policies can be set on communities or collections
Metadata (schema, requirements)
Access control (user and group based)
Curation workflow
Submission and distribution license
Storage
Our goal is to operationalize key capabilities so researchers can depend on them. Think of Gmail for science..
We are on the verge of even more substantial improvements
Optimizations: using Intel AVX extensions for certain tools
e.g. seeing 720x speedup in GATK