The document discusses the rapidly growing volumes of data being generated across many scientific domains such as biology, astronomy, climate science, and others. It notes that while "big science" projects have been able to develop robust cyberinfrastructure to manage and analyze large datasets, most individual researchers and smaller research groups lack adequate computing resources and software tools to effectively handle the data. The author argues that providing research cyberinfrastructure as a cloud-based service could help address this problem by reducing costs and barriers to entry for researchers. Specific services like Globus Online for data transfer and potential future services for storage, collaboration, and integration with other tools are presented as examples of this approach.
3. The data deluge in biology
x10 in 6 years
x105 in 6 years
www.ci.anl.gov
3
www.ci.uchicago.edu
4. Number of sequencing machines
http://omicsmaps.com/
www.ci.anl.gov
4
www.ci.uchicago.edu
5. Moore’s Law for X-ray sources
18 orders
of magnitude
12 orders of in 5 decades!
magnitude
in 6 decades
www.ci.anl.gov
5 Credit: Linda Young www.ci.uchicago.edu
7. Exploding data volumes in climate science
2004: 36 TB
2012: 2,300 TB
Climate
model intercomparison
project (CMIP) of the IPCC
www.ci.anl.gov
7
www.ci.uchicago.edu
8. Big science has been successful
OSG: 1.4M CPU-hours/day,
>90 sites, >3000 users,
>260 pubs in 2010
LIGO: 1 PB data in last science
run, distributed worldwide
Robust production solutions
Substantial teams and expense
Sustained, multi-year effort
Application-specific solutions,
built on common technology ESG: 1.2 PB climate data
delivered to 23,000 users; 600+ pubs
8 All build on NSF OCI (& DOE)-supported Globus Toolkit software
www.ci.anl.gov
www.ci.uchicago.edu
9. Small science is struggling
More data, more complex data
Ad-hoc solutions
Inadequate software, hardware
Data plan mandates
www.ci.anl.gov
9
www.ci.uchicago.edu
10. Dark data in the long tail of science
Awarded Amount 2007
$7,000,000
$6,000,000
$5,000,000
$4,000,000
$3,000,000
$2,000,000
$1,000,000
$0
1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776
NSF grant awards, 2007 (Bryan Heidorn)
www.ci.anl.gov
10
www.ci.uchicago.edu
11. The challenge of staying competitive
"Well, in our country," said Alice …
"you'd generally get to somewhere
else — if you run very fast for a
long time, as we've been doing.”
"A slow sort of country!" said the
Queen. "Now, here, you see, it
takes all the running you can do, to
keep in the same place. If you want
to get somewhere else, you must run
at least twice as fast as that!"
www.ci.anl.gov
11
www.ci.uchicago.edu
12. A crisis that demands new approaches
• We have exceptional infrastructure for the 1%
(e.g., supercomputers, Large Hadron Collider, …)
• But not for the 99% (e.g., the vast majority of
the 1.8M publicly funded researchers in the EU)
We need new approaches to providing
research cyberinfrastructure, that:
— Reduce barriers to entry
— Are cheaper
— Are sustainable
www.ci.anl.gov
12
www.ci.uchicago.edu
13. You can run a company from a coffee shop
www.ci.anl.gov
13
www.ci.uchicago.edu
14. Because businesses outsource their IT
Web presence
Email (hosted Exchange)
Calendar Software
Telephony (hosted VOIP) as a Service
Human resources and payroll (SaaS)
Accounting
Customer relationship mgmt
www.ci.anl.gov
14
www.ci.uchicago.edu
15. And often their large-scale computing too
Web presence
Email (hosted Exchange)
Calendar Software
Telephony (hosted VOIP) as a Service
Human resources and payroll (SaaS)
Accounting
Customer relationship mgmt
Infrastructure
Data analytics
as a Service
Content distribution
(IaaS)
www.ci.anl.gov
15
www.ci.uchicago.edu
16. Let’s rethink how we provide research IT
Accelerate discovery and innovation worldwide
by providing research IT as a service
Leverage the cloud to
• provide millions of researchers with
unprecedented access to powerful tools;
• enable a massive shortening of cycle times in
time-consuming research processes; and
• reduce research IT costs dramatically via
economies of scale
www.ci.anl.gov
16
www.ci.uchicago.edu
18. Cloud layers
Software as a Service: SaaS
Platform as a Service: PaaS
Infrastructure as a Service: IaaS
www.ci.anl.gov
18
18 www.ci.uchicago.edu
19. Common research data management steps
• Dark Energy Survey • SBGrid structural biology consortium
• Galaxy genomics • NCAR climate data applications
• LIGO observatory • Land use change; economics
www.ci.anl.gov
19
www.ci.uchicago.edu
20. Common research data management steps
• Dark Energy Survey • SBGrid structural biology consortium
• Galaxy genomics • NCAR climate data applications
• LIGO observatory • Land use change; economics
www.ci.anl.gov
20
www.ci.uchicago.edu
21. Scientific data delivery, 2012 1980
• “*A+ majority of users at BES facilities … physically transport data
to a home institution using portable media … data volumes are
going to increase significantly in the next few years (to 70 TB/day
or more) – data must be transferred over the network”
• “the effectiveness of data transfer middleware [is] not just on the
transfer speed, but also the time and interruption to other work
required to supervise and check on the success of large data
transfers”
• “It took two weeks and email traffic between network specialists
at NERSC and ORNL, sys-admins at NERSC, … and combustion staff
at ORNL and SNL to move 10 TB from NERSC to ORNL”
Major usability, productivity, performance problems
[ESNet Network Requirements Workshops, 2007-2010]
www.ci.anl.gov
21
www.ci.uchicago.edu
22. The challenge: Moving big data easily
What should be trivial …
“I need my data over there Data Data
– at my _____” ( Source Destination
supercomputing center,
campus server, etc.)
… can be painfully tedious and time-consuming
“GAAAH
!%&@#&
” ! Config issues
Data Data
! Firewall issues
Source Destination
! Unexpected failure
= manual retry
www.ci.anl.gov
22
www.ci.uchicago.edu
24. Globus Online: Data transfer as SaaS
• Reliable file transfer.
– Easy “fire-and-forget” transfers
– Automatic fault recovery
– High performance
– Across multiple security domains
• No IT required.
– Software as a Service (SaaS)
• No client software installation
• New features automatically available
– Consolidated support & troubleshooting
– Works with existing GridFTP servers
– Globus Connect solves “last mile problem”
• >4000 registered users, >3 Petabytes moved
Recommended by XSEDE, NERSC, Blue Waters, and many campuses
www.ci.anl.gov
24
www.ci.uchicago.edu
25. Dark Energy Survey use of Globus Online
• Dark Energy Survey
Blanco 4m on Cerro Tololo
receives 100,000 files
each night in Illinois
• They transmit files to
Texas for analysis …
then move results back
to Illinois
• Process must be reliable,
routine, and efficient
• They outsource this task Image credit: Roger Smith/NOAO/AURA/NSF
to Globus Online
www.ci.anl.gov
25
www.ci.uchicago.edu
28. Integration with Earth System Grid
High-speed transfers
Automated retries
Works behind firewalls
Credential management
Transfer monitoring
www.ci.anl.gov
28
www.ci.uchicago.edu 2
29. Globus Online under the covers
User Hub manages
user identities and
profiles
Group Hub manages
groups and policies
Resource Hub for
resource definitions
www.ci.anl.gov
29
www.ci.uchicago.edu
30. Globus Online under the covers
Monitoring and control
Auto-tuning of transfer User Hub manages
parameters user identities and
Detection & attempted profiles
correction of errors Group Hub manages
Manual intervention groups and policies
when required Resource Hub for
resource definitions
www.ci.anl.gov
30
www.ci.uchicago.edu
31. Globus Online under the covers
Monitoring and control
Auto-tuning of transfer User Hub manages
parameters user identities and
Detection & attempted profiles
correction of errors Group Hub manages
Manual intervention groups and policies
when required Resource Hub for
resource definitions
Reliable cloud-based infrastructure
EC2 for transfer management
S3 for system state
SimpleDB for lock management
Replication across availability zones
www.ci.anl.gov
31
www.ci.uchicago.edu
32. Globus Online under the covers
Monitoring and control
Auto-tuning of transfer User Hub manages
parameters user identities and
Detection & attempted profiles
correction of errors Group Hub manages
Manual intervention groups and policies
when required Resource Hub for
resource definitions
Reliable cloud-based infrastructure
EC2 for transfer management
S3 for system state
SimpleDB for lock management
Replication across availability zones
www.ci.anl.gov
32
www.ci.uchicago.edu
33. Towards “research IT as a service”
• Dark Energy Survey • SBGrid structural biology consortium
• Galaxy genomics • NCAR climate data applications
• LIGO observatory • Land use change; economics
www.ci.anl.gov
33
www.ci.uchicago.edu
34. Towards “research IT as a service”
Research data management as a service
Globus Globus Globus Globus ... SaaS
Transfer Storage Collaborate Catalog
Globus Integrate platform PaaS
www.ci.anl.gov
34
www.ci.uchicago.edu
35. Globus Storage: For when you want to …
• Place your data where
you want
• Access it from anywhere GridFTP, HTTP, WebDAV
via different protocols
• Update it, version it, Globus
Storage
and take snapshots volume
• Share versions with who
you want Commercial Campus
National
• Synchronize among storage
service
research computing
center
center
locations provider
www.ci.anl.gov
35
www.ci.uchicago.edu
36. Globus Collaborate: For when you want to
Join with a few or
many people to:
• Share documents
• Track tasks
• Send email
• Share data
• Do whatever
With:
• Common groups
• Delegated mgmt
www.ci.anl.gov
36
www.ci.uchicago.edu
37. Globus Integrate: For when you want to
Write programs that access/manage user
identities, profiles, groups, resources—and data …
Globus
Globus Transfer Globus Storage
Collaborate
• In production use • Early release
• Service and Web available in March • Initial projects
UI enhancements • Generally starting in March
continue available in Q3 • Early release
sometime in Q3
Globus Integrate Globus Connect
• Transfer API available Multi User
• User profile, group APIs in alpha
• APIs for Storage, Collaborate Globus Connect
planned after app release
… via REST APIs and command line programs
www.ci.anl.gov
37
www.ci.uchicago.edu
42. Realizing the benefits of cloud services
• Understand what services researchers really
need
• Acquire and sustain the expertise required to
create and operate useful services
• Incentivize those who produce services that are
widely adopted
• Provide excellent network connectivity
www.ci.anl.gov
42
www.ci.uchicago.edu
43. On the importance of networks
“80 percent of
success is
showing up”
www.ci.anl.gov
43
www.ci.uchicago.edu
44. Time required to move 10 Terabytes
10,000.00
1,000.00
Hours to transfer 10 Terabytes
100.00
10.00
1.00
0.10
0.01
1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06
Network speed in Megabits/sec
www.ci.anl.gov
44
www.ci.uchicago.edu
45. Time required to move 10 Terabytes
10,000.00
1,000.00
Hours to transfer 10 Terabytes
100.00
10.00
2 hours US R1 Universities
1.00
0.10
0.01
1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06
Network speed in Megabits/sec
www.ci.anl.gov
45
www.ci.uchicago.edu
46. Time required to move 10 Terabytes
10,000.00
1,000.00
Hours to transfer 10 Terabytes
100.00
10.00
2 hours US R1 Universities
1.00 10 mins Upgrade
0.10
0.01
1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06
Network speed in Megabits/sec
www.ci.anl.gov
46
www.ci.uchicago.edu
47. Time required to move 10 Terabytes
10,000.00
1,000.00 1 month Cinvestav Langebio
Hours to transfer 10 Terabytes
100.00
10.00
2 hours US R1 Universities
1.00 10 mins Upgrade
0.10
0.01
1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06
Network speed in Megabits/sec
www.ci.anl.gov
47
www.ci.uchicago.edu
48. A 21st C research cyberinfrastructure
• To provide Small and medium laboratories and projects
L L L L L L L L L
more capability for L L P L PL L P L P L L P L
more people at less cost … L L L L L L L L L
• Create cloud-based services
– Robust and universal Research data management a
– Economies of scale Collaboration, computation a
Research administration S
– Positive returns to scale
• Via the creative use of
– Aggregation (“cloud”)
– Federation (“grid”)
• Powered by networks
www.ci.anl.gov
48
www.ci.uchicago.edu
49. Questions for you
• How much “dark data” exists in your field? How
important is that data?
• Can you quantify the scale, in your field, of
– Wasted resources due to duplicated effort
– Delays in research progress due to inadequate
infrastructure?
• If you could do one thing to accelerate adoption
of advanced computing within your field, what
would it be?
www.ci.anl.gov
49
www.ci.uchicago.edu
50. Acknowledgments
Colleagues at UChicago and Argonne
Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu
Malik, Rachana Ananthakrisnan, Raj Kettimuthu,
and others listed at
www.globusonline.org/about/goteam/
NSF Office of Cyberinfrastructure
DOE Office of Advanced Scientific Computing Res.
National Institutes of Health
www.ci.anl.gov
50
www.ci.uchicago.edu
51. For more information
Attend GlobusWorld in Chicago, April 10-12, 2012
• www.globusonline.org
• Twitter: @globusonline, Globus Online on Facebook
• Foster, I. Globus Online: Accelerating and
democratizing science through cloud-based services.
IEEE Internet Computing(May/June):70-73, 2011.
• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswa
my, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pi
ckett, K. and Tuecke, S. Software as a Service for Data
Scientists. Communications of the ACM, Feb, 2012.
www.ci.anl.gov
51
www.ci.uchicago.edu
Cyberinfrastructure:The distributed computer, information, and communication technologies [that] empower the modern scientific research endeavor [Atlins report]
Gap of >1000 – AND many more systems as people jump on bandwagonMeanwhile, other resources [money, people] stay flatCrisis10^5 in 6 years10 in 6 years
http://omicsmaps.com/
PI and a handful of students and staff
80% of awards and 50% of grant $$ are < $350K
Lewis CarrollEnd-to-end crisis
The answer cannot simply be more moneyWe lack both $$ and the people to spend $$ on
Not (particularly) computing as a serviceBut the IT functions that researchers need to functionInclude collaboration as a service
Infrastructure will be provided by many – competitive – race to the bottomInteresting questions are What is the platform? And what is the software?
Sequencing: at center X, move data to Y, analyze, load into Short Read Archive (?), share, …
Sequencing: at center X, move data to Y, analyze, load into Short Read Archive (?), share, …
But when we get to work, we go back in time 20 years
User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
With a high-speed network, one can show up.Not just in person, but also computationally.