A talk at NASA Goddard, February 27, 2013
Large and diverse data result in challenging data management problems that researchers and facilities are often ill-equipped to handle. I propose a new approach to these problems based on the outsourcing of research data management tasks to software-as-a-service providers. I argue that this approach can both achieve significant economies of scale and accelerate discovery by allowing researchers to focus on research rather than mundane information technology tasks. I present early results with the approach in the context of Globus Online
10. Need: A new way to deliver
research cyberinfrastructure
Frictionless
Affordable
Sustainable
computationinstitute.org
11. We asked ourselves:
What if the research work flow
could be managed as easily as…
…our pictures
…our e-mail
…home entertainment
computationinstitute.org
12. What makes these services great?
Great User Experience
+
High performance
(but invisible) infrastructure
computationinstitute.org
13. We aspire (initially) to create a
great user experience for
research data management
What would a “dropbox for
science” look like?
computationinstitute.org
15. A common work flow…
Registry
Staging Ingest
Store Store
Community
Store
Analysis
Store
Archive Mirror
computationinstitute.org
16. … with common challenges
Data movement, sync, and sharing
Registry
• Between facilities, archives, researchers
Staging Ingest
Store Store
• Many files, large data volumes
Community
• With security, reliability, performance
Store
Analysis
Store
Archive Mirror
computationinstitute.org
18. 2 Globus
Data
Online
Data
Source moves/sy Destination
ncs files
1 User
initiates
transfer
request
Globus Online 3
notifies user
computationinstitute.org
19. 2 Globus Online tracks Data
shared files; no need Source
to move files to
cloud storage!
1 User A selects 3
file(s) to share; User B logs in to
selects Globus Online
user/group, sets and accesses
share permissions shared file
computationinstitute.org
20. Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect” install
• 5-minute Globus Connect Multi User install
computationinstitute.org
25. What is a Science DMZ?
Three key components, all required:
• “Friction free” network path
– Highly capable network devices (wire-speed, deep queues)
– Virtual circuit connectivity option
– Security policy and enforcement specific to science workflows
– Located at or near site perimeter if possible
• Dedicated, high-performance Data Transfer Nodes (DTNs)
– Hardware, operating system, libraries optimized for transfer
– Optimized data transfer tools: Globus Online, GridFTP
• Performance measurement/test node
– perfSONAR
Details at http://fasterdata.es.net/science-dmz/
computationinstitute.org
26. Globus GridFTP architecture
Parallel
TCP
LFN
Globus XIO
GridFTP UDP or
RDMA Dedicated
TCP
Shared
Internal layered XIO architecture allows alternative network
and filesystem interfaces to be plugged in to the stack
28computationinstitute.org
27. GridFTP performance options
• TCP configuration
• Concurrency: Multiple flows per node
• Parallelism: Multiple nodes
• Pipelining of requests to support small files
• Multiple cores for integrity, encryption
• Alternative protocol selection*
• Use of circuits and multiple paths*
Globus Online can configure these options
based on what it knows about a transfer
* Experimental computationinstitute.org
28. Exploiting multiple paths
• Take advantage of multiple interfaces in multi-homed data
transfer nodes
• Use circuit as well as production IP link
• Data will flow even while the circuit is being set up
• Once circuit is set up, use both paths to improve throughput
Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter computationinstitute.org
29. Exploiting multiple paths
Transfer between NERSC and ANL Transfer between UMich and Caltech
multipath
multipath
Default, commodity IP routes
+ Dedicated circuits
= Significant performance gains
Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter computationinstitute.org
30. Duration of runs, in seconds, over time.
Red: >10 TB transfer; green: >1 TB transfer.
1e+07
1 week
1e+05
1 day
1 hour
duration
1e+03
1 minute
1e+01
1 second
1e-01
2011 2012
38. Catalog as a service
Approach Three REST APIs
• Hosted user-defined /query/
catalogs • Retrieve subjects
• Based on tag model /tags/
• Create, delete, retrie
<subject, name, value>
ve tags
• Optional schema /tagdef/
constraints • Create, delete, retrie
• Integrated with other ve tag definitions
Globus services
Builds on USC Tagfiler project (C. Kesselman et al.) computationinstitute.org
44. Our vision for a 21st century
cyberinfrastructure
To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources
“Science as a service”
computationinstitute.org
45. Thank you to our sponsors!
computationinstitute.org
Notes de l'éditeur
The Computation Institute (or CI)A joint initiative between Uchicago and Argonne National LabA place where researchers from multiple disciplines come together and engage in research that is fundamentally enabled by computationMore recently ….we’ve been talking about it as the home of the research cloud …and I’ll describe what we mean by that throughout this talk
Here are some of the areas where we have active projectsFocus on areas of particular interest to I2/Esnet, namely HEP, climate change, genomics (up and coming)
And the reason is pretty obvious…This chart and others like it are becoming a cliché in next gen sequencing and big data presentations …but the point is that while Moore’s law translates to roughly 10x increase in processor power…data volumes are growing many orders of magnitude fasterAND MEANWHILE, other necessary resources [money, people] are staying pretty flatSo we have a crisis …and we hear that magic bullet of “the cloud” is going to solve itWell, as far as cost goes, clouds are helping but many issues remain
Another example if the earth systems grid that provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?It’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us that the Broad Institute hit 10PB of spinning disk last year …and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
The point is, the 1% of projects are in good shape
But what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
Lewis CarrollEnd-to-end crisis
Can’t just expect to throw more people and $$$ at the problem ….already seeing the limits
Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
Started with seemingly simple/mundane task of transferring files …etc.
Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
Extensible Session ProtocolA session provides context for a data transfer(OSI stack layer 5)Connections, forwarding, application context, etc.XSP provides mechanisms to configure dynamic network circuitsEzra Kissel and Martin Swany have developed a Globus XIO driver for XSP
Preliminary GridFTP test results has demonstrated that making use of both the default, commodity IP routes in conjunction with dedicated circuits will provide a number of significant performance gainsIn each case, our reservable circuit capacity was limited to 2Gb/s because of capacity caps, although we note that due to bandwidth “scavenging” enabled in the circuit service, we frequently see average rates above the defined bandwidth limit.
XIO-XSP is a Globus XIO driverProvides an integrated XSP client for GridFTPIncludes path provisioning and instrumentation for transfers over XSP sessionsXSPd (daemon) implements protocol frontendAccepts on-demand reservation requests from clientsSignals OSCARS and monitors circuit statusOSCARS circuits provisioned to end-hostsEither bandwidth or circuit on-demand
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many