SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Simplifying	
  Data	
  Management	
  Tasks	
  with	
  
Globus	
  
Tanu	
  Malik,	
  Ian	
  Foster,	
  Kyle	
  Chard,	
  Roselyne	
  Tchoua,	
  
Joseph	
  Baker,	
  Mike	
  Gurnis,	
  Jonathan	
  Goodall,	
  ScoD	
  Peckham	
  
GeoDataspace
Share and Reproduce
Alice wants to share her models and
simulation output with Bob, and Bob wants
to re-execute Alice’s application to validate
her inputs and outputs.
GeoDatasp
Alice’s Options
1. A tar and gzip
2. Build a website with model code,
parameters, and data
3. Create a virtual machine
GeoDatasp
Bob’s Frustration
1. I do not find the lib.so required for building
the model.
2. How do I?
GeoDatasp
Lack of easy and efficient methods for sharing
and reproducibility
Amount of pain
Bob suffers
Amount of
pain Alice suffers
GeoDataspace
• Goal: Sharing and reproducibility hand-in-
hand
• Target users: Computational geoscientists
• Data and model integration
• Research Output is More Than "Just" a Research Paper
GeoDatasp
GeoDataspace
CI Components
• The geounits
• Units of scientific activity/research output
• How to capture and track this activity
• Globus Catalog
• A scalable, flexible catalog for annotations
conforming to open-world assumption
• Globus Publish and reproduce geounits
• Share/Publish geounits for others
• Replay geounits for analysis
GeoDatasp
geounits:
package data , source code and
environment
GeoDatasp
geounit Client:
Provenance is key
GeoDatasp
1. audit
<program name>
2. PROV
compliant
database
3. exec
<program name>
[activity]
geounit Client:
Features
• Based on Code, Data, Environment (CDE’s)
ptrace and okapi functionality
• Data/code can be local or distributed
• Data/code files are not manifested into the
package until ready to share; only
descriptions in package
• Specify granularity of auditing
• Partial replay
• Unpack into docker or vagrant
Globus Catalog:
hosts geounits
• Dataset Management Model
• Catalog: a hosted resource that enables
the grouping of related datasets
• Dataset: a virtual collection of
(schemaless) metadata and distributed
data elements viz files, provenance
• Annotation: a piece of metadata that
exists within the context of a dataset or
data member
GeoDatasp
Globus Catalog
• Dataset Service
• Virtual views of data based on user-defined and/or automatically
extracted metadata (annotations)
• Implemented as a service with web and REST interfaces
• Relies on Globus Nexus for user authentication and group management
• Client-side Tooling
• Dataset ingest
• Automatic creation of datasets and extraction of metadata from various
common data formats and directory structures
• Globus endpoints
• Associate data (in files and directories) with one or more datasets
• Python Client library
• Integration with external services
• Transfer: Moving datasets from their storage endpoint(s) to a selected
destination
• Faceted Browser Search
• Search based on provenance entities and activities
GeoDatasp
Globus Catalog:
REST interface
GeoDatasp
Approach
•  Hosted user-defined catalogs
•  Based on annotation model
<dataset/member, name, value>
•  Association of data members
•  Fine grained access control
•  Flexible query language
–  Name:value, free text, facets,…
•  Integrated with other
services
/geodataspace
/geodataspace/annotation
/geodataspace/geounit
/geodataspace/geounit/annotation
/geodataspace/geounit/acl
/geodataspace/geounit/members
/geodataspace/geounit/members/annotation
/geodataspace/geounit/provenance
/geodataspace/geounit/version
Publish and Reexecute
geounits
• Still in the works
• Each geounit can be published through
Globus Publish and re-executed through
analysis platform
GeoDatasp
Science Drivers
Solid Earth
Space Science
Hydrology
CSDMS
GeoDatasp
GeoDataspace
Solid Earth
• Allow reproducible, replayable geounits of GPlates
• GPlates
• Software package has several dependencies
• Create geounits of Kinematic Representation of
Surface of Earth (3D and 4D models)
• GPlates software,
• GPML files (XML for plate tectonics) used in the model,
• output GPML files are simple X/Y format or could be visualization files, a
global set of visualization output, images as well. 
• Integrating geounits in Python workflows
• Incorporate metadata from workflows and use geounit metadata to
inform workflows
GeoDatasp
Hydrology
• Data processing steps for theVIC model
geounit 1
geounit 2
geounit 3 geounit 4
Objective: Monitor changes in the data processing steps
and compare them across the various runs
GeoDatasp
Space Science
• Create geounits of SuperDarn data and its
plotting products
• Publish them for validation
GeoDatasp
CSDMS
• How geounits should be coupled
• Metadata alignment issues
• If we create geounits of CSDMS models,
how do we enable suitable search
interfaces with the provenance metadata
and CSDMS metadata?
GeoDatasp
Current Work
• Working with use cases to bootstrap
geounits
• Populating geounits based on Python
workflows and incorporate geounits in
workflows
• Interfacing geounit Client with Globus
Catalog
• Improving distributed search functionality
GeoDatasp
Track it!
• http://workspace.earthcube.org/
geodataspace
• Software, Source code, Science Usecases,
Reports, Presentations, News
GeoDatasp
Acknowledgements
• National Science Foundation
• EarthCube Community
• Globus team
• CI team
GeoDatasp

Contenu connexe

Tendances

Tendances (20)

Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environment
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 

En vedette

Scientometrics
ScientometricsScientometrics
Scientometrics
Tanu Malik
 
Question 3) What have you learned from your audience feedback?
Question 3) What have you learned from your audience feedback?Question 3) What have you learned from your audience feedback?
Question 3) What have you learned from your audience feedback?
branblack
 
Fall 2015 Research Paper
Fall 2015 Research PaperFall 2015 Research Paper
Fall 2015 Research Paper
KRISTIN BETHEL
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
Tanu Malik
 
Nur 2-minuten
Nur 2-minutenNur 2-minuten
Nur 2-minuten
grafic02
 

En vedette (13)

Scientometrics
ScientometricsScientometrics
Scientometrics
 
How Can Policymakers and Regulators Better Engage the Internet of Things?
How Can Policymakers and Regulators Better Engage the Internet of Things? How Can Policymakers and Regulators Better Engage the Internet of Things?
How Can Policymakers and Regulators Better Engage the Internet of Things?
 
Presentación modulo 1 UNTREF 2016
Presentación modulo 1 UNTREF 2016Presentación modulo 1 UNTREF 2016
Presentación modulo 1 UNTREF 2016
 
Mercenaries_Freedom_Fighters_and_Self_De
Mercenaries_Freedom_Fighters_and_Self_DeMercenaries_Freedom_Fighters_and_Self_De
Mercenaries_Freedom_Fighters_and_Self_De
 
Question 3) What have you learned from your audience feedback?
Question 3) What have you learned from your audience feedback?Question 3) What have you learned from your audience feedback?
Question 3) What have you learned from your audience feedback?
 
Datos de caso clinico
Datos de caso clinicoDatos de caso clinico
Datos de caso clinico
 
Fall 2015 Research Paper
Fall 2015 Research PaperFall 2015 Research Paper
Fall 2015 Research Paper
 
Bimba kids 14-04-2013
Bimba kids   14-04-2013Bimba kids   14-04-2013
Bimba kids 14-04-2013
 
Metodología para el Alineamiento de Procesos de Negocio y Sistemas de Informa...
Metodología para el Alineamiento de Procesos de Negocio y Sistemas de Informa...Metodología para el Alineamiento de Procesos de Negocio y Sistemas de Informa...
Metodología para el Alineamiento de Procesos de Negocio y Sistemas de Informa...
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
 
Nur 2-minuten
Nur 2-minutenNur 2-minuten
Nur 2-minuten
 
Demo
DemoDemo
Demo
 
David Campbell: Writing Security
David Campbell: Writing SecurityDavid Campbell: Writing Security
David Campbell: Writing Security
 

Similaire à GeoDataspace: Simplifying Data Management Tasks with Globus

GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
Thierry Badard
 
managing georeferenced content with Plone and collective.geo
managing georeferenced content with Plone and collective.geomanaging georeferenced content with Plone and collective.geo
managing georeferenced content with Plone and collective.geo
gborelli
 

Similaire à GeoDataspace: Simplifying Data Management Tasks with Globus (20)

Ozri 2013 Brisbane, Australia - Geodatabase Efficiencies
Ozri 2013 Brisbane, Australia - Geodatabase EfficienciesOzri 2013 Brisbane, Australia - Geodatabase Efficiencies
Ozri 2013 Brisbane, Australia - Geodatabase Efficiencies
 
FOSS4G 2017 Spatial Sql for Rookies
FOSS4G 2017 Spatial Sql for RookiesFOSS4G 2017 Spatial Sql for Rookies
FOSS4G 2017 Spatial Sql for Rookies
 
Introduction to Google Earth Engine .pptx
Introduction to Google Earth Engine .pptxIntroduction to Google Earth Engine .pptx
Introduction to Google Earth Engine .pptx
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
 
Geonetwork for Spatial Data
Geonetwork for Spatial DataGeonetwork for Spatial Data
Geonetwork for Spatial Data
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
PEARC17: Live Integrated Visualization Environment: An Experiment in General...
PEARC17: Live Integrated Visualization Environment: An Experiment in General...PEARC17: Live Integrated Visualization Environment: An Experiment in General...
PEARC17: Live Integrated Visualization Environment: An Experiment in General...
 
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
 
Big Data
Big DataBig Data
Big Data
 
Geodaten & Drupal 7
Geodaten & Drupal 7Geodaten & Drupal 7
Geodaten & Drupal 7
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
PostGIS and Spatial SQL
PostGIS and Spatial SQLPostGIS and Spatial SQL
PostGIS and Spatial SQL
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
managing georeferenced content with Plone and collective.geo
managing georeferenced content with Plone and collective.geomanaging georeferenced content with Plone and collective.geo
managing georeferenced content with Plone and collective.geo
 
GLOSIS vision | GSP Soil Data Facility, ISRIC - Bas Kempen
GLOSIS vision | GSP Soil Data Facility, ISRIC - Bas KempenGLOSIS vision | GSP Soil Data Facility, ISRIC - Bas Kempen
GLOSIS vision | GSP Soil Data Facility, ISRIC - Bas Kempen
 
Social Networks Analysis
Social Networks AnalysisSocial Networks Analysis
Social Networks Analysis
 
ITEM 1. Cont. GloSIS – Spatial Data Infrastructure_Bas Kempen
ITEM 1. Cont. GloSIS – Spatial Data Infrastructure_Bas KempenITEM 1. Cont. GloSIS – Spatial Data Infrastructure_Bas Kempen
ITEM 1. Cont. GloSIS – Spatial Data Infrastructure_Bas Kempen
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

GeoDataspace: Simplifying Data Management Tasks with Globus

  • 1. Simplifying  Data  Management  Tasks  with   Globus   Tanu  Malik,  Ian  Foster,  Kyle  Chard,  Roselyne  Tchoua,   Joseph  Baker,  Mike  Gurnis,  Jonathan  Goodall,  ScoD  Peckham   GeoDataspace
  • 2. Share and Reproduce Alice wants to share her models and simulation output with Bob, and Bob wants to re-execute Alice’s application to validate her inputs and outputs. GeoDatasp
  • 3. Alice’s Options 1. A tar and gzip 2. Build a website with model code, parameters, and data 3. Create a virtual machine GeoDatasp
  • 4. Bob’s Frustration 1. I do not find the lib.so required for building the model. 2. How do I? GeoDatasp Lack of easy and efficient methods for sharing and reproducibility Amount of pain Bob suffers Amount of pain Alice suffers
  • 5. GeoDataspace • Goal: Sharing and reproducibility hand-in- hand • Target users: Computational geoscientists • Data and model integration • Research Output is More Than "Just" a Research Paper GeoDatasp
  • 6. GeoDataspace CI Components • The geounits • Units of scientific activity/research output • How to capture and track this activity • Globus Catalog • A scalable, flexible catalog for annotations conforming to open-world assumption • Globus Publish and reproduce geounits • Share/Publish geounits for others • Replay geounits for analysis GeoDatasp
  • 7. geounits: package data , source code and environment GeoDatasp
  • 8. geounit Client: Provenance is key GeoDatasp 1. audit <program name> 2. PROV compliant database 3. exec <program name> [activity]
  • 9. geounit Client: Features • Based on Code, Data, Environment (CDE’s) ptrace and okapi functionality • Data/code can be local or distributed • Data/code files are not manifested into the package until ready to share; only descriptions in package • Specify granularity of auditing • Partial replay • Unpack into docker or vagrant
  • 10. Globus Catalog: hosts geounits • Dataset Management Model • Catalog: a hosted resource that enables the grouping of related datasets • Dataset: a virtual collection of (schemaless) metadata and distributed data elements viz files, provenance • Annotation: a piece of metadata that exists within the context of a dataset or data member GeoDatasp
  • 11. Globus Catalog • Dataset Service • Virtual views of data based on user-defined and/or automatically extracted metadata (annotations) • Implemented as a service with web and REST interfaces • Relies on Globus Nexus for user authentication and group management • Client-side Tooling • Dataset ingest • Automatic creation of datasets and extraction of metadata from various common data formats and directory structures • Globus endpoints • Associate data (in files and directories) with one or more datasets • Python Client library • Integration with external services • Transfer: Moving datasets from their storage endpoint(s) to a selected destination • Faceted Browser Search • Search based on provenance entities and activities GeoDatasp
  • 12. Globus Catalog: REST interface GeoDatasp Approach •  Hosted user-defined catalogs •  Based on annotation model <dataset/member, name, value> •  Association of data members •  Fine grained access control •  Flexible query language –  Name:value, free text, facets,… •  Integrated with other services /geodataspace /geodataspace/annotation /geodataspace/geounit /geodataspace/geounit/annotation /geodataspace/geounit/acl /geodataspace/geounit/members /geodataspace/geounit/members/annotation /geodataspace/geounit/provenance /geodataspace/geounit/version
  • 13. Publish and Reexecute geounits • Still in the works • Each geounit can be published through Globus Publish and re-executed through analysis platform GeoDatasp
  • 14. Science Drivers Solid Earth Space Science Hydrology CSDMS GeoDatasp GeoDataspace
  • 15. Solid Earth • Allow reproducible, replayable geounits of GPlates • GPlates • Software package has several dependencies • Create geounits of Kinematic Representation of Surface of Earth (3D and 4D models) • GPlates software, • GPML files (XML for plate tectonics) used in the model, • output GPML files are simple X/Y format or could be visualization files, a global set of visualization output, images as well.  • Integrating geounits in Python workflows • Incorporate metadata from workflows and use geounit metadata to inform workflows GeoDatasp
  • 16. Hydrology • Data processing steps for theVIC model geounit 1 geounit 2 geounit 3 geounit 4 Objective: Monitor changes in the data processing steps and compare them across the various runs GeoDatasp
  • 17. Space Science • Create geounits of SuperDarn data and its plotting products • Publish them for validation GeoDatasp
  • 18. CSDMS • How geounits should be coupled • Metadata alignment issues • If we create geounits of CSDMS models, how do we enable suitable search interfaces with the provenance metadata and CSDMS metadata? GeoDatasp
  • 19. Current Work • Working with use cases to bootstrap geounits • Populating geounits based on Python workflows and incorporate geounits in workflows • Interfacing geounit Client with Globus Catalog • Improving distributed search functionality GeoDatasp
  • 20. Track it! • http://workspace.earthcube.org/ geodataspace • Software, Source code, Science Usecases, Reports, Presentations, News GeoDatasp
  • 21. Acknowledgements • National Science Foundation • EarthCube Community • Globus team • CI team GeoDatasp