Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

•Télécharger en tant que PPTX, PDF•

0 j'aime•203 vues

We describe the automated data ingest scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform automated data ingest and present a faceted search interface that can be used by science gateways to simplify data discovery. We also walk through the application's GitHub repository and highlight relevant components.

Logiciels

Simplifying Science Gateway Data
Management with Globus
Part IV – Automated Data Ingests
October 2020, Gateways 2020

Phase 1 - Gather data
Gathering datasets from research partners
• Your project is gathering datasets
from partners. Each dataset is
several TBs and takes ~a day to
transfer over the network.
• For the data to be useful, it needs
descriptive metadata.
• Ultimately, the team needs to find
datasets that match specific
criteria.

What are the dataset ingest challenges?
• Getting very large datasets transferred from gateway
users’ systems to the central repository
– (This is Scenario I - large-scale data transfer.)
• Generating persistent identifiers for the data in the
central repository so we can link metadata to data
• Storing the metadata
• Indexing the metadata to enable searching

Demonstration
Data ingests in a
web application
https://petraldata.net/

What needs to be in place for it to work?
• Data storage
– Globus Connect Server on Petrel
• Persistent identifiers
– FAIR Research Identifier Service
– Hosted by https://fair-research.org/
• Metadata storage, indexing, search
– Globus Search API
– Hosted by Globus

Globus Connect Server on Petrel
• Configured for self-service projects
– Researchers do not receive local (Linux) accounts!
– Uses Globus for authorization & management
• Guest collections and groups
– Project PIs request access by applying to join the “Petrel
Project Owners” group (using the Globus web app)
– Admin creates Globus group, makes PI a group manager
– Admin creates guest collection, makes PI an access manager
– Admin sets a quota of 100TB for the guest collection

• RESTful web service, written in Python, that
stores identifier metadata
• Mints (creates) identifiers from external
service providers using a unified service
provider interface (SPI)
• Different identifiers supported through
namespaces
• Client requests served as HTML landing
pages or other machine-readable formats
(e.g., JSON, JSON-LD)
FAIR Research Identifiers
AWS-RDS
AWS-EC2
Postgres
Registration SPI
(Python)
Web Server - REST API
(Apache, Flask, Python)
RDBMS ORM
(SQLAlchemy)
AuthN/AuthZ
(Globus Auth, Globus Groups)
Web
Browser
Client
APIs
HTML JSON, JSON-LD, other
extensible renderings
DataCite
(DOI)
EZID
(ARK)
Minid
(Handle)
https://minid.readthedocs.io/en/develop/

• REST API provides a simple CRUD
interface
• Has other capabilities, like finding
identifiers by checksum
• JSON is used for request and
response
• Namespaces may also have their own
handlers, landing pages, and other
customizations.
FAIR Research Identifiers

Globus Search API
• RESTful API for indexing & search
– Hosted by Globus (including the metadata &
index storage!)
– Each project gets an “index” object (private
tenancy)
– REST API, Python client package, Python CLI
• https://docs.globus.org/api/search/

Globus Search API features
• Scalable: to billions of entries
• Schema agnostic: can use standard
(e.g., DataCite) or custom metadata
• Fine-grain access control: only returns
results that are visible to user
• Plain text search: ranked results
• Faceted search: for data discovery
• Rich query language: ranges,
expressions, regex, fuzzy, stemming, etc.

Key ingredients
1. UUID and base path for the guest collection where
data is gathered
2. Minid Python client
3. UUID for Globus Search index
4. Your choice of appropriate metadata schema for
your project’s datasets

Recommandé

Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus

Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus

Gateways 2020 Tutorial - Introduction to GlobusGlobus

Automating Research Data Management at Scale with GlobusGlobus

Data Orchestration at Scale (GlobusWorld Tour West)Globus

Connecting Your System to Globus (APS Workshop)Globus

GlobusWorld 2021 Tutorial: Building with the Globus PlatformGlobus

Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Globus

Recommandé

Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus

Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus

Gateways 2020 Tutorial - Introduction to GlobusGlobus

Automating Research Data Management at Scale with GlobusGlobus

Data Orchestration at Scale (GlobusWorld Tour West)Globus

Connecting Your System to Globus (APS Workshop)Globus

GlobusWorld 2021 Tutorial: Building with the Globus PlatformGlobus

Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Globus

Introduction to the Globus Platform (APS Workshop)Globus

Globus Portal Framework (APS Workshop)Globus

What's New in Globus - Internet2 TechEXtraGlobus

Enabling Secure Data Discoverability (SC21 Tutorial)Globus

Recent Upgrades to ARM Data Transfer and Delivery Using GlobusGlobus

"What's New With Globus" Webinar: Spring 2018Globus

Introduction to Globus (APS Workshop)Globus

Globus: Enabling the Open Storage NetworkGlobus

GlobusWorld 2021 Tutorial: Globus for System AdministratorsGlobus

Globus: Beyond File TransferGlobus

20090701 Climate Data StagingHenning Bergmeyer

GlobusWorld 2020 KeynoteGlobus

Introduction to the Globus Platform (GlobusWorld Tour - UMich)Globus

SomeSlidesguestd60742

Globus: Research Data Management as Service and Platform - pearc17Mary Bass

Globus status and publication plansIan Foster

Globus publication demo screenshotsIan Foster

Automating Research Data Flows with the Globus Command Line Interface (CLI)Globus

Log analysis using elkRushika Shah

Globus Platform OverviewGlobus

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

Advanced Computing Meets Data FAIRnessGlobus

Contenu connexe

Tendances

Introduction to the Globus Platform (APS Workshop)Globus

Globus Portal Framework (APS Workshop)Globus

What's New in Globus - Internet2 TechEXtraGlobus

Enabling Secure Data Discoverability (SC21 Tutorial)Globus

Recent Upgrades to ARM Data Transfer and Delivery Using GlobusGlobus

"What's New With Globus" Webinar: Spring 2018Globus

Introduction to Globus (APS Workshop)Globus

Globus: Enabling the Open Storage NetworkGlobus

GlobusWorld 2021 Tutorial: Globus for System AdministratorsGlobus

Globus: Beyond File TransferGlobus

20090701 Climate Data StagingHenning Bergmeyer

GlobusWorld 2020 KeynoteGlobus

Introduction to the Globus Platform (GlobusWorld Tour - UMich)Globus

SomeSlidesguestd60742

Globus: Research Data Management as Service and Platform - pearc17Mary Bass

Globus status and publication plansIan Foster

Globus publication demo screenshotsIan Foster

Automating Research Data Flows with the Globus Command Line Interface (CLI)Globus

Log analysis using elkRushika Shah

Globus Platform OverviewGlobus

Tendances (20)

Introduction to the Globus Platform (APS Workshop)

Globus Portal Framework (APS Workshop)

What's New in Globus - Internet2 TechEXtra

Enabling Secure Data Discoverability (SC21 Tutorial)

Recent Upgrades to ARM Data Transfer and Delivery Using Globus

"What's New With Globus" Webinar: Spring 2018

Introduction to Globus (APS Workshop)

Globus: Enabling the Open Storage Network

GlobusWorld 2021 Tutorial: Globus for System Administrators

Globus: Beyond File Transfer

20090701 Climate Data Staging

GlobusWorld 2020 Keynote

Introduction to the Globus Platform (GlobusWorld Tour - UMich)

SomeSlides

Globus: Research Data Management as Service and Platform - pearc17

Globus status and publication plans

Globus publication demo screenshots

Automating Research Data Flows with the Globus Command Line Interface (CLI)

Log analysis using elk

Globus Platform Overview

Similaire à Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

Advanced Computing Meets Data FAIRnessGlobus

ASP.NET Mvc 4 web apiTiago Knoch

Building Data Portals and Science Gateways with GlobusGlobus

Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655Getting value from IoT, Integration and Data Analytics

Building RESTfull Data Services with WebAPIGert Drapers

Building Software Backend (Web API)Alexander Goida

SOLID Programming with Portable Class LibrariesVagif Abilov

Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...Thomas W. Fry

Denodo Partner Connect: Technical Webinar - Ask Me AnythingDenodo

Basics of the Web PlatformSanjeev Verma, PhD

Building Research Applications with Globus PaaSGlobus

HDF Cloud ServicesThe HDF-EOS Tools and Information Center

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit

Tutorial: Leveraging Globus in your Research ApplicationsGlobus

ArcReady - Architecting For The CloudMicrosoft ArcReady

Evolution Of The Web Platform & Browser SecuritySanjeev Verma, PhD

Debugging the Web with FiddlerIdo Flatow

Working with Globus Platform ServicesGlobus

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies

Similaire à Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus (20)

Data Analytics Service Company and Its Ruby Usage

Advanced Computing Meets Data FAIRness

ASP.NET Mvc 4 web api

Building Data Portals and Science Gateways with Globus

Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655

Building RESTfull Data Services with WebAPI

Building Software Backend (Web API)

SOLID Programming with Portable Class Libraries

Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...

Denodo Partner Connect: Technical Webinar - Ask Me Anything

Basics of the Web Platform

Building Research Applications with Globus PaaS

HDF Cloud Services

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks

Tutorial: Leveraging Globus in your Research Applications

ArcReady - Architecting For The Cloud

Evolution Of The Web Platform & Browser Security

Debugging the Web with Fiddler

Working with Globus Platform Services

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks

Plus de Globus

Advanced Globus System Administration TopicsGlobus

Instrument Data Automation: The Life of a FlowGlobus

Reliable, Remote Computation at All ScalesGlobus

Best Practices for Data Sharing Using GlobusGlobus

An Introduction to Globus for ResearchersGlobus

Introduction to Research Automation with GlobusGlobus

Globus for System AdministratorsGlobus

Introduction to Globus for System AdministratorsGlobus

Introduction to Data Transfer and Sharing for ResearchersGlobus

Introduction to the Globus Platform for DevelopersGlobus

Introduction to the Command Line Interface (CLI)Globus

Automating Research Data with Globus Flows and ComputeGlobus

Automating Research Data Flows and Introduction to the Globus PlatformGlobus

Advanced Globus System AdministrationGlobus

Introduction to Globus for System AdministratorsGlobus

Introduction to Globus for New UsersGlobus

Working with Globus Platform Services and PortalsGlobus

Globus AutomationGlobus

Advanced Globus System AdministrationGlobus

Introduction to GlobusGlobus

Plus de Globus (20)

Advanced Globus System Administration Topics

Instrument Data Automation: The Life of a Flow

Reliable, Remote Computation at All Scales

Best Practices for Data Sharing Using Globus

An Introduction to Globus for Researchers

Introduction to Research Automation with Globus

Globus for System Administrators

Introduction to Globus for System Administrators

Introduction to Data Transfer and Sharing for Researchers

Introduction to the Globus Platform for Developers

Introduction to the Command Line Interface (CLI)

Automating Research Data with Globus Flows and Compute

Automating Research Data Flows and Introduction to the Globus Platform

Advanced Globus System Administration

Introduction to Globus for System Administrators

Introduction to Globus for New Users

Working with Globus Platform Services and Portals

Globus Automation

Advanced Globus System Administration

Introduction to Globus

Dernier

Announcing Codolex 2.0 from GDK SoftwareJim McKeeth

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

WSO2CON 2024 - Does Open Source Still Matter?WSO2

%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba

WSO2CON2024 - It's time to go PlatformlessWSO2

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2

Dernier (20)

Announcing Codolex 2.0 from GDK Software

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

WSO2CON 2024 - Does Open Source Still Matter?

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

WSO2CON2024 - It's time to go Platformless

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

%in Soweto+277-882-255-28 abortion pills for sale in soweto

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

Artyushina_Guest lecture_YorkU CS May 2024.pptx

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

1. Simplifying Science Gateway Data Management with Globus Part IV – Automated Data Ingests October 2020, Gateways 2020

2. Phase 1 - Gather data Gathering datasets from research partners • Your project is gathering datasets from partners. Each dataset is several TBs and takes ~a day to transfer over the network. • For the data to be useful, it needs descriptive metadata. • Ultimately, the team needs to find datasets that match specific criteria.

3. What are the dataset ingest challenges? • Getting very large datasets transferred from gateway users’ systems to the central repository – (This is Scenario I - large-scale data transfer.) • Generating persistent identifiers for the data in the central repository so we can link metadata to data • Storing the metadata • Indexing the metadata to enable searching

4. Demonstration Data ingests in a web application https://petraldata.net/

5. What needs to be in place for it to work? • Data storage – Globus Connect Server on Petrel • Persistent identifiers – FAIR Research Identifier Service – Hosted by https://fair-research.org/ • Metadata storage, indexing, search – Globus Search API – Hosted by Globus

6. Globus Connect Server on Petrel • Configured for self-service projects – Researchers do not receive local (Linux) accounts! – Uses Globus for authorization & management • Guest collections and groups – Project PIs request access by applying to join the “Petrel Project Owners” group (using the Globus web app) – Admin creates Globus group, makes PI a group manager – Admin creates guest collection, makes PI an access manager – Admin sets a quota of 100TB for the guest collection

7. • RESTful web service, written in Python, that stores identifier metadata • Mints (creates) identifiers from external service providers using a unified service provider interface (SPI) • Different identifiers supported through namespaces • Client requests served as HTML landing pages or other machine-readable formats (e.g., JSON, JSON-LD) FAIR Research Identifiers AWS-RDS AWS-EC2 Postgres Registration SPI (Python) Web Server - REST API (Apache, Flask, Python) RDBMS ORM (SQLAlchemy) AuthN/AuthZ (Globus Auth, Globus Groups) Web Browser Client APIs HTML JSON, JSON-LD, other extensible renderings DataCite (DOI) EZID (ARK) Minid (Handle) https://minid.readthedocs.io/en/develop/

8. • REST API provides a simple CRUD interface • Has other capabilities, like finding identifiers by checksum • JSON is used for request and response • Namespaces may also have their own handlers, landing pages, and other customizations. FAIR Research Identifiers

9. Globus Search API • RESTful API for indexing & search – Hosted by Globus (including the metadata & index storage!) – Each project gets an “index” object (private tenancy) – REST API, Python client package, Python CLI • https://docs.globus.org/api/search/

10. Globus Search API features • Scalable: to billions of entries • Schema agnostic: can use standard (e.g., DataCite) or custom metadata • Fine-grain access control: only returns results that are visible to user • Plain text search: ranked results • Faceted search: for data discovery • Rich query language: ranges, expressions, regex, fuzzy, stemming, etc.

11. Key ingredients 1. UUID and base path for the guest collection where data is gathered 2. Minid Python client 3. UUID for Globus Search index 4. Your choice of appropriate metadata schema for your project’s datasets

12. Code Data ingest in a web application

Notes de l'éditeur

You’re working on a project with partners at other institutions, each of whom is analyzing unique samples and generating big datasets from them. You need to gather 100s of TBs of data on your campus’s HPC storage system. How can you make it easy for your partners to get the data from their labs to your server? And once it’s there, how are the partners going to understand each others’ datasets? First, they need to be able to see, in general, what’s been uploaded. Then, they need to find datasets that have specific features. NOTE: We’re presenting this as a single project, but at Globus, we see this happening for dozens-to-hundreds of research projects on a continuous basis. Our end goal is to enable research teams to do this routinely, without special planning or extraordinary measures by individual projects.
Examples of “analysis on a community dataset”: Examples of ”analyze user’s data”: Examples of “download simulation results”: Examples of “submit data to a repository”:
Petrel Data: https://petreldata.net/ Data storage is provided by the Advanced Leadership Computing Facility at Argonne National Laboratory. Petrel offers 100TB allocations to approved projects, with a total of 3PB of storage. Goal is to enable projects to self-manage themselves, including ingest, metadata management, index & search, and sharing permissions.
PIs request access by applying to join a Globus group Petrel admin creates a project group for the PI and makes the PI a group manager Petrel admin creates a Globus guest collection with access managed by the PI Petrel admin also sets a quota of 100TB for the guest collection’s directory.
https://github.com/globus/globus-jupyter-notebooks/blob/master/Data_Publication_Flow.ipynb