SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Policy-based Data Management


Integrated Rule Oriented Data Grid (iRODS)

      Reagan W. Moore (DICE-UNC)
      Arcot Rajasekar (DICE-UNC)

      http://irods.diceresearch.org
What is the Opportunity Play for iRODS


 At a high level …
   The Management of Big Data is the #1 concern for IT
      •   Life Cycle Management;
      •   Useful (actionable) and searchable metadata
      •   Integrity
      •   Collaboration (Federation of Immutable data)
    iRODS Provides Policy-Based data management:
      • Next Generation data management cyber-infrastructure
      • System that enables a flexible, adaptive, customizable
        data management architecture
      • Tool for large collections (Petabytes, hundreds of millions of files)
We will touch on …
   Properties of policy-based data management systems
   Management of the data life cycle (project collection, digital library, persistent
    archive, processing pipeline)
   Applications of iRODS
      LifeTime™ Library (digital library for students)
      Genomics data grid
      Carolina Digital Repository (institution repository)
      French National Library (IT automation)
      DataNet Federation Consortium (data and workflow sharing for collaborative
        research)
     1. What iRODS is and what problems it is solving today, and tomorrow.
     2. Speak to different use cases (there will be many companies attending
        representing many departments with different opportunities/problems)
         a. Digitization of University Assets- Library archive
         b. Genomic pipeline automation
         c. IT service automation
Topics

• Principles behind policy-based data management
  – Enable collaborative research
  – Enable reproducible science
  – Enable creation of reference collections


• Integrated Rule-Oriented Data System (iRODS)
  – Enforce management policies
  – Automate administrative functions
  – Validate assessment criteria
Shared Collections – Data Grid

                            50 clients: web browser,
         Client
                            unix shell command, …

                            Data grid middleware
     Data Grid              provides global name,
                            single sign-on, policy
                            enforcement, metadata,
                            replication
 File              Tape
System            Archive   Multiple types of systems
                            can be used to store data
Policy-based Data Management

                              Client




iRODS-server                                          iRODS-server
 Rule-engine                                           Rule Engine
  Rule base                                             Rule base
  Workflows                 Logical                     Workflows
                          Collection
                          (data grid)

  Storage                                                Storage
               Consensus on Policies and Procedures
                    controls the Data Collection
Policy-Based Data Environments
 Purpose
    Reason a collection is assembled
 Properties
    Attributes needed to ensure the purpose
 Policies
    Controls for enforcing desired properties,
         • mapped to computer actionable rules
 Procedures
    Functions that implement the policies
         • Mapped to computer actionable workflows
 Persistent state information
    Results of applying the procedures
         • mapped to system metadata
 Property verification
    Validation that state information conforms to the desired purpose
         • mapped to periodically executed policies



                                         7
Community-based Collection Life Cycle

 The driving purpose changes at each stage of the data life cycle


                                Data
 Project         Data        Processing         Digital        Reference         Federation
Collection       Grid         Pipeline          Library        Collection

 Private        Shared        Analyzed        Published        Preserved         Sustained

 Local        Distribution     Service       Description     Representation     Re-purposing
 Policy         Policy          Policy         Policy            Policy            Policy




             Stages correspond to addition of new policies for a broader community
             Virtualize the stages of the collection life cycle through policy evolution
Applications

 Data Grids                                   (data sharing)
       Ocean Observatories Initiative
       The iPlant Collaborative
       National Optical Astronomy Observatory
       Babar High Energy Physics
       Broad Institute genomics data grid
       WellCome Trust Sanger Institute genomics data grid
 Digital Libraries                            (data publication)
     Texas Digital Library
     French National Library
     UNC-CH SILS LifeTime Library
 Repositories / Archives                      (data preservation)
     NASA Center for Climate Simulation
     Carolina Digital Repository
Sequencing Work – an Infrastructure View

                                                                        RC, RENCI, LCCC, HTSF Infrastructure
 Managing several hundred TBs of genomic data                           hardware: ITS; software: LCCC, UNC High
                                                                        Throughput Sequencing facility
                                                                                                       Production
  RENCI Infrastructure          Test/Development
                                                                            Archive
                                                                                                              Genome
                                                                                                             Databases
                                  Genome
      Pipelines
                                 Databases
                                                                                                           VarDB      Hadoop
                                                                            Pipelines


 Data Production

     UNC HTFS                               Distributed ad-hoc
     Third Party                                processing
      Vendors                            iRODS data-grid managed
                                                                                           Genome
                                                processing                                 Annotations
National Resources                                                                          Ref
                                                                                            Se
                                                                                                    dbSN                   1000
                                                                                                               HGMD
                                                                                                      P                   Genomes
                                                                                             q
                      Open
                     Science                 Data Sharing
   RENCI              Grid                                                    Clinical Data Systems
   Science                                     Local
    Portal           TeraGrid                                     NIH            NCGenes
                                             (TUCASI)
                                                                                                  Secure Medical
                     UNC BASS                         Other                                       Workspace
                       …                           Institutions
Managing Data on the Research Side
     UNC                 RENCI                                                External
                                             Genomics           Lab                          External
   STORAGE              STORAGE                                              Compute:
                                              Storage         Machines                       Partners
 (Tape, Drives)       (Tape, Drives)                                        Open Science
                                                                                Grid
                                           Genomics HPC                       Clemson          NIH
   UNC HPC             RENCI HPC                             IT Machines
                                             Genomics                           Clouds
                      RENCI Hadoop
                                              Hadoop



                                                        iRODS gracefully allows for introducing control:
                                                        •Data movement and replication
Wild West                         Managed               •Metadata standards
                                                        •Archival, deletion, and retention
                                                        •Integration with workflows, hadoop, databases
                                                        •Hiding complexities
  Data
                                                        •Automation
                    Students           IT Staff
Providers                                               •…, all policy driven
      Researchers
                             External                   •…, without breaking the in-place systems
                           Collaborators
SILS LifeTime Library
 Student digital libraries
    Enable students to build collections of
        Photographs
        MP3 audio files
        Class documents
        Video
        Web site archive
 Resources provided by School of Information and
  Library Science at UNC-CH
    Student collections range from 2 GBytes to 150 Gbytes
    Number of files from 2000 to 12,000
SILS LifeTime Library Policies

 Library management
     Replication
     Checksums
     Versioning
     Strict access controls
     Quotas
     Metadata catalog replication
     Installation environment archiving
 Ingestion
   Automated synchronization of student directory
    with LifeTime Library
   Automated loading of MP3 metadata
Carolina Digital Repository




    Policy-Driven Repository Infrastructure project
funded by the Institute for Museum and Library Services
Carolina Digital Repository
Ingest Workflow
iRODS Data Grid

 More than 50 different clients have been used to
  interact with the data grid
      Web browsers                 (iDrop-web, Rich Web client)
      Web services                 (VOSpace)
      Load libraries               (Python, Java)
      I/O libraries                (C, C++, Fortran)
      File systems                 (FUSE, WebDav, Parrot)
      Synchronization interfaces   (iDrop)
      Unix tools / Grid tools      (icommands, SAGA, SRM, Griphyn)
      Workflows                    (Kepler, Taverna)
      Digital Libraries            (Fedora, DSpace)
      Portals                      (EnginFrame)
Managing Information & Knowledge

Concepts
 Data            objects
 Information     names
 Knowledge       relationships between names
 Wisdom          relationships between relationships

Implementation
 Data            bytes                        Storage system
 Information     metadata                     Relational database
 Knowledge       policies / procedures        Rule base / Rule engine
 Wisdom          policy enforcement point     Data Grid
Data Virtualization
    Access Interface        • Map from the actions
                              requested by the client to
                              multiple policy
Policy Enforcement Points     enforcement points.
                            • Map from policy to
Standard Micro-services       standard micro-services.
                            • Map from micro-services
Standard I/O Operations       to standard Posix I/O
                              operations.
                            • Map standard I/O
    Storage Protocol          operations to the
                              protocol supported by
     Storage System           the storage system
System and User-driven Rules


 The data grid automatically applies rules
  defined in the rule base, core.re
 You can define rules that are applied
  interactively, or that are deferred for later
  execution
     irule –F “rule-file.r”
Example Rule


 Write “Hello World”
 Create rule file call ruleHello.r:
     myTestRule {
     writeLine("stdout”, "Hello World");
     }
     INPUT null
     OUTPUT ruleExecOut


irule –F ruleHello.r
Production Integrity Rule
   Verify all input parameters for consistency.
   Query the iRODS metadata catalog to retrieve status information
   Verify the integrity of each file in a collection
   Update all replicas to the most recent version.
   Minimize the load on production services through a deadline scheduler
   Differentiate between the logical name for a file and the physical replica locations.
   Identify all missing replicas and document their lack.
   Create new replicas to replace missing replicas.
   Implement load leveling to distribute the new replicas across the storage systems
   Create a log file that records all repair operations performed upon the collection.
   Track progress of the policy execution.
   Initialize the rule for the first execution.
   Enable restart of the process from the last set of checked files in case of a system halt.
   Manipulate files in batches of 256 files at a time to handle arbitrarily large collections.
   Minimize the number of sleep periods used by the deadline scheduler.
   Include the checking of new files that have been added during the execution of the policy
   Write out statistics about the effective execution rate, and the number of files checked.
Workflow Management & Registration

                                          Workflow file
eCWkflow.mss
                                          Directory holding all input and output files
/earthCube/eCWkflow                       associated with workflow file (mounted
                                          collection that is linked to the workflow file)
      eCWkflow.run                        Automatically generated run file for
                                          Executing each input file
      eCWkflow2.run
      eCWkflow.mpf                        Input parameter file, lists parameters
                                          and input and output file names
      eCWkflow2.mpf

   /earthCube/eCWkflow/eCWkflow.runDir0
                                                    Directory holding all output
                                                    files generated for invocation
                        Outfile                      of eCWkflow.run, the version
                                                    number is incremented
   /earthCube/eCWkflow/eCWkflow2.runDir
                     0
                                                    Output file created for
                       Newfile                      eCWKflow.mpf
Publications

 Rajasekar, R., M. Wan, R. Moore, W. Schroeder, S.-Y. Chen, L.
  Gilbert, C.-Y. Hou, C. Lee, R. Marciano, P. Tooby, A. de Torcy, B.
  Zhu, “iRODS Primer: Integrated Rule-Oriented Data System”,
  Morgan & Claypool, 2010.

 Ward, R., M. Wan, W. Schroeder, A. Rajasekar, A. de Torcy, T.
  Russell, H. Xu, R. Moore, “The integrated Rule-Oriented Data
  System (iRODS 3.0) Micro-service Workbook”, DICE
  Foundation, November 2011, ISBN: 9781466469129,
  Amazon.com
iRODS - Open Source Software



                       Reagan W. Moore
                     rwmoore@renci.org
                 http://irods.diceresearch.org



NSF OCI-0940841 “DataNet Federation Consortium”
NSF OCI-1032732 “Improvement of iRODS for Multi-Disciplinary Applications”
NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype”
NSF SDCI-0721400 “Data Grids for Community Driven Applications”
iRODS Distributed Data Management
Initializing Workflow Parameters
*Val = "0”;
msiExecStrCondQuery("SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
'*Coll' and META_COLL_ATTR_NAME = 'TEST_DATA_ID'", *GenQOut2);
foreach (*GenQOut2) {
msiGetValByKey(*GenQOut2, "META_COLL_ATTR_NAME", *Val);
  }
if(int(*Val) == 0) {
  *Str1 = "TEST_DATA_ID=0”;
  msiString2KeyValPair(*Str1,*kvp);
  msiAssociateKeyValuePairsToObj(*kvp,*Coll,"-C");
  writeLine("*Lfile","added TEST_DATA_ID attribute to collection *Coll");
}
# on a restart TEST_DATA_ID will be greater than 0
  msiMakeGenQuery("META_COLL_ATTR_VALUE", "COLL_NAME = '*Coll' and
META_COLL_ATTR_NAME = 'TEST_DATA_ID'", *GenQInp2);
msiExecGenQuery(*GenQInp2,*GenQOut2);
  foreach(*GenQOut2) {
  msiGetValByKey(*GenQOut2, "META_COLL_ATTR_VALUE", *colldataID);
  }
Workflow Operations Used
 Arithmetic (+, -, *, /)
 Boolean tests (==, !=, &&, ||, >, <, >=)
 Conditional statements
   if / then / else
 Control
   break / fail
 Loops
   for / foreach / while
 List manipulation
   initialization / list addition (cons) / extracting an element from a
     list (elem) / updating an element in a list (setelem)
 Variable manipulation
   initialization / type conversion (int, double, str)
Micro-services Used
 Metadata catalog manipulation
    msiGetValByKey                   get metadata from structure
    msiExecStrCondQuery              execute string conditional query
    msiString2KeyValPair             convert string to key-value pair
    msiAssociateKeyValuePairsToObj   add metadata
    msiMakeGenQuery                  create a query
    msiExecGenQuery                  execute a query
    msiCloseGenQuery                 release query buffers
    msiGetContInxFromGenQueryOut     check for more rows
    msiRemoveKeyValuePairsFromObj    remove metadata
    msiGetMoreRows                   get more rows from query
Micro-services Used

 Data and directory manipulation
   msiIsColl           check whether name is a collection
   msiCollCreate       create a collection
   msiDataObjCreate create a file
   msiDataObjRepl      replicate a file
   msiDataObjChksum checksum a file
   msiDataObjUnlink    delete a file

 System functions
   msiGetSystemTime get the system time
   writeLine        write a line to a file or standard out
   msiSleep         sleep
Performance at renci’

• Execute call to rule engine   18 msecs
• Execute metadata query        714 msecs

• Disk seek latency             5 msecs
• Disk rotational latency       11 msecs

• Production loop logic         6.3 msecs
• Checksum verification         21 msecs
Data Analysis Use Cases
• Demonstrate reproducible science. A use case could include the
  registration, storage, sharing, and re-execution of a workflow. The hypoxia
  use case from the Cross-Domain and Brokering Concept groups could be
  used as an example.
• Automate data retrieval. A use case could demonstrate remote access to a
  data collection, retrieval of desired data sets, transformation, and use in
  an analysis workflow. An eco-hydrology example that automates access
  to digital elevation maps and land use coverage is being built.
• Integrate community resources with collaboration environments. An
  example would be use of the DAB protocol to identify and cache local
  copies of relevant data sets for local analysis.
• Integrate multiple community resources. A use case could be
  demonstration of invocation of multiple workflow systems within the
  same analysis. An example is the integration of Cyber-integrator
  workflow with collaboration environments to support drought prediction.
Eco-Hydrology
                     Choose gauge
                     or outlet (HIS)

                                       RHESSys workflow to develop a nested
                        Extract
                     drainage area     watershed parameter file (worldfile)
                      (NHDPlus)
                                       containing a nested ecogeomorphic object
                       Digital
                      Elevation
                                       framework, and full, initial system state.
    Slope
                     Model (DEM)
   Aspect
                   Nested watershed
Streams (NHD)          structure              Soil and vegetation
Roads (DOT)                                     parameter files
                          Strata

                          Patch
                                            Land Use         NLCD (EPA)
                        Hillslope

                          Basin             Leaf Area
                                                             Landsat TM
                                              Index
                     Stream network

                                            Phenology          MODIS
                        Worldfile
  Flowtable

                                            Soil Data          USDA
                        RHESSys
iRODS Rule for RHESSys

     Modular workflow composed by chaining basic transformation
        Define input variables
        Call functions to apply each transformation step
        Store results in shared collection

main {
 getExtentForGageReachcode(*gageReachcode, *extentInNHD_Vect_Coords);
 convertExtentToNHD_DEM(*extentInNHD_Vect_Coords, *extentInNHD_DEM_Coords);
 extractTileFromNHD_DEM(trimr(*extentInNHD_DEM_Coords, "n"));
 importDEMTileIntoNewGRASSLocationAsUTM(*extentInNHD_Vect_Coords, *newLocPhysPath,
*newLocObjPath);
 delineateWatershedForNHDGage(*nhdStreamGageID, *newLocPhysPath, *newLocObjPath);
}
extractTileFromNHD_DEM(*extentCoords) {
# Split path to object into collection and name
  msiSplitPath(*nhdDEMObjPath, *nhdDEMObjColl, *nhdDEMObjName);
  writeLine("serverLog", *nhdDEMObjColl);
  writeLine("serverLog", *nhdDEMObjName);
# Build query to discover physical path
  msiAddSelectFieldToGenQuery("DATA_PATH", "null", *genQInp);
  msiAddConditionToGenQuery("DATA_NAME", "=", *nhdDEMObjName, *genQInp);
  msiAddConditionToGenQuery("COLL_NAME", "=", *nhdDEMObjColl, *genQInp);
  msiAddConditionToGenQuery("DATA_RESC_NAME", "=", *rescName, *genQInp);
# Run query
  msiExecGenQuery(*genQInp, *genQOut);
# Extract path from query result
  foreach (*genQOut) {msiGetValByKey(*genQOut, "DATA_PATH", *filePath); }
  writeLine("serverLog", *filePath);
# Determine physical path of input directory
  msiSplitPath(*filePath, *inFileDir, *headerFileIgnore);
# Generate physical path of output file
  msiSplitPath(*inFileDir, *inFileParentDir, *rasterDatasetName)
  *tileFileName = "SUBSET-"++*rasterDatasetName++".img"
  *tileFilePath = *inFileParentDir++"/"++*tileFileName;
# Generate iRODS path of output
  msiSplitPath(*nhdDEMObjColl, *nhdDEMObjCollParent, *junk)
  *tileObjPath = *nhdDEMObjCollParent++"/"++*tileFileName
 *args = "-of HFA -projwin "++*extentCoords++" "++"'*inFileDir'"++" "++"'*tileFilePath'";
  writeLine("serverLog", *args);
  msiExecCmd("gdal_translate", *args, "iren.renci.org", "null", "null", *cmd_out);
  writeLine("serverLog", *cmd_out);
# Register tile file with iRODS
  msiPhyPathReg(*tileObjPath, *rescName, *tileFilePath, "null", *status);
}
Summary

 iRODS is a power Policy-Based engine for Managing
  NextGen Big Data Cyber-Infrastructures
 Enables a Flexible, Adaptive and Customizable
  Data Management Architecture
 “Canned” scripts (policies) can be created to
  standardize and automated users processes
   Simple menu driven interface
   No CS Degree needed
 iRODS is the middleware for
       Distributed Data Management
Thank you




Questions?

Contenu connexe

Tendances

5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar SlidesDuraSpace
 
ARIADNE: progress in the first nine month
ARIADNE: progress in the first nine monthARIADNE: progress in the first nine month
ARIADNE: progress in the first nine monthariadnenetwork
 
Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)vty
 
How Worthy is DSpace for Digital Libraries
How Worthy is DSpace for Digital LibrariesHow Worthy is DSpace for Digital Libraries
How Worthy is DSpace for Digital LibrariesAmit Shaw
 
DataverseEU as multilingual repository
DataverseEU as multilingual repositoryDataverseEU as multilingual repository
DataverseEU as multilingual repositoryvty
 
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...4Science
 
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...4Science
 
Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...vty
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyvty
 
Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...
Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...
Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...inside-BigData.com
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...dbpublications
 
Building Cyber-infrastructure at UNC-CH
Building Cyber-infrastructure at UNC-CHBuilding Cyber-infrastructure at UNC-CH
Building Cyber-infrastructure at UNC-CHGary Wilhelm
 
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...4Science
 
News about DSpace-CRIS Anwendertreffen 2020
News about DSpace-CRIS Anwendertreffen 2020News about DSpace-CRIS Anwendertreffen 2020
News about DSpace-CRIS Anwendertreffen 20204Science
 
Ariadne: Archiving and Repositories
Ariadne: Archiving and RepositoriesAriadne: Archiving and Repositories
Ariadne: Archiving and Repositoriesariadnenetwork
 
BHL Global Infrastructure - Vision
BHL Global Infrastructure - VisionBHL Global Infrastructure - Vision
BHL Global Infrastructure - VisionChris Freeland
 

Tendances (20)

5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
 
ARIADNE: progress in the first nine month
ARIADNE: progress in the first nine monthARIADNE: progress in the first nine month
ARIADNE: progress in the first nine month
 
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
 
Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)
 
How Worthy is DSpace for Digital Libraries
How Worthy is DSpace for Digital LibrariesHow Worthy is DSpace for Digital Libraries
How Worthy is DSpace for Digital Libraries
 
ELIXIR
ELIXIRELIXIR
ELIXIR
 
DataverseEU as multilingual repository
DataverseEU as multilingual repositoryDataverseEU as multilingual repository
DataverseEU as multilingual repository
 
TYPO3 and CMIS
TYPO3 and CMISTYPO3 and CMIS
TYPO3 and CMIS
 
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
 
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
 
Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
HDF5 iRODS
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhy
 
Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...
Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...
Building a Tiered Digital Storage Environment on User-Defined Metadata to Ena...
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
Building Cyber-infrastructure at UNC-CH
Building Cyber-infrastructure at UNC-CHBuilding Cyber-infrastructure at UNC-CH
Building Cyber-infrastructure at UNC-CH
 
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
 
News about DSpace-CRIS Anwendertreffen 2020
News about DSpace-CRIS Anwendertreffen 2020News about DSpace-CRIS Anwendertreffen 2020
News about DSpace-CRIS Anwendertreffen 2020
 
Ariadne: Archiving and Repositories
Ariadne: Archiving and RepositoriesAriadne: Archiving and Repositories
Ariadne: Archiving and Repositories
 
BHL Global Infrastructure - Vision
BHL Global Infrastructure - VisionBHL Global Infrastructure - Vision
BHL Global Infrastructure - Vision
 

Similaire à iRODS

Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Jian Qin
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006raj_vij
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
Appistry WGDAS Presentation
Appistry WGDAS PresentationAppistry WGDAS Presentation
Appistry WGDAS Presentationelasticdave
 
110823 data fed_solta11
110823 data fed_solta11110823 data fed_solta11
110823 data fed_solta11Rudolf Husar
 
Developments in datamanagement
Developments in datamanagementDevelopments in datamanagement
Developments in datamanagementSURFnet
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr
 
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...SEAD
 
Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysisDenis C. Bauer
 
ASIH Fishnet2 Presentation
ASIH Fishnet2 PresentationASIH Fishnet2 Presentation
ASIH Fishnet2 PresentationDave Vieglais
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
 
Cloud Technical Challenges
Cloud Technical ChallengesCloud Technical Challenges
Cloud Technical ChallengesGuy Coates
 
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28Sage Base
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides DuraSpace
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycleSherry Lake
 

Similaire à iRODS (20)

Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Appistry WGDAS Presentation
Appistry WGDAS PresentationAppistry WGDAS Presentation
Appistry WGDAS Presentation
 
110823 data fed_solta11
110823 data fed_solta11110823 data fed_solta11
110823 data fed_solta11
 
Developments in datamanagement
Developments in datamanagementDevelopments in datamanagement
Developments in datamanagement
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. ...
 
Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysis
 
ASIH Fishnet2 Presentation
ASIH Fishnet2 PresentationASIH Fishnet2 Presentation
ASIH Fishnet2 Presentation
 
Shaman Project Hemmje
Shaman Project  HemmjeShaman Project  Hemmje
Shaman Project Hemmje
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...
 
Cloud Technical Challenges
Cloud Technical ChallengesCloud Technical Challenges
Cloud Technical Challenges
 
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

iRODS

  • 1. Policy-based Data Management Integrated Rule Oriented Data Grid (iRODS) Reagan W. Moore (DICE-UNC) Arcot Rajasekar (DICE-UNC) http://irods.diceresearch.org
  • 2. What is the Opportunity Play for iRODS  At a high level …  The Management of Big Data is the #1 concern for IT • Life Cycle Management; • Useful (actionable) and searchable metadata • Integrity • Collaboration (Federation of Immutable data)  iRODS Provides Policy-Based data management: • Next Generation data management cyber-infrastructure • System that enables a flexible, adaptive, customizable data management architecture • Tool for large collections (Petabytes, hundreds of millions of files)
  • 3. We will touch on …  Properties of policy-based data management systems  Management of the data life cycle (project collection, digital library, persistent archive, processing pipeline)  Applications of iRODS  LifeTime™ Library (digital library for students)  Genomics data grid  Carolina Digital Repository (institution repository)  French National Library (IT automation)  DataNet Federation Consortium (data and workflow sharing for collaborative research) 1. What iRODS is and what problems it is solving today, and tomorrow. 2. Speak to different use cases (there will be many companies attending representing many departments with different opportunities/problems) a. Digitization of University Assets- Library archive b. Genomic pipeline automation c. IT service automation
  • 4. Topics • Principles behind policy-based data management – Enable collaborative research – Enable reproducible science – Enable creation of reference collections • Integrated Rule-Oriented Data System (iRODS) – Enforce management policies – Automate administrative functions – Validate assessment criteria
  • 5. Shared Collections – Data Grid 50 clients: web browser, Client unix shell command, … Data grid middleware Data Grid provides global name, single sign-on, policy enforcement, metadata, replication File Tape System Archive Multiple types of systems can be used to store data
  • 6. Policy-based Data Management Client iRODS-server iRODS-server Rule-engine Rule Engine Rule base Rule base Workflows Logical Workflows Collection (data grid) Storage Storage Consensus on Policies and Procedures controls the Data Collection
  • 7. Policy-Based Data Environments  Purpose  Reason a collection is assembled  Properties  Attributes needed to ensure the purpose  Policies  Controls for enforcing desired properties, • mapped to computer actionable rules  Procedures  Functions that implement the policies • Mapped to computer actionable workflows  Persistent state information  Results of applying the procedures • mapped to system metadata  Property verification  Validation that state information conforms to the desired purpose • mapped to periodically executed policies 7
  • 8. Community-based Collection Life Cycle The driving purpose changes at each stage of the data life cycle Data Project Data Processing Digital Reference Federation Collection Grid Pipeline Library Collection Private Shared Analyzed Published Preserved Sustained Local Distribution Service Description Representation Re-purposing Policy Policy Policy Policy Policy Policy Stages correspond to addition of new policies for a broader community Virtualize the stages of the collection life cycle through policy evolution
  • 9. Applications  Data Grids (data sharing)  Ocean Observatories Initiative  The iPlant Collaborative  National Optical Astronomy Observatory  Babar High Energy Physics  Broad Institute genomics data grid  WellCome Trust Sanger Institute genomics data grid  Digital Libraries (data publication)  Texas Digital Library  French National Library  UNC-CH SILS LifeTime Library  Repositories / Archives (data preservation)  NASA Center for Climate Simulation  Carolina Digital Repository
  • 10. Sequencing Work – an Infrastructure View RC, RENCI, LCCC, HTSF Infrastructure Managing several hundred TBs of genomic data hardware: ITS; software: LCCC, UNC High Throughput Sequencing facility Production RENCI Infrastructure Test/Development Archive Genome Databases Genome Pipelines Databases VarDB Hadoop Pipelines Data Production UNC HTFS Distributed ad-hoc Third Party processing Vendors iRODS data-grid managed Genome processing Annotations National Resources Ref Se dbSN 1000 HGMD P Genomes q Open Science Data Sharing RENCI Grid Clinical Data Systems Science Local Portal TeraGrid NIH NCGenes (TUCASI) Secure Medical UNC BASS Other Workspace … Institutions
  • 11. Managing Data on the Research Side UNC RENCI External Genomics Lab External STORAGE STORAGE Compute: Storage Machines Partners (Tape, Drives) (Tape, Drives) Open Science Grid Genomics HPC Clemson NIH UNC HPC RENCI HPC IT Machines Genomics Clouds RENCI Hadoop Hadoop iRODS gracefully allows for introducing control: •Data movement and replication Wild West Managed •Metadata standards •Archival, deletion, and retention •Integration with workflows, hadoop, databases •Hiding complexities Data •Automation Students IT Staff Providers •…, all policy driven Researchers External •…, without breaking the in-place systems Collaborators
  • 12. SILS LifeTime Library  Student digital libraries  Enable students to build collections of  Photographs  MP3 audio files  Class documents  Video  Web site archive  Resources provided by School of Information and Library Science at UNC-CH  Student collections range from 2 GBytes to 150 Gbytes  Number of files from 2000 to 12,000
  • 13. SILS LifeTime Library Policies  Library management  Replication  Checksums  Versioning  Strict access controls  Quotas  Metadata catalog replication  Installation environment archiving  Ingestion  Automated synchronization of student directory with LifeTime Library  Automated loading of MP3 metadata
  • 14. Carolina Digital Repository Policy-Driven Repository Infrastructure project funded by the Institute for Museum and Library Services
  • 16. iRODS Data Grid  More than 50 different clients have been used to interact with the data grid  Web browsers (iDrop-web, Rich Web client)  Web services (VOSpace)  Load libraries (Python, Java)  I/O libraries (C, C++, Fortran)  File systems (FUSE, WebDav, Parrot)  Synchronization interfaces (iDrop)  Unix tools / Grid tools (icommands, SAGA, SRM, Griphyn)  Workflows (Kepler, Taverna)  Digital Libraries (Fedora, DSpace)  Portals (EnginFrame)
  • 17. Managing Information & Knowledge Concepts  Data objects  Information names  Knowledge relationships between names  Wisdom relationships between relationships Implementation  Data bytes Storage system  Information metadata Relational database  Knowledge policies / procedures Rule base / Rule engine  Wisdom policy enforcement point Data Grid
  • 18. Data Virtualization Access Interface • Map from the actions requested by the client to multiple policy Policy Enforcement Points enforcement points. • Map from policy to Standard Micro-services standard micro-services. • Map from micro-services Standard I/O Operations to standard Posix I/O operations. • Map standard I/O Storage Protocol operations to the protocol supported by Storage System the storage system
  • 19. System and User-driven Rules  The data grid automatically applies rules defined in the rule base, core.re  You can define rules that are applied interactively, or that are deferred for later execution irule –F “rule-file.r”
  • 20. Example Rule  Write “Hello World”  Create rule file call ruleHello.r: myTestRule { writeLine("stdout”, "Hello World"); } INPUT null OUTPUT ruleExecOut irule –F ruleHello.r
  • 21. Production Integrity Rule  Verify all input parameters for consistency.  Query the iRODS metadata catalog to retrieve status information  Verify the integrity of each file in a collection  Update all replicas to the most recent version.  Minimize the load on production services through a deadline scheduler  Differentiate between the logical name for a file and the physical replica locations.  Identify all missing replicas and document their lack.  Create new replicas to replace missing replicas.  Implement load leveling to distribute the new replicas across the storage systems  Create a log file that records all repair operations performed upon the collection.  Track progress of the policy execution.  Initialize the rule for the first execution.  Enable restart of the process from the last set of checked files in case of a system halt.  Manipulate files in batches of 256 files at a time to handle arbitrarily large collections.  Minimize the number of sleep periods used by the deadline scheduler.  Include the checking of new files that have been added during the execution of the policy  Write out statistics about the effective execution rate, and the number of files checked.
  • 22. Workflow Management & Registration Workflow file eCWkflow.mss Directory holding all input and output files /earthCube/eCWkflow associated with workflow file (mounted collection that is linked to the workflow file) eCWkflow.run Automatically generated run file for Executing each input file eCWkflow2.run eCWkflow.mpf Input parameter file, lists parameters and input and output file names eCWkflow2.mpf /earthCube/eCWkflow/eCWkflow.runDir0 Directory holding all output files generated for invocation Outfile of eCWkflow.run, the version number is incremented /earthCube/eCWkflow/eCWkflow2.runDir 0 Output file created for Newfile eCWKflow.mpf
  • 23. Publications  Rajasekar, R., M. Wan, R. Moore, W. Schroeder, S.-Y. Chen, L. Gilbert, C.-Y. Hou, C. Lee, R. Marciano, P. Tooby, A. de Torcy, B. Zhu, “iRODS Primer: Integrated Rule-Oriented Data System”, Morgan & Claypool, 2010.  Ward, R., M. Wan, W. Schroeder, A. Rajasekar, A. de Torcy, T. Russell, H. Xu, R. Moore, “The integrated Rule-Oriented Data System (iRODS 3.0) Micro-service Workbook”, DICE Foundation, November 2011, ISBN: 9781466469129, Amazon.com
  • 24. iRODS - Open Source Software Reagan W. Moore rwmoore@renci.org http://irods.diceresearch.org NSF OCI-0940841 “DataNet Federation Consortium” NSF OCI-1032732 “Improvement of iRODS for Multi-Disciplinary Applications” NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” NSF SDCI-0721400 “Data Grids for Community Driven Applications”
  • 26. Initializing Workflow Parameters *Val = "0”; msiExecStrCondQuery("SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME = '*Coll' and META_COLL_ATTR_NAME = 'TEST_DATA_ID'", *GenQOut2); foreach (*GenQOut2) { msiGetValByKey(*GenQOut2, "META_COLL_ATTR_NAME", *Val); } if(int(*Val) == 0) { *Str1 = "TEST_DATA_ID=0”; msiString2KeyValPair(*Str1,*kvp); msiAssociateKeyValuePairsToObj(*kvp,*Coll,"-C"); writeLine("*Lfile","added TEST_DATA_ID attribute to collection *Coll"); } # on a restart TEST_DATA_ID will be greater than 0 msiMakeGenQuery("META_COLL_ATTR_VALUE", "COLL_NAME = '*Coll' and META_COLL_ATTR_NAME = 'TEST_DATA_ID'", *GenQInp2); msiExecGenQuery(*GenQInp2,*GenQOut2); foreach(*GenQOut2) { msiGetValByKey(*GenQOut2, "META_COLL_ATTR_VALUE", *colldataID); }
  • 27. Workflow Operations Used  Arithmetic (+, -, *, /)  Boolean tests (==, !=, &&, ||, >, <, >=)  Conditional statements  if / then / else  Control  break / fail  Loops  for / foreach / while  List manipulation  initialization / list addition (cons) / extracting an element from a list (elem) / updating an element in a list (setelem)  Variable manipulation  initialization / type conversion (int, double, str)
  • 28. Micro-services Used  Metadata catalog manipulation  msiGetValByKey get metadata from structure  msiExecStrCondQuery execute string conditional query  msiString2KeyValPair convert string to key-value pair  msiAssociateKeyValuePairsToObj add metadata  msiMakeGenQuery create a query  msiExecGenQuery execute a query  msiCloseGenQuery release query buffers  msiGetContInxFromGenQueryOut check for more rows  msiRemoveKeyValuePairsFromObj remove metadata  msiGetMoreRows get more rows from query
  • 29. Micro-services Used  Data and directory manipulation  msiIsColl check whether name is a collection  msiCollCreate create a collection  msiDataObjCreate create a file  msiDataObjRepl replicate a file  msiDataObjChksum checksum a file  msiDataObjUnlink delete a file  System functions  msiGetSystemTime get the system time  writeLine write a line to a file or standard out  msiSleep sleep
  • 30. Performance at renci’ • Execute call to rule engine 18 msecs • Execute metadata query 714 msecs • Disk seek latency 5 msecs • Disk rotational latency 11 msecs • Production loop logic 6.3 msecs • Checksum verification 21 msecs
  • 31. Data Analysis Use Cases • Demonstrate reproducible science. A use case could include the registration, storage, sharing, and re-execution of a workflow. The hypoxia use case from the Cross-Domain and Brokering Concept groups could be used as an example. • Automate data retrieval. A use case could demonstrate remote access to a data collection, retrieval of desired data sets, transformation, and use in an analysis workflow. An eco-hydrology example that automates access to digital elevation maps and land use coverage is being built. • Integrate community resources with collaboration environments. An example would be use of the DAB protocol to identify and cache local copies of relevant data sets for local analysis. • Integrate multiple community resources. A use case could be demonstration of invocation of multiple workflow systems within the same analysis. An example is the integration of Cyber-integrator workflow with collaboration environments to support drought prediction.
  • 32. Eco-Hydrology Choose gauge or outlet (HIS) RHESSys workflow to develop a nested Extract drainage area watershed parameter file (worldfile) (NHDPlus) containing a nested ecogeomorphic object Digital Elevation framework, and full, initial system state. Slope Model (DEM) Aspect Nested watershed Streams (NHD) structure Soil and vegetation Roads (DOT) parameter files Strata Patch Land Use NLCD (EPA) Hillslope Basin Leaf Area Landsat TM Index Stream network Phenology MODIS Worldfile Flowtable Soil Data USDA RHESSys
  • 33. iRODS Rule for RHESSys Modular workflow composed by chaining basic transformation Define input variables Call functions to apply each transformation step Store results in shared collection main { getExtentForGageReachcode(*gageReachcode, *extentInNHD_Vect_Coords); convertExtentToNHD_DEM(*extentInNHD_Vect_Coords, *extentInNHD_DEM_Coords); extractTileFromNHD_DEM(trimr(*extentInNHD_DEM_Coords, "n")); importDEMTileIntoNewGRASSLocationAsUTM(*extentInNHD_Vect_Coords, *newLocPhysPath, *newLocObjPath); delineateWatershedForNHDGage(*nhdStreamGageID, *newLocPhysPath, *newLocObjPath); }
  • 34. extractTileFromNHD_DEM(*extentCoords) { # Split path to object into collection and name msiSplitPath(*nhdDEMObjPath, *nhdDEMObjColl, *nhdDEMObjName); writeLine("serverLog", *nhdDEMObjColl); writeLine("serverLog", *nhdDEMObjName); # Build query to discover physical path msiAddSelectFieldToGenQuery("DATA_PATH", "null", *genQInp); msiAddConditionToGenQuery("DATA_NAME", "=", *nhdDEMObjName, *genQInp); msiAddConditionToGenQuery("COLL_NAME", "=", *nhdDEMObjColl, *genQInp); msiAddConditionToGenQuery("DATA_RESC_NAME", "=", *rescName, *genQInp); # Run query msiExecGenQuery(*genQInp, *genQOut); # Extract path from query result foreach (*genQOut) {msiGetValByKey(*genQOut, "DATA_PATH", *filePath); } writeLine("serverLog", *filePath); # Determine physical path of input directory msiSplitPath(*filePath, *inFileDir, *headerFileIgnore); # Generate physical path of output file msiSplitPath(*inFileDir, *inFileParentDir, *rasterDatasetName) *tileFileName = "SUBSET-"++*rasterDatasetName++".img" *tileFilePath = *inFileParentDir++"/"++*tileFileName; # Generate iRODS path of output msiSplitPath(*nhdDEMObjColl, *nhdDEMObjCollParent, *junk) *tileObjPath = *nhdDEMObjCollParent++"/"++*tileFileName *args = "-of HFA -projwin "++*extentCoords++" "++"'*inFileDir'"++" "++"'*tileFilePath'"; writeLine("serverLog", *args); msiExecCmd("gdal_translate", *args, "iren.renci.org", "null", "null", *cmd_out); writeLine("serverLog", *cmd_out); # Register tile file with iRODS msiPhyPathReg(*tileObjPath, *rescName, *tileFilePath, "null", *status); }
  • 35. Summary  iRODS is a power Policy-Based engine for Managing NextGen Big Data Cyber-Infrastructures  Enables a Flexible, Adaptive and Customizable Data Management Architecture  “Canned” scripts (policies) can be created to standardize and automated users processes  Simple menu driven interface  No CS Degree needed  iRODS is the middleware for Distributed Data Management