2. What is the Opportunity Play for iRODS
At a high level …
The Management of Big Data is the #1 concern for IT
• Life Cycle Management;
• Useful (actionable) and searchable metadata
• Integrity
• Collaboration (Federation of Immutable data)
iRODS Provides Policy-Based data management:
• Next Generation data management cyber-infrastructure
• System that enables a flexible, adaptive, customizable
data management architecture
• Tool for large collections (Petabytes, hundreds of millions of files)
3. We will touch on …
Properties of policy-based data management systems
Management of the data life cycle (project collection, digital library, persistent
archive, processing pipeline)
Applications of iRODS
LifeTime™ Library (digital library for students)
Genomics data grid
Carolina Digital Repository (institution repository)
French National Library (IT automation)
DataNet Federation Consortium (data and workflow sharing for collaborative
research)
1. What iRODS is and what problems it is solving today, and tomorrow.
2. Speak to different use cases (there will be many companies attending
representing many departments with different opportunities/problems)
a. Digitization of University Assets- Library archive
b. Genomic pipeline automation
c. IT service automation
4. Topics
• Principles behind policy-based data management
– Enable collaborative research
– Enable reproducible science
– Enable creation of reference collections
• Integrated Rule-Oriented Data System (iRODS)
– Enforce management policies
– Automate administrative functions
– Validate assessment criteria
5. Shared Collections – Data Grid
50 clients: web browser,
Client
unix shell command, …
Data grid middleware
Data Grid provides global name,
single sign-on, policy
enforcement, metadata,
replication
File Tape
System Archive Multiple types of systems
can be used to store data
6. Policy-based Data Management
Client
iRODS-server iRODS-server
Rule-engine Rule Engine
Rule base Rule base
Workflows Logical Workflows
Collection
(data grid)
Storage Storage
Consensus on Policies and Procedures
controls the Data Collection
7. Policy-Based Data Environments
Purpose
Reason a collection is assembled
Properties
Attributes needed to ensure the purpose
Policies
Controls for enforcing desired properties,
• mapped to computer actionable rules
Procedures
Functions that implement the policies
• Mapped to computer actionable workflows
Persistent state information
Results of applying the procedures
• mapped to system metadata
Property verification
Validation that state information conforms to the desired purpose
• mapped to periodically executed policies
7
8. Community-based Collection Life Cycle
The driving purpose changes at each stage of the data life cycle
Data
Project Data Processing Digital Reference Federation
Collection Grid Pipeline Library Collection
Private Shared Analyzed Published Preserved Sustained
Local Distribution Service Description Representation Re-purposing
Policy Policy Policy Policy Policy Policy
Stages correspond to addition of new policies for a broader community
Virtualize the stages of the collection life cycle through policy evolution
9. Applications
Data Grids (data sharing)
Ocean Observatories Initiative
The iPlant Collaborative
National Optical Astronomy Observatory
Babar High Energy Physics
Broad Institute genomics data grid
WellCome Trust Sanger Institute genomics data grid
Digital Libraries (data publication)
Texas Digital Library
French National Library
UNC-CH SILS LifeTime Library
Repositories / Archives (data preservation)
NASA Center for Climate Simulation
Carolina Digital Repository
10. Sequencing Work – an Infrastructure View
RC, RENCI, LCCC, HTSF Infrastructure
Managing several hundred TBs of genomic data hardware: ITS; software: LCCC, UNC High
Throughput Sequencing facility
Production
RENCI Infrastructure Test/Development
Archive
Genome
Databases
Genome
Pipelines
Databases
VarDB Hadoop
Pipelines
Data Production
UNC HTFS Distributed ad-hoc
Third Party processing
Vendors iRODS data-grid managed
Genome
processing Annotations
National Resources Ref
Se
dbSN 1000
HGMD
P Genomes
q
Open
Science Data Sharing
RENCI Grid Clinical Data Systems
Science Local
Portal TeraGrid NIH NCGenes
(TUCASI)
Secure Medical
UNC BASS Other Workspace
… Institutions
11. Managing Data on the Research Side
UNC RENCI External
Genomics Lab External
STORAGE STORAGE Compute:
Storage Machines Partners
(Tape, Drives) (Tape, Drives) Open Science
Grid
Genomics HPC Clemson NIH
UNC HPC RENCI HPC IT Machines
Genomics Clouds
RENCI Hadoop
Hadoop
iRODS gracefully allows for introducing control:
•Data movement and replication
Wild West Managed •Metadata standards
•Archival, deletion, and retention
•Integration with workflows, hadoop, databases
•Hiding complexities
Data
•Automation
Students IT Staff
Providers •…, all policy driven
Researchers
External •…, without breaking the in-place systems
Collaborators
12. SILS LifeTime Library
Student digital libraries
Enable students to build collections of
Photographs
MP3 audio files
Class documents
Video
Web site archive
Resources provided by School of Information and
Library Science at UNC-CH
Student collections range from 2 GBytes to 150 Gbytes
Number of files from 2000 to 12,000
16. iRODS Data Grid
More than 50 different clients have been used to
interact with the data grid
Web browsers (iDrop-web, Rich Web client)
Web services (VOSpace)
Load libraries (Python, Java)
I/O libraries (C, C++, Fortran)
File systems (FUSE, WebDav, Parrot)
Synchronization interfaces (iDrop)
Unix tools / Grid tools (icommands, SAGA, SRM, Griphyn)
Workflows (Kepler, Taverna)
Digital Libraries (Fedora, DSpace)
Portals (EnginFrame)
17. Managing Information & Knowledge
Concepts
Data objects
Information names
Knowledge relationships between names
Wisdom relationships between relationships
Implementation
Data bytes Storage system
Information metadata Relational database
Knowledge policies / procedures Rule base / Rule engine
Wisdom policy enforcement point Data Grid
18. Data Virtualization
Access Interface • Map from the actions
requested by the client to
multiple policy
Policy Enforcement Points enforcement points.
• Map from policy to
Standard Micro-services standard micro-services.
• Map from micro-services
Standard I/O Operations to standard Posix I/O
operations.
• Map standard I/O
Storage Protocol operations to the
protocol supported by
Storage System the storage system
19. System and User-driven Rules
The data grid automatically applies rules
defined in the rule base, core.re
You can define rules that are applied
interactively, or that are deferred for later
execution
irule –F “rule-file.r”
21. Production Integrity Rule
Verify all input parameters for consistency.
Query the iRODS metadata catalog to retrieve status information
Verify the integrity of each file in a collection
Update all replicas to the most recent version.
Minimize the load on production services through a deadline scheduler
Differentiate between the logical name for a file and the physical replica locations.
Identify all missing replicas and document their lack.
Create new replicas to replace missing replicas.
Implement load leveling to distribute the new replicas across the storage systems
Create a log file that records all repair operations performed upon the collection.
Track progress of the policy execution.
Initialize the rule for the first execution.
Enable restart of the process from the last set of checked files in case of a system halt.
Manipulate files in batches of 256 files at a time to handle arbitrarily large collections.
Minimize the number of sleep periods used by the deadline scheduler.
Include the checking of new files that have been added during the execution of the policy
Write out statistics about the effective execution rate, and the number of files checked.
22. Workflow Management & Registration
Workflow file
eCWkflow.mss
Directory holding all input and output files
/earthCube/eCWkflow associated with workflow file (mounted
collection that is linked to the workflow file)
eCWkflow.run Automatically generated run file for
Executing each input file
eCWkflow2.run
eCWkflow.mpf Input parameter file, lists parameters
and input and output file names
eCWkflow2.mpf
/earthCube/eCWkflow/eCWkflow.runDir0
Directory holding all output
files generated for invocation
Outfile of eCWkflow.run, the version
number is incremented
/earthCube/eCWkflow/eCWkflow2.runDir
0
Output file created for
Newfile eCWKflow.mpf
23. Publications
Rajasekar, R., M. Wan, R. Moore, W. Schroeder, S.-Y. Chen, L.
Gilbert, C.-Y. Hou, C. Lee, R. Marciano, P. Tooby, A. de Torcy, B.
Zhu, “iRODS Primer: Integrated Rule-Oriented Data System”,
Morgan & Claypool, 2010.
Ward, R., M. Wan, W. Schroeder, A. Rajasekar, A. de Torcy, T.
Russell, H. Xu, R. Moore, “The integrated Rule-Oriented Data
System (iRODS 3.0) Micro-service Workbook”, DICE
Foundation, November 2011, ISBN: 9781466469129,
Amazon.com
24. iRODS - Open Source Software
Reagan W. Moore
rwmoore@renci.org
http://irods.diceresearch.org
NSF OCI-0940841 “DataNet Federation Consortium”
NSF OCI-1032732 “Improvement of iRODS for Multi-Disciplinary Applications”
NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype”
NSF SDCI-0721400 “Data Grids for Community Driven Applications”
26. Initializing Workflow Parameters
*Val = "0”;
msiExecStrCondQuery("SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =
'*Coll' and META_COLL_ATTR_NAME = 'TEST_DATA_ID'", *GenQOut2);
foreach (*GenQOut2) {
msiGetValByKey(*GenQOut2, "META_COLL_ATTR_NAME", *Val);
}
if(int(*Val) == 0) {
*Str1 = "TEST_DATA_ID=0”;
msiString2KeyValPair(*Str1,*kvp);
msiAssociateKeyValuePairsToObj(*kvp,*Coll,"-C");
writeLine("*Lfile","added TEST_DATA_ID attribute to collection *Coll");
}
# on a restart TEST_DATA_ID will be greater than 0
msiMakeGenQuery("META_COLL_ATTR_VALUE", "COLL_NAME = '*Coll' and
META_COLL_ATTR_NAME = 'TEST_DATA_ID'", *GenQInp2);
msiExecGenQuery(*GenQInp2,*GenQOut2);
foreach(*GenQOut2) {
msiGetValByKey(*GenQOut2, "META_COLL_ATTR_VALUE", *colldataID);
}
27. Workflow Operations Used
Arithmetic (+, -, *, /)
Boolean tests (==, !=, &&, ||, >, <, >=)
Conditional statements
if / then / else
Control
break / fail
Loops
for / foreach / while
List manipulation
initialization / list addition (cons) / extracting an element from a
list (elem) / updating an element in a list (setelem)
Variable manipulation
initialization / type conversion (int, double, str)
28. Micro-services Used
Metadata catalog manipulation
msiGetValByKey get metadata from structure
msiExecStrCondQuery execute string conditional query
msiString2KeyValPair convert string to key-value pair
msiAssociateKeyValuePairsToObj add metadata
msiMakeGenQuery create a query
msiExecGenQuery execute a query
msiCloseGenQuery release query buffers
msiGetContInxFromGenQueryOut check for more rows
msiRemoveKeyValuePairsFromObj remove metadata
msiGetMoreRows get more rows from query
29. Micro-services Used
Data and directory manipulation
msiIsColl check whether name is a collection
msiCollCreate create a collection
msiDataObjCreate create a file
msiDataObjRepl replicate a file
msiDataObjChksum checksum a file
msiDataObjUnlink delete a file
System functions
msiGetSystemTime get the system time
writeLine write a line to a file or standard out
msiSleep sleep
30. Performance at renci’
• Execute call to rule engine 18 msecs
• Execute metadata query 714 msecs
• Disk seek latency 5 msecs
• Disk rotational latency 11 msecs
• Production loop logic 6.3 msecs
• Checksum verification 21 msecs
31. Data Analysis Use Cases
• Demonstrate reproducible science. A use case could include the
registration, storage, sharing, and re-execution of a workflow. The hypoxia
use case from the Cross-Domain and Brokering Concept groups could be
used as an example.
• Automate data retrieval. A use case could demonstrate remote access to a
data collection, retrieval of desired data sets, transformation, and use in
an analysis workflow. An eco-hydrology example that automates access
to digital elevation maps and land use coverage is being built.
• Integrate community resources with collaboration environments. An
example would be use of the DAB protocol to identify and cache local
copies of relevant data sets for local analysis.
• Integrate multiple community resources. A use case could be
demonstration of invocation of multiple workflow systems within the
same analysis. An example is the integration of Cyber-integrator
workflow with collaboration environments to support drought prediction.
32. Eco-Hydrology
Choose gauge
or outlet (HIS)
RHESSys workflow to develop a nested
Extract
drainage area watershed parameter file (worldfile)
(NHDPlus)
containing a nested ecogeomorphic object
Digital
Elevation
framework, and full, initial system state.
Slope
Model (DEM)
Aspect
Nested watershed
Streams (NHD) structure Soil and vegetation
Roads (DOT) parameter files
Strata
Patch
Land Use NLCD (EPA)
Hillslope
Basin Leaf Area
Landsat TM
Index
Stream network
Phenology MODIS
Worldfile
Flowtable
Soil Data USDA
RHESSys
33. iRODS Rule for RHESSys
Modular workflow composed by chaining basic transformation
Define input variables
Call functions to apply each transformation step
Store results in shared collection
main {
getExtentForGageReachcode(*gageReachcode, *extentInNHD_Vect_Coords);
convertExtentToNHD_DEM(*extentInNHD_Vect_Coords, *extentInNHD_DEM_Coords);
extractTileFromNHD_DEM(trimr(*extentInNHD_DEM_Coords, "n"));
importDEMTileIntoNewGRASSLocationAsUTM(*extentInNHD_Vect_Coords, *newLocPhysPath,
*newLocObjPath);
delineateWatershedForNHDGage(*nhdStreamGageID, *newLocPhysPath, *newLocObjPath);
}
35. Summary
iRODS is a power Policy-Based engine for Managing
NextGen Big Data Cyber-Infrastructures
Enables a Flexible, Adaptive and Customizable
Data Management Architecture
“Canned” scripts (policies) can be created to
standardize and automated users processes
Simple menu driven interface
No CS Degree needed
iRODS is the middleware for
Distributed Data Management