The document discusses the implementation of a digital library management system using grid computing technologies. It proposes mapping digital library functions to grid services to enable distributed storage, parallel processing, and user management capabilities. The authors describe developing an ontology-based model and pilot digital library application. Experimental results show that distributing search tasks across multiple grid nodes can significantly reduce execution time compared to a single node. The document concludes digital libraries must include powerful search tools and grid infrastructures can help implement efficient parallel processing functions.
How to Troubleshoot Apps for the Modern Connected Worker
Dapsys08 dl on_grid
1. Cluj Napoca, 28 August 2008
2008 IEEE International Conference on Intelligent Computer Communication and Processing
Digital Libraries Workshop
Towards a GRID-Based Digital Library
Management System.
Gheorghe Sebestyén-Pál1
, Doina Banciu2
, Tünde Bálint1
,
Bogdan Moscaiuc1
, and Ágnes Sebestyén-Pál1
1- Technical University of Cluj-Napoca
2 - ICI Bucharest
2. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Content
Classical vs. Digital Libraries
Recent research on Digital Libraries (DL)
Main issues and requirements for DLs
An ontology-based DL model
Grid-enabled DL
Implementation considerations of a pilot DL
Experiments
Conclusions
3. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Classical vs. Digital Libraries
Classical library
a repository of knowledge organized mainly on
paper
Digital library
Not only a digitized version of a classical library
A new set of functionalities and services are added (e.g.
access control, resources management and allocation,
complex search and processing services, etc.)
A data exchange and cooperation environment
DLs are becoming digital content management systems
Incorporates a wide variety of formats and data types ( text,
audio, video, multi-document complex digital objects)
Uses a variety of communication and data-exchange
protocols and standards
4. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
IT and Communication technologies involved in
the implementation of digital libraries
http://mapageweb.umontreal.ca/turner/meta/english/metamap.html
5. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Goals for modern DLs
DELOS project’s vision –
“to enable any person to access all human knowledge
anytime and anywhere, in a friendly, multi-modal,
efficient, and effective way, by overcoming barriers of
distance, language, and culture and by using multiple
Internet-connected devices”
DL - a knowledge repository and an information
exchange infrastructure that allows:
data generation,
processing and
seamless access to relevant information, regardless of the
geographic distribution of hardware resources, databases
or persons.
6. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Research in digital libraries
Delos Network of Excellence –
Goals: to define and implement digital libraries on new computing and
communication technologies
Achievements: definition of functional and architectural
requirements for DL implementation
BRICKS project
Goals: to design a user and service-oriented space to share
knowledge and resources in a multi-cultural heritage.
Achievements:
Definition of a digital library architecture for a very broad and
heterogeneous user community; automatic indexing and annotation
functionalities
OpenDlib project
Goal: development of a software toolkit for dedicated DLs generation
Achievements: tools for content harvesting form existing resources
Fedora, DSpace – open source software for DLs
Lucene – open source Search engines
7. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Research in digital libraries (cont.)
Diligent project (part of EGEE project)
Goal: the use of GRID infrastructure for DL implementation
Achievements: a new vision about the DL concept:
DL = a dynamic digital content repository and management system
dedicated for a purpose (e.g. a project, an art collection, an academic
course)
Definition of generic DL services mapped on GRID services
DLs dedicated for different domains – with powerful processing
capabilities
SINRED project – National Excellency project
Goal: development of a national framework for DLs specialized on
technical sciences and research
Achievements: evaluation of requirements, evaluation of existing
software, infrastructure development, DL model definition,
implementation of a pilot DL
SIPADOC project – National research program
Goal: reevaluation of the national patrimony through DLs
Achievements: evaluation of digitizing tools
8. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Key issues in DL implementation
Architectural issues:
distributed nature of storage, processing and access resources
Scalability, flexibility, interoperability
Functional requirements:
Core functions: storage, indexing and annotation, data-search, content
retrieval, users management
Content organization should reflect semantic connections
Processing facilities
Data processing services – specialized for different fields
Pattern search and recognition
QoS issues
Restricted time to obtain relevant information
Reasonable time for complex data processing
User and access control management
Virtual organizations
Role-based access
9. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
DL = Essence & Metadata Management
Text
Audio
Video
Text
Digital content
generation and
harvesting
Management of
essence
Automatic feature
(metadata) extraction
Metadata
Management
Cataloging, indexing,
annotation
Access and
visualization
Cataloging
information system
10. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
An ontology-based
Digital Library approach
Ontology: concepts and relations together with a
reasoning engine
Ontology for technical and scientific domains
Main concepts:
Digital objects:
association of content, metadata and
procedures
Examples: articles, technical reports,
prospects, PhD Thesis, patents
Digital collections
Set of digital objects structured for a
given goal/purpose of based on a
given criterion
Examples: articles of an author,
documents of a domain
Events
Conferences, workshops, seminars
Processes
Projects
Courses
Virtual organizations
Roles
users
11. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Grid-enabled digital library services
Why DLs on GRID infrastructure?
Huge volume of documents/digital objects
Concurrent access and multiple search engines (see
Google)
Multimedia streaming
Automatic indexing and annotation
Complex processing requires prohibitive time
User management through virtual organizations
Job distribution facilities offered by GRID
12. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
DL functions mapped on GRID services
Computing, storage and communication resources
Digital Library
GRID Services
Collections
management
Catalog and
metadata
management
Digital objects
management
Users’
management
Data
visualization
Virtual
organizations
management
Resource
management
Task
distribution
Processing
Data distribution
and replication
Data processing
13. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Experiments
Two approaches:
DL implementation on Alchemi GRID (Microsoft)
Job distribution at thread level
Explicit GRID programming
Experiments with multimedia streaming (multimedia content
distribution)
DL implementation on Condor GRID (Open source)
Job distribution at task level
Job and data distribution is transparent to the DL application
( distribution is made through separate scripts)
Experiments with “key-word search” in the whole DL content
The execution time decreased with the number of executor
computers
For more than 5 executors the scheduling and communication
time is comparable with the execution time
14. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
A pilot implementation of a Digital library
framework developed with GRID support
Goal: implementation of a digital content storage and retrieval
system dedicated for educational and scientific activities (courses,
projects, etc.)
Main requirements:
A DL adaptable for a given purpose/goal
Access controlled and restricted with virtual organizations
Ontology-based approach (concepts, relations, semantic search)
Advanced search procedures
GRID-enabled full-text search services – for better reaction time
Access through Internet browsers
The result:
A distributed digital library application, which allows:
Management of digital objects (upload, storage, indexing, metadata
creation
Management of collections
Management of users and virtual organizations
15. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Pilot DL details:
(www.bib-dig.utcluj.ro)
Management of digital objects
Digital Documents’ upload,
Annotation, metadata generation according with
Dublin Core
Distributed Storage of data
Management of collections
Define a new collection
Attach new documents to an existing collection
Associate access rights to a collection
Management of users and virtual organizations
Define new users and new virtual organizations
Define roles
Associate roles to users and collections
16. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Snapshots of the DL application’s interface
bib-dig.utcluj.ro
17. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Snapshots of the DL application’s interface
18. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Search techniques in DLs
through key-word or index search:
Database techniques
through semantic Information
Retrieval:
Semantic graph with documents
and concepts
through non-semantic Information
Retrieval:
Naive Bayes Algorithm
Probabilistic approach
Based on probabilistic
similarity between documents
Topic-Based Vector Space
Model Algorithm
19. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Experimental results
Execution time v. s. number of executor nodes
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 3 4 5
Nodes
Time(s)
Search execution time
Scheduling and
communication time
(case 1)
Scheduling and
communication time
(case 2)
Total time (case1)
Total time (case2)
20. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Experiments
21. Debrecen, 3-5 September 2008, DAPSYS’08
7th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS
Conclusions
DLs are complex content management systems that extend the functionalities of
classical libraries:
Semantic organization of a wide variety of information formats
Multiple search and data retrieval techniques (including full-text and
semantic search):
Key-word full-text search
Semantic search
Statistical and probabilistic retrieval and classification
Access control to distributed and remote data
DLs are Data exchange and cooperation environments
Useful for remote and cooperative work
DLs must include powerful search and data retrieval engines
GRID infrastructures may be a feasible support in the implementation of DLs
For more efficient parallel search, classification or automatic annotation
22. Cluj Napoca, 28 August 2008
2008 IEEE International Conference on Intelligent Computer Communication and Processing
Digital Libraries Workshop
Thank you for your
attention
Questions ?