(Final) bilingual equivalence mapping methods and issues
Union catalogandknowledge engineering for teldap
1. Union Catalog and Knowledge
Engineering for TELDAP
Keh-Jiann Chen
Principal Investigator
Core Platforms for Digital Contents Project, TELDAP
Research Fellow
Research Center for Information Technology Innovation &
Institute of Information Science, Academia Sinica
2. Outline
Introduction
Union catalog
Databases and metadata for
digital contents and websites
Knowledge engineering
Future perspective
3. Introduction
The integration and management of digital
contents has become an important issue as
the amount of digital contents produced from
different projects and institutions increases
rapidly.
The goal of our project is to achieve
optimized preservation, retrieval, and
presentation of digital collections.
4. Outline
Introduction
Union catalog
Databases and metadata for
digital contents and websites
Knowledge engineering
Future perspective
5. What is the union catalog ?
• It is a catalog and portal for all digital collections of
TELDAP.
• It is an integrated platform for browsing and searching
entire digital contents of TELDAP.
• Metadata provides core descriptions and licensing
information of each digital collection.
7. Outline
Introduction
Union catalog
Databases and metadata for
digital contents and websites
Knowledge engineering
Future perspective
8. Metadata models for different
types of objects
Archived digital items
• Union catalog metadata model- Dublin core+
Web sites
• DCCAP (Dublin Core Collections Application Profile)
• Fields for internal used only
― Unique Identifier, Format, Evaluation, Cataloging History
Documents
• Document metadata-Dublin core
9. 9
Metadata for
digital items :
Over 3 million
digital items and
still increasing
Element Definition
Title A name given to the resource
Creator An entity primarily responsible for making the
content of the resource
Subject and Keywords The topic of the content of the resource
Description An account of the content of the resource
Publisher An entity responsible for making the resource
available
Contributor An entity responsible for making contributions to the
content of the resource
Date A date associated with an event in the life cycle of
the resource
Resource Type The nature or genre of the content of the resource
Format The physical or digital manifestation of the resource
Resource Identifier An unambiguous reference to the resource within a
given context
Source A Reference to a resource from which the present
resource is derived
Language A language of the intellectual content of the
resource
Relation A reference to a related resource
Coverage The extent or scope of the content of the resource
Rights Management Information about rights held in and over the
resource
11. Metadata for websites
Over 500 websites and still increasing
Metadata
• DCCAP (Dublin Core Collections Application
Profile)
• Total of 19 data fields
12. The Website Homepage Picture
URL, Project Information
Type, Name, Author, Subject,
Description, Language,
Item Type, Target
Archived Information:
URL, time, authorization
Copyright, Purpose, Other Information
Figure: http://digitalarchives.tw
Metadata for
websites
13. Dynamic categorization
• User-oriented categorization
– General, elementary school students, high school
students, researchers, …etc.
• Topical-based categorization
– Archaeology, painting, animal, plant, document, …
etc.
• Functional-based categorization
– Research, education, business, technology,…
• Categorization based on institutions
– Academia Sinica, Taiwan U., Palace museum,…
14. Purpose: Education
Target: Elementary school student,
Junior high school student,
Teacher…
Purpose: Creative applications
Purpose: Academic research
Subject: Animal, Archaeology,
Anthropology…
Figure: http://digitalarchives.tw
Digitalarchives.tw
15. Metadata for project documents
Over 14,000 documents and still increasing
Metadata- Dublin core
Construct Teldapwiki- A Wikipedia for
TELDAP http://wiki.teldap.tw/
16. Outline
Introduction
Union catalog
Databases and metadata for
digital contents and websites
Knowledge engineering
Future perspective
17. Plans of making knowledge
structures for TELDAP
• Construct metadata models for different objects.
• Establish hyperlinks between contexts and
objects.
– Develop keyword extraction tools.
– Design automatic hyperlink tagging tools.
• Construct TELDAP ontology and thesaurus.
– Art & Architecture Thesaurus by Getty
– Chinese WordNet
18. (1) Metadata models for different objects
• Digital collections
– Union catalog metadata model- Dublin core+
• Web sites
– DCCAP (Dublin Core Collections Application Profile)
– Public fields
– Private fields
Unique Identifier, Format, Evaluation, Cataloging History
• Documents
– Document metadata-Dublin core
19. (2) Establish hyperlinks between contents
and objects
• Identify keywords in contents.
• Tag keywords with related object hyperlinks.
20. Develop hyperlink tagging tools
• Word segmentation tools
– Resolve word segmentation ambiguities and identify
keywords.
– CKIP word segmentation system:
http://ckipsvr.iis.sinica.edu.tw/
21. Develop hyperlink tagging tools
• TELDAP keyword dictionary
– Extract keywords from metadata and establish
object-keyword relations.
Extract text from XML data for each object.
The text are classified by topics, titles,
descriptions, authors, locations, eras etc.
From each class of text file extract keywords by
automatic word segmentation, keyword
extraction, and manual post editing.
– Current dictionary contains more than 50,000
Keywords.
22. Prototype system for hyperlink tagger
• Identify and select keywords from the input text
24. Prototype system for hyperlink tagger
• Hyperlinks point to the related digital collections
25. (3) Construct TELDAP ontology and
thesaurus
Establish association links between
Chinese keywords and Getty AAT.
Merge TELDAP keywords with Chinese
AAT.
26. Outline
Introduction
Union catalog
Databases and metadata for
digital contents and websites
Knowledge engineering
Future perspective
27. Future Perspective
• Technology development
– Construct multi-lingua thesauri – extend Getty AAT.
– Maintain the TELDAP keyword-and-object relation
database.
– Construct name authority files, gazetteers, and
universal calendars.
– Design hyperlink taggers and keyword extension tools.
– Design an authoring tool which provides hyperlinks of
keyword related digital contents automatically.
– Design knowledge-based content retrieval system.
28. Future Perspectives
• Content enrichment
– Within TELDAP :
Standardize object metadata model and data format.
Provide object metadata in controlled vocabulary.
Write scripts and stories for different topics with Wiki-like
knowledge structure.
Enrich the digital collections.
Establish hyperlinks between text books and TELDAP
collections.
– Extend the knowledge sources : e.g. Wikipedia