The document summarizes the RIKS Demonstrator project. The project involves several partners including UNU-CRIS, K.U. Leuven, and i.Know. K.U. Leuven focuses on automated content extraction from multilingual web pages, text categorization using machine learning, and providing a search engine and indexing infrastructure. i.Know focuses on information forensics including smart indexing that distinguishes concepts and relations. The demonstrator aims to automate retrieval, processing, and presentation of news, documentation, and treaty texts for an e-learning environment.
Apollon - 22/5/12 - 09:00 - User-driven Open Innovation Ecosystems
Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator
1. Automated Information Retrieval
and Text Categorization:
The RIKS Demonstrator
Acknowledge final event
November 25, 2008
Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR)
Saskia Debergh (i.Know)
Philippe De Lombaerde, Birger Fühne (UNU-CRIS)
Overview
• UNU CRIS: The RIKS Demonstrator
UNU-CRIS:
• K.U.Leuven:
– Content extraction from multilingual Web pages
– Text categorization: machine learning approach
– Search engine and indexing infrastructure
– Interfacing the Acknowledge platform
• i.Know:
– Information forensics
Acknowledge 25-11-2008
1
2. The RIKS Demonstrator
• United Nations University – Comparative Regional
Integration Studies (UNU-CRIS)
• Issues addressed in research and capacity building:
– (i) emergence of regional (= supra-national)
governance level
– (ii) linkages with other governance levels (national,
global/UN)
– (iii) building of regional institutions
– (iv) growing regional interdependence, etc.
• RIKS = Regional Integration Knowledge System
(UNU-CRIS and GARNET NoE)
Acknowledge 25-11-2008
Acknowledge 25-11-2008
2
3. The RIKS Demonstrator
Issues addressed in the demonstrator:
How to automate retrieval and processing
p g
(cleaning, search, categorization,
presentation) of particular types of relevant
information in an e-learning environment?:
– ‘News’: short texts, various formats, dynamic
collection, short life cycle, role of news in e-
learning application
– ‘Documentation’: heterogeneous texts: scientific
articles, theses, essays, ... , rather static collection
– Treaty texts: long and complex texts, static
collection, issue of accessibility
Acknowledge 25-11-2008
RIKS
example output
Acknowledge 25-11-2008
3
4. Demo
Acknowledge 25-11-2008
K.U.Leuven: Content extraction
from multilingual Web pages
• = Extracting main content from Web page and
removing extraneous data (navigation menu’s,
advertisements, etc.)
• Requirements of the tool:
– Accurate
– Generic
– Multilingual
– Fast
Acknowledge 25-11-2008
4
6. [Arias et al. submitted]
[5] =[Gottron 2008]
Acknowledge 25-11-2008
K.U.Leuven:Text categorization
• Heterogeneous documentation and Google News
classified into 27 categories (e.g., trade, poverty, ...)
(e g trade poverty )
• Supervised classifier: Multinomial Naïve Bayes, Support
Vector Machine, ...
• Features:
– different features: unigrams, bigrams, feature item
sets, ...
• Additional feature Selection:
– Chi Square, Information Gain, Linear Classifier
Weights, Orthogonal Centroid Feature Selection
• Different test set ups
6