Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

Automated Information Retrieval
and Text Categorization:
The RIKS Demonstrator

Acknowledge final event
November 25, 2008
Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR)
Saskia Debergh (i.Know)
Philippe De Lombaerde, Birger Fühne (UNU-CRIS)

Overview
• UNU CRIS: The RIKS Demonstrator
UNU-CRIS:
• K.U.Leuven:
– Content extraction from multilingual Web pages
– Text categorization: machine learning approach
– Search engine and indexing infrastructure
– Interfacing the Acknowledge platform
• i.Know:
– Information forensics

Acknowledge 25-11-2008

1

• United Nations University – Comparative Regional
Integration Studies (UNU-CRIS)
• Issues addressed in research and capacity building:
– (i) emergence of regional (= supra-national)
governance level
– (ii) linkages with other governance levels (national,
global/UN)
– (iii) building of regional institutions
– (iv) growing regional interdependence, etc.
• RIKS = Regional Integration Knowledge System
(UNU-CRIS and GARNET NoE)


2

Issues addressed in the demonstrator:
How to automate retrieval and processing
p g
(cleaning, search, categorization,
presentation) of particular types of relevant
information in an e-learning environment?:
– ‘News’: short texts, various formats, dynamic
collection, short life cycle, role of news in e-
learning application
– ‘Documentation’: heterogeneous texts: scientific
articles, theses, essays, ... , rather static collection
– Treaty texts: long and complex texts, static
collection, issue of accessibility

RIKS
example output


3

Demo


K.U.Leuven: Content extraction
from multilingual Web pages

• = Extracting main content from Web page and
removing extraneous data (navigation menu’s,
advertisements, etc.)
• Requirements of the tool:
– Accurate
– Generic
– Multilingual
– Fast


4


[Arias et al. submitted]

5

[Arias et al. submitted]
[5] =[Gottron 2008]

K.U.Leuven:Text categorization
• Heterogeneous documentation and Google News
classified into 27 categories (e.g., trade, poverty, ...)
(e g trade poverty )
• Supervised classifier: Multinomial Naïve Bayes, Support
Vector Machine, ...
• Features:
– different features: unigrams, bigrams, feature item
sets, ...
• Additional feature Selection:
– Chi Square, Information Gain, Linear Classifier
Weights, Orthogonal Centroid Feature Selection
• Different test set ups

6

K.U.Leuven: Text categorization


RIKS
K.U.Leuven: search engine


7


Demo


8

Weten dat je niet weet wat je zou moeten weten

1. Information Forensics ‐ Smart Indexing
more than just an index
distinguishes between concepts and relations
distinguishes between concepts and relations
starts from unstructured text (bottom‐up instead of top‐down)

recognises word groups as meaningful units
Top‐down: Bottom‐up:

knowledge knowledge
keywords concepts and relations
text text
© i.Know NV ‐ All rights reserved.


1. Information Forensics – Smart Indexing
De Fortis Bank werd overgenomen door BNP Paribas.

Traditional indexing (keywords):

Keyword Index
Fortis 0.23
stopwords calculation Bank 0.38
werd 0.08
stemming correlation
overgenomen 0.21
door 0.12
BNP 0.34
De Fortis Bank werd overgenomen door BNP Paribas
Paribas 0.27


9


1. Information Forensics – Smart Indexing

Smart Indexing (concepts and relations):


Smart Index
relation concept
Concept Fortis Bank
detection detection
Relation werd overgenomen door
werd overgenomen door
Concept BNP Paribas

De Fortis Bank werd overgenomen door BNP Paribas



2. Categorisation based on Smart Indexing
Preconditions:
Pre defined taxonomy/ontology
Pre‐defined taxonomy/ontology
Top‐down processing

Advantages of Smart Indexing:
Smart Indexing Results can be used to fill and enrich the taxonomy, thus ensuring
the entries are
relevant
precise
complete


10


2. Categorisation

Categorisation

EU EFTA

Smart Indexing (concepts and relations):

The Agreement will be applied with the European and with the EFTA states.
Union

Input:
The Agreement will be applied with the European Union and with the EFTA states.


RIKS
i.Know: news categorization


11

RIKS
i.Know: news categorization



12


Demo


13

Thank you


14

Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (17)

Similaire à Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

Similaire à Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator (15)

Plus de imec.archive

Plus de imec.archive (20)

Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator