Semantic Transforms Using Collaborative Knowledge Bases

Semantic Transforms Using
Collaborative Knowledge Bases

Yegin Genc, Winter Mason, Jeffrey V. Nickerson

Stevens Institute of Technology

Overview

• Automatically understand online information

• Using network artifacts, such as Wikipedia, to
help

Topic Models
Algorithms to understand and
organize documents by
uncovering semantic structure
of a document collection

• Discover hidden themes –
patterns of word use
• Connect documents that
exhibit similar patterns

Latent Dirichlet Allocation (LDA)

“In the computer science field of artificial intelligence, a genetic algorithm (GA) is a
search heuristic that mimics the process of natural evolution. This heuristic is
routinely used to generate useful solutions to optimization and search problems.
Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which
generate solutions to optimization problems using techniques inspired by natural
evolution, such as inheritance, mutation, selection, and crossover.” 1

Algorithms – 0.28 Genetic – 0.18
Optimization – 0.28 Natural – 0.18
Algorithm – 0.14 Evolution – 0.18
Computer – 0.14 Evolutionary – 0.09
Techniques – 0.14 …
….
1http://en.wikipedia.org/wiki/Genetic_algorithm

Topics from LDA
computer chemistry cortex orbit infection
methods synthesis stimulus dust immune
number oxidation fig jupiter aids
two reaction vision line infected
principle product neuron system viral
design organic recordings solar cells
Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)

methods k of the for the the operations the
the the objects of the o and the of
a of to a linear we of functional a
of algorithm and to problem and to requires is
problems for the we problems a that and in
Ten randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of
the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).

The interpretation problem
1. Labeling the topics is difficult (J. Chang et al.,
2009)
2. The relationships between topics are not
identified
3. The information in the topics is based solely
on the input corpus
4. The external validity of the topics may be
limited

Collaborative Knowledge Bases
1. Labeled topics
2. Connected to each other in a meaningful way
3. Contain rich, focused information on
particular topics
4. Contain fresh, up-to-date information about
practically everything

Wikipedia Pages as Topics
LDA topic Wikipedia Page

orbit Solar System
dust “The Solar System[a] consists of the Sun
jupiter and the astronomical objects
gravitationally bound in orbit around it,
line
all of which formed from the collapse of a
system giant molecular cloud approximately 4.6
solar billion years ago…”
gas
atmospheric (http://en.wikipedia.org/wiki/Solar_System)
mars
field

Wikipedia Pages as Topics
Topics are characterized as distributions over observed words in
Wikipedia pages

Wikipedia Word Freq.
orbit 34 0.12
dust 7 0.02 {Wi Î k}
bk = p(Wi | k) = N
jupiter 36 0.12
line 0 0.00 å {W Î k}
i
i
system 76 0.26
βk : Per-topic word distribution
solar 110 0.38
gas 11 0.04
atmospheric 1 0.00
mars 8 0.03
field 8 0.03

DOCUMENT – TOPIC DOCUMENT – W0RD TOPIC - WORD
Θ (D x K) W (D x W )
β (K x W)
Z d,n W d,n

n
Z d,n
LDA

d d

Wiki (W x K)
k k
WIKI

d = d
*

D: Documents K: Topics W: Words

Experiment
Data
617 abstracts from Journal of the ACM
Classified into 80 categories by their authors
53 categories have corresponding Wikipedia Pages

Abstracts
{Article Name: On the (Im)possibility of Obfuscating Programs,
Category: D.4. Operating Systems
Add. Category: F.1 Computation by Abstract Devices
…
}

Category Mappings
Category Wikipedia Page
D.4 Operating Systems: Operating System
F.1 Computation by Abstract Devices : Abstract Machine

Three variations of our method

- Inbound links are Wikipedia pages that link to the topic page
- Outbound links are Wikipedia pages linked to by the topic
page
- Text-based method only uses word distributions in topic pages

Results
Method Primary Primary or Additional

Text 182 (29.5%) 314 (50.8%)

Inbound links 131 (21.2%) 249 (40.0%)

Outbound links 79 (12.8%) 166 (26.9%)

The number (and percentage) of authors’ primary ACM topic labels, or authors’
primary + additional ACM topics successfully identified by each method.

LDA cannot be compared without an additional step mapping word distributions to
ACM topics.

Concluding Remarks
The Wiki categories often match the categories that
were chosen by the authors. When they don’t
match, they generally appear plausible.

Among the variations of our method, the text based
approach performed better than link based
approaches.

Among the link based approaches, inbound links
performed better than outbound links.

Next Steps

Dependent topic structures

Combine heuristics with generative models:
Wikipedia as a prior for the topic distribution
Learn from the documents observed.

Semantic Transforms Using Collaborative Knowledge Bases

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

En vedette

En vedette (6)

Similaire à Semantic Transforms Using Collaborative Knowledge Bases

Similaire à Semantic Transforms Using Collaborative Knowledge Bases (20)

Dernier

Dernier (20)

Semantic Transforms Using Collaborative Knowledge Bases

Notes de l'éditeur