3. Topic Models
Algorithms to understand and
organize documents by
uncovering semantic structure
of a document collection
• Discover hidden themes –
patterns of word use
• Connect documents that
exhibit similar patterns
4. Latent Dirichlet Allocation (LDA)
“In the computer science field of artificial intelligence, a genetic algorithm (GA) is a
search heuristic that mimics the process of natural evolution. This heuristic is
routinely used to generate useful solutions to optimization and search problems.
Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which
generate solutions to optimization problems using techniques inspired by natural
evolution, such as inheritance, mutation, selection, and crossover.” 1
Algorithms – 0.28 Genetic – 0.18
Optimization – 0.28 Natural – 0.18
Algorithm – 0.14 Evolution – 0.18
Computer – 0.14 Evolutionary – 0.09
Techniques – 0.14 …
….
1http://en.wikipedia.org/wiki/Genetic_algorithm
5. Topics from LDA
computer chemistry cortex orbit infection
methods synthesis stimulus dust immune
number oxidation fig jupiter aids
two reaction vision line infected
principle product neuron system viral
design organic recordings solar cells
Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)
methods k of the for the the operations the
the the objects of the o and the of
a of to a linear we of functional a
of algorithm and to problem and to requires is
problems for the we problems a that and in
Ten randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of
the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
6. The interpretation problem
1. Labeling the topics is difficult (J. Chang et al.,
2009)
2. The relationships between topics are not
identified
3. The information in the topics is based solely
on the input corpus
4. The external validity of the topics may be
limited
7. Collaborative Knowledge Bases
1. Labeled topics
2. Connected to each other in a meaningful way
3. Contain rich, focused information on
particular topics
4. Contain fresh, up-to-date information about
practically everything
8. Wikipedia Pages as Topics
LDA topic Wikipedia Page
orbit Solar System
dust “The Solar System[a] consists of the Sun
jupiter and the astronomical objects
gravitationally bound in orbit around it,
line
all of which formed from the collapse of a
system giant molecular cloud approximately 4.6
solar billion years ago…”
gas
atmospheric (http://en.wikipedia.org/wiki/Solar_System)
mars
field
9. Wikipedia Pages as Topics
Topics are characterized as distributions over observed words in
Wikipedia pages
Wikipedia Word Freq.
orbit 34 0.12
dust 7 0.02 {Wi Î k}
bk = p(Wi | k) = N
jupiter 36 0.12
line 0 0.00 å {W Î k}
i
i
system 76 0.26
βk : Per-topic word distribution
solar 110 0.38
gas 11 0.04
atmospheric 1 0.00
mars 8 0.03
field 8 0.03
10. DOCUMENT – TOPIC DOCUMENT – W0RD TOPIC - WORD
Θ (D x K) W (D x W )
β (K x W)
Z d,n W d,n
n
Z d,n
LDA
d d
Wiki (W x K)
k k
WIKI
d = d
*
D: Documents K: Topics W: Words
11. Experiment
Data
617 abstracts from Journal of the ACM
Classified into 80 categories by their authors
53 categories have corresponding Wikipedia Pages
Abstracts
{Article Name: On the (Im)possibility of Obfuscating Programs,
Category: D.4. Operating Systems
Add. Category: F.1 Computation by Abstract Devices
…
}
Category Mappings
Category Wikipedia Page
D.4 Operating Systems: Operating System
F.1 Computation by Abstract Devices : Abstract Machine
12. Three variations of our method
- Inbound links are Wikipedia pages that link to the topic page
- Outbound links are Wikipedia pages linked to by the topic
page
- Text-based method only uses word distributions in topic pages
13. Results
Method Primary Primary or Additional
Text 182 (29.5%) 314 (50.8%)
Inbound links 131 (21.2%) 249 (40.0%)
Outbound links 79 (12.8%) 166 (26.9%)
The number (and percentage) of authors’ primary ACM topic labels, or authors’
primary + additional ACM topics successfully identified by each method.
LDA cannot be compared without an additional step mapping word distributions to
ACM topics.
15. Concluding Remarks
The Wiki categories often match the categories that
were chosen by the authors. When they don’t
match, they generally appear plausible.
Among the variations of our method, the text based
approach performed better than link based
approaches.
Among the link based approaches, inbound links
performed better than outbound links.
16. Next Steps
Dependent topic structures
Combine heuristics with generative models:
Wikipedia as a prior for the topic distribution
Learn from the documents observed.
Notes de l'éditeur
Blei- “Much of my research is in topic models, which are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. These algorithms help us develop new ways to search, browse and summarize large archives of texts.”
Here is an example of a paragraphWe assume that some number of topics exist in a document setEach document is a mixture of these corpus wide topicsEach topic is a distribution over wordsEach word is drawn from one of those topics