The document discusses graphical models for analyzing large amounts of data from the internet. It outlines several applications of graphical models including clustering documents, detecting topics in text, word segmentation, modeling user interests over time, and detecting ideology from text. The document also discusses challenges like scale of data, need for advanced modeling beyond clustering/topics, and scalable inference algorithms. Basic statistical tools for graphical models like probability, independence, Bayes' rule, and exponential families are also covered.
Difference Between Search & Browse Methods in Odoo 17
Graphical models for automatic information extraction
1. Graphical Models for the Internet
Alexander Smola & Amr Ahmed
Yahoo! Research & Australian National University
Santa Clara, CA
alex@smola.org blog.smola.org
2. Outline
• Part 1 - Motivation
• Automatic information extraction
• Application areas
• Part 2 - Basic Tools
• Density estimation / conjugate distributions
• Directed Graphical models and inference
• Part 3 - Topic Models (our workhorse)
• Statistical model
• Large scale inference (parallelization, particle filters)
• Part 4 - Advanced Modeling
• Temporal dependence
• Mixing clustering and topic models
• Social Networks
• Language models
12. Language model
automatically synthesized
from Penn Treebank
Mochihashi, Yamada, Ueda
ACL 2009
13. User model over time
0.5 Baseball
0.3
0.4 Dating
Propotion
Baseball
0.3 0.2 Finance
0.2 Celebrity Jobs
0.1
0.1 Dating
Health
0 0
0 10 20 30 40 0 10 20 30 40
Dating DayBaseball Celebrity Health Jobs DayFinance
League Snooki
women skin job financial
baseball Tom
men body career Thomson
basketball, Cruise
dating fingers business chart
doublehead Katie
singles cells assistant real
Bergesen Holmes
personals toes hiring Stock
Griffey Pinkett
seeking wrinkle part-‐time Trading
bullpen Kudrow
match layers receptionist currency
Greinke Hollywood
Ahmed et al., KDD 2011
19. Ontologies
• continuous
maintenance
• no guarantee
of coverage
• difficult
categories
expensive, small
20. Face Classification
• 100-1000
people
• 10k faces
• curated
(not realistic)
• expensive to
generate
21. Topic Detection & Tracking
• editorially
curated
training data
• expensive to
generate
• subjective in
selection of
threads
• language
specific
22. Advertising Targeting
• Needs training data in every language
• Is it really relevant for better ads?
• Does it cover relevant areas?
23. Challenges
• Scale
• Millions to billions of instances
(documents, clicks, users, messages, ads)
• Rich structure of data (ontology, categories, tags)
• Model description typically larger than memory of single workstation
• Modeling
• Usually clustering or topic models do not solve the problem
• Temporal structure of data
• Side information for variables
• Solve problem. Don’t simply apply a model!
• Inference
• 10k-100k clusters for hierarchical model
• 1M-100M words
• Communication is an issue for large state space
24. Summary - Part 1
• Essentially infinite amount of data
• Labeling is prohibitively expensive
• Not scalable for i18n
• Even for supervised problems unlabeled data
abounds. Use it.
• User-understandable structure for
representation purposes
• Solutions are often customized to problem
We can only cover building blocks in tutorial.
27. Probability
• Space of events X
• server status (working, slow, broken)
• income of the user (e.g. $95,000)
• search queries (e.g. “graphical models”)
• Probability axioms (Kolmogorov)
Pr(X) ∈ [0, 1], Pr(X ) = 1
Pr(∪i Xi ) = i Pr(Xi ) if Xi ∩ Xj = ∅
• Example queries
• P(server working) = 0.999
• P(90,000 income 100,000) = 0.1
28. (In)dependence
• Independence Pr(x, y) = Pr(x) · Pr(y)
• Login behavior of two users (approximately)
• Disk crash in different colos (approximately)
• Dependent events
• Emails Pr(x, y) = Pr(x) · Pr(y)
• Queries
• News stream / Buzz / Tweets
• IM communication Everywhere!
• Russian Roulette
33. AIDS test (Bayes rule)
• Data
• Approximately 0.1% are infected
• Test detects all infections
• Test reports positive for 1% healthy people
• Probability of having AIDS if test is positive
Pr(t|a = 1) · Pr(a = 1)
Pr(a = 1|t) =
Pr(t)
Pr(t|a = 1) · Pr(a = 1)
=
Pr(t|a = 1) · Pr(a = 1) + Pr(t|a = 0) · Pr(a = 0)
1 · 0.001
= = 0.091
1 · 0.001 + 0.01 · 0.999
34. Improving the diagnosis
• Use a follow-up test
• Test 2 reports positive for 90% infections
• Test 2 reports positive for 5% healthy people
0.01 · 0.05 · 0.999
= 0.357
1 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999
• Why can’t we use Test 1 twice?
Outcomes are not independent but tests 1 and
2 are conditionally independent
p(t1 , t2 |a) = p(t1 |a) · p(t2 |a)
36. Naive Bayes Spam Filter
• Key assumption
Words occur independently of each other
given the label of the document
n
p(w1 , . . . , wn |spam) = p(wi |spam)
i=1
• Spam classification via Bayes Rule
n
p(spam|w1 , . . . , wn ) ∝ p(spam) p(wi |spam)
• Parameter estimation
i=1
Compute spam probability and word
distributions for spam and ham
37. A Graphical Model
spam spam
how to estimate
p(w|spam)
w1 w2 ... wn wi
n
i=1..n
p(w1 , . . . , wn |spam) = p(wi |spam)
i=1
38. Naive NaiveBayes Classifier
• Two classes (spam/ham)
• Binary features (e.g. presence of $$$, viagra)
• Simplistic Algorithm
• Count occurrences of feature for spam/ham
• Count number of spam/ham mails
spam probability
feature probability
n(i, y) n(y)
p(xi = TRUE|y) = and p(y) =
n(y) n
n(y) n(i, y) n(y) − n(i, y)
p(y|x) ∝
n n(y) n(y)
i:xi =TRUE i:xi =FALSE
39. Naive NaiveBayes Classifier
what if n(i,y)=n(y)?
what if n(i,y)=0?
n(y) n(i, y) n(y) − n(i, y)
p(y|x) ∝
n n(y) n(y)
i:xi =TRUE i:xi =FALSE
41. Two outcomes (binomial)
• Example: probability of ‘viagra’ in spam/ham
• Data likelihood
p(X; π) = π n1 (1 − π)n0
• Maximum Likelihood Estimation
• Constraint π ∈ [0, 1]
• Taking derivatives yields
n1
π=
n0 + n1
42. n outcomes (multinomial)
• Example: USA, Canada, India, UK, NZ
• Data likelihood
ni
p(X; π) = πi
i
• Maximum Likelihood Estimation
• Constrained optimization problem πi = 1
i
• Using log-transform yields
ni
πi =
j nj
44. Conjugate Priors
• Unless we have lots of data estimates are weak
• Usually we have an idea of what to expect
p(θ|X) ∝ p(X|θ) · p(θ)
we might even have ‘seen’ such data before
• Solution: add ‘fake’ observations
p(θ) ∝ p(Xfake |θ) hence p(θ|X) ∝ p(X|θ)p(Xfake |θ) = p(X ∪ Xfake |θ)
• Inference (generalized Laplace smoothing)
n n
1 1 m fake count
φ(xi ) −→ φ(xi ) + µ0
n i=1 n + m i=1 n+m
fake mean
45. Conjugate Prior in action
mi = m · [µ0 ]i
• Discrete Distribution
ni ni + mi
p(x = i) = −→ p(x = i) =
n n+m
• Tossing a dice
Outcome 1 2 3 4 5 6
Counts 3 6 2 1 4 4
MLE 0.15 0.30 0.10 0.05 0.20 0.20
MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19
MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17
• Rule of thumb
need 10 data points (or prior) per parameter
49. Exponential Families
• Density function
p(x; θ) = exp (φ(x), θ − g(θ))
where g(θ) = log exp (φ(x ), θ)
x
• Log partition function generates cumulants
∂θ g(θ) = E [φ(x)]
2
∂θ g(θ) = Var [φ(x)]
• g is convex (second derivative is p.s.d.)
50. Examples
• Binomial Distribution φ(x) = x
• Discrete Distribution φ(x) = ex
(ex is unit vector for x) 1
φ(x) = x, xx
• Gaussian 2
• Poisson (counting measure 1/x!) φ(x) = x
• Dirichlet, Beta, Gamma, Wishart, ...
55. Maximum Likelihood
• Negative log-likelihood
n
− log p(X; θ) = g(θ) − φ(xi ), θ
i=1
empirical
mean average
• Taking derivatives
n
1
−∂θ log p(X; θ) = m E[φ(x)] − φ(xi )
m i=1
We pick the parameter such that the
distribution matches the empirical average.
56. Example: Gaussian Estimation
• Sufficient statistics: x, x 2
• Mean and variance given by
µ = Ex [x] and σ 2 = Ex [x2 ] − E2 [x]
x
• Maximum Likelihood Estimate
n n
1 2 1 2 2
µ=
ˆ xi and σ = xi − µ
ˆ
n i=1 n i=1
• Maximum a Posteriori Estimate smoother
n
n
1 2 1 2 n0 2
µ=
ˆ xi and σ = xi + 1−µ
ˆ
n + n0 i=1
n + n0 i=1
n + n0
smoother
57. Collapsing
• Conjugate priors
p(θ) ∝ p(Xfake |θ)
Hence we know how to compute normalization
• Prediction p(x|X) = p(x|θ)p(θ|X)dθ
(Beta, binomial) ∝ p(x|θ)p(X|θ)p(Xfake |θ)dθ
(Dirichlet, multinomial)
(Gamma, Poisson) = p({x} ∪ X ∪ Xfake |θ)dθ
(Wishart, Gauss) look up closed
form expansions
http://en.wikipedia.org/wiki/Exponential_family
59. ... some Web 2.0 service
MySQL Apache
Website
• Joint distribution (assume a and m are independent)
p(m, a, w) = p(w|m, a)p(m)p(a)
• Explaining away
p(w|m, a)p(m)p(a)
p(m, a|w) =
,a p(w|m , a )p(m )p(a )
m
a and m are dependent conditioned on w
60. ... some Web 2.0 service
MySQL Apache
Website
is broken is working
At least one of the
MySQL is working
two services is broken
Apache is working
(not independent)
61. Directed graphical model
m a m a m a
w w w
user
• Easier estimation u
action
• 15 parameters for full joint distribution
• 1+1+3+1 for factorizing distribution
• Causal relations
• Inference for unobserved variables
63. Directed Graphical Model
• Joint probability distribution
p(x) = p(xi |xparents(i) )
i
• Parameter estimation
• If x is fully observed the likelihood breaks up
log p(x|θ) = log p(xi |xparents(i) , θ)
i
• If x is partially observed things get interesting
maximization, EM, variational, sampling ...
64. Clustering
Density Estimation θ
n
p(x, θ) = p(θ) p(xi |θ)
i=1 x
Clustering K n
θ
p(x, y, θ) = p(π) p(θk ) p(yi |π)p(xi |θ, yi )
k=1 i=1
y
x
65. Chains
Markov Chain Plate
past past present future future
Hidden Markov Chain user’s
mindset
observed
user action
user model for traversal through search results
66. Chains
Markov Chain Plate
n−1
p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ)
i=1
Hidden Markov Chain user’s
mindset
n−1
n
p(x, y; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) p(yi |xi )
i=1 i=1
observed
user action
user model for traversal through search results
67. Factor Graphs
Latent Factors
Observed
Effects
• Observed effects
Click behavior, queries, watched news, emails
• Latent factors
User profile, news content, hot keywords, social
connectivity graph, events
68. Recommender Systems
news,
SearchMonkey
answers u m
social
ranking
OMG r ... intersecting plates ...
personals (like nested for loops)
• Users u
• Movies m
• Ratings r (but only for a subset of users)
69. Challenges
domain
• How to design models expert
• Common (engineering) sense
• Computational tractability
• Inference statistics
• Easy for fully observed situations
• Many algorithms if not fully observed
• Dynamic programming / message passing
70. Summary - Part 2
• Probability theory to estimate events
• Conjugate priors and Laplace smoothing
• Conjugate = phantasy data
• Collapsing
• Laplace smoothing
• Directed graphical models
73. Clustering
Density Estimation log-concave θ
n
p(x, θ) = p(θ) p(xi |θ)
i=1 find θ x
Clustering K n
θ
p(x, y, θ) = p(π) p(θk ) p(yi |π)p(xi |θ, yi )
k=1 i=1
general nonlinear y
x
74. Clustering
• Optimization problem
maximize p(x, y, θ)
θ
y
K
n
maximize log p(π) + log p(θk ) + log [p(yi |π)p(xi |θ, yi )]
θ
k=1 i=1 yi ∈Y
• Options
• Direct nonconvex optimization (e.g. BFGS)
• Sampling (draw from the joint distribution)
• Variational approximation
(concave lower bounds aka EM algorithm)
75. Clustering
• Integrate out y θ • Integrate out θ
θ Y
y
x x
x
• Nonconvex • Y is coupled
optimization • Sampling
problem
• Collapsed p
• EM algorithm p(y|x) ∝ p({x} | {xi : yi = y} ∪ Xfake )p(y|Y ∪ Yfake )
76. Gibbs sampling
• Sampling:
Draw an instance x from distribution p(x)
• Gibbs sampling:
• In most cases direct sampling not possible
• Draw one set of variables at a time
(b,g) - draw p(.,g)
(g,g) - draw p(g,.)
0.45 0.05 (g,g) - draw p(.,g)
(b,g) - draw p(b,.)
0.05 0.45 (b,b) ...
90. Topic Models
Australia
Singapore
university
USA airline
airline
Singapore
university
USA Singapore
food food
91. Clustering Topic Models
Clustering Topics
?
group objects decompose objects
by prototypes into prototypes
92. Clustering Topic Models
clustering Latent Dirichlet Allocation
α prior α prior
cluster topic
θ probability θ probability
cluster
y label
y topic label
instance instance
x x
96. Joint Probability Distribution
sample Ψ
independently sample θ slo
p(θ, z, ψ, x|α, β) independently w
K
m
α
= p(ψk |β) p(θi |α)
k=1 i=1 topic
m,mi
θi probability
p(zij |θi )p(xij |zij , ψ)
i,j
sample z zij topic label
independently
instance
language prior β ψk xij
97. Collapsed Sampler
p(z, x|α, β)
fa
m
k
st
= p(zi |α) p({xij |zij = k} |β) α
i=1 k=1
topic
sample z θi probability
sequentially
zij topic label
instance
language prior β ψk xij
98. Collapsed Sampler
Griffiths Steyvers, 2005
p(z, x|α, β)
fa
m
k
st
= p(zi |α) p({xij |zij = k} |β) α
i=1 k=1
topic
−ij
θi probability
n (t, d) + αt n−ij (t, w) + βt
n−i (d) + t αt n−i (t) + t βt
zij topic label
instance
language prior β ψk xij
99. Sequential Algorithm
• Collapsed Gibbs Sampler
• For 1000 iterations do
• For each document do
• For each word in the document do
• Resample topic for the word
• Update local (document, topic) table
• Update global (word, topic) table
this kills parallelism
100. State of the art
UMass Mallet, UC Irvine, Google
• For 1000 iterations do
table out
• For each document do
of sync
• For each word in the document do
• Resample topic for the word memory
• Update local (document, topic) table inefficient
• Update CPU local (word, topic) table blocking
• Update global (word, topic) table
network
bound
changes rapidly
αt n(t, d = i) n(t, w = wij ) [n(t, d = i) + αt ]
p(t|wij ) ∝ βw ¯ + βw n(t) + β +
¯ ¯
n(t) + β n(t) + β
slow moderately fast
101. Our Approach
• For 1000 iterations do (independently per computer)
• For each thread/core do
• For each document do
• For each word in the document do
• Resample topic for the word
• Update local (document, topic) table
• Generate computer local (word, topic) message
• In parallel update local (word, topic) table
• In parallel update global (word, topic) table
network memory table out
blocking
bound inefficient of sync
concurrent minimal continuous barrier
cpu hdd net view sync free
103. Multicore Architecture
Intel Threading Building Blocks
tokens
sampler
sampler diagnostics
file count output to
sampler topics
combiner sampler updater file
optimization
sampler
topics
joint state table
• Decouple multithreaded sampling and updating
(almost) avoids stalling for locks in the sampler
• Joint state table
• much less memory required
• samplers syncronized (10 docs vs. millions delay)
• Hyperparameter update via stochastic gradient descent
• No need to keep documents in memory (streaming)
104. Cluster Architecture
sampler sampler sampler sampler
ice
• Distributed (key,value) storage via memcached
• Background asynchronous synchronization
• single word at a time to avoid deadlocks
• no need to have joint dictionary
• uses disk, network, cpu simultaneously
105. Cluster Architecture
sampler sampler sampler sampler
ice ice ice ice
• Distributed (key,value) storage via ICE
• Background asynchronous synchronization
• single word at a time to avoid deadlocks
• no need to have joint dictionary
• uses disk, network, cpu simultaneously
106. Making it work
• Startup
• Randomly initialize topics on each node
(read from disk if already assigned - hotstart)
• Sequential Monte Carlo for startup much faster
• Aggregate changes on the fly
• Failover
• State constantly being written to disk
(worst case we lose 1 iteration out of 1000)
• Restart via standard startup routine
• Achilles heel: need to restart from checkpoint if even
a single machine dies.
107. Easily extensible
• Better language model (topical n-grams)
can process millions of users (vs 1000s)
• Conditioning on side information (upstream)
estimate topic based on authorship, source,
joint user model ...
• Conditioning on dictionaries (downstream)
integrate topics between different languages
• Time dependent sampler for user model
approximate inference per episode
108. Google
Mallet Irvine’08 Irvine’09 Yahoo LDA
LDA
Multicore no yes yes yes yes
Cluster MPI no MPI point 2 point memcached
dictionary separate joint
State table separate separate
split sparse sparse
asynchronous
synchronous synchronous synchronous asynchronous
Schedule approximate
exact exact exact exact
messages
109. Speed
• 1M documents per day on 1 computer
(1000 topics per doc, 1000 words per doc)
• 350k documents per day per node
(context switches memcached stray reducers)
• 8 Million docs (Pubmed)
(sampler does not burn in well - too short doc)
• Irvine: 128 machines, 10 hours
• Yahoo: 1 machine, 11 days
• Yahoo: 20 machines, 9 hours
• 20 Million docs (Yahoo! News Articles)
• Yahoo: 100 machines, 12 hours
110. Scalability
200k documents/computer
40
30
20
10
0 CPUs
1 10 20 50 100
Runtime (hours) Initial topics per word x10
Likelihood even improves with parallelism!
-3.295 (1 node) -3.288 (10 nodes) -3.287 (20 nodes)
111. The Competition
Dataset size (millions) 50k
20 50000
15
Throughput/h
10
5 37500
0
Google Irvine Yahoo
25000
Cluster size
130
97.5 12500
65 6.4k
32.5
150
0 0
Google Irvine Yahoo Google Irvine Yahoo
113. Variable Replication
• Global shared variable
computer
x y z x y y’ z
synchronize local copy
• Make local copy
• Distributed (key,value) storage table for global copy
• Do all bookkeeping locally (store old versions)
• Sync local copies asynchronously using message passing
(no global locks are needed)
• This is an approximation!
114. Asymmetric Message Passing
• Large global shared state space
(essentially as large as the memory in computer)
• Distribute global copy over several machines
(distributed key,value storage)
global state
current copy
old copy
115. Out of core storage
• Very large state space
x y z
• Gibbs sampling requires us to traverse the data sequentially many
times (think 1000x)
• Stream local data from disk and update coupling variable each
time local data is accessed
• This is exact
tokens
sampler
sampler diagnostics
file count output to
sampler topics
combiner sampler updater file
optimization
sampler
topics
116. Summary - Part 3
• Inference in graphical models
• Clustering
• Topic models
• Sampling
• Implementation details
119. Problem
• How many clusters should we pick?
• How about a prior for infinitely many clusters?
• Finite model
n(y) + αy
p(y|Y, α) =
n + y αy
• Infinite model
Assume that the total smoother weight is constant
n(y) α
p(y|Y, α) = and p(new|Y, α) =
n+ y αy n+α
120. Chinese Restaurant Metaphor
φ1 φ2 φ3
the rich get richer
GeneraBve
Process
-‐For
data
point
xi
-‐
Choose
table
j
∝
mj
and
Sample
xi
~
f(φj)
-‐
Choose
a
new
table
K+1
∝
α
-‐
Sample
φK+1
~
G0
and
Sample
xi
~
f(φK+1)
Pitman; Antoniak; Ishwaran; Jordan et al.; Teh et al.;
121. Evolutionary Clustering
• Time series of objects, e.g. news stories
• Stories appear / disappear
• Want to keep track of clusters automatically
132. User modeling
Problem
formulaBon
Movies
Auto
Car Theatre
Price
Deals Art
Used
gallery
van inspecBon
Diet
Hiring
job Calories
Salary
Hiring Recipe
Diet
diet chocolate
calories
Flight School
London Supplies
Hotel Loan
weather college
133. User modeling
Problem
formulaBon
CARS Art
Movies
Auto
Car Theatre
Price
Deals Art
Used
gallery
van inspecBon
Jobs
Diet Diet
Hiring
job Calories
Salary
Hiring Recipe
Diet
diet chocolate
calories
Travel
Flight School finance
London College Supplies
Hotel Loan
weather college
134. User modeling
Problem
formulaBon
Input
• Queries
issued
by
the
user
or
Tags
of
watched
content
• Snippet
of
page
examined
by
user
• Time
stamp
of
each
acBon
(day
resoluBon)
Output
•
Users’
daily
distribuBon
over
intents
•
Dynamic
intent
representaBon
Travel
Flight School finance
London College Supplies
Hotel Loan
weather college
135. Time dependent models
• LDA for topical model of users where
• User interest distribution changes over time
• Topics change over time
• This is like a Kalman filter except that
• Don’t know what to track (a priori)
• Can’t afford a Rauch-Tung-Striebel smoother
• Much more messy than plain LDA
136. Graphical Model
α αt−1 αt αt+1
time dependent
plain θi t−1
θi t
θi
t+1
θi user interest
LDA zij zij
wij wij user actions
φk φt−1
k
φt
k φt+1
k
t−1 t t+1
actions per topic
β β β β
137. All
μ3
month
μ2
week
Long-‐term μ
short-‐term Prior
for
user
acBons
at
Bme
t
food
Food recipe Part-‐Bme Kelly chicken
Chicken job Opening recipe Pizza
pizza hiring salary cuisine millage
t
t+1
Time
Diet Cars Job Finance
Recipe Car Bank
job
Chocolate Blue Online
Career
Pizza Book Credit
Business
Food Kelley Card
Assistant
Chicken Prices debt
Hiring
Milk Small por_olio
Part-‐Bme
Buaer Speed Finance
RecepBonist
Powder large Chase
138. At
0me
t At
0me
t+1
Car job
Bank
Recipe AlBma Career Online
Chocolate Accord Business Credit
Pizza Blue Assistant Card
Food Book Hiring debt
Chicken Kelley Part-‐Bme por_olio
Milk Prices RecepBoni Finance
Buaer Small st Chase
Powder Speed
short-‐term
priors
Food
Chicken
Pizza
mileage
GeneraBve
Process
•
For
each
user
interacBon
•
Choose
an
intent
from
local
distribuBon
• Sample
word
from
the
topic’s
word-‐distribuBon
Car
speed
offer •Choose
a
new
intent
∝
α
Camry
accord
career • Sample
a
new
intent
from
the
global
distribuBon
•
Sample
word
from
the
new
topic
word-‐distribuBon
139. At
0me
t At
0me
t+1 At
0me
t+2 At
0me
t+3
Global m
process m'
n
User
1 n'
process
User
2
process
User
3
process
140. Sample users
0.5 Baseball
0.3
0.4 Dating
Propotion
Baseball
0.3 0.2 Finance
0.2 Celebrity Jobs
0.1
0.1 Dating
Health
0 0
0 10 20 30 40 0 10 20 30 40
Day Day
Dating Baseball Celebrity Health Jobs Finance
League Snooki
women skin job financial
baseball Tom
men body career Thomson
basketball, Cruise
dating fingers business chart
doublehead Katie
singles cells assistant real
Bergesen Holmes
personals toes hiring Stock
Griffey Pinkett
seeking wrinkle part-‐time Trading
bullpen Kudrow
match layers receptionist currency
Greinke Hollywood
144. LDA for user profiling
Sample Z Sample Z Sample Z Sample Z
For users For users For users For users
Write counts Write counts Write counts Write counts
to to to to
memcached memcached memcached memcached
Barrier
Collect counts
Do nothing Do nothing Do nothing
and sample
Barrier
Read from Read from Read from Read from
memcached memcached memcached memcached
147. News Stream
• Over 1 high quality news article per second
• Multiple sources (Reuters, AP, CNN, ...)
• Same story from multiple sources
• Stories are related
• Goals
• Aggregate articles into a storyline
• Analyze the storyline (topics, entities)
148. Clustering / RCRP
• Assume active story
distribution at time t
• Draw story indicator
• Draw words from story
distribution
• Down-weight story counts for
next day
Ahmed Xing, 2008
149. Clustering / RCRP
• Pro
• Nonparametric model of story generation
(no need to model frequency of stories)
• No fixed number of stories
• Efficient inference via collapsed sampler
• Con
• We learn nothing!
• No content analysis
150. Latent Dirichlet Allocation
• Generate topic distribution
per article
• Draw topics per word from
topic distribution
• Draw words from topic specific
word distribution
Blei, Ng, Jordan, 2003
151. Latent Dirichlet Allocation
• Pro
• Topical analysis of stories
• Topical analysis of words (meaning, saliency)
• More documents improve estimates
• Con
• No clustering
152. More Issues
• Named entities are special, topics less
(e.g. Tiger Woods and his mistresses)
• Some stories are strange
(topical mixture is not enough - dirty models)
• Articles deviate from general story
(Hierarchical DP)
154. Storylines Model
• Topic model
• Topics per cluster
• RCRP for cluster
• Hierarchical DP for
article
• Separate model
for named entities
• Story specific
correction
156. The
Graphical
Model
Storylines Model
Tightly-‐focused High-‐level
concepts
157. The
Graphical
Model
Storylines Model
Each
story
has:
•DistribuBon
over
words
•DistribuBon
over
topics
•DistribuBon
over
named
enBtes
158. The
Graphical
Model
Storylines Model
•
Document’s
topic
mix
is
sampled
from
its
story
prior
•
Words
inside
a
document
either
global
or
story
specific 49
163. Estimation
• Sequential Monte Carlo (Particle Filter)
• For new time period draw stories s, topics z
p(st+1 , zt+1 |x1...t+1 , s1...t , z1...t )
using Gibbs Sampling for each particle
• Reweight particle via
p(xt+1 |x1...t , s1...t , z1...t )
• Regenerate particles if l2 norm too heavy
164. Numbers ...
• TDT5 (Topic Detection and Tracking)
macro-averaged minimum detection cost: 0.714
time entities topics story words
0.84 0.90 0.86 0.75
This is the best performance on TDT5!
• Yahoo News data
... beats all other clustering algorithms
178. Problem
Statement
Ideologies
Build
a
model
to
describe
both
collecBons
of
data
VisualizaBon
•
How
does
each
ideology
view
mainstream
events?
•
On
which
topics
do
they
differ?
•
On
which
topics
do
they
agree?
179. Problem
Statement
Ideologies
Build
a
model
to
describe
both
collecBons
of
data
VisualizaBon
ClassificaBon
•Given
a
new
news
arBcle
or
a
blog
post,
the
system
should
infer
•
From
which
side
it
was
wriaen
•
JusBfy
its
answer
on
a
topical
level
(view
on
aborBon,
taxes,
health
care)
180. Problem
Statement
Ideologies
Build
a
model
to
describe
both
collecBons
of
data
VisualizaBon
ClassificaBon
Structured
browsing
•Given
a
new
news
arBcle
or
a
blog
post,
the
user
can
ask
for
:
•Examples
of
other
arBcles
from
the
same
ideology
about
the
same
topic
•Documents
that
could
exemplify
alterna0ve
views
from
other
ideologies
181. Building a factored model
β1
φ1,1 φ2,1
β1
φ1,2 φ2,2
Ω1 Ω2
βk-‐1
φ1,k φ2,k
βk
Ideology
1 Ideology
2
Views Views
Topics
182. Building a factored model
β1
φ1,1 φ2,1
β2
φ1,2 φ2,2
Ω1 Ω2
βk-‐1
φ1,k φ2,k
βk
Ideology
1 Ideology
2
Views Views
Topics
λ
λ
1−λ
1−λ
183. Datasets
Data
• BiAerlemons:
• Middle-‐east
conflict,
document
wriaen
by
Israeli
and
PalesBnian
authors.
•
~300
documents
form
each
view
with
average
length
740
•
MulB
author
collecBon
•
80-‐20
split
for
test
and
train
• Poli0cal
Blog-‐1:
•
American
poliBcal
blogs
(Democrat
and
Republican)
•
2040
posts
with
average
post
length
=
100
words
•
Follow
test
and
train
split
as
in
(Yano
et
al.,
2009)
• Poli0cal
Blog-‐2
(test
generalizaBon
to
a
new
wriBng
style)
•
Same
as
1
but
6
blogs,
3
from
each
side
•
~14k
posts
with
~200
words
per
post
•
4
blogs
for
training
and
2
blogs
for
test
184. Example:
Biaerlemons
corpus
Bitterlemons dataset
US
role
powell minister colin visit
arafat state leader roadmap
Israeli election month iraq yasir
bush US president american internal policy statement PalesBnian
sharon administration prime express pro previous
View
senior involvement clinton
pressure policy washington package work transfer View
terrorism
european
Roadmap
process
palestinian
palestinian end settlement
process force terrorism unit israeli
israeli implementation obligation
provide confidence element roadmap phase security Peace
peace stop expansion commitment
interim discussion union ceasefire state plan political
year political fulfill unit illegal present
succee point build positive international step authority occupation
process previous assassination meet
recognize present timetable process
state forward
end security
end conflict
right way
government Arab
Involvement government
need conflict people
way track negotiation official time year
peace strategic plo hizballah
security leadership position force
islamic neighbor territorial syria syrian negotiate lebanon
withdrawal time victory negotiation
radical iran relation think deal conference concession
present second stand
obviou countri mandate asad agreement regional
circumstance represent
greater conventional intifada october initiative relationship
sense talk strategy issue
affect jihad time
participant parti negotiator
187. Geqng
AlternaBve
View
Finding alternate views
-‐ Given
a
document
wriaen
in
one
ideology,
retrieve
the
equivalent
-‐ Baseline:
SVM
+
cosine
similarity
144
188. Can
We
use
Unlabeled
data?
Unlabeled data
•
In
theory
this
is
simple
•Add
a
step
that
samples
the
document
view
(v)
•Doesn’t
mix
in
pracBce
because
Bght
coupling
between
v
and
(x1,x2,z)
•SoluBon
•Sample
v
and
(x1,x2,z)
as
a
block
using
a
Metropolis-‐HasBng
step
•
This
is
a
huge
proposal!
189. Summary - Part 4
• Chinese Restaurant Process
• Recurrent CRP
• User modeling
• Storylines
• Ideology detection