Graphical models for automatic information extraction

Graphical Models for the Internet
Alexander Smola & Amr Ahmed

Yahoo! Research & Australian National University
Santa Clara, CA
alex@smola.org blog.smola.org

Outline
• Part 1 - Motivation
• Automatic information extraction
• Application areas
• Part 2 - Basic Tools
• Density estimation / conjugate distributions
• Directed Graphical models and inference
• Part 3 - Topic Models (our workhorse)
• Statistical model
• Large scale inference (parallelization, particle ﬁlters)
• Part 4 - Advanced Modeling
• Temporal dependence
• Mixing clustering and topic models
• Social Networks
• Language models

Data on the Internet
• Webpages (content, graph)
• Clicks (ad, page, social) Finite resources
• Users (OpenID, FB Connect)
• e-mails (Hotmail, Y!Mail, Gmail)
• Editors are expensive
• Photos, Movies (Flickr, YouTube, Vimeo ...) • Editors don’t know users
•
•
Cookies / tracking info (see Ghostery)
Installed apps (Android market etc.) unlimited amounts
• Barrier to i18n
• Abuse (intrusions are novel)
of • Implicit feedback
data
• Location (Latitude, Loopt, Foursquared)
• User generated content (Wikipedia & co)
• Ads (display, text, DoubleClick, Yahoo)
• Comments (Disqus, Facebook) • Data analysis (find interesting stuff
• Reviews (Yelp, Y!Local)
rather than find x)
• Third party features (e.g. Experian)
• Social connections (LinkedIn, Facebook) • Integrating many systems
• Purchase decisions (Netflix, Amazon)
• Modular design for data integration
• Instant Messages (YIM, Skype, Gtalk)
• Search terms (Google, Bing) • Integrate with given prediction tasks
• Timestamp (everything)
• News articles (BBC, NYTimes, Y!News) Invest in modeling and naming
•
•
Blog posts (Tumblr, Wordpress)
Microblogs (Twitter, Jaiku, Meme)
rather than data generation

Clustering documents

airline

university

restaurant

Today’s mission

Find hidden structure in the data
Human understandable
Improved knowledge for estimation

Hierarchical Clustering

NIPS 2010
Adams,
Ghahramani,
Jordan

Topics in text

Latent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003

Word segmentation

Mochihashi, Yamada, Ueda, ACL 2009

Language model
automatically synthesized
from Penn Treebank

Mochihashi, Yamada, Ueda
ACL 2009

User model over time
0.5 Baseball
0.3
0.4 Dating

Propotion
Baseball
0.3 0.2 Finance

0.2 Celebrity Jobs
0.1
0.1 Dating

Health
0 0
0 10 20 30 40 0 10 20 30 40
Dating DayBaseball Celebrity Health Jobs DayFinance
League Snooki
women skin job financial
baseball Tom
men body career Thomson
basketball, Cruise
dating fingers business chart
doublehead Katie
singles cells assistant real
Bergesen Holmes
personals toes hiring Stock
Griffey Pinkett
seeking wrinkle part-‐time Trading
bullpen Kudrow
match layers receptionist currency
Greinke Hollywood

Ahmed et al., KDD 2011

Face recognition from captions

Jain, Learned-Miller, McCallum, ICCV 2007

Storylines from news
Ahmed et al,
AISTATS 2011

Ideology detection

Ahmed et al, 2010; Bitterlemons collection

Hypertext topic extraction

Gruber, Rosen-Zvi, Weiss; UAI 2008

Ontologies
• continuous
maintenance
• no guarantee
of coverage
• difﬁcult
categories

expensive, small

Face Classiﬁcation

• 100-1000
people
• 10k faces
• curated
(not realistic)
• expensive to
generate

Topic Detection & Tracking
• editorially
curated
training data
• expensive to
generate
• subjective in
selection of
threads
• language
speciﬁc

Advertising Targeting

• Needs training data in every language
• Is it really relevant for better ads?
• Does it cover relevant areas?

Challenges
• Scale
• Millions to billions of instances
(documents, clicks, users, messages, ads)
• Rich structure of data (ontology, categories, tags)
• Model description typically larger than memory of single workstation
• Modeling
• Usually clustering or topic models do not solve the problem
• Temporal structure of data
• Side information for variables

• Solve problem. Don’t simply apply a model!
• Inference
• 10k-100k clusters for hierarchical model
• 1M-100M words
• Communication is an issue for large state space

Summary - Part 1
• Essentially inﬁnite amount of data
• Labeling is prohibitively expensive
• Not scalable for i18n
• Even for supervised problems unlabeled data
abounds. Use it.
• User-understandable structure for
representation purposes
• Solutions are often customized to problem
We can only cover building blocks in tutorial.

Probability
• Space of events X
• server status (working, slow, broken)
• income of the user (e.g. $95,000)
• search queries (e.g. “graphical models”)
• Probability axioms (Kolmogorov)
Pr(X) ∈ [0, 1], Pr(X ) = 1

Pr(∪i Xi ) = i Pr(Xi ) if Xi ∩ Xj = ∅
• Example queries
• P(server working) = 0.999
• P(90,000 income 100,000) = 0.1

(In)dependence
• Independence Pr(x, y) = Pr(x) · Pr(y)
• Login behavior of two users (approximately)
• Disk crash in different colos (approximately)
• Dependent events
• Emails Pr(x, y) = Pr(x) · Pr(y)
• Queries
• News stream / Buzz / Tweets
• IM communication Everywhere!

• Russian Roulette

Independence

0.3 0.2

0.3 0.2

Dependence

0.45 0.05

0.05 0.45

A Graphical Model

Spam Mail

p(spam, mail) = p(spam) p(mail|spam)

Bayes Rule

• Joint Probability
Pr(X, Y ) = Pr(X|Y ) Pr(Y ) = Pr(Y |X) Pr(X)
• Bayes Rule
Pr(Y |X) · Pr(X)
Pr(X|Y ) =
Pr(Y )
• Hypothesis testing
• Reverse conditioning

AIDS test (Bayes rule)
• Data
• Approximately 0.1% are infected
• Test detects all infections
• Test reports positive for 1% healthy people
• Probability of having AIDS if test is positive
Pr(t|a = 1) · Pr(a = 1)
Pr(a = 1|t) =
Pr(t)
Pr(t|a = 1) · Pr(a = 1)
=
Pr(t|a = 1) · Pr(a = 1) + Pr(t|a = 0) · Pr(a = 0)
1 · 0.001
= = 0.091
1 · 0.001 + 0.01 · 0.999

Improving the diagnosis
• Use a follow-up test
• Test 2 reports positive for 90% infections
• Test 2 reports positive for 5% healthy people
0.01 · 0.05 · 0.999
= 0.357
1 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999
• Why can’t we use Test 1 twice?
Outcomes are not independent but tests 1 and
2 are conditionally independent
p(t1 , t2 |a) = p(t1 |a) · p(t2 |a)

Naive Bayes Spam Filter
• Key assumption
Words occur independently of each other
given the label of the document
n
p(w1 , . . . , wn |spam) = p(wi |spam)
i=1
• Spam classiﬁcation via Bayes Rule
n
p(spam|w1 , . . . , wn ) ∝ p(spam) p(wi |spam)
• Parameter estimation
i=1

Compute spam probability and word
distributions for spam and ham

A Graphical Model

spam spam
how to estimate
p(w|spam)

w1 w2 ... wn wi

n
i=1..n
p(w1 , . . . , wn |spam) = p(wi |spam)
i=1

Naive NaiveBayes Classiﬁer
• Two classes (spam/ham)
• Binary features (e.g. presence of $$$, viagra)
• Simplistic Algorithm
• Count occurrences of feature for spam/ham
• Count number of spam/ham mails
spam probability
feature probability
n(i, y) n(y)
p(xi = TRUE|y) = and p(y) =
n(y) n
n(y) n(i, y) n(y) − n(i, y)
p(y|x) ∝
n n(y) n(y)
i:xi =TRUE i:xi =FALSE

Naive NaiveBayes Classiﬁer

what if n(i,y)=n(y)?

what if n(i,y)=0?

n(y) n(i, y) n(y) − n(i, y)
p(y|x) ∝
n n(y) n(y)
i:xi =TRUE i:xi =FALSE

Two outcomes (binomial)
• Example: probability of ‘viagra’ in spam/ham
• Data likelihood
p(X; π) = π n1 (1 − π)n0
• Maximum Likelihood Estimation
• Constraint π ∈ [0, 1]
• Taking derivatives yields
n1
π=
n0 + n1

n outcomes (multinomial)
• Example: USA, Canada, India, UK, NZ
• Data likelihood
ni
p(X; π) = πi
i
• Maximum Likelihood Estimation
• Constrained optimization problem πi = 1
i
• Using log-transform yields
ni
πi =
j nj

Tossing a Dice

12 24

60 120

Conjugate Priors
• Unless we have lots of data estimates are weak
• Usually we have an idea of what to expect
p(θ|X) ∝ p(X|θ) · p(θ)
we might even have ‘seen’ such data before
• Solution: add ‘fake’ observations
p(θ) ∝ p(Xfake |θ) hence p(θ|X) ∝ p(X|θ)p(Xfake |θ) = p(X ∪ Xfake |θ)

• Inference (generalized Laplace smoothing)
n n
1 1 m fake count
φ(xi ) −→ φ(xi ) + µ0
n i=1 n + m i=1 n+m
fake mean

Conjugate Prior in action
mi = m · [µ0 ]i
• Discrete Distribution
ni ni + mi
p(x = i) = −→ p(x = i) =
n n+m
• Tossing a dice
Outcome 1 2 3 4 5 6
Counts 3 6 2 1 4 4
MLE 0.15 0.30 0.10 0.05 0.20 0.20
MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19
MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17

• Rule of thumb
need 10 data points (or prior) per parameter

Exponential Families
• Density function
p(x; θ) = exp (φ(x), θ − g(θ))

where g(θ) = log exp (φ(x ), θ)

x

• Log partition function generates cumulants
∂θ g(θ) = E [φ(x)]
2
∂θ g(θ) = Var [φ(x)]
• g is convex (second derivative is p.s.d.)

Examples

• Binomial Distribution φ(x) = x

• Discrete Distribution φ(x) = ex

(ex is unit vector for x) 1
φ(x) = x, xx
• Gaussian 2

• Poisson (counting measure 1/x!) φ(x) = x
• Dirichlet, Beta, Gamma, Wishart, ...

Poisson Distribution
λx e−λ
p(x; λ) =
x!

Beta Distribution

xα−1 (1 − x)β−1
p(x; α, β) =
B(α, β)

Dirichlet Distribution

... this is a distribution over distributions ...

Maximum Likelihood
• Negative log-likelihood
n

− log p(X; θ) = g(θ) − φ(xi ), θ
i=1

empirical
mean average
• Taking derivatives
n
1
−∂θ log p(X; θ) = m E[φ(x)] − φ(xi )
m i=1

We pick the parameter such that the
distribution matches the empirical average.

Example: Gaussian Estimation
• Sufﬁcient statistics: x, x 2

• Mean and variance given by
µ = Ex [x] and σ 2 = Ex [x2 ] − E2 [x]
x

• Maximum Likelihood Estimate
n n
1 2 1 2 2
µ=
ˆ xi and σ = xi − µ
ˆ
n i=1 n i=1

• Maximum a Posteriori Estimate smoother
n
n

1 2 1 2 n0 2
µ=
ˆ xi and σ = xi + 1−µ
ˆ
n + n0 i=1
n + n0 i=1
n + n0
smoother

Collapsing
• Conjugate priors
p(θ) ∝ p(Xfake |θ)
Hence we know how to compute normalization

• Prediction p(x|X) = p(x|θ)p(θ|X)dθ

(Beta, binomial) ∝ p(x|θ)p(X|θ)p(Xfake |θ)dθ
(Dirichlet, multinomial)
(Gamma, Poisson) = p({x} ∪ X ∪ Xfake |θ)dθ
(Wishart, Gauss) look up closed
form expansions

http://en.wikipedia.org/wiki/Exponential_family

... some Web 2.0 service
MySQL Apache

Website

• Joint distribution (assume a and m are independent)
p(m, a, w) = p(w|m, a)p(m)p(a)
• Explaining away
p(w|m, a)p(m)p(a)
p(m, a|w) =
,a p(w|m , a )p(m )p(a )

m
a and m are dependent conditioned on w

... some Web 2.0 service
MySQL Apache

Website

is broken is working

At least one of the
MySQL is working
two services is broken
Apache is working
(not independent)

Directed graphical model
m a m a m a

w w w

user
• Easier estimation u
action
• 15 parameters for full joint distribution
• 1+1+3+1 for factorizing distribution
• Causal relations
• Inference for unobserved variables

No loops allowed

p(c|e)p(e|c)

p(c|e)p(e) or p(e|c)p(c)

Directed Graphical Model
• Joint probability distribution

p(x) = p(xi |xparents(i) )
i

• Parameter estimation
• If x is fully observed the likelihood breaks up

log p(x|θ) = log p(xi |xparents(i) , θ)
i
• If x is partially observed things get interesting
maximization, EM, variational, sampling ...

Clustering
Density Estimation θ
n

p(x, θ) = p(θ) p(xi |θ)
i=1 x
Clustering K n
θ
p(x, y, θ) = p(π) p(θk ) p(yi |π)p(xi |θ, yi )
k=1 i=1

y

x

Chains
Markov Chain Plate

past past present future future

Hidden Markov Chain user’s
mindset

observed
user action
user model for traversal through search results

Chains
Markov Chain Plate
n−1

p(x; θ) = p(x0 ; θ) p(xi+1 |xi ; θ)
i=1

Hidden Markov Chain user’s
mindset
n−1
n

p(x, y; θ) = p(x0 ; θ) p(xi+1 |xi ; θ) p(yi |xi )
i=1 i=1

observed
user action
user model for traversal through search results

Factor Graphs
Latent Factors

Observed
Effects

• Observed effects
Click behavior, queries, watched news, emails
• Latent factors
User proﬁle, news content, hot keywords, social
connectivity graph, events

Recommender Systems
news,
SearchMonkey
answers u m
social
ranking
OMG r ... intersecting plates ...
personals (like nested for loops)

• Users u
• Movies m
• Ratings r (but only for a subset of users)

Challenges
domain
• How to design models expert
• Common (engineering) sense
• Computational tractability
• Inference statistics
• Easy for fully observed situations
• Many algorithms if not fully observed
• Dynamic programming / message passing

Summary - Part 2

• Probability theory to estimate events
• Conjugate priors and Laplace smoothing
• Conjugate = phantasy data
• Collapsing
• Laplace smoothing
• Directed graphical models

Part 3 - Clustering Topic Models

Clustering
Density Estimation log-concave θ
n

p(x, θ) = p(θ) p(xi |θ)
i=1 ﬁnd θ x
Clustering K n
θ
p(x, y, θ) = p(π) p(θk ) p(yi |π)p(xi |θ, yi )
k=1 i=1

general nonlinear y

x

Clustering
• Optimization problem

maximize p(x, y, θ)
θ
y
K
n

maximize log p(π) + log p(θk ) + log [p(yi |π)p(xi |θ, yi )]
θ
k=1 i=1 yi ∈Y
• Options
• Direct nonconvex optimization (e.g. BFGS)
• Sampling (draw from the joint distribution)
• Variational approximation
(concave lower bounds aka EM algorithm)

Clustering
• Integrate out y θ • Integrate out θ
θ Y
y
x x
x
• Nonconvex • Y is coupled
optimization • Sampling
problem
• Collapsed p
• EM algorithm p(y|x) ∝ p({x} | {xi : yi = y} ∪ Xfake )p(y|Y ∪ Yfake )

Gibbs sampling
• Sampling:
Draw an instance x from distribution p(x)
• Gibbs sampling:
• In most cases direct sampling not possible
• Draw one set of variables at a time
(b,g) - draw p(.,g)
(g,g) - draw p(g,.)
0.45 0.05 (g,g) - draw p(.,g)
(b,g) - draw p(b,.)
0.05 0.45 (b,b) ...

Gibbs sampling for clustering

random
initialization


sample
cluster labels


resample
cluster model


resample
cluster labels


resample
cluster model e.g. Mahout Dirichlet Process Clustering

Inference Algorithm ≠ Model
Corollary: EM ≠ Clustering

Grouping objects

Singapore

Grouping objects

airline

university

restaurant

Grouping objects
Australia
USA

Singapore

Topic Models
Australia
Singapore
university
USA airline
airline

Singapore
university

USA Singapore
food food

Clustering Topic Models
Clustering Topics

?

group objects decompose objects
by prototypes into prototypes

clustering Latent Dirichlet Allocation

α prior α prior

cluster topic
θ probability θ probability

cluster
y label
y topic label

instance instance
x x


Cluster/
topic x membership = Documents
distributions

clustering: (0, 1) matrix
topic model: stochastic matrix
LSI: arbitrary matrices

Joint Probability Distribution
sample Ψ
independently sample θ slo
p(θ, z, ψ, x|α, β) independently w
K
m
α
= p(ψk |β) p(θi |α)
k=1 i=1 topic
m,mi
θi probability
p(zij |θi )p(xij |zij , ψ)
i,j
sample z zij topic label
independently
instance
language prior β ψk xij

Collapsed Sampler
p(z, x|α, β)
fa
m
k
st
= p(zi |α) p({xij |zij = k} |β) α
i=1 k=1

topic
sample z θi probability
sequentially

zij topic label

instance

Collapsed Sampler
Grifﬁths Steyvers, 2005
p(z, x|α, β)
fa
m
k
st
= p(zi |α) p({xij |zij = k} |β) α
i=1 k=1

topic
−ij
θi probability
n (t, d) + αt n−ij (t, w) + βt

n−i (d) + t αt n−i (t) + t βt
zij topic label

instance

Sequential Algorithm
• Collapsed Gibbs Sampler
• For 1000 iterations do
• For each document do
• For each word in the document do
• Resample topic for the word
• Update local (document, topic) table
• Update global (word, topic) table

this kills parallelism

State of the art
UMass Mallet, UC Irvine, Google
• For 1000 iterations do
table out
of sync
• Resample topic for the word memory
• Update local (document, topic) table inefﬁcient
• Update CPU local (word, topic) table blocking
• Update global (word, topic) table
network
bound
changes rapidly
αt n(t, d = i) n(t, w = wij ) [n(t, d = i) + αt ]
p(t|wij ) ∝ βw ¯ + βw n(t) + β +
¯ ¯
n(t) + β n(t) + β
slow moderately fast

Our Approach
• For 1000 iterations do (independently per computer)
• For each thread/core do
• Resample topic for the word
• Update local (document, topic) table
• Generate computer local (word, topic) message
• In parallel update local (word, topic) table
• In parallel update global (word, topic) table
network memory table out
blocking
bound inefﬁcient of sync

concurrent minimal continuous barrier
cpu hdd net view sync free

Multicore Architecture
Intel Threading Building Blocks
tokens
sampler
sampler diagnostics
ﬁle count output to
sampler topics
combiner sampler updater ﬁle
optimization
sampler
topics

joint state table
• Decouple multithreaded sampling and updating
(almost) avoids stalling for locks in the sampler
• Joint state table
• much less memory required
• samplers syncronized (10 docs vs. millions delay)
• Hyperparameter update via stochastic gradient descent
• No need to keep documents in memory (streaming)

Cluster Architecture
sampler sampler sampler sampler

ice

• Distributed (key,value) storage via memcached
• Background asynchronous synchronization
• single word at a time to avoid deadlocks
• no need to have joint dictionary
• uses disk, network, cpu simultaneously

Cluster Architecture
sampler sampler sampler sampler

ice ice ice ice

• Distributed (key,value) storage via ICE
• Background asynchronous synchronization
• single word at a time to avoid deadlocks
• no need to have joint dictionary
• uses disk, network, cpu simultaneously

Making it work
• Startup
• Randomly initialize topics on each node
(read from disk if already assigned - hotstart)
• Sequential Monte Carlo for startup much faster
• Aggregate changes on the ﬂy
• Failover
• State constantly being written to disk
(worst case we lose 1 iteration out of 1000)
• Restart via standard startup routine
• Achilles heel: need to restart from checkpoint if even
a single machine dies.

Easily extensible
• Better language model (topical n-grams)
can process millions of users (vs 1000s)
• Conditioning on side information (upstream)
estimate topic based on authorship, source,
joint user model ...
• Conditioning on dictionaries (downstream)
integrate topics between different languages
• Time dependent sampler for user model
approximate inference per episode

Google
Mallet Irvine’08 Irvine’09 Yahoo LDA
LDA

Multicore no yes yes yes yes

Cluster MPI no MPI point 2 point memcached

dictionary separate joint
State table separate separate
split sparse sparse
asynchronous
synchronous synchronous synchronous asynchronous
Schedule approximate
exact exact exact exact
messages

Speed
• 1M documents per day on 1 computer
(1000 topics per doc, 1000 words per doc)
• 350k documents per day per node
(context switches memcached stray reducers)
• 8 Million docs (Pubmed)
(sampler does not burn in well - too short doc)
• Irvine: 128 machines, 10 hours
• Yahoo: 1 machine, 11 days
• Yahoo: 20 machines, 9 hours
• 20 Million docs (Yahoo! News Articles)
• Yahoo: 100 machines, 12 hours

Scalability
200k documents/computer
40

30

20

10

0 CPUs
1 10 20 50 100
Runtime (hours) Initial topics per word x10

Likelihood even improves with parallelism!
-3.295 (1 node) -3.288 (10 nodes) -3.287 (20 nodes)

The Competition
Dataset size (millions) 50k
20 50000
15
Throughput/h
10
5 37500
0
Google Irvine Yahoo
25000
Cluster size
130
97.5 12500
65 6.4k
32.5
150
0 0
Google Irvine Yahoo Google Irvine Yahoo

Variable Replication
• Global shared variable
computer

x y z x y y’ z

synchronize local copy
• Make local copy
• Distributed (key,value) storage table for global copy
• Do all bookkeeping locally (store old versions)
• Sync local copies asynchronously using message passing
(no global locks are needed)
• This is an approximation!

Asymmetric Message Passing
• Large global shared state space
(essentially as large as the memory in computer)
• Distribute global copy over several machines
(distributed key,value storage)
global state

current copy
old copy

Out of core storage
• Very large state space
x y z

• Gibbs sampling requires us to traverse the data sequentially many
times (think 1000x)
• Stream local data from disk and update coupling variable each
time local data is accessed
• This is exact

tokens
sampler
sampler diagnostics
ﬁle count output to
sampler topics
combiner sampler updater ﬁle
optimization
sampler
topics

Summary - Part 3

• Inference in graphical models
• Clustering
• Topic models
• Sampling
• Implementation details

Chinese Restaurant Process

φ1 φ2 φ3

Problem
• How many clusters should we pick?
• How about a prior for inﬁnitely many clusters?
• Finite model
n(y) + αy
p(y|Y, α) =
n + y αy

• Inﬁnite model
Assume that the total smoother weight is constant
n(y) α
p(y|Y, α) = and p(new|Y, α) =
n+ y αy n+α

Chinese Restaurant Metaphor

φ1 φ2 φ3

the rich get richer
GeneraBve
Process

-‐For
data
point
xi

-‐
Choose
table
j
∝
mj

and

Sample
xi
~
f(φj)
-‐
Choose
a
new
table

K+1
∝
α

-‐
Sample
φK+1
~
G0

and
Sample
xi
~
f(φK+1)

Pitman; Antoniak; Ishwaran; Jordan et al.; Teh et al.;

Evolutionary Clustering

• Time series of objects, e.g. news stories
• Stories appear / disappear
• Want to keep track of clusters automatically

Recurrent Chinese Restaurant Process

T=1

φ1,1 φ2,1 φ3,1

m'1,1=2 m'2,1=3 m'3,1=1
T=2


T=1

φ1,1 φ2,1 φ3,1

m'1,1=2 m'2,1=3 m'3,1=1
T=2
φ1,1 φ2,1 φ3,1


T=1

φ1,1 φ2,1 φ3,1

m'1,1=2 m'2,1=3 m'3,1=1
T=2
φ1,1 φ2,1 φ3,1

Sample
φ1,2
~
P(.| φ1,1)


T=1

φ1,1 φ2,1 φ3,1

m'1,1=2 m'2,1=3 m'3,1=1
T=2
φ1,2 φ2,2 φ3,1 φ4,2

dead cluster new cluster

Longer History
φ1,1 φ2,1 φ3,1

T=1

m'1,1=2 m'2,1=3 m'3,1=1 T=2
φ1,2 φ2,2 φ3,1 φ4,2

m'2,3

T=3
φ1,2 φ2,2 φ4,2

TDPM Generative Power
DPM

W= ∞
λ=∞

TDPM

W=4 Powerlaw
λ = .4

Independent
DPMs

W= 0
λ = ? (any)

37

User modeling
Baseball
0.3
Propotion

0.2 Finance

Jobs
0.1
Dating

0
0 10 20 30 40
Day

Buying a camera

show ads now too late
time

User modeling
Problem
formulaBon
Movies
Auto
Car Theatre
Price
Deals Art
Used
gallery
van inspecBon

Diet
Hiring
job Calories
Salary
Hiring Recipe
Diet
diet chocolate
calories

Flight School
London Supplies
Hotel Loan
weather college

User modeling
Problem
formulaBon
CARS Art
Movies
Auto
Car Theatre
Price
Deals Art
Used
gallery
van inspecBon
Jobs
Diet Diet
Hiring
job Calories
Salary
Hiring Recipe
Diet
diet chocolate
calories
Travel
Flight School ﬁnance
London College Supplies
Hotel Loan
weather college

User modeling
Problem
formulaBon
Input
• Queries
issued
by
the
user
or
Tags
of
watched
content
• Snippet
of
page
examined
by
user
• Time
stamp
of
each
acBon
(day
resoluBon)

Output

•

Users’
daily
distribuBon
over
intents
•

Dynamic
intent
representaBon
Travel
Flight School ﬁnance
London College Supplies
Hotel Loan
weather college

Time dependent models

• LDA for topical model of users where
• User interest distribution changes over time
• Topics change over time
• This is like a Kalman ﬁlter except that
• Don’t know what to track (a priori)
• Can’t afford a Rauch-Tung-Striebel smoother
• Much more messy than plain LDA

Graphical Model
α αt−1 αt αt+1
time dependent
plain θi t−1
θi t
θi
t+1
θi user interest
LDA zij zij

wij wij user actions

φk φt−1
k
φt
k φt+1
k

t−1 t t+1
actions per topic
β β β β

All
μ3

month
μ2

week

Long-‐term μ
short-‐term Prior
for
user

acBons
at
Bme
t

food
Food recipe Part-‐Bme Kelly chicken
Chicken job Opening recipe Pizza
pizza hiring salary cuisine millage

t

t+1

Time
Diet Cars Job Finance
Recipe Car Bank
job

Chocolate Blue Online
Career
Pizza Book Credit
Business
Food Kelley Card
Assistant
Chicken Prices debt

Hiring
Milk Small por_olio
Part-‐Bme
Buaer Speed Finance
RecepBonist
Powder large Chase

At
0me
t At
0me
t+1
Car job
Bank
Recipe AlBma Career Online
Chocolate Accord Business Credit
Pizza Blue Assistant Card
Food Book Hiring debt

Chicken Kelley Part-‐Bme por_olio
Milk Prices RecepBoni Finance
Buaer Small st Chase
Powder Speed

short-‐term
priors
Food
Chicken
Pizza

mileage

GeneraBve
Process
•
For
each
user
interacBon
•
Choose
an
intent
from
local
distribuBon
• Sample
word
from
the
topic’s
word-‐distribuBon

Car
speed
oﬀer •Choose
a
new
intent

∝
α

Camry
accord
career • Sample
a
new
intent
from
the
global
distribuBon
•
Sample
word
from
the
new
topic
word-‐distribuBon

At
0me
t At
0me
t+1 At
0me
t+2 At
0me
t+3

Global m
process m'

n
User
1 n'
process

User
2
process

User
3
process

Sample users
0.5 Baseball
0.3
0.4 Dating

Propotion
Baseball
0.3 0.2 Finance

0.2 Celebrity Jobs
0.1
0.1 Dating

Health
0 0
0 10 20 30 40 0 10 20 30 40
Day Day
Dating Baseball Celebrity Health Jobs Finance
League Snooki
women skin job financial
baseball Tom
men body career Thomson
basketball, Cruise
dating fingers business chart
doublehead Katie
singles cells assistant real
Bergesen Holmes
personals toes hiring Stock
Griffey Pinkett
seeking wrinkle part-‐time Trading
bullpen Kudrow
match layers receptionist currency
Greinke Hollywood

ROC score improvement
Dataset−2

baseline
62 TLDA
TLDA+Baseline

60

58

56

54

52

50
]

]

]
]
]

]

]
0

0
00

40

20
60
00

00

00
00

2
,6

0,

0,
0,
,4

,2

,1
1

00

[6

[4
0
00

00

00

[1
0

[6

[4

[2
[1

LDA for user proﬁling
Sample Z Sample Z Sample Z Sample Z
For users For users For users For users

Write counts Write counts Write counts Write counts
to to to to
memcached memcached memcached memcached

Barrier

Collect counts
Do nothing Do nothing Do nothing
and sample

Barrier

Read from Read from Read from Read from
memcached memcached memcached memcached

News Stream
• Over 1 high quality news article per second
• Multiple sources (Reuters, AP, CNN, ...)
• Same story from multiple sources
• Stories are related

• Goals
• Aggregate articles into a storyline
• Analyze the storyline (topics, entities)

Clustering / RCRP
• Assume active story
distribution at time t
• Draw story indicator
• Draw words from story
distribution
• Down-weight story counts for
next day

Ahmed Xing, 2008

Clustering / RCRP
• Pro
• Nonparametric model of story generation
(no need to model frequency of stories)
• No ﬁxed number of stories
• Efﬁcient inference via collapsed sampler
• Con
• We learn nothing!
• No content analysis

Latent Dirichlet Allocation
• Generate topic distribution
per article
• Draw topics per word from
topic distribution
• Draw words from topic speciﬁc
word distribution

Blei, Ng, Jordan, 2003

Latent Dirichlet Allocation

• Pro
• Topical analysis of stories
• Topical analysis of words (meaning, saliency)
• More documents improve estimates
• Con
• No clustering

More Issues
• Named entities are special, topics less
(e.g. Tiger Woods and his mistresses)
• Some stories are strange
(topical mixture is not enough - dirty models)
• Articles deviate from general story
(Hierarchical DP)

Storylines
Amr Ahmed, Quirong Ho, Jake Eisenstein,
Alex Smola, Choon Hui Teo, 2011

Storylines Model
• Topic model
• Topics per cluster
• RCRP for cluster
• Hierarchical DP for
article
• Separate model
for named entities
• Story speciﬁc
correction

Storylines Model

High-level
Tightly-focused
concepts

46

The
Graphical
Model
Storylines Model

Tightly-‐focused High-‐level
concepts

The
Graphical
Model
Storylines Model

Each
story
has:
•DistribuBon
over
words
•DistribuBon
over
topics
•DistribuBon
over
named
enBtes

The
Graphical
Model
Storylines Model

•
Document’s
topic
mix
is
sampled
from
its
story
prior
•
Words
inside
a
document
either
global
or
story
speciﬁc 49

The
GeneraBve
Process
Generative process

50

The
GeneraBve
Process
Generative process

51

The
GeneraBve
Process
Generative process

52

The
GeneraBve
Process
Generative process

53

Estimation
• Sequential Monte Carlo (Particle Filter)
• For new time period draw stories s, topics z
p(st+1 , zt+1 |x1...t+1 , s1...t , z1...t )
using Gibbs Sampling for each particle
• Reweight particle via

p(xt+1 |x1...t , s1...t , z1...t )
• Regenerate particles if l2 norm too heavy

Numbers ...
• TDT5 (Topic Detection and Tracking)
macro-averaged minimum detection cost: 0.714
time entities topics story words

0.84 0.90 0.86 0.75

This is the best performance on TDT5!
• Yahoo News data
... beats all other clustering algorithms

Detecting Ideologies

Ahmed and Xing, 2010

Problem
Statement
Ideologies

Build
a
model
to
describe
both

collecBons
of
data

VisualizaBon
•
How
does
each
ideology
view
mainstream
events?
•
On
which
topics
do
they
diﬀer?
•
On
which
topics
do
they
agree?

Problem
Statement
Ideologies

Build
a
model
to
describe
both

collecBons
of
data

VisualizaBon
ClassiﬁcaBon
•Given
a
new
news
arBcle

or
a
blog
post,
the
system
should
infer
•
From
which
side
it
was
wriaen
•

JusBfy
its
answer
on
a
topical
level
(view
on
aborBon,
taxes,
health
care)

Graphical models for automatic information extraction

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (19)

Similaire à Graphical models for automatic information extraction

Similaire à Graphical models for automatic information extraction (18)

Plus de antiw

Plus de antiw (9)

Dernier

Dernier (20)

Graphical models for automatic information extraction