Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Simultaneous Joint and Conditional
Modeling of
Documents Tagged from Two
Perspectives
Pradipto Das, Rohini Srihari and Yun Fu
SUNY Buffalo
CIKM 2011, Glasgow, Scotland

Ubiquitous Bi-Perspective Document Structure
Words
indicative of
important
Wiki concepts

Actual human
generated
Wiki category
tags – words
that
summarize/
categorize the
document
Wikipedia


Words Actual tags
indicative for the
of forum post
questions – even
frequencies
are given!
Words
indicative
of answers

StackOverflow


Words
indicative
of
document
title
Words
indicative
of image Actual
description tags given
by users
Yahoo! Flickr

Understanding the Two Perspectives

What if the documents
are plain text files?

News Article

 Imagine browsing over reports in a topic cluster

It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far
more simple or at least more single-minded
pursuit than that of Ms. Holland.
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
News Article

 What words can we remember after a first browse?

It is believedUS investigators have asked for,
but have been so far refused access to, evidence
accumulated by German prosecutors
probing allegations that former GM director, Mr. German, US,
Lopez, stole industrial secrets from the US group investigations,
and took them with him when he joined VW last year.
GM, Dorothea
Thisinvestigation was launched by US
President Bill Clinton and is in principle a far more simple
Holland, Lopez,
or at least more single-minded pursuit than that of Ms. prosecute
Holland. The “document level”
Dorothea Holland, until four months ago perspective

News Article German case.

 What helped us generate the Document Level perspective?
The “word level”
perspective

Named Entities
prosecutors probing allegations that former German, US,
LOCATION secrets from the US group and took them
investigations,
MISC with him when he joined VW last year. GM, Dorothea
ORGANIZATION This investigation was launched by US Holland, Lopez,
PERSON President Bill Clinton and is in principle a far
prosecute
Important Verbs pursuit than that of Ms. Holland.
The “document level”
and Dependents Dorothea Holland, until four months ago perspective
WHAT was the only prosecuting lawyer on the
HAPPENED? German case.
News Article

What if we turn the document off?
 Summarization power of the perspectives

prosecutors probing allegations that former German, US,
secrets from the US group and took them
investigations,
with him when he joined VW last year. GM, Dorothea
This investigation was launched by US Holland, Lopez,
President Bill Clinton and is in principle a far
prosecute
pursuit than that of Ms. Holland
Dorothea Holland, until four months ago
German case.
Sentence Boundaries

Hypothesis
• Documents are at least tagged from two
different perspectives – either implicit or
explicit and one perspective affects the other
– Simplest example of implicit WL tagging – binned
positions indicating sections
– Simplest example of implicit DL tagging – tag cloud
It is believed US investigators have asked for, but have been so far refused
Begin (0)

access to, evidence accumulated by German prosecutors probing allegations that
former GM director, Mr. Lopez, stole industrial secrets from the US group and
took them with him when he joined VW last year.
This investigation was launched by US President Bill Clinton and is in principle
Midd
le (1)

a far more simple or at least more single-minded pursuit than that of Ms.
Holland.
Dorothea Holland, until four months ago was the only prosecuting lawyer on
End

tagcrowd.com
(2)

the German case.
The “word level” (WL) tags are usually some category descriptions

How can bi-level perspective be exploited?
 Can we generate category labels for Wikipedia
documents by looking at image captions?
 Can we use images to label latent topics?

 Can we build a topic model that incorporates both
perspectives simultaneously?
 choice of document level tags, impact on
performance
 Can supervised and unsupervised generative
models work together?

Example – A Wikipedia Article on “fog”
0

1

2

Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics Labels by human editors

The Wikipedia Article on “fog”
 Take the first category label – “weather hazards to aircraft”
 “aircraft” doesn’t occur in the document body!
 “hazard” only appears in a section label read as “Visibility
hazards”
 “Weather” appears only 6 out of 15 times in the main body
 However, if we look at the images, it seems that the concept of
fog is related to concepts like fog over the Golden Gate bridge,
fog in streets, poor visibility and quality of air

Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California,
bridge, air Labels by model from title and image captions
Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics Labels by human editors

The Family of Tag-Topic Models
• TagLDA: An occurrence of a word depends on
how much of it is explained by a topic K and a
WL tag t
 Intuitively

LDA TagLDA
Train
Sample

L L L L L L
S S
LDA’s learnt “purple” topic can TagLDA learns the “purple” topic
generate all 4 large balls with better based on a constraint - it
high probability will generate a mix of large and
small balls with high probability

Faceted Bi-Perspective Document Organization

Topics conditioned on different section identifiers Correspondence
Topics
(WL tag categories) of DL tag words
over
with content
Topic Marginals image
words
captions
Topic Labeling

MMLDA METag2LDA TagLDA CorrMMLDA CorrMETag2LDA

Combines Combines
TagLDA and TagLDA and
MMLDA CorrMMLDA

MM = Multinomial + Multinomial; ME = Multinomial + Exponential

• METag2LDA: A topic generating all DL tags in a document
doesn’t necessarily mean that the same topic generates all
words in the document
• CorrMETag2LDA: A topic generating *all* DL tags in a
document does mean that the same topic generates all
words in the document - a considerable strongpoint
METag2LDA CorrME-
Topic concentration parameter Tag2LDA
Document specific topic proportions
Indicator variables

Document content words
Document Level (DL) tags
Word Level (WL) tags

Topic Parameters
Tag Parameters

Experiments
 Wikipedia articles with images and captions manually
collected along {food, animal, countries, sport, war,
transportation, nature, weapon, universe and ethnic
groups} concepts
 Tags used:
 DL Tags – image caption words and the article titles
 WL Tags – Positions of sections binned into 5 bins
 Objective: to generate category labels for test documents
 Evaluation
– Perplexity: to see performance among various TagLDA models
– WordNet based similarity evaluation between actual category
labels and model output

Evaluations – Held-out Perplexity
0.8

Millions
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
K=20 K=50 K=100 K=200

MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA

Selected Wikipedia Articles
 WL tag categories – Section positions in the document
 DL tags – image caption words and article titles
 TagLDA perplexity is comparable to MM(METag2)LDA
 The (image caption words + article titles) and the content words are
independently discriminative enough
 CorrMM(METag2)LDA performs best since almost all image caption words and
the article title for a Wikipedia document are about a specific topic and the
correspondence assumption is accepted by the model with much higher
confidence

Evaluations – Application End-Goals
2
1.8
1.6
1.4 METag2LDA-
AverageDistance
1.2
1 corrMETag2LDA-
AverageDistance
0.8
0.6 METag2LDA-
BestDistance
0.4
0.2 corrMETag2LDA-
BestDistance
0
K=20 K=50 K=100 K=200

Inverse Hop distance in WordNet ontology
 Top 5 words from the caption vocabulary are chosen
 Max Weighted Average = 5, Max Best = 1
 METag2LDA almost always wins by narrow margins
 METag2LDA reweights the vocabulary of caption words and article titles that are about a
topic and hence may miss specializations relevant to document within the top (5) ones
 In WordNet ontology, specializations lead to more hop distance
 Ontology based scoring helps explain connections to caption words to ground truths e.g.
Skateboard  skate  glide  snowboard

Evaluations – Held-out Perplexity
1.65 2

Millions
Millions

1.6
1.5
1.55
1.5 1
1.45
1.4 0.5
1.35
0
40 60 80 100
40 60 80 100
MMLDA METag2LDA corrLDA corrMETag2LDA MMLDA METag2LDA corrLDA
corrMETag2LDA TagLDA

DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)
 WL tag categories – Named Entities
 DL tags – abstract coherence markers like (“subj”  “obj”) e.g. “Mary/Subj taught the
class. Everybody liked Mary/Obj.” *Ignored coref resolution+
 Abstract markers like (“subj”  “obj”) acting as DL perspective are not document
discriminative markers
 Rather they indicate a semantic perspective of coherence which is intricately linked to words
 Topics are influenced both by non-sparse document level coherence indicators like (“subj” 
“obj”, “subj”  “--”, etc.) AND also by document level co-occurrence
 By ignoring the DL perspective completely leads to better fit by TagLDA due to variations
in word distributions only

Evaluations – Application End-Goals
4
3.66
3.5
3 3.08
METag2LDA
2.5
CorrMETag2LDA
2
1.88
1.5
1 0.96 0.98 0.91
0.5 0.63
0.35
0
40 60 80 100

Person Named Entity coverage (DUC05 data)
 Two PERSON NEs in the same docset i.e., manual topic set are related (G in total)
 A_B, A, B are treated as separate PERSON NEs
 For each docset in DUC05 data
 Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE
facets
 Find how many matched over all documents in a docset (M in total)
 Win over baseline = M/G (averaged over all docsets)
 CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like
“SubjObj” coherence markers)
 More topics are pulled out that group more PER NEs across documents (Recall )

Model Usefulness and Applications
• Applications
– Document classification using reduced dimensions
– Find faceted topics automatically through word level tags
– Learn correspondences between perspectives
– Label topics through document level multimedia
– Create recommendations based on perspectives
– Video analysis: word prediction given video features
– Tying “multilingual comparable corpora” through topics
– Multi-document summarization using coherence
– E-Textbook aided discussion forum mining:
• Explore topics through the lens of students and teachers
• Label topics from posts through concepts in the e-textbook

Summary
• Flexible family of topic models that integrate a
partitioned space of DL tags and words with WL tag
categories
– Supervised models can collaborate with unsupervised
generative models i.e. supervised models can be bettered
independently
• Captioned multimedia objects like images, video, audio
can provide intuitive latent space labeling – a picture is
worth a 1000 words
• Obtain “facets” in topics
• As always held-out perplexity should not always be the
sole judge of end-task performance

Thanks!

Special thanks to Jordan Boyd-Graber for useful
discussions on TagLDA parameter regularizations

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Notes de l'éditeur