SlideShare une entreprise Scribd logo
1  sur  25
Simultaneous Joint and Conditional
          Modeling of
   Documents Tagged from Two
          Perspectives
   Pradipto Das, Rohini Srihari and Yun Fu
               SUNY Buffalo
       CIKM 2011, Glasgow, Scotland
Ubiquitous Bi-Perspective Document Structure
    Words
 indicative of
  important
 Wiki concepts




                                    Actual human
                                      generated
                                    Wiki category
                                     tags – words
                                         that
                                     summarize/
                                    categorize the
                                      document
Wikipedia
Ubiquitous Bi-Perspective Document Structure



    Words                            Actual tags
  indicative                            for the
      of                             forum post
  questions                             – even
                                     frequencies
                                      are given!
   Words
 indicative
 of answers




StackOverflow
Ubiquitous Bi-Perspective Document Structure




  Words
indicative
    of
document
   title
   Words
 indicative
  of image                              Actual
description                           tags given
                                       by users
Yahoo! Flickr
Understanding the Two Perspectives



    What if the documents
    are plain text files?


News Article
Understanding the Two Perspectives
     Imagine browsing over reports in a topic cluster


                 It is believed US investigators have asked
               for, but have been so far refused access to,
               evidence accumulated by German
               prosecutors probing allegations that former
               GM director, Mr. Lopez, stole industrial
               secrets from the US group and took them
               with him when he joined VW last year.
                 This investigation was launched by US
               President Bill Clinton and is in principle a far
               more simple or at least more single-minded
               pursuit than that of Ms. Holland.
                 Dorothea Holland, until four months ago
               was the only prosecuting lawyer on the
               German case.
News Article
Understanding the Two Perspectives
     What words can we remember after a first browse?


                 It is believedUS investigators            have asked for,
               but have been so far refused access to, evidence
               accumulated by   German prosecutors
               probing allegations that former GM director, Mr.                German, US,
               Lopez, stole industrial secrets from the US group              investigations,
               and took them with him when he joined VW last year.
                                                                              GM, Dorothea
                  Thisinvestigation             was launched by US
               President Bill Clinton and is in principle a far more simple
                                                                              Holland, Lopez,
               or at least more single-minded pursuit than that of   Ms.        prosecute
               Holland.                                                       The “document level”
                 Dorothea Holland, until four months ago                           perspective
               was the only prosecuting lawyer on the

News Article   German case.
Understanding the Two Perspectives
       What helped us generate the Document Level perspective?
                               The “word level”
                                 perspective
                  It is believed US investigators have asked
                for, but have been so far refused access to,
                evidence accumulated by German

Named Entities
                prosecutors probing allegations that former         German, US,
                GM director, Mr. Lopez, stole industrial
   LOCATION     secrets from the US group and took them
                                                                   investigations,
     MISC       with him when he joined VW last year.              GM, Dorothea
ORGANIZATION      This investigation was launched by US            Holland, Lopez,
    PERSON      President Bill Clinton and is in principle a far
                more simple or at least more single-minded
                                                                     prosecute
Important Verbs pursuit than that of Ms. Holland.
                                                                   The “document level”
and Dependents    Dorothea Holland, until four months ago               perspective
     WHAT       was the only prosecuting lawyer on the
  HAPPENED?     German case.
News Article
What if we turn the document off?
 Summarization power of the perspectives


          It is believed US investigators have asked
        for, but have been so far refused access to,
        evidence accumulated by German
        prosecutors probing allegations that former         German, US,
        GM director, Mr. Lopez, stole industrial
        secrets from the US group and took them
                                                           investigations,
        with him when he joined VW last year.              GM, Dorothea
          This investigation was launched by US            Holland, Lopez,
        President Bill Clinton and is in principle a far
        more simple or at least more single-minded
                                                             prosecute
        pursuit than that of Ms. Holland
          Dorothea Holland, until four months ago
        was the only prosecuting lawyer on the
        German case.
                                         Sentence Boundaries
Hypothesis
 • Documents are at least tagged from two
   different perspectives – either implicit or
   explicit and one perspective affects the other
             – Simplest example of implicit WL tagging – binned
               positions indicating sections
             – Simplest example of implicit DL tagging – tag cloud
                It is believed US investigators have asked for, but have been so far refused
 Begin (0)




             access to, evidence accumulated by German prosecutors probing allegations that
             former GM director, Mr. Lopez, stole industrial secrets from the US group and
             took them with him when he joined VW last year.
                This investigation was launched by US President Bill Clinton and is in principle
Midd
le (1)




             a far more simple or at least more single-minded pursuit than that of Ms.
             Holland.
                Dorothea Holland, until four months ago was the only prosecuting lawyer on
End




                                                                                                   tagcrowd.com
(2)




             the German case.
                 The “word level” (WL) tags are usually some category descriptions
How can bi-level perspective be exploited?
 Can we generate category labels for Wikipedia
  documents by looking at image captions?
    Can we use images to label latent topics?

 Can we build a topic model that incorporates both
  perspectives simultaneously?
    choice of document level tags, impact on
     performance
    Can supervised and unsupervised generative
     models work together?
Example – A Wikipedia Article on “fog”
0




1




2

    Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
    phenomena | Fog | Psychrometrics                               Labels by human editors
The Wikipedia Article on “fog”
 Take the first category label – “weather hazards to aircraft”
    “aircraft” doesn’t occur in the document body!
    “hazard” only appears in a section label read as “Visibility
      hazards”
    “Weather” appears only 6 out of 15 times in the main body
 However, if we look at the images, it seems that the concept of
  fog is related to concepts like fog over the Golden Gate bridge,
  fog in streets, poor visibility and quality of air



Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California,
bridge, air                                Labels by model from title and image captions
Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics                                 Labels by human editors
The Family of Tag-Topic Models
• TagLDA: An occurrence of a word depends on
  how much of it is explained by a topic K and a
  WL tag t
          Intuitively

         LDA                               TagLDA
Train
Sample




            L        L      L       L         L                         L
                                                        S       S
         LDA’s learnt “purple” topic can   TagLDA learns the “purple” topic
         generate all 4 large balls with   better based on a constraint - it
         high probability                  will generate a mix of large and
                                           small balls with high probability
Faceted Bi-Perspective Document Organization




  Topics conditioned on different section identifiers              Correspondence
                                                         Topics
                 (WL tag categories)                               of DL tag words
                                                          over
                                                                    with content
   Topic Marginals                                       image
                                                                        words
                                                        captions
                                                                   Topic Labeling
The Family of Tag-Topic Models
MMLDA    METag2LDA             TagLDA            CorrMMLDA CorrMETag2LDA




           Combines                                              Combines
          TagLDA and                                            TagLDA and
            MMLDA                                               CorrMMLDA

    MM = Multinomial + Multinomial; ME = Multinomial + Exponential
The Family of Tag-Topic Models
• METag2LDA: A topic generating all DL tags in a document
  doesn’t necessarily mean that the same topic generates all
  words in the document
• CorrMETag2LDA: A topic generating *all* DL tags in a
  document does mean that the same topic generates all
  words in the document - a considerable strongpoint
METag2LDA                                               CorrME-
                   Topic concentration parameter        Tag2LDA
                 Document specific topic proportions
                         Indicator variables

                      Document content words
                      Document Level (DL) tags
                        Word Level (WL) tags

                          Topic Parameters
                           Tag Parameters
Experiments
 Wikipedia articles with images and captions manually
  collected along {food, animal, countries, sport, war,
  transportation, nature, weapon, universe and ethnic
  groups} concepts
 Tags used:
    DL Tags – image caption words and the article titles
    WL Tags – Positions of sections binned into 5 bins
 Objective: to generate category labels for test documents
 Evaluation
   – Perplexity: to see performance among various TagLDA models
   – WordNet based similarity evaluation between actual category
     labels and model output
Evaluations – Held-out Perplexity
                                  0.8




                       Millions
                                  0.7
                                  0.6
                                  0.5
                                  0.4
                                  0.3
                                  0.2
                                  0.1
                                    0
                                            K=20        K=50        K=100       K=200

                                    MMLDA     TagLDA   corrLDA   METag2LDA   corrMETag2LDA


                                        Selected Wikipedia Articles
 WL tag categories – Section positions in the document
 DL tags – image caption words and article titles
 TagLDA perplexity is comparable to MM(METag2)LDA
     The (image caption words + article titles) and the content words are
      independently discriminative enough
 CorrMM(METag2)LDA performs best since almost all image caption words and
  the article title for a Wikipedia document are about a specific topic and the
  correspondence assumption is accepted by the model with much higher
  confidence
Evaluations – Application End-Goals
                             2
                           1.8
                           1.6
                           1.4                                 METag2LDA-
                                                               AverageDistance
                           1.2
                             1                                 corrMETag2LDA-
                                                               AverageDistance
                           0.8
                           0.6                                 METag2LDA-
                                                               BestDistance
                           0.4
                           0.2                                 corrMETag2LDA-
                                                               BestDistance
                             0
                                 K=20   K=50   K=100   K=200

                        Inverse Hop distance in WordNet ontology
 Top 5 words from the caption vocabulary are chosen
 Max Weighted Average = 5, Max Best = 1
 METag2LDA almost always wins by narrow margins
 METag2LDA reweights the vocabulary of caption words and article titles that are about a
  topic and hence may miss specializations relevant to document within the top (5) ones
     In WordNet ontology, specializations lead to more hop distance
 Ontology based scoring helps explain connections to caption words to ground truths e.g.
  Skateboard  skate  glide  snowboard
Evaluations – Held-out Perplexity
                   1.65                                                                2




                                                                           Millions
        Millions




                    1.6
                                                                                      1.5
                   1.55
                    1.5                                                                1
                   1.45
                    1.4                                                               0.5
                   1.35
                                                                                       0
                           40          60             80         100
                                                                                              40      60       80   100
                   MMLDA   METag2LDA        corrLDA        corrMETag2LDA              MMLDA            METag2LDA     corrLDA
                                                                                      corrMETag2LDA    TagLDA

                    DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)
 WL tag categories – Named Entities
 DL tags – abstract coherence markers like (“subj”  “obj”) e.g. “Mary/Subj taught the
  class. Everybody liked Mary/Obj.” *Ignored coref resolution+
 Abstract markers like (“subj”  “obj”) acting as DL perspective are not document
  discriminative markers
     Rather they indicate a semantic perspective of coherence which is intricately linked to words
      Topics are influenced both by non-sparse document level coherence indicators like (“subj” 
       “obj”, “subj”  “--”, etc.) AND also by document level co-occurrence
 By ignoring the DL perspective completely leads to better fit by TagLDA due to variations
  in word distributions only
Evaluations – Application End-Goals
                           4
                                                                          3.66
                          3.5
                           3                                 3.08
                                                                                 METag2LDA
                          2.5
                                                                                 CorrMETag2LDA
                           2
                                                 1.88
                          1.5
                           1         0.96                    0.98         0.91
                          0.5                    0.63
                                     0.35
                           0
                                40          60          80          100

                       Person Named Entity coverage (DUC05 data)
 Two PERSON NEs in the same docset i.e., manual topic set are related (G in total)
 A_B, A, B are treated as separate PERSON NEs
 For each docset in DUC05 data
     Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE
       facets
     Find how many matched over all documents in a docset (M in total)
 Win over baseline = M/G (averaged over all docsets)
 CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like
  “SubjObj” coherence markers)
     More topics are pulled out that group more PER NEs across documents (Recall )
Model Usefulness and Applications
• Applications
   –   Document classification using reduced dimensions
   –   Find faceted topics automatically through word level tags
   –   Learn correspondences between perspectives
   –   Label topics through document level multimedia
   –   Create recommendations based on perspectives
   –   Video analysis: word prediction given video features
   –   Tying “multilingual comparable corpora” through topics
   –   Multi-document summarization using coherence
   –   E-Textbook aided discussion forum mining:
        • Explore topics through the lens of students and teachers
        • Label topics from posts through concepts in the e-textbook
Summary
• Flexible family of topic models that integrate a
  partitioned space of DL tags and words with WL tag
  categories
   – Supervised models can collaborate with unsupervised
     generative models i.e. supervised models can be bettered
     independently
• Captioned multimedia objects like images, video, audio
  can provide intuitive latent space labeling – a picture is
  worth a 1000 words
• Obtain “facets” in topics
• As always held-out perplexity should not always be the
  sole judge of end-task performance
Thanks!


 Special thanks to Jordan Boyd-Graber for useful
discussions on TagLDA parameter regularizations

Contenu connexe

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

  • 1. Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives Pradipto Das, Rohini Srihari and Yun Fu SUNY Buffalo CIKM 2011, Glasgow, Scotland
  • 2. Ubiquitous Bi-Perspective Document Structure Words indicative of important Wiki concepts Actual human generated Wiki category tags – words that summarize/ categorize the document Wikipedia
  • 3. Ubiquitous Bi-Perspective Document Structure Words Actual tags indicative for the of forum post questions – even frequencies are given! Words indicative of answers StackOverflow
  • 4. Ubiquitous Bi-Perspective Document Structure Words indicative of document title Words indicative of image Actual description tags given by users Yahoo! Flickr
  • 5. Understanding the Two Perspectives What if the documents are plain text files? News Article
  • 6. Understanding the Two Perspectives  Imagine browsing over reports in a topic cluster It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. News Article
  • 7. Understanding the Two Perspectives  What words can we remember after a first browse? It is believedUS investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. German, US, Lopez, stole industrial secrets from the US group investigations, and took them with him when he joined VW last year. GM, Dorothea Thisinvestigation was launched by US President Bill Clinton and is in principle a far more simple Holland, Lopez, or at least more single-minded pursuit than that of Ms. prosecute Holland. The “document level” Dorothea Holland, until four months ago perspective was the only prosecuting lawyer on the News Article German case.
  • 8. Understanding the Two Perspectives  What helped us generate the Document Level perspective? The “word level” perspective It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German Named Entities prosecutors probing allegations that former German, US, GM director, Mr. Lopez, stole industrial LOCATION secrets from the US group and took them investigations, MISC with him when he joined VW last year. GM, Dorothea ORGANIZATION This investigation was launched by US Holland, Lopez, PERSON President Bill Clinton and is in principle a far more simple or at least more single-minded prosecute Important Verbs pursuit than that of Ms. Holland. The “document level” and Dependents Dorothea Holland, until four months ago perspective WHAT was the only prosecuting lawyer on the HAPPENED? German case. News Article
  • 9. What if we turn the document off?  Summarization power of the perspectives It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former German, US, GM director, Mr. Lopez, stole industrial secrets from the US group and took them investigations, with him when he joined VW last year. GM, Dorothea This investigation was launched by US Holland, Lopez, President Bill Clinton and is in principle a far more simple or at least more single-minded prosecute pursuit than that of Ms. Holland Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. Sentence Boundaries
  • 10. Hypothesis • Documents are at least tagged from two different perspectives – either implicit or explicit and one perspective affects the other – Simplest example of implicit WL tagging – binned positions indicating sections – Simplest example of implicit DL tagging – tag cloud It is believed US investigators have asked for, but have been so far refused Begin (0) access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle Midd le (1) a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on End tagcrowd.com (2) the German case. The “word level” (WL) tags are usually some category descriptions
  • 11. How can bi-level perspective be exploited?  Can we generate category labels for Wikipedia documents by looking at image captions?  Can we use images to label latent topics?  Can we build a topic model that incorporates both perspectives simultaneously?  choice of document level tags, impact on performance  Can supervised and unsupervised generative models work together?
  • 12. Example – A Wikipedia Article on “fog” 0 1 2 Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors
  • 13. The Wikipedia Article on “fog”  Take the first category label – “weather hazards to aircraft”  “aircraft” doesn’t occur in the document body!  “hazard” only appears in a section label read as “Visibility hazards”  “Weather” appears only 6 out of 15 times in the main body  However, if we look at the images, it seems that the concept of fog is related to concepts like fog over the Golden Gate bridge, fog in streets, poor visibility and quality of air Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California, bridge, air Labels by model from title and image captions Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors
  • 14. The Family of Tag-Topic Models • TagLDA: An occurrence of a word depends on how much of it is explained by a topic K and a WL tag t  Intuitively LDA TagLDA Train Sample L L L L L L S S LDA’s learnt “purple” topic can TagLDA learns the “purple” topic generate all 4 large balls with better based on a constraint - it high probability will generate a mix of large and small balls with high probability
  • 15. Faceted Bi-Perspective Document Organization Topics conditioned on different section identifiers Correspondence Topics (WL tag categories) of DL tag words over with content Topic Marginals image words captions Topic Labeling
  • 16. The Family of Tag-Topic Models MMLDA METag2LDA TagLDA CorrMMLDA CorrMETag2LDA Combines Combines TagLDA and TagLDA and MMLDA CorrMMLDA MM = Multinomial + Multinomial; ME = Multinomial + Exponential
  • 17. The Family of Tag-Topic Models • METag2LDA: A topic generating all DL tags in a document doesn’t necessarily mean that the same topic generates all words in the document • CorrMETag2LDA: A topic generating *all* DL tags in a document does mean that the same topic generates all words in the document - a considerable strongpoint METag2LDA CorrME- Topic concentration parameter Tag2LDA Document specific topic proportions Indicator variables Document content words Document Level (DL) tags Word Level (WL) tags Topic Parameters Tag Parameters
  • 18. Experiments  Wikipedia articles with images and captions manually collected along {food, animal, countries, sport, war, transportation, nature, weapon, universe and ethnic groups} concepts  Tags used:  DL Tags – image caption words and the article titles  WL Tags – Positions of sections binned into 5 bins  Objective: to generate category labels for test documents  Evaluation – Perplexity: to see performance among various TagLDA models – WordNet based similarity evaluation between actual category labels and model output
  • 19. Evaluations – Held-out Perplexity 0.8 Millions 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 K=20 K=50 K=100 K=200 MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA Selected Wikipedia Articles  WL tag categories – Section positions in the document  DL tags – image caption words and article titles  TagLDA perplexity is comparable to MM(METag2)LDA  The (image caption words + article titles) and the content words are independently discriminative enough  CorrMM(METag2)LDA performs best since almost all image caption words and the article title for a Wikipedia document are about a specific topic and the correspondence assumption is accepted by the model with much higher confidence
  • 20. Evaluations – Application End-Goals 2 1.8 1.6 1.4 METag2LDA- AverageDistance 1.2 1 corrMETag2LDA- AverageDistance 0.8 0.6 METag2LDA- BestDistance 0.4 0.2 corrMETag2LDA- BestDistance 0 K=20 K=50 K=100 K=200 Inverse Hop distance in WordNet ontology  Top 5 words from the caption vocabulary are chosen  Max Weighted Average = 5, Max Best = 1  METag2LDA almost always wins by narrow margins  METag2LDA reweights the vocabulary of caption words and article titles that are about a topic and hence may miss specializations relevant to document within the top (5) ones  In WordNet ontology, specializations lead to more hop distance  Ontology based scoring helps explain connections to caption words to ground truths e.g. Skateboard  skate  glide  snowboard
  • 21. Evaluations – Held-out Perplexity 1.65 2 Millions Millions 1.6 1.5 1.55 1.5 1 1.45 1.4 0.5 1.35 0 40 60 80 100 40 60 80 100 MMLDA METag2LDA corrLDA corrMETag2LDA MMLDA METag2LDA corrLDA corrMETag2LDA TagLDA DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)  WL tag categories – Named Entities  DL tags – abstract coherence markers like (“subj”  “obj”) e.g. “Mary/Subj taught the class. Everybody liked Mary/Obj.” *Ignored coref resolution+  Abstract markers like (“subj”  “obj”) acting as DL perspective are not document discriminative markers  Rather they indicate a semantic perspective of coherence which is intricately linked to words  Topics are influenced both by non-sparse document level coherence indicators like (“subj”  “obj”, “subj”  “--”, etc.) AND also by document level co-occurrence  By ignoring the DL perspective completely leads to better fit by TagLDA due to variations in word distributions only
  • 22. Evaluations – Application End-Goals 4 3.66 3.5 3 3.08 METag2LDA 2.5 CorrMETag2LDA 2 1.88 1.5 1 0.96 0.98 0.91 0.5 0.63 0.35 0 40 60 80 100 Person Named Entity coverage (DUC05 data)  Two PERSON NEs in the same docset i.e., manual topic set are related (G in total)  A_B, A, B are treated as separate PERSON NEs  For each docset in DUC05 data  Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE facets  Find how many matched over all documents in a docset (M in total)  Win over baseline = M/G (averaged over all docsets)  CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like “SubjObj” coherence markers)  More topics are pulled out that group more PER NEs across documents (Recall )
  • 23. Model Usefulness and Applications • Applications – Document classification using reduced dimensions – Find faceted topics automatically through word level tags – Learn correspondences between perspectives – Label topics through document level multimedia – Create recommendations based on perspectives – Video analysis: word prediction given video features – Tying “multilingual comparable corpora” through topics – Multi-document summarization using coherence – E-Textbook aided discussion forum mining: • Explore topics through the lens of students and teachers • Label topics from posts through concepts in the e-textbook
  • 24. Summary • Flexible family of topic models that integrate a partitioned space of DL tags and words with WL tag categories – Supervised models can collaborate with unsupervised generative models i.e. supervised models can be bettered independently • Captioned multimedia objects like images, video, audio can provide intuitive latent space labeling – a picture is worth a 1000 words • Obtain “facets” in topics • As always held-out perplexity should not always be the sole judge of end-task performance
  • 25. Thanks! Special thanks to Jordan Boyd-Graber for useful discussions on TagLDA parameter regularizations

Notes de l'éditeur

  1. Hyperlinked text in body represent word level tagsCategories represent document level tags
  2. Word level tags: question/answerDoc level tags: actual tags for the forum post
  3. Word level tags: title, image descriptionDoc level tags: tags given by users
  4. Document about investigationsWe don’t have annotations but let’s see how that can be built up!
  5. Words to the right are relevant to the topic of the document set – mostly by frequency
  6. Since documents are mostly about some events; Certain words strike us – NEs mentioned frequently and across sentencesDependencies between subjects and objects of the important verbs from the document set.
  7. The word and doc level tagged words alone are sufficient to summarize the document as bags of words
  8. I don’t think we need this slide. I should explain these points while showing the previous slide!
  9. Cons:Collocations need to be addressedChains don’t involve causality e.g. (fogs & accidents, [hop length = 12])
  10. Within the family of (corr)MM(E)(Tag2)LDAs modeling joint observations, corrMETag2LDA performs best