SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
A paper by   Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang
                              Zhai

              Presented By Kumar Ashish
  INF384H/CS395T: Concepts of Information Retrieval (and Web
                     Search) Fall 2011
   Zero Count Problem: Term is a possible word
    of Information need does not occur in
    document
   General Problem of Estimation: Terms
    occurring once are overestimated even
    though their occurrence was partly by chance
   In order to solve above problems, high quality
    extra data is required to enlarge the sample
    of document.
This gives the average logarithmic
distance between the probabilities: a
word would be observed at random from
unigram query language model and
unigram document language model.
C(w, d) is number of times word w occur in document d, and |d|
is the length of document.

Problems:
•Assigns Zero Probability to any word not present in document
causing problem in scoring a document with KL-Divergence.
   Jelinek-Mercer(JM) Smoothing

   Dirichlet Smoothing
   Proposes a fixed parameter λ to control
    interpolation.




          Probability of word w given by the collection model Θc
   It uses document dependent coefficient
    (parameterized with μ) to control the
    interpolation.
   Uses clustering information to smooth a
    document.
   Divides all documents into K clusters.
   First smoothes cluster model with collection
    model using Dirichlet Smoothing.
    Takes smoothed cluster as a new reference
    model to smooth document using JM
    Smoothing
ΘLd stand for document d’s cluster model and λ,
β are smoothing parameters.
   Better than JM or Dirichlet Smoothing: It
    expands a document with more data from
    the cluster instead of just using the same
    collection language model.
Cluster D good for
                        smoothing document a
                        but not good for
                        document d.




Ideally each document
should have its own
cluster centered
around itself.
   Expand each document using Probabilistic
    Neighborhood  to  estimate  a   virtual
    document(d’).



   Apply any interpolation based method(e.g. JM
    or Dirchlet) to such a virtual document and
    treat the word counts given by this virtual
    document as if they were the original word
    count.
   Can use Cosine rule to determine documents
    in the neighborhood of Original document.



   Problems:
    ◦ In narrow sense would contain only few documents
      whereas in wide sense the whole collection may
      included.
    ◦ Neighbor documents can’t be sampled the same as
      original document.
   Associates a Confidence Value with every
    document in the collection

    ◦ This Confidence Value reflects the belief that the
      document is sampled from the same underlying
      model as the original one.
   Confidence Value(γd) is associated to every
    document to indicate how strongly it is
    sampled from d’s document.
   Confidence Value should follow normal
    distribution:
   Shorter document require more help from its
    neighbor.
   Longer documents rely more on itself.

   In order to take care of this a parameter α is
    introduced to control this balance.
For Efficiency: Pseudo term count can be
calculated only using top M closest Neighbors ( as
confidence value follows decay shape)
   For performance comparison:
    ◦ It uses four TREC data sets
        AP(Associate Press news 1988-90)
        LA ( LA times)
        WSJ(Wall Street Journals 1987- 92)
        SJMN(San Jose Mercury News 1991)
   For Testing Algorithm Scale Up
    ◦ Uses TREC8
   For Testing Effect on Short Documents
    ◦ Uses DOE( Department of Energy)
Comparison of DELM +(Diri/JM) with Diri/JM


λ for JM, μ for Dirichet are optimal and the same values of λ or
μ are used for DELM without further tuning. M is 100 and α is
0.5 for DELM.
DELM Outperforms JM and Dirichlet on each Data Sets with
improvement as much as 15% in case of Associated Press
News(AP).
Compared Precision
                                    values at different
                                    levels of recall for AP
                                    data sets.
                                    DELM + Dirichet
                                    outperforms Dirichlet
                                    on every precision
                                    point.




Precision-Recall Curve on AP Data
Compares
                                              Performance Trend
                                              with respect to M(
                                              top    M     closest
                                              neighbors for each
                                              Document)




      Performance change with respect to M
Conclusion:
 Neighborhood information improves retrieval accuracy
Performance becomes insensitive to M when M is sufficiently large
Comparison of DELM+Dirichlet with CBDM

 DELM + Dirichet outperforms CBDM in MAP values on all
four data sets.
Document in AP88-89 was shrinked to 30% of original in 1st,
50% of original in 2nd and 70% of original in 3rd .
Results shows that DELM help shorter documents more than
longer ones (41% on 30%-length corpus to 16% on full length)
Performance change with respect to α

Optimal Points Migrate when document length becomes
shorter. ( 100% corpus length gets optimal at α = 0.4 but
30% corpus has to use α = 0.2)
Combination of DELM with Pseudo Feedback

 DELM combined with Model-Based Feedback proposed in (Zhai
and Lafferty, 2001a)
Experiment Performed by:
    Retrieving Documents by DELM method
   Choosing top five document to do model based Feedback
   Using Expanded query model to retrieve documents again
Result: DELM can be combined with pseudo feedback to
improve performance
   References:
    ◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf
    ◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
    ◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf

Contenu connexe

Tendances

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
Search Engines
Search EnginesSearch Engines
Search Engines
butest
 

Tendances (20)

Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translation
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Av33274282
Av33274282Av33274282
Av33274282
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
Clique
Clique Clique
Clique
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Learning group dssm - 20170605
Learning group   dssm - 20170605Learning group   dssm - 20170605
Learning group dssm - 20170605
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Cluster
ClusterCluster
Cluster
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
 
Chapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationChapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication Operation
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
4 Cliques Clusters
4 Cliques Clusters4 Cliques Clusters
4 Cliques Clusters
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 

Similaire à Language Model Information Retrieval with Document Expansion

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
Gavrila_ICCV99.pdf
Gavrila_ICCV99.pdfGavrila_ICCV99.pdf
Gavrila_ICCV99.pdf
Deepdeeper
 

Similaire à Language Model Information Retrieval with Document Expansion (20)

Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Contextual ontology alignment may 2011
Contextual ontology alignment may 2011Contextual ontology alignment may 2011
Contextual ontology alignment may 2011
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
 
Ir 09
Ir   09Ir   09
Ir 09
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Improved text clustering with
Improved text clustering withImproved text clustering with
Improved text clustering with
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Decision tables
Decision tablesDecision tables
Decision tables
 
Delta encoding in data compression by Nisha Menon K studying mtech at fisat
Delta encoding in data compression by Nisha Menon K studying mtech at fisat Delta encoding in data compression by Nisha Menon K studying mtech at fisat
Delta encoding in data compression by Nisha Menon K studying mtech at fisat
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
 
Gavrila_ICCV99.pdf
Gavrila_ICCV99.pdfGavrila_ICCV99.pdf
Gavrila_ICCV99.pdf
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim RemaniFinding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Dernier (20)

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 

Language Model Information Retrieval with Document Expansion

  • 1. A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai Presented By Kumar Ashish INF384H/CS395T: Concepts of Information Retrieval (and Web Search) Fall 2011
  • 2. Zero Count Problem: Term is a possible word of Information need does not occur in document  General Problem of Estimation: Terms occurring once are overestimated even though their occurrence was partly by chance  In order to solve above problems, high quality extra data is required to enlarge the sample of document.
  • 3. This gives the average logarithmic distance between the probabilities: a word would be observed at random from unigram query language model and unigram document language model.
  • 4. C(w, d) is number of times word w occur in document d, and |d| is the length of document. Problems: •Assigns Zero Probability to any word not present in document causing problem in scoring a document with KL-Divergence.
  • 5. Jelinek-Mercer(JM) Smoothing  Dirichlet Smoothing
  • 6. Proposes a fixed parameter λ to control interpolation. Probability of word w given by the collection model Θc
  • 7. It uses document dependent coefficient (parameterized with μ) to control the interpolation.
  • 8. Uses clustering information to smooth a document.  Divides all documents into K clusters.  First smoothes cluster model with collection model using Dirichlet Smoothing.  Takes smoothed cluster as a new reference model to smooth document using JM Smoothing
  • 9. ΘLd stand for document d’s cluster model and λ, β are smoothing parameters.
  • 10. Better than JM or Dirichlet Smoothing: It expands a document with more data from the cluster instead of just using the same collection language model.
  • 11. Cluster D good for smoothing document a but not good for document d. Ideally each document should have its own cluster centered around itself.
  • 12. Expand each document using Probabilistic Neighborhood to estimate a virtual document(d’).  Apply any interpolation based method(e.g. JM or Dirchlet) to such a virtual document and treat the word counts given by this virtual document as if they were the original word count.
  • 13. Can use Cosine rule to determine documents in the neighborhood of Original document.  Problems: ◦ In narrow sense would contain only few documents whereas in wide sense the whole collection may included. ◦ Neighbor documents can’t be sampled the same as original document.
  • 14. Associates a Confidence Value with every document in the collection ◦ This Confidence Value reflects the belief that the document is sampled from the same underlying model as the original one.
  • 15. Confidence Value(γd) is associated to every document to indicate how strongly it is sampled from d’s document.  Confidence Value should follow normal distribution:
  • 16. Shorter document require more help from its neighbor.  Longer documents rely more on itself.  In order to take care of this a parameter α is introduced to control this balance.
  • 17. For Efficiency: Pseudo term count can be calculated only using top M closest Neighbors ( as confidence value follows decay shape)
  • 18. For performance comparison: ◦ It uses four TREC data sets  AP(Associate Press news 1988-90)  LA ( LA times)  WSJ(Wall Street Journals 1987- 92)  SJMN(San Jose Mercury News 1991)  For Testing Algorithm Scale Up ◦ Uses TREC8  For Testing Effect on Short Documents ◦ Uses DOE( Department of Energy)
  • 19. Comparison of DELM +(Diri/JM) with Diri/JM λ for JM, μ for Dirichet are optimal and the same values of λ or μ are used for DELM without further tuning. M is 100 and α is 0.5 for DELM. DELM Outperforms JM and Dirichlet on each Data Sets with improvement as much as 15% in case of Associated Press News(AP).
  • 20. Compared Precision values at different levels of recall for AP data sets. DELM + Dirichet outperforms Dirichlet on every precision point. Precision-Recall Curve on AP Data
  • 21. Compares Performance Trend with respect to M( top M closest neighbors for each Document) Performance change with respect to M Conclusion:  Neighborhood information improves retrieval accuracy Performance becomes insensitive to M when M is sufficiently large
  • 22. Comparison of DELM+Dirichlet with CBDM  DELM + Dirichet outperforms CBDM in MAP values on all four data sets.
  • 23. Document in AP88-89 was shrinked to 30% of original in 1st, 50% of original in 2nd and 70% of original in 3rd . Results shows that DELM help shorter documents more than longer ones (41% on 30%-length corpus to 16% on full length)
  • 24. Performance change with respect to α Optimal Points Migrate when document length becomes shorter. ( 100% corpus length gets optimal at α = 0.4 but 30% corpus has to use α = 0.2)
  • 25. Combination of DELM with Pseudo Feedback  DELM combined with Model-Based Feedback proposed in (Zhai and Lafferty, 2001a) Experiment Performed by:  Retrieving Documents by DELM method Choosing top five document to do model based Feedback Using Expanded query model to retrieve documents again Result: DELM can be combined with pseudo feedback to improve performance
  • 26. References: ◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf ◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf ◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf