SlideShare une entreprise Scribd logo
A paper by   Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang
                              Zhai

              Presented By Kumar Ashish
  INF384H/CS395T: Concepts of Information Retrieval (and Web
                     Search) Fall 2011
   Zero Count Problem: Term is a possible word
    of Information need does not occur in
    document
   General Problem of Estimation: Terms
    occurring once are overestimated even
    though their occurrence was partly by chance
   In order to solve above problems, high quality
    extra data is required to enlarge the sample
    of document.
This gives the average logarithmic
distance between the probabilities: a
word would be observed at random from
unigram query language model and
unigram document language model.
C(w, d) is number of times word w occur in document d, and |d|
is the length of document.

Problems:
•Assigns Zero Probability to any word not present in document
causing problem in scoring a document with KL-Divergence.
   Jelinek-Mercer(JM) Smoothing

   Dirichlet Smoothing
   Proposes a fixed parameter λ to control
    interpolation.




          Probability of word w given by the collection model Θc
   It uses document dependent coefficient
    (parameterized with μ) to control the
    interpolation.
   Uses clustering information to smooth a
    document.
   Divides all documents into K clusters.
   First smoothes cluster model with collection
    model using Dirichlet Smoothing.
    Takes smoothed cluster as a new reference
    model to smooth document using JM
    Smoothing
ΘLd stand for document d’s cluster model and λ,
β are smoothing parameters.
   Better than JM or Dirichlet Smoothing: It
    expands a document with more data from
    the cluster instead of just using the same
    collection language model.
Cluster D good for
                        smoothing document a
                        but not good for
                        document d.




Ideally each document
should have its own
cluster centered
around itself.
   Expand each document using Probabilistic
    Neighborhood  to  estimate  a   virtual
    document(d’).



   Apply any interpolation based method(e.g. JM
    or Dirchlet) to such a virtual document and
    treat the word counts given by this virtual
    document as if they were the original word
    count.
   Can use Cosine rule to determine documents
    in the neighborhood of Original document.



   Problems:
    ◦ In narrow sense would contain only few documents
      whereas in wide sense the whole collection may
      included.
    ◦ Neighbor documents can’t be sampled the same as
      original document.
   Associates a Confidence Value with every
    document in the collection

    ◦ This Confidence Value reflects the belief that the
      document is sampled from the same underlying
      model as the original one.
   Confidence Value(γd) is associated to every
    document to indicate how strongly it is
    sampled from d’s document.
   Confidence Value should follow normal
    distribution:
   Shorter document require more help from its
    neighbor.
   Longer documents rely more on itself.

   In order to take care of this a parameter α is
    introduced to control this balance.
For Efficiency: Pseudo term count can be
calculated only using top M closest Neighbors ( as
confidence value follows decay shape)
   For performance comparison:
    ◦ It uses four TREC data sets
        AP(Associate Press news 1988-90)
        LA ( LA times)
        WSJ(Wall Street Journals 1987- 92)
        SJMN(San Jose Mercury News 1991)
   For Testing Algorithm Scale Up
    ◦ Uses TREC8
   For Testing Effect on Short Documents
    ◦ Uses DOE( Department of Energy)
Comparison of DELM +(Diri/JM) with Diri/JM


λ for JM, μ for Dirichet are optimal and the same values of λ or
μ are used for DELM without further tuning. M is 100 and α is
0.5 for DELM.
DELM Outperforms JM and Dirichlet on each Data Sets with
improvement as much as 15% in case of Associated Press
News(AP).
Compared Precision
                                    values at different
                                    levels of recall for AP
                                    data sets.
                                    DELM + Dirichet
                                    outperforms Dirichlet
                                    on every precision
                                    point.




Precision-Recall Curve on AP Data
Compares
                                              Performance Trend
                                              with respect to M(
                                              top    M     closest
                                              neighbors for each
                                              Document)




      Performance change with respect to M
Conclusion:
 Neighborhood information improves retrieval accuracy
Performance becomes insensitive to M when M is sufficiently large
Comparison of DELM+Dirichlet with CBDM

 DELM + Dirichet outperforms CBDM in MAP values on all
four data sets.
Document in AP88-89 was shrinked to 30% of original in 1st,
50% of original in 2nd and 70% of original in 3rd .
Results shows that DELM help shorter documents more than
longer ones (41% on 30%-length corpus to 16% on full length)
Performance change with respect to α

Optimal Points Migrate when document length becomes
shorter. ( 100% corpus length gets optimal at α = 0.4 but
30% corpus has to use α = 0.2)
Combination of DELM with Pseudo Feedback

 DELM combined with Model-Based Feedback proposed in (Zhai
and Lafferty, 2001a)
Experiment Performed by:
    Retrieving Documents by DELM method
   Choosing top five document to do model based Feedback
   Using Expanded query model to retrieve documents again
Result: DELM can be combined with pseudo feedback to
improve performance
   References:
    ◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf
    ◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
    ◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf

Contenu connexe

Tendances

Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translationBhavesh Singh
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetVishva Abeyrathne
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Learning group dssm - 20170605
Learning group   dssm - 20170605Learning group   dssm - 20170605
Learning group dssm - 20170605Shuai Zhang
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...ijaia
 
Chapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationChapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationNifras Ismail
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 

Tendances (20)

Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translation
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Av33274282
Av33274282Av33274282
Av33274282
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
Clique
Clique Clique
Clique
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Learning group dssm - 20170605
Learning group   dssm - 20170605Learning group   dssm - 20170605
Learning group dssm - 20170605
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Cluster
ClusterCluster
Cluster
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
 
Chapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationChapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication Operation
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
4 Cliques Clusters
4 Cliques Clusters4 Cliques Clusters
4 Cliques Clusters
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 

Similaire à Language Model Information Retrieval with Document Expansion

Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Contextual ontology alignment may 2011
Contextual ontology alignment may 2011Contextual ontology alignment may 2011
Contextual ontology alignment may 2011Mariana Damova, Ph.D
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...ijitcs
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
Improved text clustering with
Improved text clustering withImproved text clustering with
Improved text clustering withIJDKP
 
Delta encoding in data compression by Nisha Menon K studying mtech at fisat
Delta encoding in data compression by Nisha Menon K studying mtech at fisat Delta encoding in data compression by Nisha Menon K studying mtech at fisat
Delta encoding in data compression by Nisha Menon K studying mtech at fisat Nisha Menon K
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding IJCERT
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...Neo4j
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021Praneeth Vepakomma
 
Gavrila_ICCV99.pdf
Gavrila_ICCV99.pdfGavrila_ICCV99.pdf
Gavrila_ICCV99.pdfDeepdeeper
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for PhyloinformaticsRutger Vos
 
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim RemaniFinding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim RemaniJAXLondon2014
 

Similaire à Language Model Information Retrieval with Document Expansion (20)

Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Contextual ontology alignment may 2011
Contextual ontology alignment may 2011Contextual ontology alignment may 2011
Contextual ontology alignment may 2011
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
 
Ir 09
Ir   09Ir   09
Ir 09
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Improved text clustering with
Improved text clustering withImproved text clustering with
Improved text clustering with
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Decision tables
Decision tablesDecision tables
Decision tables
 
Delta encoding in data compression by Nisha Menon K studying mtech at fisat
Delta encoding in data compression by Nisha Menon K studying mtech at fisat Delta encoding in data compression by Nisha Menon K studying mtech at fisat
Delta encoding in data compression by Nisha Menon K studying mtech at fisat
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
Government GraphSummit: Leveraging Knowledge Graphs for Foundational Intellig...
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
 
Gavrila_ICCV99.pdf
Gavrila_ICCV99.pdfGavrila_ICCV99.pdf
Gavrila_ICCV99.pdf
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim RemaniFinding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
 

Dernier

Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptSourabh Kumar
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsCol Mukteshwar Prasad
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxCeline George
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersPedroFerreira53928
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17Celine George
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticspragatimahajan3
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxJenilouCasareno
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...Sayali Powar
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePedroFerreira53928
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportAvinash Rai
 
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfDr. M. Kumaresan Hort.
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17Celine George
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTechSoup
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeSaadHumayun7
 

Dernier (20)

Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17
 
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdfPost Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdf
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
 

Language Model Information Retrieval with Document Expansion

  • 1. A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai Presented By Kumar Ashish INF384H/CS395T: Concepts of Information Retrieval (and Web Search) Fall 2011
  • 2. Zero Count Problem: Term is a possible word of Information need does not occur in document  General Problem of Estimation: Terms occurring once are overestimated even though their occurrence was partly by chance  In order to solve above problems, high quality extra data is required to enlarge the sample of document.
  • 3. This gives the average logarithmic distance between the probabilities: a word would be observed at random from unigram query language model and unigram document language model.
  • 4. C(w, d) is number of times word w occur in document d, and |d| is the length of document. Problems: •Assigns Zero Probability to any word not present in document causing problem in scoring a document with KL-Divergence.
  • 5. Jelinek-Mercer(JM) Smoothing  Dirichlet Smoothing
  • 6. Proposes a fixed parameter λ to control interpolation. Probability of word w given by the collection model Θc
  • 7. It uses document dependent coefficient (parameterized with μ) to control the interpolation.
  • 8. Uses clustering information to smooth a document.  Divides all documents into K clusters.  First smoothes cluster model with collection model using Dirichlet Smoothing.  Takes smoothed cluster as a new reference model to smooth document using JM Smoothing
  • 9. ΘLd stand for document d’s cluster model and λ, β are smoothing parameters.
  • 10. Better than JM or Dirichlet Smoothing: It expands a document with more data from the cluster instead of just using the same collection language model.
  • 11. Cluster D good for smoothing document a but not good for document d. Ideally each document should have its own cluster centered around itself.
  • 12. Expand each document using Probabilistic Neighborhood to estimate a virtual document(d’).  Apply any interpolation based method(e.g. JM or Dirchlet) to such a virtual document and treat the word counts given by this virtual document as if they were the original word count.
  • 13. Can use Cosine rule to determine documents in the neighborhood of Original document.  Problems: ◦ In narrow sense would contain only few documents whereas in wide sense the whole collection may included. ◦ Neighbor documents can’t be sampled the same as original document.
  • 14. Associates a Confidence Value with every document in the collection ◦ This Confidence Value reflects the belief that the document is sampled from the same underlying model as the original one.
  • 15. Confidence Value(γd) is associated to every document to indicate how strongly it is sampled from d’s document.  Confidence Value should follow normal distribution:
  • 16. Shorter document require more help from its neighbor.  Longer documents rely more on itself.  In order to take care of this a parameter α is introduced to control this balance.
  • 17. For Efficiency: Pseudo term count can be calculated only using top M closest Neighbors ( as confidence value follows decay shape)
  • 18. For performance comparison: ◦ It uses four TREC data sets  AP(Associate Press news 1988-90)  LA ( LA times)  WSJ(Wall Street Journals 1987- 92)  SJMN(San Jose Mercury News 1991)  For Testing Algorithm Scale Up ◦ Uses TREC8  For Testing Effect on Short Documents ◦ Uses DOE( Department of Energy)
  • 19. Comparison of DELM +(Diri/JM) with Diri/JM λ for JM, μ for Dirichet are optimal and the same values of λ or μ are used for DELM without further tuning. M is 100 and α is 0.5 for DELM. DELM Outperforms JM and Dirichlet on each Data Sets with improvement as much as 15% in case of Associated Press News(AP).
  • 20. Compared Precision values at different levels of recall for AP data sets. DELM + Dirichet outperforms Dirichlet on every precision point. Precision-Recall Curve on AP Data
  • 21. Compares Performance Trend with respect to M( top M closest neighbors for each Document) Performance change with respect to M Conclusion:  Neighborhood information improves retrieval accuracy Performance becomes insensitive to M when M is sufficiently large
  • 22. Comparison of DELM+Dirichlet with CBDM  DELM + Dirichet outperforms CBDM in MAP values on all four data sets.
  • 23. Document in AP88-89 was shrinked to 30% of original in 1st, 50% of original in 2nd and 70% of original in 3rd . Results shows that DELM help shorter documents more than longer ones (41% on 30%-length corpus to 16% on full length)
  • 24. Performance change with respect to α Optimal Points Migrate when document length becomes shorter. ( 100% corpus length gets optimal at α = 0.4 but 30% corpus has to use α = 0.2)
  • 25. Combination of DELM with Pseudo Feedback  DELM combined with Model-Based Feedback proposed in (Zhai and Lafferty, 2001a) Experiment Performed by:  Retrieving Documents by DELM method Choosing top five document to do model based Feedback Using Expanded query model to retrieve documents again Result: DELM can be combined with pseudo feedback to improve performance
  • 26. References: ◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf ◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf ◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf