SlideShare a Scribd company logo
1 of 1
Download to read offline
• Kendall's τ and AP correla�on are successful at comparing two given rankigns
• What about the correla�on between the observed and the true ranking?
• Useful as a single, well-understood, figure of the reliability of an experiment
• Contrary to sensi�vity or sta�s�cal significance, it gives an idea of global
similarity with the truth, not just about individual pairs of systems (eg. t-test)
or about a swap somewhere in the ranking (eg. ANOVA)
Toward Es�ma�ng the Rank Correla�on
between the Test Collec�on Results
and the True System Performance
Julián Urbano and Mónica Marrero
fully reproducible:
data and code
available online
SIGIR 2016
Pisa, July 19th
Evalua�on
0.0 0.2 0.4 0.6 0.8 1.0
0.01.02.0
Population of Topics
Effectiveness
Density
Sample of Topics
Effectiveness
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
04812
Test Collec�on Real World
S5 > S12 > S6 > S2 > S1 > S4...
Future Work
• Be�er es�mators of discordance
• Interval es�mators
• Fully Bayesian approach
• Consider other sources of
variability besides topics, such as
systems or documents
Results: Error of es�mators
0.020.040.060.080.10
tau − adhoc6
topic set size
Error
10
20
30
40
50
60
70
80
90
100
ML
MSQD
RES
KD
SH(w/o)
SH(w)
tau − adhoc7
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tau − adhoc8
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc8
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc7
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc6
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
• Split-half es�mators perform very poorly
• About 0.035 error with 50 topics
• All proposals near the same, but MSQD be�er with small samples
Results: Bias of es�mators
tau − adhoc6
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
ML
MSQD
RES
KD
SH(w/o)
SH(w)
0.000.040.08
tau − adhoc7
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
0.000.040.08
tau − adhoc8
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc6
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc7
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc8
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
• Split-half es�mators are clearly biased
• Correla�ons generally overes�mated
• MSQD much be�er with small collec�ons, KD slightly be�er otherwise
• We need to know the true scores in order to evaluate the es�mators!
• Stochas�c simula�on from a previous collec�on Y: maintains distribu�ons
and correla�ons, and prefixes vector of true mean scores E[Xs]=μs :=Ys
• From TREC 6, 7 & 8, simulate 3x1000 collec�ons of n=10, 20,...,100 topics
• Split-half baselines w/ and w/o replacement: y=a·ebx
, 2000 replicates
S1 > S2 > S3 > S4 > S5 > S6...
Expected Correla�on with the True Ranking
bias correction
rank of Xi within
the sample

More Related Content

More from Julián Urbano

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 

More from Julián Urbano (20)

Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Recently uploaded

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance

  • 1. • Kendall's τ and AP correla�on are successful at comparing two given rankigns • What about the correla�on between the observed and the true ranking? • Useful as a single, well-understood, figure of the reliability of an experiment • Contrary to sensi�vity or sta�s�cal significance, it gives an idea of global similarity with the truth, not just about individual pairs of systems (eg. t-test) or about a swap somewhere in the ranking (eg. ANOVA) Toward Es�ma�ng the Rank Correla�on between the Test Collec�on Results and the True System Performance Julián Urbano and Mónica Marrero fully reproducible: data and code available online SIGIR 2016 Pisa, July 19th Evalua�on 0.0 0.2 0.4 0.6 0.8 1.0 0.01.02.0 Population of Topics Effectiveness Density Sample of Topics Effectiveness Frequency 0.0 0.2 0.4 0.6 0.8 1.0 04812 Test Collec�on Real World S5 > S12 > S6 > S2 > S1 > S4... Future Work • Be�er es�mators of discordance • Interval es�mators • Fully Bayesian approach • Consider other sources of variability besides topics, such as systems or documents Results: Error of es�mators 0.020.040.060.080.10 tau − adhoc6 topic set size Error 10 20 30 40 50 60 70 80 90 100 ML MSQD RES KD SH(w/o) SH(w) tau − adhoc7 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tau − adhoc8 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc8 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc7 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc6 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 • Split-half es�mators perform very poorly • About 0.035 error with 50 topics • All proposals near the same, but MSQD be�er with small samples Results: Bias of es�mators tau − adhoc6 topic set size Bias 10 20 30 40 50 60 70 80 90 100 ML MSQD RES KD SH(w/o) SH(w) 0.000.040.08 tau − adhoc7 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 0.000.040.08 tau − adhoc8 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc6 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc7 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc8 topic set size Bias 10 20 30 40 50 60 70 80 90 100 • Split-half es�mators are clearly biased • Correla�ons generally overes�mated • MSQD much be�er with small collec�ons, KD slightly be�er otherwise • We need to know the true scores in order to evaluate the es�mators! • Stochas�c simula�on from a previous collec�on Y: maintains distribu�ons and correla�ons, and prefixes vector of true mean scores E[Xs]=μs :=Ys • From TREC 6, 7 & 8, simulate 3x1000 collec�ons of n=10, 20,...,100 topics • Split-half baselines w/ and w/o replacement: y=a·ebx , 2000 replicates S1 > S2 > S3 > S4 > S5 > S6... Expected Correla�on with the True Ranking bias correction rank of Xi within the sample