SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
A Comparison of the Optimality 
of Statistical Significance Tests 
for Information Retrieval Evaluation 
Julián Urbano, Mónica Marrero and Diego Martín 
Department of Computer Science · University Carlos III of Madrid 
The problem: is system A more effective than system B? 
The drill: evaluate with a test collection and run a statistical significance test 
The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? 
The reason: test assumptions are violated, so which one is optimal in practice? 
Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) 
Data and Methods 
· TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 
o 110 runs, 5995 pairs of systems 
· Randomly split topics in T1 and T2, as if two collections 
o Evaluate all runs and compute p-values 
o Compare p-values from T1 with p-values from T2 
o 1000 trials, 12M p-values per test, 60M in total 
· Interpret pairs of p-values for different α levels 
T2 
A ≻B A ≺B A ≻≻B A ≺≺B 
T1 
A ≻B Non-significance 
A≻≻B 
Lack of 
power 
Minor 
error 
Success 
Major 
error 
Non-significance rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
.001 .005 .01 .05 .1 
Significance level a 
Non-significants / Total 
0.3 0.35 0.4 0.45 0.5 0.6 
Previous Work 
Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 
· Wilcoxon more powerful than t-test, but more errors 
Smucker et al. ‘07, ‘09 
· bootstrap test overly powerful, though similar to t-test 
and permutation 
· Wilcoxon and sign unreliable, should use permutation 
· Power: bootstrap test gives more significant results 
· Safety: t-test gives fewer errors 
· Exactness: Wilcoxon test best tracks the nominal level 
· The permutation test is not optimal in practice 
· Error rates seem lower than expected; focus on power 
Success rate 
.001 .005 .01 .05 .1 
Significance level a 
Successes / Total significants 
0.76 0.78 0.80 0.82 0.84 0.86 
Take-Home Messages 
Lack of power rate 
.001 .005 .01 .05 .1 
Significance level a 
Lacks of power / Total significants 
0.12 0.14 0.16 0.18 0.20 
Minor error rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
y=x 
.001 .005 .01 .05 .1 
Significance level a 
Minor errors / Total significants 
0.001 0.002 0.005 0.010 0.020 
Major error rate 
.001 .005 .01 .05 .1 
Significance level a 
Major errors / Total significants 
5e-07 5e-06 5e-05 5e-04 
Global error rate 
.0001 .0005.001 .005 .01 .05 .1 .5 
Significance level a 
Minor and Major errors / Total significants 
5e-04 2e-03 5e-03 2e-02 5e-02 
Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Contenu connexe

Similaire à A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

Cross validation
Cross validationCross validation
Cross validationRidhaAfrawe
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision makingShankha Goswami
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...TESCO - The Eastern Specialty Company
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxbelay41
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Matt Hansen
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec domsBabasab Patil
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narrSarah
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykSMayankSharma
 
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hkajithsrc
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methodsSubash John
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptxZohairMughal1
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Processknksmart
 

Similaire à A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation (20)

Cross validation
Cross validationCross validation
Cross validation
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
Burden Testing, Theory, and Practice
Burden Testing, Theory, and PracticeBurden Testing, Theory, and Practice
Burden Testing, Theory, and Practice
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3
 
Instrument Transformer Testing
Instrument Transformer TestingInstrument Transformer Testing
Instrument Transformer Testing
 
SE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptxSE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptx
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narr
 
Facility Location
Facility Location Facility Location
Facility Location
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hk
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methods
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptx
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
 

Plus de Julián Urbano

Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
 

Plus de Julián Urbano (20)

Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Dernier

Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Dernier (20)

Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

  • 1. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation Julián Urbano, Mónica Marrero and Diego Martín Department of Computer Science · University Carlos III of Madrid The problem: is system A more effective than system B? The drill: evaluate with a test collection and run a statistical significance test The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? The reason: test assumptions are violated, so which one is optimal in practice? Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) Data and Methods · TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 o 110 runs, 5995 pairs of systems · Randomly split topics in T1 and T2, as if two collections o Evaluate all runs and compute p-values o Compare p-values from T1 with p-values from T2 o 1000 trials, 12M p-values per test, 60M in total · Interpret pairs of p-values for different α levels T2 A ≻B A ≺B A ≻≻B A ≺≺B T1 A ≻B Non-significance A≻≻B Lack of power Minor error Success Major error Non-significance rate t-test permutation bootstrap Wilcoxon sign .001 .005 .01 .05 .1 Significance level a Non-significants / Total 0.3 0.35 0.4 0.45 0.5 0.6 Previous Work Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 · Wilcoxon more powerful than t-test, but more errors Smucker et al. ‘07, ‘09 · bootstrap test overly powerful, though similar to t-test and permutation · Wilcoxon and sign unreliable, should use permutation · Power: bootstrap test gives more significant results · Safety: t-test gives fewer errors · Exactness: Wilcoxon test best tracks the nominal level · The permutation test is not optimal in practice · Error rates seem lower than expected; focus on power Success rate .001 .005 .01 .05 .1 Significance level a Successes / Total significants 0.76 0.78 0.80 0.82 0.84 0.86 Take-Home Messages Lack of power rate .001 .005 .01 .05 .1 Significance level a Lacks of power / Total significants 0.12 0.14 0.16 0.18 0.20 Minor error rate t-test permutation bootstrap Wilcoxon sign y=x .001 .005 .01 .05 .1 Significance level a Minor errors / Total significants 0.001 0.002 0.005 0.010 0.020 Major error rate .001 .005 .01 .05 .1 Significance level a Major errors / Total significants 5e-07 5e-06 5e-05 5e-04 Global error rate .0001 .0005.001 .005 .01 .05 .1 .5 Significance level a Minor and Major errors / Total significants 5e-04 2e-03 5e-03 2e-02 5e-02 Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant