SlideShare a Scribd company logo
1 of 8
Download to read offline
Significance Tests
     in NLP
    Presented by Jinho D. Choi
 University of Colorado at Boulder
      September 15th, 2010
Data Type
•   Continuous data
    •   Outputs are from infinitely many possible values (regression).

    •   e.g., temperatures, document relevancies.

    •   Each value is relevant to one another.

    •   One sample t-test, Paired two sample t-test.


•   Categorical data
    •   Outputs are from finitely defined categories (classification).

    •   e.g,. pos-tags, dependency labels.

    •   Each value is not relevant to one another.

    •   Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square
        test, McNemar’s test
One sample t-test
•   One sample t-test
    •   The true mean is known, and the normal distribution is assumed.

    •   Null hypothesis: difference between true mean and our mean is zero.

•   Example
    •   Average ITA score = 84.31% (true mean)
              be          say         get        know          see      our mean
            90.88%     89.75%       84.11%      87.57%      88.19%        90.25%

    •   Calculate t-score:


    •   Use the t-score to find p-value in the distribution table.
        •    Degree of freedom: minimal # of values to determine all the data points.

        •    p ≤ 0.01 → the difference is statistically significant with over 99% confidence.
Paired two sample t-test
•   Paired two sample t-test
    •    Each sample is tested by two players or a player twice.

    •    Null hypothesis: mean difference between two normally distributed
         populations is zero.

•   Example
                   EBC        EBN       SIN        XIN       WEB            WSJ    Mean
        LTH       83.36      86.32     86.80      85.50      85.53         87.15   85.88
        Clear     84.06      86.77     86.55      85.41      85.70         87.58   86.09


    •    Calculate t-score:

    •    Find p-value.
        •    p = 0.1701→ the difference is not statistically significant.


            NLP data is often not normally distributed.
Wilcoxon signed-rank test
•   Wilcoxon signed-rank test
    •    Non-parametric test: no distribution is assumed.

    •    Null hypothesis: median difference between pairs of observations is zero

•   Example
                        EBC         EBN         SIN         XIN       WEB      WSJ
            LTH        83.36       86.32       86.80       85.50      85.53   87.15
           Clear       84.06       86.77       86.55       85.41      85.70   87.58
        Clear - LTH     0.7        0.45        -0.25       -0.09      0.17    0.43
        Singed rank      6           5           -3          -1         2       4

    •    W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4

    •    Use the min(W+, W-) to find p-value.
        •   p ≤ 0.2188 → the difference is not statistically significant.

        •   cf. paired two sample t-test: p = 0.1701.
Fisher's exact test
•   Fisher's exact test
    •   Comparing binary outputs produced by two methods.

    •   The significance of the deviation can be calculated exactly.

    •   Null hypothesis: output difference between two methods is zero.
                      Method 1 Method 2    Total
          Class 1        a        b         a+b
          Class 2        c        d        c+d
           Total        a+c      b+d         n
•   Example
                            Clear       LTH          Total
           Correct        142,731    142,375       285,106
          Incorrect        23,055     23,411        46,466
            Total         165,786    165,786       331,572
                                                                 Really?
Pearson's chi-square test
•   Pearson's chi-square test
    •   Each observation is independent from one another.

    •   The chi-square distribution is assumed.

    •   Null hypothesis: difference between observed frequency distribution and
        true distribution is zero.
                                                                    observed
•   Example                                                         true
                          Clear          LTH              X2
         Correct        142,731       142,375           0.89
        Incorrect        23,055        23,411           5.41
          Total         165,786       165,786            6.3

    •   Calculate X2-score:

    •   Use the X2-score to find p-value.

        •   p = 0.0121→ the difference is statistically significant with 98.79% confidence.
McNemar's test
•   McNemar's test
    •   Applied to 2×2 contingency tables with binary outputs.

    •   Non-parametric test: no distribution is assumed.

    •   Null hypothesis: p(b) = p(c)
                                                     Method 2:+
                                                                Method 1:+
                                                                    a
                                                                                 Method 1:-
                                                                                     b
•   Example                                          Method 2:-     c                d
                        Clear 1: +       Clear 1: -        Total
        LTH 2: +            138,402            3,973         142,375
        LTH 2: -               4,329          19,082          23,411
          Total             142,731           23,055         165,786


    •   Calculate X2-score:

    •   Use the X2-score to find p-value.
        •   p < 0.0001→ the difference is statistically significant with 99.99% confidence.

More Related Content

What's hot (20)

Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Sampling
Sampling Sampling
Sampling
 
Introduction to experimental design
Introduction to experimental designIntroduction to experimental design
Introduction to experimental design
 
Types of Research Design
Types of Research DesignTypes of Research Design
Types of Research Design
 
Chi-square, Yates, Fisher & McNemar
Chi-square, Yates, Fisher & McNemarChi-square, Yates, Fisher & McNemar
Chi-square, Yates, Fisher & McNemar
 
Matched pair designs
Matched pair designsMatched pair designs
Matched pair designs
 
Scales of Measurement
Scales of MeasurementScales of Measurement
Scales of Measurement
 
Estimation and hypothesis testing 1 (graduate statistics2)
Estimation and hypothesis testing 1 (graduate statistics2)Estimation and hypothesis testing 1 (graduate statistics2)
Estimation and hypothesis testing 1 (graduate statistics2)
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Anova and T-Test
Anova and T-TestAnova and T-Test
Anova and T-Test
 
Single factor design
Single factor designSingle factor design
Single factor design
 
Presentation on stratified sampling
Presentation on stratified samplingPresentation on stratified sampling
Presentation on stratified sampling
 
Standard error of the mean
Standard error of the meanStandard error of the mean
Standard error of the mean
 
Analysis of variance
Analysis of varianceAnalysis of variance
Analysis of variance
 
Experimental research design
Experimental research designExperimental research design
Experimental research design
 
Statistical tests for data involving quantitative data
Statistical tests for data involving quantitative dataStatistical tests for data involving quantitative data
Statistical tests for data involving quantitative data
 
Mpc 006 - 01-04 level of significance
Mpc 006 - 01-04 level of significanceMpc 006 - 01-04 level of significance
Mpc 006 - 01-04 level of significance
 
Measurement of scales
Measurement of scalesMeasurement of scales
Measurement of scales
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesis
 
The Normal Distribution
The Normal DistributionThe Normal Distribution
The Normal Distribution
 

Viewers also liked

Randomized Controlled Trials
Randomized Controlled TrialsRandomized Controlled Trials
Randomized Controlled TrialsNabeela Basha
 
Uses of epidemiology
Uses of epidemiologyUses of epidemiology
Uses of epidemiologyKEM Hospital
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in StatisticsVikash Keshri
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisUC Davis
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical conceptsCarlo Magno
 
Chi Squared
Chi SquaredChi Squared
Chi SquaredGeoBlogs
 
Interview Carlos Corriere del Ticino
Interview Carlos Corriere del TicinoInterview Carlos Corriere del Ticino
Interview Carlos Corriere del TicinoCreus Moreira Carlos
 
Cerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicCerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicFederico Cerutti
 
Bar Diagram (chart) in Statistics presentation
Bar Diagram (chart) in Statistics presentationBar Diagram (chart) in Statistics presentation
Bar Diagram (chart) in Statistics presentationsheiblu
 
Quantitative techniques in research
Quantitative techniques in researchQuantitative techniques in research
Quantitative techniques in researchCarlo Magno
 
One Sample T Test
One Sample T TestOne Sample T Test
One Sample T Testshoffma5
 
One-Sample Hypothesis Tests
One-Sample Hypothesis TestsOne-Sample Hypothesis Tests
One-Sample Hypothesis TestsSr Edith Bogue
 
diagrammatic presentation of data-bar diagram & pie diagram
diagrammatic presentation of data-bar diagram & pie diagramdiagrammatic presentation of data-bar diagram & pie diagram
diagrammatic presentation of data-bar diagram & pie diagramanusha gupta
 

Viewers also liked (20)

Randomized Controlled Trials
Randomized Controlled TrialsRandomized Controlled Trials
Randomized Controlled Trials
 
Uses of epidemiology
Uses of epidemiologyUses of epidemiology
Uses of epidemiology
 
Chi square test
Chi square testChi square test
Chi square test
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in Statistics
 
Chi square test
Chi square testChi square test
Chi square test
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysis
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical concepts
 
Chi Squared
Chi SquaredChi Squared
Chi Squared
 
Interview Carlos Corriere del Ticino
Interview Carlos Corriere del TicinoInterview Carlos Corriere del Ticino
Interview Carlos Corriere del Ticino
 
Cerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicCerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective Logic
 
The chi square_test
The chi square_testThe chi square_test
The chi square_test
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
presentation of data
presentation of datapresentation of data
presentation of data
 
Bar Diagram (chart) in Statistics presentation
Bar Diagram (chart) in Statistics presentationBar Diagram (chart) in Statistics presentation
Bar Diagram (chart) in Statistics presentation
 
Quantitative techniques in research
Quantitative techniques in researchQuantitative techniques in research
Quantitative techniques in research
 
One Sample T Test
One Sample T TestOne Sample T Test
One Sample T Test
 
Randomized controlled trials
Randomized controlled trialsRandomized controlled trials
Randomized controlled trials
 
One-Sample Hypothesis Tests
One-Sample Hypothesis TestsOne-Sample Hypothesis Tests
One-Sample Hypothesis Tests
 
diagrammatic presentation of data-bar diagram & pie diagram
diagrammatic presentation of data-bar diagram & pie diagramdiagrammatic presentation of data-bar diagram & pie diagram
diagrammatic presentation of data-bar diagram & pie diagram
 
Statistical software
Statistical softwareStatistical software
Statistical software
 

Similar to Significance tests

t distribution, paired and unpaired t-test
t distribution, paired and unpaired t-testt distribution, paired and unpaired t-test
t distribution, paired and unpaired t-testBPKIHS
 
NON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptxNON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptxDrLasya
 
Introduction to Business Analytics Course Part 9
Introduction to Business Analytics Course Part 9Introduction to Business Analytics Course Part 9
Introduction to Business Analytics Course Part 9Beamsync
 
allnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptxallnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptxSoujanyaLk1
 
Testing a claim about a standard deviation or variance
Testing a claim about a standard deviation or variance  Testing a claim about a standard deviation or variance
Testing a claim about a standard deviation or variance Long Beach City College
 
Sociology 601 class 7
Sociology 601 class 7Sociology 601 class 7
Sociology 601 class 7Rishabh Gupta
 
hypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigmahypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigmavdheerajk
 
09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.pptPooja Sakhla
 
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdfDr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdfHassanMohyUdDin2
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxlmelaine
 
Effect of global market on indian market
Effect of global market on indian marketEffect of global market on indian market
Effect of global market on indian marketArpit Jain
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)마이캠퍼스
 

Similar to Significance tests (20)

t distribution, paired and unpaired t-test
t distribution, paired and unpaired t-testt distribution, paired and unpaired t-test
t distribution, paired and unpaired t-test
 
NON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptxNON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptx
 
Introduction to Business Analytics Course Part 9
Introduction to Business Analytics Course Part 9Introduction to Business Analytics Course Part 9
Introduction to Business Analytics Course Part 9
 
All non parametric test
All non parametric testAll non parametric test
All non parametric test
 
All non parametric test
All non parametric testAll non parametric test
All non parametric test
 
allnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptxallnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptx
 
Testing a claim about a standard deviation or variance
Testing a claim about a standard deviation or variance  Testing a claim about a standard deviation or variance
Testing a claim about a standard deviation or variance
 
Goodness of fit test
Goodness of fit testGoodness of fit test
Goodness of fit test
 
Sociology 601 class 7
Sociology 601 class 7Sociology 601 class 7
Sociology 601 class 7
 
hypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigmahypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigma
 
09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt
 
Data analysis
Data analysisData analysis
Data analysis
 
Chi square
Chi squareChi square
Chi square
 
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdfDr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
 
Probability
ProbabilityProbability
Probability
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docx
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Effect of global market on indian market
Effect of global market on indian marketEffect of global market on indian market
Effect of global market on indian market
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Estimating a Population Mean
Estimating a Population Mean  Estimating a Population Mean
Estimating a Population Mean
 

More from Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionJinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning RepresentationJinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingJinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet SimilaritiesJinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical RelationsJinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementJinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingJinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueJinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingJinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological SortJinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseJinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsJinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyJinho Choi
 

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Significance tests

  • 1. Significance Tests in NLP Presented by Jinho D. Choi University of Colorado at Boulder September 15th, 2010
  • 2. Data Type • Continuous data • Outputs are from infinitely many possible values (regression). • e.g., temperatures, document relevancies. • Each value is relevant to one another. • One sample t-test, Paired two sample t-test. • Categorical data • Outputs are from finitely defined categories (classification). • e.g,. pos-tags, dependency labels. • Each value is not relevant to one another. • Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square test, McNemar’s test
  • 3. One sample t-test • One sample t-test • The true mean is known, and the normal distribution is assumed. • Null hypothesis: difference between true mean and our mean is zero. • Example • Average ITA score = 84.31% (true mean) be say get know see our mean 90.88% 89.75% 84.11% 87.57% 88.19% 90.25% • Calculate t-score: • Use the t-score to find p-value in the distribution table. • Degree of freedom: minimal # of values to determine all the data points. • p ≤ 0.01 → the difference is statistically significant with over 99% confidence.
  • 4. Paired two sample t-test • Paired two sample t-test • Each sample is tested by two players or a player twice. • Null hypothesis: mean difference between two normally distributed populations is zero. • Example EBC EBN SIN XIN WEB WSJ Mean LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88 Clear 84.06 86.77 86.55 85.41 85.70 87.58 86.09 • Calculate t-score: • Find p-value. • p = 0.1701→ the difference is not statistically significant. NLP data is often not normally distributed.
  • 5. Wilcoxon signed-rank test • Wilcoxon signed-rank test • Non-parametric test: no distribution is assumed. • Null hypothesis: median difference between pairs of observations is zero • Example EBC EBN SIN XIN WEB WSJ LTH 83.36 86.32 86.80 85.50 85.53 87.15 Clear 84.06 86.77 86.55 85.41 85.70 87.58 Clear - LTH 0.7 0.45 -0.25 -0.09 0.17 0.43 Singed rank 6 5 -3 -1 2 4 • W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4 • Use the min(W+, W-) to find p-value. • p ≤ 0.2188 → the difference is not statistically significant. • cf. paired two sample t-test: p = 0.1701.
  • 6. Fisher's exact test • Fisher's exact test • Comparing binary outputs produced by two methods. • The significance of the deviation can be calculated exactly. • Null hypothesis: output difference between two methods is zero. Method 1 Method 2 Total Class 1 a b a+b Class 2 c d c+d Total a+c b+d n • Example Clear LTH Total Correct 142,731 142,375 285,106 Incorrect 23,055 23,411 46,466 Total 165,786 165,786 331,572 Really?
  • 7. Pearson's chi-square test • Pearson's chi-square test • Each observation is independent from one another. • The chi-square distribution is assumed. • Null hypothesis: difference between observed frequency distribution and true distribution is zero. observed • Example true Clear LTH X2 Correct 142,731 142,375 0.89 Incorrect 23,055 23,411 5.41 Total 165,786 165,786 6.3 • Calculate X2-score: • Use the X2-score to find p-value. • p = 0.0121→ the difference is statistically significant with 98.79% confidence.
  • 8. McNemar's test • McNemar's test • Applied to 2×2 contingency tables with binary outputs. • Non-parametric test: no distribution is assumed. • Null hypothesis: p(b) = p(c) Method 2:+ Method 1:+ a Method 1:- b • Example Method 2:- c d Clear 1: + Clear 1: - Total LTH 2: + 138,402 3,973 142,375 LTH 2: - 4,329 19,082 23,411 Total 142,731 23,055 165,786 • Calculate X2-score: • Use the X2-score to find p-value. • p < 0.0001→ the difference is statistically significant with 99.99% confidence.