1. Significance Tests
in NLP
Presented by Jinho D. Choi
University of Colorado at Boulder
September 15th, 2010
2. Data Type
• Continuous data
• Outputs are from infinitely many possible values (regression).
• e.g., temperatures, document relevancies.
• Each value is relevant to one another.
• One sample t-test, Paired two sample t-test.
• Categorical data
• Outputs are from finitely defined categories (classification).
• e.g,. pos-tags, dependency labels.
• Each value is not relevant to one another.
• Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square
test, McNemar’s test
3. One sample t-test
• One sample t-test
• The true mean is known, and the normal distribution is assumed.
• Null hypothesis: difference between true mean and our mean is zero.
• Example
• Average ITA score = 84.31% (true mean)
be say get know see our mean
90.88% 89.75% 84.11% 87.57% 88.19% 90.25%
• Calculate t-score:
• Use the t-score to find p-value in the distribution table.
• Degree of freedom: minimal # of values to determine all the data points.
• p ≤ 0.01 → the difference is statistically significant with over 99% confidence.
4. Paired two sample t-test
• Paired two sample t-test
• Each sample is tested by two players or a player twice.
• Null hypothesis: mean difference between two normally distributed
populations is zero.
• Example
EBC EBN SIN XIN WEB WSJ Mean
LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88
Clear 84.06 86.77 86.55 85.41 85.70 87.58 86.09
• Calculate t-score:
• Find p-value.
• p = 0.1701→ the difference is not statistically significant.
NLP data is often not normally distributed.
5. Wilcoxon signed-rank test
• Wilcoxon signed-rank test
• Non-parametric test: no distribution is assumed.
• Null hypothesis: median difference between pairs of observations is zero
• Example
EBC EBN SIN XIN WEB WSJ
LTH 83.36 86.32 86.80 85.50 85.53 87.15
Clear 84.06 86.77 86.55 85.41 85.70 87.58
Clear - LTH 0.7 0.45 -0.25 -0.09 0.17 0.43
Singed rank 6 5 -3 -1 2 4
• W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4
• Use the min(W+, W-) to find p-value.
• p ≤ 0.2188 → the difference is not statistically significant.
• cf. paired two sample t-test: p = 0.1701.
6. Fisher's exact test
• Fisher's exact test
• Comparing binary outputs produced by two methods.
• The significance of the deviation can be calculated exactly.
• Null hypothesis: output difference between two methods is zero.
Method 1 Method 2 Total
Class 1 a b a+b
Class 2 c d c+d
Total a+c b+d n
• Example
Clear LTH Total
Correct 142,731 142,375 285,106
Incorrect 23,055 23,411 46,466
Total 165,786 165,786 331,572
Really?
7. Pearson's chi-square test
• Pearson's chi-square test
• Each observation is independent from one another.
• The chi-square distribution is assumed.
• Null hypothesis: difference between observed frequency distribution and
true distribution is zero.
observed
• Example true
Clear LTH X2
Correct 142,731 142,375 0.89
Incorrect 23,055 23,411 5.41
Total 165,786 165,786 6.3
• Calculate X2-score:
• Use the X2-score to find p-value.
• p = 0.0121→ the difference is statistically significant with 98.79% confidence.
8. McNemar's test
• McNemar's test
• Applied to 2×2 contingency tables with binary outputs.
• Non-parametric test: no distribution is assumed.
• Null hypothesis: p(b) = p(c)
Method 2:+
Method 1:+
a
Method 1:-
b
• Example Method 2:- c d
Clear 1: + Clear 1: - Total
LTH 2: + 138,402 3,973 142,375
LTH 2: - 4,329 19,082 23,411
Total 142,731 23,055 165,786
• Calculate X2-score:
• Use the X2-score to find p-value.
• p < 0.0001→ the difference is statistically significant with 99.99% confidence.