This document evaluates different statistical significance tests for information retrieval evaluation by comparing their power, safety, and ability to maintain the desired error rate. It analyzes results from over 60 million p-value comparisons between 110 retrieval system runs on TREC topics. The t-test was found to have the fewest errors but lower power, while the Wilcoxon test best maintained the nominal error rate but had lower power and more errors than other tests. The permutation test had an optimal balance of power and safety for practical use in information retrieval evaluation.