Pitfalls of statistical hypothesis testing: multiple testing
BMJ 2014; 349 doi: https://doi.org/10.1136/bmj.g5310 (Published 29 August 2014) Cite this as: BMJ 2014;349:g5310
All rapid responses
I thank Lytsy[1] and Chiolero[2] for their responses to my article on multiple statistical hypothesis testing.[3]
Lytsy suggests statement a) may be only probably true rather than (unequivocally) true as stated. No justification is provided by Lytsy for this suggestion. In the example used the researchers reported a statistically significant difference between treatment groups in mean BMI at age 2 years. If no difference had existed between treatment groups in mean BMI at age 2 years in the population (statement a), then by definition a type I error would have occurred. Statement a is (unequivocally and not probably) true.
Chiolero makes an interesting point about the importance of presenting the observed differences between treatment groups in the outcome measures in addition to the results of statistical hypothesis testing. This would allow an evaluation of the treatment effects in terms of clinical significance and not hypothesis testing alone. Such a table was constructed but it was decided not to include it in the answers since it contributed little to the topic of the endgame. The P-values for the individual statistical tests were also not presented. The article addressed the pitfalls of multiple statistical hypothesis testing, and not the advantages and disadvantages of hypothesis testing per se. Chiolero’s response may have missed the point of the article. Statistical hypothesis testing is a requirement of all journals, and the aim of the article was to raise awareness about multiple testing. Previous endgame questions have addressed the advantages and disadvantages of hypothesis testing. In particular consideration of the comparison of statistical significance with clinical significance, plus the smallest effect of clinical interest and the impact that overpowered studies have on statistical significance.[4][5][6] Readers can refer to these articles in order to explore the topic further.
1. Lytsy P. Pitfalls of statistical hypothesis testing: multiple testing. 3rd September 2014.
2. Chiolero A. Multiple comparison: too much weight to the statistical significance? 5th September 2014.
3. Sedgwick P. Pitfalls of statistical hypothesis testing: multiple testing. BMJ 2014;349:g5310.
4. Sedgwick P. The importance of statistical power. BMJ 2013;347:f6282
5. Sedgwick P. Clinical significance versus statistical significance. BMJ 2014;348:g2130.
6. Sedgwick P. Understanding why “absence of evidence is not evidence of absence”. BMJ 2014;349:g4751.
Competing interests: No competing interests
In the example used by the author to address the issue of multiple testing, the actual differences between the groups, i.e., the effect size of the intervention, were not reported. Unfortunatly, this contributes to give too much weight to the statistical significance of the difference. It is however of major importance to evaluate the actual difference between the groups, especially when multiple comparisons are done. That would help gauge the potential clinical importance of the intervention evaluated.
Competing interests: No competing interests
Interesting as always, but strictly speaking: is statement a) really true? Or only probably true?
Competing interests: No competing interests
On some curious statements about Type I error rates
Sedgwick [1] computed the familywise Type I error rates for families of two, three, four, and five unadjusted orthogonal tests to be 0.098, 0.143, 0.186, and 0.226, respectively, assuming true null hypotheses and a comparisonwise alpha level of 0.05. Those estimates are correct (notwithstanding a trivial rounding mistake). However, a statement that follows shortly thereafter is patently false: “When outcomes are not independent, the probability of a significant result after multiple hypothesis testing increases further and will be greater than indicated above.” On the contrary, the opposite is generally true: under the null hypothesis, the familywise Type I error rate monotonically decreases (toward the comparisonwise alpha level) as positive dependence among the test statistics increases.[2, 3]
Negative dependence can increase the familywise Type I error rate relative to independence, but generally only to a negligible extent.[2] Note also that for two-sided tests, negative dependence among test statistics is typically not plausible. Moreover, it is clear from the context that Sedgwick was referring to positively dependent two-sided tests.
Sedgwick reasoned that for highly correlated outcomes, when one significant result occurs under the null hypothesis, “others will be likely to occur too.” That is true enough, but by the same token, when one non-significant result occurs in the given scenario, others will be more likely to be non-significant too. Hence, dependence does not ultimately affect the number of Type I errors expected to occur in the long term (for unadjusted or Bonferroni-adjusted tests).[3] In fact, positive dependence reduces the probability of Type I error in a given study, because errors will tend to occur simultaneously and hence be concentrated into a smaller proportion of studies--thereby making studies that contain “at least one” Type I error less frequent.
There are additional curious statements in the article. For example, what does it mean to say the Bonferroni procedure’s disadvantage is that “it errs on the side on [sic] non-significance?” It was also suggested that if a large number of tests are conducted, “ultimately some of these will result in a type I error”--which is not necessarily the case. Perhaps Sedgwick meant that of all the experiments with large numbers of hypotheses, some are bound to produce Type I errors. That is true enough, but the same could be said of experiments with single hypotheses. Suffice it to say, the article would benefit from some corrections and clarifications.
References
1. Sedgwick P. Pitfalls of statistical hypothesis testing: multiple testing. BMJ 2014;349:g5310.
2. Dmitrienko A, Bretz F, Westfall PH, et al. Multiple testing methodology. In: Dmitrienko A, Tamhane AC, Bretz F, eds. Multiple Testing Problems in Pharmaceutical Statistics. Boca Raton, FL: Chapman & Hall; 2010:35-98.
3. Ryan TA. Multiple comparisons in psychological research. Psychol Bull 1959;56:26-47.
Competing interests: No competing interests