Multiple significance tests: the Bonferroni correctionBMJ 2012; 344 doi: http://dx.doi.org/10.1136/bmj.e509 (Published 25 January 2012) Cite this as: BMJ 2012;344:e509
- Philip Sedgwick, senior lecturer in medical statistics
- 1Centre for Medical and Healthcare Education, St George’s, University of London, Tooting, London, UK
Researchers assessed the effects of hormone replacement therapy, consisting of combined oestrogen and progestogen, on health related quality of life. A randomised placebo controlled, double blind trial study design was used. Women were recruited if they were postmenopausal, had a uterus, and were aged 50-69 at randomisation. Outcome measures included health related quality of life and psychological wellbeing. The study period was one year.1
The researchers investigated the effects of combined hormone replacement therapy compared with placebo at one year, using a 0.05 (5%) critical level of significance and adjusting this with the Bonferroni correction. The researchers concluded that combined hormone replacement therapy started many years after the menopause can improve health related quality of life.
For which of the following does the Bonferroni correction reduce the probability of occurring?
a) Type I error
b) Type II error
The Bonferroni correction reduces the probability of making a type I error (answer a) but not a type II error (answer b).
Combined hormone replacement therapy was compared with placebo using statistical hypothesis testing, the purpose of which was to make inferences about the population on the basis of the sample. However, if the sample was not representative of the population then errors could have been committed in the hypothesis testing. Two types of error were possible, type I and II, described in a previous question.2 The purpose of the Bonferroni correction was to limit the probability of committing a type I error (answer a).
Type I and II errors would both result in the incorrect inference being made about the effectiveness of the combined hormone replacement therapy. A type I error would occur if the null hypothesis was incorrectly rejected in favour of the alternative—that is, if there was a difference in outcome between combined hormone treatment and placebo in the sample but not in the population. A type I error would occur because of sampling error: only a proportion of the population was studied, possibly resulting in an unrepresentative sample. Sampling error can also result in a type II error, which is when the null hypothesis is not rejected in favour of the alternative when it should have been—that is, there is a difference in outcome in the population between combined hormone treatment and placebo but the difference was not seen in the sample. However, the Bonferroni correction does not limit the probability of a type II error occurring (answer b is false). Sampling error can be reduced by increasing sample size, thus obtaining a more representative sample, and therefore doing so increases the power of the statistical test.3
For each hypothesis test in the study, the P value was derived by hypothetically repeating the study an infinite number of times. The P value is the proportion of these hypothetical studies that would have produced a test statistic greater or equal to the absolute value calculated in the above study. The critical level of significance is set at 0.05 (5%). Therefore, for each hypothesis test the null hypothesis would be rejected in favour of the alternative for those 5% of the infinite number of studies with the largest test statistics; hence for any hypothesis test the maximum probability of rejecting the null hypothesis was 0.05. Since any hypothesis test could result in a type I error, the probability of it occurring for each test was 0.05. When multiple hypothesis tests are performed, the probability of a type I error occurring is greater than 0.05.4
Care must be taken when studies undertake a large number of statistical tests—ultimately some of these will result in a type I error. However, we will not know which significant findings are a type I error. Various approaches have been suggested to reduce the number of type I errors when undertaking multiple testing, including the Bonferroni correction.
The Bonferroni correction involved adjusting the critical significance level of 0.05 by dividing it by the number of statistical tests performed. The researchers reported performing 41 statistical tests, and so therefore statistical significance was achieved if P was less than 0.05 ÷ 41, or 0.001. The correction is conservative and not recommended if a large number of tests are performed, since few if any tests will be significant after the correction has been applied.
Cite this as: BMJ 2012;344:e509
Competing interests: None declared.