Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
Thomas V Perneger Institute of Social and
Preventive Medicine, University of Geneva, CH-1211 Geneva 4, Switzerland
Correspondence to: Dr Perneger perneger{at}cmu.unige.ch
When more than one statistical test is performed in
analysing the data from a clinical study, some statisticians and
journal editors demand that a more stringent criterion be used for
"statistical significance" than the conventional
P<0.05.1 Many well meaning researchers, eager for
methodological rigour, comply without fully grasping what is at stake.
Recently, adjustments for multiple tests (or Bonferroni adjustments)
have found their way into introductory texts on medical statistics,
which has increased their apparent legitimacy.
This
paper advances the view, widely held by epidemiologists, that
Bonferroni adjustments are, at best, unnecessary and, at worst,
deleterious to sound statistical inference.
Bonferroni adjustments are based on the following
reasoning.1-3 If a null hypothesis is true (for instance,
two treatment groups in a randomised trial do not differ in terms
of cure rates), a significant difference (P<0.05) will be observed by
chance once in 20 trials. This is the type I error, or Irrelevant null hypothesis
Inference defies common sense
Increase in type II errors
What tests should be included?
Summary points
Adjusting statistical significance for the number of tests that
have been performed on study data
the Bonferroni method
creates more
problems than it solves
The Bonferroni method is concerned with the general null hypothesis
(that all null hypotheses are true simultaneously), which is rarely of
interest or use to researchers
The main weakness is that the interpretation of a finding depends on
the number of other tests performed
The likelihood of type II errors is also increased, so that truly
important differences are deemed non-significant
Simply describing what tests of significance have been performed, and
why, is generally the best way of dealing with multiple comparisons
![]()
Adjustment for multiple tests
. When 20 independent tests are performed (for example, study groups are compared
with regard to 20 unrelated variables) and the null hypothesis holds for all 20 comparisons, the chance of at least one test being significant is no longer 0.05, but 0.64. The formula for the error rate
across the study is 1
(1
)n, where n is the number
of tests performed. However, the Bonferroni adjustment deflates the
applied to each, so the study-wide error rate remains at 0.05. The
adjusted significance level is 1
(1
)1/n (in this
case 0.00256), often approximated by
/n (here 0.0025). What is wrong
with this statistical approach?
![]()
Problems
The first problem is that Bonferroni adjustments are concerned
with the wrong hypothesis.4-6 The study- wide error rate
applies only to the hypothesis that the two groups are identical on all 20 variables (the universal null hypothesis). If one or more of the 20 P values is less than 0.00256, the universal null hypothesis is
rejected. We can say that the two groups are not equal for all 20 variables, but we cannot say which, or even how many, variables differ.
Such information is usually of no interest to the researcher, who wants
to assess each variable in its own right. A clinical equivalent would
be the case of a doctor who orders 20 different laboratory tests for a
patient, only to be told that some are abnormal, without further
detail. Thus, Bonferroni adjustments provide a correct answer to a
largely irrelevant question.
Bonferroni adjustments imply that a given comparison will be
interpreted differently according to how many other tests were performed. For example, the difference in remission rates between two
chemotherapeutic treatments could be interpreted as statistically significant or not depending on whether or not survival rates, quality
of life scores, and complication rates were also tested. In a clinical
setting, a patient's packed cell volume might be abnormally low,
except if the doctor also ordered a platelet count, in which case it
could be deemed normal. Surely this is absurd, at least within the
current scientific paradigm. Evidence in data is what the data
say
other considerations, such as how many other tests were performed,
are irrelevant.
Type I errors cannot decrease (the whole point of Bonferroni
adjustments) without inflating type II errors (the probability of
accepting the null hypothesis when the alternative is
true).4 And type II errors are no less false than type I
errors. In clinical practice, if a high concentration of creatine
kinase were considered compatible with "no myocardial infarction"
by virtue of a Bonferroni adjustment, the patient would be denied
appropriate care. In research, an effective treatment may be deemed no
better than placebo. Thus, contrary to what some researchers believe,
Bonferroni adjustments do not guarantee a "prudent" interpretation
of results.
Most proponents of the Bonferroni method would count at least all
the statistical tests in a given report as a basis for adjusting P
values. But how about tests that were performed, but not published, or
tests published in other papers based on the same study? If several
papers are planned, should future ones be accounted for in the first
publication? Should we worry about error rates related to an
investigator
taking the number of tests he or she has done in their
lifetime into consideration6
or error rates related to
journals? Should confidence intervals, which are not statistical tests,
but are often interpreted as such (the confidence interval includes 0, hence the groups do not differ) be counted? No statistical theory
provides answers for these practical issues.

View larger version (60K):
[in a new window]
| |
A futuristic scenario |
|---|
What would happen to biomedical research if Bonferroni adjustments
became routine? Cynical researchers would slice their results like
salami, publishing one P value at a time to escape the wrath of the
statistical reviewer. Idealists would conduct studies to examine only
one association at a time
wasting time, energy, and public money.
Meta-analysts would go out of business, since a pooled analysis
would invalidate retrospectively all original findings by adding more
tests to be adjusted for. Journals would have to create a new
section entitled "P value updates," in which P values of previously
published papers would be corrected for newly published tests based on
the same study. And so on ....
| |
Back to the Neyman-Pearson theory |
|---|
These objections seem so compelling that the reader may wonder why adjustments for multiple tests were developed at all. The answer is that such adjustments are correct in the original framework of statistical test theory, proposed by Neyman and Pearson in the 1920s.7 This theory was intended to aid decisions in repetitive situations. Imagine that your factory produces light bulbs in lots of 1000, and that testing each bulb before shipment would be impractical. You can decide to test only a sample in each lot, and to reject (literally) any lots in which more than a predefined number (x) of bulbs in the sample are defective. Of course, your decision might be wrong for any particular lot, but the Neyman-Pearson theory provides a decision rule (the number x), so that over many trials your error rates (type I and type II) will be minimised. Now, if for some reason you took 20 samples out of a given lot instead of one, and decided that you would reject the lot if the number of defective bulbs exceeded x in only one sample, you would be much too likely to reject a good lot in error, and a Bonferroni adjustment would restore the original optimal error rates.
The catch is that Neyman and Pearson developed their statistical tests to aid decision making, not to assess evidence in data. The latter practice may be objected to for several reasons (this topic would deserve a discussion of its own), and alternative approaches to statistical inference, such as estimation procedures, use of likelihood ratios, and Bayesian methods, have been proposed.8-11 Bonferroni adjustments follow the original logic of statistical tests as supports of repeated decisions, but they are of little help in determining what the data say in one particular study.
| |
Should Bonferroni adjustments ever be used? |
|---|
Statistical adjustment for multiple tests make sense in a few situations. Firstly, the universal null hypothesis is occasionally of interest. For instance, to verify that a disease is not associated with an HLA phenotype, we may compare available HLA antigens (perhaps 40) in a group of cases and controls. If no association existed, at least one test would be significant with a probability of 0.87, and Bonferroni adjustments would protect against making excessive claims. A clinical equivalent is the case of a healthy person undergoing several laboratory tests as part of a general health check. Secondly, adjustments are appropriate when the same test is repeated in many subsamples, such as when stratified analyses (by age group, sex, income status, etc) are conducted without an a priori hypothesis that the primary association should differ between these subgroups. Note that this is the scenario, reminiscent of repeated sampling of the same lot, that Tukey and Bland and Altman use in their justifications of multiple test adjustments. Sequential testing of trial results also falls in this category. A final situation in which Bonferroni adjustments may be acceptable is when searching for significant associations without pre-established hypotheses.
| |
The best approach |
|---|
However, even in these situations, simply describing what was done and why, and discussing the possible interpretations of each result, should enable the reader to reach a reasonable conclusion without the help of Bonferroni adjustments. There is an important difference between what the data say and what the researcher (or the reader) believes to be true.8 The latter depends not only on the data at hand but also on considerations such as whether a finding is biologically plausible or whether the significant test was a serendipitous finding in a fishing expedition. The integration of prior beliefs with evidence is best achieved by Bayesian methods, not by Bonferroni adjustments. In summary, Bonferroni adjustments have, at best, limited applications in biomedical research, and should not be used when assessing evidence about specific hypotheses.
| |
Acknowledgments |
|---|
I thank Dr Richard M Royall, Department of Biostatistics, Johns Hopkins University, for helpful comments on the manuscript.
Funding: Swiss National Science Foundation (PROSPER 3233-32609.91).
Conflict of interest: None.
| |
References |
|---|
|
|
|---|
(Accepted 16 January 1998)
Read all Rapid Responses
What can you learn from this BMJ paper? Read Leanne Tite's Paper+