The tyranny of power: is there a better way to calculate sample size?
BMJ 2009; 339 doi: https://doi.org/10.1136/bmj.b3985 (Published 06 October 2009) Cite this as: BMJ 2009;339:b3985
All rapid responses
This was an excellent and important article relating to individual
trials. It is also worth considering the use of power calculations in the
context of systematic reviews and meta-analyses. In this context, studies
with inadequate sample size may reasonably be included in reviews though
sensitivity analyses are required examining the effect of including or
excluding them. In this case the presence of a power calculation in
original trials is considered as a quality or risk of bias issue. It
resurrects the question as to whether one large RCT, adequately powered
for all outcomes provides better evidence than a systematic review of
large numbers of smaller RCTs.
Competing interests:
None declared
Competing interests: No competing interests
Has anyone noticed that this article is exceptionally well written?
Not only is is well organised, structured,and argued ; with exactly the
right examples in precisely the right places: but the prose is lucid,
simple and spare. It is in fact quite beautiful
This may not make it right, but it does make it unusual!
Competing interests:
Martin taught me statistics nearly thirty years ago
Competing interests: No competing interests
I am an experimental psychologist, so my background will be different
to those of most of BMJ's readership. Nonetheless, all researchers in the
health/medical areas must know something about statistical power and its
relationship with estimating sample size in the process of obtaining
project approval.
Estimating sample size no matter what techniques are deemed
appropriate is essentially a boot-strap exercise: intending researchers
identify published research that is broadly similar to that which they
propose to undertake and make the assumption that their proposals will be
sufficiently similar to make the estimates meaningful. I have some
questions:
(1) Is the exercise meaningful for unexplored research topics with a
meagre publication base?
(2) It the answer is no, has there been a bias on the part of
assessors and researchers towards well-charted topics?
(3) Have young researchers been unduly channelled towards established
areas and perhaps into the large, unwieldy groups that characterise much
modern research?
Competing interests:
None declared
Competing interests: No competing interests
In his article, Bland questions the way sample size is currently
determined in clinical trials.
Actual clinical trials with a priori determination of an adequate sample
size are designed to allow sufficient power to reach significant results.
This way of driving clinical research may limit interpretation of trials
with potentially valuable treatments which could be rejected because of
lack of significance. This conservative attitude could lead to keep on
using less effective or potentially harmful treatments.
Moreover, in a recent study[1], we outlined that sample-size
calculations in RCTs are inadequately reported, often erroneous, and based
on assumptions that are frequently inaccurate. These results let think
that neither investigators nor reviewers take the reporting of power
calculation very seriously. This situation challenges the scientific
credibility of trials in which the sample size calculation is poorly
reported and make the methodologists doubtful about sample size
determination as currently performed.
We join Bland in his wish to change the way sample size is determined
and we agree that using confidence intervals could be one solution. This
was already proposed by Goodman in 1994[2] but this has never been used
probably because it raises a major issue. Stopping a priori determination
of a single primary endpoint could lead to the multiplication of
endpoints with selective reporting of the endpoints for which the most
favorable results were obtained. We wonder what would be the consequences
of commercializing new drugs according to a posteriori selected endpoints.
In conclusion we also claim for stopping power calculation but we
think that the price to pay using confidence intervals is too high: if
investigators stop reporting a single primary endpoint, the temptation of
data dredging will be important.
1. Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P :
Reporting of sample size calculation in randomised controlled trials:
review. Bmj 2009; 338 b1732
2. Goodman SN, Berlin JA: The use of predicted confidence intervals
when planning experiments and the misuse of power when interpreting
results. Ann Intern Med 1994; 121(3): 200-6.
Competing interests:
None declared
Competing interests: No competing interests
Statistical power: still an essential element in sample size calculation.
Bland has proposed that study sample size should be calculated to achieve an acceptable degree of precision, measured as the width of the confidence interval for the treatment effect [1]. In fact confidence interval width is already established as the basis of sample size calculations for equivalence studies. For example, assume two treatments are in truth equivalent, with p = 15% of patients in both arms developing the disease under study despite their allocated treatment. The researchers want to know the sample size that will give a 95% confidence interval for the risk difference, with width 2d excluding more than a 5% absolute difference between the treatments. The sample size n for each treatment arm is calculated as:
n = (1 / d)2 x 2 x p x (100 - p) x (1.96 + 0.84)2
which gives sample size of 800 per arm. (See page 129 of Pocock's text for further details of sample size calculations for equivalence studies [2]). Now, rearranging the formula for 95% confidence interval width used by Bland [1] we get:
n = (1 / d)2 x 2 x p x (100 - p) x 1.962
Where 2d remains the confidence interval width and p is the average treatment response across the two treatment arms. This calculation gives a sample size of 392 per arm. The one clear difference between the two formulae is the additional 0.84 term in the first calculation; this increases the sample size to achieve 80% power. Intuitively, in the context of an equivalence study, high power and the corresponding increase in sample size are allowing for the chance observation of a treatment difference in the study sample, and improve the chances of demonstrating true equivalence despite that chance difference. Omitting the 0.84 term from the first calculation will give a sample size that achieves 50% statistical power, i.e. a 50:50 chance of demonstrating the true equivalence of the two treatments.
Now consider the standard sample size calculation for demonstrating a minimum important risk difference d with 80% power at the 5% statistical significance level:
n = (1 / d)2 x 2 x p x (100 - p) x (1.96 + 0.84)2
To highlight the similarity between the three formulae, p is again the average treatment response across the two treatment arms; in reality the treatment responses in the two arms would be kept distinct. This similarity suggests that the traditional sample size calculation can be considered as aiming for a 95% confidence interval width which will exclude a zero risk difference so long as the true treatment effect is at least that stated as the minimum important difference. The 0.84 term is again increasing the sample size to 800 in each study arm, so achieving 80% power and improving the chances of the 95% confidence interval excluding a zero risk difference when the true risk difference is underestimated in the study sample. By not incorporating power, Bland's approach suggests that much smaller sample sizes are required, with the risk of inconclusive results (i.e. a confidence interval that encompasses conflicting results) when the true treatment effect is underestimated in the study sample.
A further difference between the three calculations is the ease with which a value can be given to d. In the first calculation, for equivalence studies, a true treatment difference of zero is assumed and d is the maximum difference which would still allow the two treatments to be considered as clinically equivalent. In the third calculation, for demonstrating differences between treatments, d is the minimum difference between treatments that would be clinically important to detect. Hence by aiming for 2d as the 95% confidence interval width, that confidence interval will not include important differences (so demonstrating equivalence) and a zero difference (so demonstrating difference) respectively. It is less clear what value of d is desirable when we follow Bland's suggestion [1]. At first a 95% confidence interval that excludes unimportant differences appears desirable, but this introduces the difficulty of assuming a true difference between the treatments being compared, with d then the true difference minus the minimum important difference.
In conclusion, while Bland's proposal for informing sample size calculations with the width of the confidence interval is attractive [1], it is unclear what width of confidence interval we should be aiming for in practice. In any case power remains an essential element of sample size calculations, formally increasing the sample size to allow a treatment effect to be detected even when underestimated in sample data.
[1] Bland JM. The tyranny of power: is there a better way to calculate sample size? BMJ 2009; 339:b3985. doi:10.1136/bmj.b3985.
[2] Pocock SJ. Clinical trials: a practical approach. Chichester: Wiley, 1984.
Competing interests:
None declared
Competing interests: No competing interests