Sample size calculations: should the emperor’s clothes be off the peg or made to measure?

BMJ 2012; 345 doi: (Published 23 August 2012)
Cite this as: BMJ 2012;345:e5278

Recent rapid responses

Rapid responses are electronic letters to the editor. They enable our users to debate issues raised in articles published on Although a selection of rapid responses will be included as edited readers' letters in the weekly print issue of the BMJ, their first appearance online means that they are published articles. If you need the url (web address) of an individual response, perhaps for citation purposes, simply click on the response headline and copy the url from the browser window.

Displaying 1-6 out of 6 published

I congratulate Norman and colleagues [1] for making some insightful points in an area dominated by rigid dogma that has little supporting evidence or reasoning. They conclude by endorsing my position that determination of sample size “should explicitly deal with broader ethical issues underlying the choice.” I would like to clarify and emphasize what I advocate.

I’ve argued that a rational choice of sample size should consider the drawbacks of increasing sample size, including higher study costs and burdening more participants, rather than just the implications for power. Neglecting everything except power would only make sense if there were a sudden and substantial increase in a study’s projected scientific or clinical value when the conventional goal of 80% power is reached. Although sample size conventions and thinking have evolved as if this “threshold myth” [2] were true, the reality is very different--increasing sample size always produces diminishing marginal returns for projected study value [2, 3].

This reality is functionally the opposite of the threshold myth, permitting reasonable choices to be determined from costs alone without consideration of power [3]. One such choice is the sample size that minimizes the total study cost divided by sample size, and this choice is guaranteed to be more cost efficient than any larger sample size, i.e., to have a larger ratio of projected study value to total study cost. This will often correspond to the choice suggested Muller [4], because simply doing what is feasible often minimizes total cost divided by sample size. My experience is that this is usually how investigators actually choose a sample size, and it has a rational justification. In contrast, the requirement for 80% power is justified only by “tradition” [5] and implicit acceptance of the threshold myth. It is also poorly defined, because every sample size produces 80% power for some effect size, and how the effect size should be chosen is murky [2] (Dubben’s response [6] notwithstanding).

The “expected utility” methods mentioned by Lilford and Chilton [7] are also known as “value of information” methods [8, 9] and have existed for many years [10]. These require quantifying projected study value and knowledge about the state of nature, and they optimize projected value minus study cost, thereby providing a rational balancing of the gains and drawbacks of increasing sample size. These are worthwhile methods when the effort and expertise to implement them are available.

Ethical considerations are difficult to factor into choice of sample size because the risks and burdens that participants will shoulder can be difficult to quantify. In principle, these could be added to study cost, and doing so should reduce sample size (see reference [3], Proposition 4). In general, ethical considerations can only exert downward pressure on sample size [11].

Because existing conventions for determining sample size neglect important considerations, their enforcement in the peer-review process generally does not produce genuine scientific or ethical improvement. I hope that this contribution from Norman and colleagues will help to reduce such useless criticism of sample size, at least when it falls within their proposed ranges.

[1] Norman G, Monteiro S, Salama S. Sample size calculations: should the emperor’s clothes be off the peg or made to measure? BMJ 2012; 345:e5278.

[2] Bacchetti P. Current sample size conventions: flaws, harms, and alternatives. BMC Med
2010; 8:17. Available at:

[3] Bacchetti P, McCulloch CE, Segal MR. Simple, defensible sample sizes based on cost efficiency. Biometrics 2008; 64:577-585. Available at:

[4] Muller S, Hider SL, Helliwell T, Mallen CD. Re: Sample size calculations: should the emperor’s clothes be off the peg or made to measure?

[5] Halpern SD, Karlawish JHT, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA 2002; 288:358-62.

[6] Dubben H-H. Re: Sample size calculations: should the emperor’s clothes be off the peg or made to measure?

[7] Lilford RJ, Clifton P. Re: Sample size calculations: should the emperor’s clothes be off the peg or made to measure?

[8] Willan AR, Pinto EM. The value of information and optimal clinical trial design. Stat Med 2005; 24:1791–1806

[9] Willan AR. Optimal sample size determinations from an industry perspective based on the expected value of information. Clin. Trials 2008; 5:587–594.

[10] Detsky AS. Using cost-effectiveness analysis to improve the efficiency of allocating funds to clinical trials. Stat Med 1990; 9:173–184.

[11] Bacchetti P, Wolf LE, Segal MR, McCulloch CE: Ethics and sample size. American Journal of Epidemiology 2005; 161:105-110. Available at:

Competing interests: None declared

Peter Bacchetti, Professor

University of California, San Francisco, Box 0560, San Francisco, CA 94143, USA

Click to like:

27 September 2012

Norman et al. present an interesting perspective on sample size calculations in clinical studies.

Unfortunately they erroneously describe sample size requirements in multivariate analysis. The correct rule of thumb is that models should be used with a minimum of 10 outcome events (rather than study participants) per included variable, including dummy variables used for categorical data.


Concato J, Peduzzi P, Holfold TR, et al. Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. J Clin Epidemiol 1995;48:1495-501.

Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373-9.

Competing interests: None declared

Gordon W Fuller, Doctoral Research Fellow

Trauma Audit and Research Network, Royal Salford Hospital, Salford, UK

Click to like:

Dear Editor,
With respect to the article by Norman et al. [1] we make the following points:

1. We completely agree that a type 2 error of 20%, while the type 1 error is 5%, “is logically unsupportable” as pointed out by us [2] and others.[3]

2. A logical way to choose a sample size is to use expected utility theory to specify a trade-off between magnitude of effect and costs (broadly defined) under expected utility theory.[4–6]

3. In our opinion, it is not unethical to conduct an ‘underpowered’ study, firstly, because some unbiased evidence is better than none, and secondly, because results can be pooled with those from other studies in a meta-analysis, thereby contributing to narrower confidence limits overall.[7]

4. Choosing the mean effect observed as a basis for a sample size calculation would not be very helpful in medical topics where it may be close to zero.[8]

5. In many cases in medicine where regression is used, the outcome variable is binary (e.g. stroke rate), in which case much larger samples are required than in the example given.

Yours faithfully,

R. Lilford, P. Chilton

[1] Norman G, Monteiro S, Salama S. Sample size calculations: should the emperor’s clothes be off the peg or made to measure? BMJ. 2012;345:e5278.
[2] Lilford RJ, Johnson N. The alpha and beta errors in randomized trials. NEJM. 1990;322(11):780-1.
[3] Schwartz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutic trials. J Chron Dis. 1967;20:637-648.
[4] Girling AJ, Lilford RJ, Braunholtz DA, Gillett WR. Sample-size calculations for trials that inform individual treatment decisions: a 'true-choice' approach. Clin Trials. 2007;4(1):15-24.
[5] Girling AJ, Freeman G, Gordon JP, Poole-Wilson P, Scott DA, Lilford RJ. Modeling payback from research into the efficacy of left-ventricular assist devices as destination therapy. Int J Technol Assess Health Care. 2007;23(2):269-77.
[6] Claxton K. The irrelevance of inference: A decision-making approach to the stochastic evaluation of health care technologies. J Health Econ. 1999;18:341-364.
[7] Edwards SJ, Lilford RJ, Braunholtz D, Jackson J. Why "underpowered" trials are not necessarily unethical. Lancet. 1997;350(9080):804-7.
[8] Bowater RJ, Lilford RJ. Clinical effectiveness in cardiovascular trials in relation to the importance to the patient of the end-points measured. J Eval Clin Pract. 2011;17(4):547-53.

Competing interests: None declared

Richard J. Lilford, Professor of Clinical Epidemiology

Peter Chilton

University of Birmingham, Edgbaston, Birmingham, West Midlands B15 2TT UK

Click to like:

Wrong question. Wrong answer.

Norman and colleagues pose the wrong question: “How much do you think your treatment will affect systolic blood pressure?”

The crucial question is: Which minimum effect size is relevant?

And then: Which probabilities for a type I and for type II error are tolerable? These are not necessarily off-the-peg 0.05 and 0.2! They should be sensibly chosen (made to measure!) like sensitivity and specificity of many diagnostic tests. It is important to calculate custom tailored sample sizes, even if it is sometimes difficult, to avoid unethical studies with inadequate power.

Competing interests: None declared

Hans-Hermann Dubben, Senior scientist

University Medical Centre Hamburg-Eppendorf, Martinistrasse 52, 20246 Hamburg, Germany

Click to like:

Norman and colleagues raise an important issue that we all know to be true – sample size calculations are at best an educated guess. Whilst their suggestion is potentially helpful for situations where an idea of sample size might be gleaned, we have an example of a recent study where we had very little information at all.

The study concerned was a primary care inception cohort of a relatively rare and little researched condition. Initially, we attempted to conduct a formal sample size calculation. We adopted standard values for alpha and beta. After that however, we were truly making up the values to put into the formula. We had multiple research questions (this is a cohort not a trial) and there is no agreed definition of outcome. This was a primary care study, and virtually all previous research in this condition has been in secondary care, so we had no best guess of potential rates of any outcome.

We did however have an idea of the sample size we might be able to recruit. Combining the estimated incidence rate from database studies with a feasible number of practices and the time allotted to us by the funders with our experience response rates, we estimated that we could have a reasonable number of people in which to attempt to answer the research questions of interest.

Considering all this, we decided to be honest with the ethics committee. We submitted our calculation of the number we might sensibly hope to recruit, along with an explanation of the lack of formal sample size calculation. At the committee meeting, we were questioned as to the lack of ‘statistics’, but on our re-enforcing the lack of information with which to perform a calculation, the committee were happy to grant approval.

We appreciate that our approach is by no means perfect, and raises issues regarding the statistical power that we have to answer our research questions. Also, it is not a trial and so does not have the potential to expose patient to or withhold treatments unnecessarily. We hope though, that our success in taking this approach might encourage others to consider something similar should the need arise.

Competing interests: None declared

Sara Muller, Research Associate

Samantha L Hider, Toby Helliwell, Christian D Mallen

Arthritis Research UK Primary Care Centre, Primary Care Sciences, Keele University, Keele, Staffordshire, ST5 5BG

Click to like:

Norman and colleagues criticise "conventional sample size calculations, based on guesses about statistical parameters" because they "are subject to large uncertainties". They then propose that normative ranges of sample sizes for common research designs would be more sensible.

To support this they provide a Table with sample sizes for various combinations of relative risk reduction and base rate. The sample sizes vary by a factor of one thousand. So, based on guesses about likely relative risk and base rate, the normative approach to sample size estimation is subject to large uncertainty.

Whether off-the-peg or made-to-measure, the emperor's new clothes seem not to be fit for purpose.

Competing interests: None declared

Paul D.P. Pharoah, Public health doctor

University of Cambridge, Strangeways Research Laboratory, Worts Causeway, Cambridge, CB1 8RN

Click to like: