Analysis of continuous data from small samplesBMJ 2009; 338 doi: http://dx.doi.org/10.1136/bmj.a3166 (Published 06 April 2009) Cite this as: BMJ 2009;338:a3166
- J Martin Bland, professor of health statistics 1,
- Douglas G Altman, professor of statistics in medicine2
- 1Department of Health Sciences, University of York, York YO10 5DD
- 2Centre for Statistics in Medicine, University of Oxford, Wolfson College Annexe, Oxford OX2 6UD
- Correspondence to: Professor Bland
Studies with small numbers of measurements are rare in the modern BMJ, but they used to be common and remain plentiful in specialist clinical journals. Their analysis is often more problematic than that for large samples.
Parametric methods, including t tests, correlation, and regression, require the assumption that the data follow a normal distribution and that variances are uniform between groups or across ranges.1 In small samples these assumptions are particularly important, so this setting seems ideal for rank (non-parametric) methods, which make no assumptions about the distribution of the data; they use the rank order of observations rather than the measurements themselves.1 Unfortunately, rank methods are least effective in small samples. Indeed, for very small samples, they cannot yield a significant result whatever the data. For example, when using the Mann-Witney test for comparing two samples of fewer than four observations a statistically significant difference is impossible: any data give P>0.05. Similarly, the Wilcoxon paired test, the sign test, and Spearman’s and Kendall’s rank correlation coefficients cannot produce P<0.05 for fewer than six observations. Methods based on the t distribution do not have this problem and can detect differences in samples as small as two for paired differences and three for two groups, or detect correlations in samples of three.
For example, we were recently asked about the data in table 1⇓, which shows before and after measurements of pudendal nerve terminal motor latency. Should we use the Wilcoxon or the sign test? MB replied that the Wilcoxon would be acceptable, giving P<0.05 (actually P=0.047), and so would the paired t test, which gave P=0.04. The questioner also asked whether the Wilcoxon test could be used for the second group of four observations alone, for patients who had received a slightly different intervention. Here all the differences are in the same direction, but the Wilcoxon test gives P=0.125. It is not possible for it to give a significant difference. The paired t test gives P=0.04, a significant difference.
On the other hand, using t methods when their assumptions are greatly violated can also be misleading. Table 2⇓ shows concentration of antibody to type II group B Streptococcus in 20 volunteers before and after immunisation.2 3 The comparison of the antibody levels was summarised in the report of this study as “t=1.8; P>0.05”. The paired t test is not suitable for these data, because the differences clearly have a very skewed distribution. There are 8 zero differences, forming a clump at one end of the distribution, which would remain whatever transformation we used. We could consider the Wilcoxon paired sample test, but this method assumes that the differences have a symmetrical distribution, which they do not. The sign test is preferred here; it tests the null hypothesis that non-zero differences are equally likely to be positive or negative, using the binomial distribution. We have 1 negative and 11 positive differences, which gives P=0.006. Hence the original authors failed to detect a difference because they used an inappropriate analysis.
We have often come across the idea that we should not use t distribution methods for small samples but should instead use rank based methods. The statement is sometimes that we should not use t methods at all for samples of fewer than six observations.4 But, as we noted, rank based methods cannot produce anything useful for such small samples.
The aversion to parametric methods for small samples may arise from the inability to assess the distribution shape when there are so few observations. How can we tell whether data follow a normal distribution if we have only a few observations? The answer is that we have not only the data to be analysed, but usually also experience of other sets of measurements of the same thing. In addition, general experience tells us that body size measurements are usually approximately normal, as are the logarithms of many blood concentrations and the square roots of counts.
Cite this as: BMJ 2009;338:a3166
We thank Jonathan Cowley for the data in table 1.
Competing interests: None declared.
Provenance and peer review: Commissioned, not externally peer reviewed.