Comparisons within randomised groups can be very misleadingBMJ 2011; 342 doi: https://doi.org/10.1136/bmj.d561 (Published 06 May 2011) Cite this as: BMJ 2011;342:d561
- J Martin Bland, professor of health statistics1,
- Douglas G Altman, professor of statistics in medicine2
- 1Department of Health Sciences, University of York, York YO10 5DD
- 2Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD
- Correspondence to: Professor M Bland
When we randomise trial participants into two or more intervention groups, we do this to remove bias; the groups will, on average, be comparable in every respect except the treatment which they receive. Provided the trial is well conducted, without other sources of bias, any difference in the outcome of the groups can then reasonably be attributed to the different interventions received. In a previous note we discussed the analysis of those trials in which the primary outcome measure is also measured at baseline. We discussed several valid analyses, observing that “analysis of covariance” (a regression method) is the method of choice.1
Rather than comparing the randomised groups directly, however, researchers sometimes look at the change in the measurement between baseline and the end of the trial; they test whether there was a significant change from baseline, separately in each randomised group. They may then report that this difference is significant in one group but not in the other, and conclude that this is evidence that the groups, and hence the treatments, are different. One such example was a recent trial in which participants were randomised to receive either an “anti-ageing” cream or the vehicle as a placebo.2 A wrinkle score was recorded at baseline and after six months. The authors gave the results of significance tests comparing the score with baseline for each group separately, reporting the active treatment group to have a significant difference (P=0.013) and the vehicle group not (P=0.11). Their interpretation was that the cosmetic cream resulted in significant clinical improvement in facial wrinkles. But we cannot validly draw this conclusion, because the lack of a significant difference in the vehicle group does not provide good evidence that the anti-ageing product is superior.3
The essential feature of a randomised trial is the comparison between groups. Within group analyses do not address a meaningful question: the question is not whether there is a change from baseline, but whether any change is greater in one group than the other. It is not possible to draw valid inferences by comparing P values. In particular, there is an inflated risk of a false positive result, which we shall illustrate with a simulation.
The table shows simulated data for a randomised trial with two groups of 30 participants⇓. Data were drawn from the same population, so there is no systematic difference between the two groups. The true baseline measurements had a mean of 10.0 with standard deviation (SD) 2.0, and the outcome measurement was equal to the baseline plus an increase of 0.5 and a random element with SD 1.0. The difference between mean outcomes is 0.22 (95% confidence interval –0.75 to 0.34, P=0.5), adjusting for the baseline by analysis of covariance.1 The difference is not statistically significant, which is not surprising because we know that the null hypothesis of no difference in the population is true. If we compare baseline with outcome for each group using a paired t test, however, for group A the difference is statistically significant, P=0.03, for group B it is not significant, P = 0.2. These results are quite similar to those of the anti-ageing cream trial.2
We would not wish to draw any conclusions from one simulation. In 1000 runs, the difference between groups had P<0.05 in the analysis of covariance 47 times, or for 4.7% of samples, very close to the 5% we expect. Of the 2000 comparisons between baseline and outcome, 1500 (75%) had P<0.05. In this simulation, where there is no difference whatsoever between the two “treatments,” the probability of a significant difference in one group but not the other was 38%, not 5%. Hence a significant difference in one group but not the other is not good evidence of a significant difference between the groups. Even when there is a clear benefit of one treatment over the other, separate P values are not the way to analyse such studies.4
How many pairs of tests will have one significant and one non-significant difference depends on the size of the change from baseline to final measurement. If the population difference from baseline is very large, nearly all the within group tests will be significant, and if the population difference is small, nearly all tests will be not significant, so there will be few samples with only one significant difference. If the difference is such that half the samples would show a significant change from baseline, as it would be in our simulation if the underlying difference were 0.37 rather than 0.5, we would expect 50% of samples to have just one significant difference.
Cite this as: BMJ 2011;342:d561
Contributors: JMB and DGA jointly wrote and agreed the text, JMB did the statistical analysis.
Competing interests: All authors have completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous 3 years; no other relationships or activities that could appear to have influenced the submitted work.