CCBYNC Open access

Credibility of claims of subgroup effects in randomised controlled trials: systematic review

BMJ 2012; 344 doi: (Published 15 March 2012) Cite this as: BMJ 2012;344:e1553

Re: Credibility of claims of subgroup effects in randomised controlled trials: systematic review

Sun and colleagues have investigated the credibility of reported claims of subgroup effects in published randomised controlled trials using specified criteria [1]. The authors considered the critical criteria for evaluating the credibility for an observed subgroup effect to be: use of subgroup variables measured at baseline; pre-specification of subgroup hypotheses; and statistical significance of an interaction test. The first of these is reasonable as subgroups defined according to post-randomisation characteristics might be influenced by tested interventions and so the observed differences may simply be the result of bias. However, the second and third criteria lack a firm conceptual rationale to justify them as stated, and so the conclusions drawn from their analysis may be flawed.

In a framework of hypothesis testing the fundamental question of interest is “given the data, what is the probability that the null hypothesis (or the alternative hypothesis) is true”. It can be shown that, in the absence of bias, this probability depends on the prior probability of the alternative hypothesis being correct (the prior), on the statistical power of the study to detect an effect and the P-value [2]. It is this simple observation that should inform the interpretation of reported sub-group effects.

In a randomized trial the prior for the primary endpoint ought to be close to 50 percent – equipoise. Under this prior, a result declared significant at P<0.05 will have a six percent chance of being a false positive if the study power was 80 percent. However, the prior relating to a subgroup effect being may be small, and is often substantially smaller than 50 percent. Under a prior of 5 percent and the same power, a result declared significant at P<0.05 will have a 54 percent chance of being a false positive.

The prior does not depend directly on either the number of hypotheses tested or on whether or not the hypothesis was pre-specified, although these are related. If one pre-specifies a hypothesis it suggests that there was some rationale for that hypothesis and therefore the prior for that hypothesis may be higher than a post hoc hypothesis. But this may not always apply, as one could pre-specify an extremely unlikely hypothesis. For example the hypothesis that being born under the star sign Libra affects response to a particular drug. On the other hand, evidence from an external source may arise during the conduct of a trial that results in a new hypothesis that is highly plausible, and so, despite being post hoc, the prior may be high. It is apparent that it is not whether or not the hypothesis was pre-specified that is important, but whether or not that hypothesis is likely to be correct.

The P-value, in itself, is an uninformative probability - it is the probability of obtaining data as or more extreme than those observed IF the null hypothesis is correct. The incorrect interpretation of the P-value as the probability of the observed data occurring by chance has been responsible for serious misinterpretation of many of the findings from observational and clinical epidemiology, including the interpretation of subgroup analyses in randomized controlled trials. A similar error arises from interpreting a 95 percent confidence interval as the range over which we are 95 percent certain that the true value will lie. As shown above, when the prior is low, even a result declared significant at P < 0.05 may be more likely to be a false positive than a true positive. The P-value can only be interpreted rationally when the prior is taken into account.

Statistical power also needs to be taken into account. A statistically significant result arising from a small study with low power is more likely to be a false positive than a result with the same P-value from a large study. This is particularly relevant for subgroup analyses in which the sample sizes may be considerably smaller than the main study.

If they are to be useful and not just a tick box exercise, any criteria for evaluating reported subgroup effects should explicitly state how those criteria might affect the prior. Only then can the subgroup effect be interpreted.

1. Sun X, Briel M, Busse JW, You JJ, Akl EA, Mejza F, Bala MM, Bassler D, Mertz D, Diaz-Granados N, Vandvik PO, Malaga G, Srinathan SK, Dahm P, Johnston BC, Alonso-Coello P, Hassouneh B, Walter SD, Heels-Ansdell D, Bhatnagar N, Altman DG, Guyatt GH. Credibility of claims of subgroup effects in randomised controlled trials: systematic review. BMJ 2012;344:e1553.

2. Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J. Natl. Cancer Inst. 2004;96(6):434-42.

Competing interests: No competing interests

19 March 2012
Paul D Pharoah
Reader inCancer Epidemiology
University of Cambridge
Strangeways Research Laboratory, Worts Causeway, Cambridge, CB1 8RN