Education And Debate

Statistics Notes: Interaction 2: compare effect sizes not P values

BMJ 1996; 313 doi: (Published 28 September 1996) Cite this as: BMJ 1996;313:808
  1. John N S Matthews, senior lecturer in medical statisticsa,
  2. Douglas G Altman, headb
  1. a Department of Medical Statistics, University of Newcastle, Newcastle upon Tyne NE2 4HH
  2. b ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF
  1. Correspondence to: Dr Matthews.

    As we have previously described,1 the statistical term interaction relates to the non-independence of the effects of two variables on the outcome of interest. For example, in a controlled trial comparing a new treatment with a standard treatment we may want to examine whether the observed benefit was the same for different subgroups of patients. A common approach to answering this question is to analyse the data separately in each subgroup. Here we illustrate this approach and explain why it is incorrect.

    One of several subgroup analyses in a trial of antenatal steroids for preventing neonatal respiratory distress syndrome2 was performed to see whether the effect of treatment was different in mothers who did or did not develop pre-eclampsia. Among mothers with preeclampsia 21.2% (7/33) of babies whose mothers were given dexamethasone developed neonatal respiratory distress syndrome compared with 27.3% (9/33) of babies whose mothers received placebo, giving P = 0.57. Among mothers who did not have pre-eclampsia 7.9% (21/267) of babies in the steroid group and 14.1% (37/262) of babies in the placebo group developed neonatal respiratory distress syndrome, giving P = 0.021.

    There is a temptation to claim that the difference in P values establishes a difference between subgroups because “there is a treatment effect in mothers without pre-eclampsia but not in those with pre-eclampsia.” This argument is false: the key to realising this is to recall that a statement such as P = 0.57 does not mean there is no difference, merely that we have found no evidence that there is a difference. A P value is a composite which depends not only on the size of an effect but also on how precisely the effect has been estimated (its standard error). So differences in P values can arise because of differences in effect sizes or differences in standard errors or a combination of the two.

    This is well illustrated by the present example. If we measure treatment effect by the difference in percentages developing neonatal respiratory distress syndrome in the placebo and steroid groups, then the treatment effect among mothers with pre-eclampsia, namely 27.3 - 21.2 = 6.1%, is very close to the effect among mothers without pre-eclampsia, which is 14.1 - 7.9 = 6.2%. The difference in P values has arisen because only a small proportion of mothers had pre-eclampsia (66 out of 595), so the former treatment effect is estimated much less precisely than the latter.

    Another example can be found in a study of the effect of vitamin D supplementation for preventing neonatal hypocalcaemia: expectant mothers were given either supplements or placebo and the serum calcium concentration of the baby was measured at one week.3 The benefit of supplementation was investigated separately for breast and bottle fed infants, and t tests to compare the treatment groups gave P = 0.40 in the breast fed group and P = 0.0006 in the bottle fed group.

    As we have seen, it would be wrong to infer that vitamin D supplementation had a different effect on breast and bottle fed babies on the basis of these two P values: the correct way to proceed is to compare directly the sizes of the treatment effects. The effect of vitamin D supplementation can be measured by the difference in mean serum calcium concentrations between supplement and placebo groups and this gives effects of 0.04 mmol/l in the breast fed babies and 0.10 mmol/l in bottle fed babies. In order to interpret the difference in effect sizes, namely 0.06 mmol/l, we need to construct a confidence interval or perform a test of the null hypothesis that the true effect sizes are the same in each subgroup. A 95% confidence interval for the difference in effect sizes is - 0.05 to 0.17 mmol/l and a test of the null hypothesis gives P = 0.28. There is thus no evidence that the effect of vitamin D supplementation differs between breast and bottle fed infants. Comparing P values alone can be misleading.

    Details of how to construct relevant confidence intervals and carry out associated tests are contained in a subsequent Statistics Note.


    1. 1.
    2. 2.
    3. 3.