BMJ 1996;313:808 (28 September)

Education and debate

Statistics Notes: Interaction 2: compare effect sizes not P values

John N S Matthews, senior lecturer in medical statistics,a Douglas G Altman, head b

a Department of Medical Statistics, University of Newcastle, Newcastle upon Tyne NE2 4HH, b ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF

Correspondence to: Dr Matthews.

As we have previously described,1 the statistical term interaction relates to the non-independence of the effects of two variables on the outcome of interest. For example, in a controlled trial comparing a new treatment with a standard treatment we may want to examine whether the observed benefit was the same for different subgroups of patients. A common approach to answering this question is to analyse the data separately in each subgroup. Here we illustrate this approach and explain why it is incorrect.

One of several subgroup analyses in a trial of antenatal steroids for preventing neonatal respiratory distress syndrome2 was performed to see whether the effect of treatment was different in mothers who did or did not develop pre-eclampsia. Among mothers with preeclampsia 21.2% (7/33) of babies whose mothers were given dexamethasone developed neonatal respiratory distress syndrome compared with 27.3% (9/33) of babies whose mothers received placebo, giving P = 0.57. Among mothers who did not have pre-eclampsia 7.9% (21/267) of babies in the steroid group and 14.1% (37/262) of babies in the placebo group developed neonatal respiratory distress syndrome, giving P = 0.021.

There is a temptation to claim that the difference in P values establishes a difference between subgroups because "there is a treatment effect in mothers without pre-eclampsia but not in those with pre-eclampsia." This argument is false: the key to realising this is to recall that a statement such as P = 0.57 does not mean there is no difference, merely that we have found no evidence that there is a difference. A P value is a composite which depends not only on the size of an effect but also on how precisely the effect has been estimated (its standard error). So differences in P values can arise because of differences in effect sizes or differences in standard errors or a combination of the two.

This is well illustrated by the present example. If we measure treatment effect by the difference in percentages developing neonatal respiratory distress syndrome in the placebo and steroid groups, then the treatment effect among mothers with pre-eclampsia, namely 27.3 - 21.2 = 6.1%, is very close to the effect among mothers without pre-eclampsia, which is 14.1 - 7.9 = 6.2%. The difference in P values has arisen because only a small proportion of mothers had pre-eclampsia (66 out of 595), so the former treatment effect is estimated much less precisely than the latter.

Another example can be found in a study of the effect of vitamin D supplementation for preventing neonatal hypocalcaemia: expectant mothers were given either supplements or placebo and the serum calcium concentration of the baby was measured at one week.3 The benefit of supplementation was investigated separately for breast and bottle fed infants, and t tests to compare the treatment groups gave P = 0.40 in the breast fed group and P = 0.0006 in the bottle fed group.

As we have seen, it would be wrong to infer that vitamin D supplementation had a different effect on breast and bottle fed babies on the basis of these two P values: the correct way to proceed is to compare directly the sizes of the treatment effects. The effect of vitamin D supplementation can be measured by the difference in mean serum calcium concentrations between supplement and placebo groups and this gives effects of 0.04 mmol/l in the breast fed babies and 0.10 mmol/l in bottle fed babies. In order to interpret the difference in effect sizes, namely 0.06 mmol/l, we need to construct a confidence interval or perform a test of the null hypothesis that the true effect sizes are the same in each subgroup. A 95% confidence interval for the difference in effect sizes is - 0.05 to 0.17 mmol/l and a test of the null hypothesis gives P = 0.28. There is thus no evidence that the effect of vitamin D supplementation differs between breast and bottle fed infants. Comparing P values alone can be misleading.

Details of how to construct relevant confidence intervals and carry out associated tests are contained in a subsequent Statistics Note.

  1. Altman DG, Matthews JNS. Interaction 1: heterogeneity of effects. BMJ 1996;313:486. [Free Full Text]
  2. Collaborative Group on Antenatal Steroid Therapy. Effect of antenatal dexamethasone administration on the prevention of respiratory distress syndrome. Am J Obstet Gynecol 1981;141:276-87. [Medline]
  3. Cockburn F, Belton NR, Purvis RJ, Giles MM, Brown JK, Turner TL, et al. Maternal vitamin D intake and mineral metabolism in mothers and their newborn infants. BMJ 1980;281:11-4.

This article has been cited by other articles:

  • Laviolette, L, Bourbeau, J, Bernard, S, Lacasse, Y, Pepin, V, Breton, M-J, Baltzan, M, Rouleau, M, Maltais, F (2008). Assessing the impact of pulmonary rehabilitation on functional status in COPD. Thorax 63: 115-121 [Abstract] [Full text]  
  • MEEWISSE, M.-L., REITSMA, J. B., DE VRIES, G.-J., GERSONS, B. P. R., OLFF, M. (2007). Cortisol and post-traumatic stress disorder in adults: Systematic review and meta-analysis. Br. J. Psychiatry 191: 387-392 [Abstract] [Full text]  
  • Eldridge, S. (2007). Good practice in statistical reporting for Family Practice. Fam Pract 24: 93-94 [Full text]  
  • Montgomery, S M, Ehlin, A, Sacker, A (2006). Breast feeding and resilience against psychosocial stress. Arch. Dis. Child. 91: 990-994 [Abstract] [Full text]  
  • Petrie, A. (2006). Statistics in orthopaedic papers. J Bone Joint Surg Br 88-B: 1121-1136 [Abstract] [Full text]  
  • Darlow, B. A., Henderson-Smart, D. J., Simpson, J. M., Evans, N. J. (2005). Risk Factors for Severe Retinopathy of Prematurity Among Very Preterm Infants: A Unit-Based or Population-Based Approach?: In Reply. Pediatrics 116: 516-517 [Full text]  
  • Altman, D. G, Bland, J M. (2003). Statistics Notes: Interaction revisited: the difference between two estimates. BMJ 326: 219-219 [Full text]  
  • Matthews, J N S (1999). Sponsored trials do not necessarily give more-favourable results. BMJ 318: 1762a-1762 [Full text]  
  • Matthews, J. N S, Altman, D. G (1996). Statistics notes: Interaction 3: How to examine heterogeneity. BMJ 313: 862-862 [Full text]  

Online poll
Find out more

Rapid responses for this article

There are no rapid responses for this article.


Student BMJ

Risk of surgery for inflammatory bowel disease: record linkage studies

What can you learn from this BMJ paper? Read Leanne Tite's Paper+

www.student.bmj.com

Listen to the latest BMJ Interview