Multiple imputation for missing data in epidemiological and clinical research: potential and pitfallsBMJ 2009; 338 doi: https://doi.org/10.1136/bmj.b2393 (Published 29 June 2009) Cite this as: BMJ 2009;338:b2393
- Jonathan A C Sterne, professor of medical statistics and epidemiology1,
- Ian R White, senior scientist2,
- John B Carlin, director of clinical epidemiology and biostatistics unit3,
- Michael Spratt, research associate1,
- Patrick Royston, senior scientist 4,
- Michael G Kenward, professor of biostatistics5,
- Angela M Wood, lecturer in biostatistics6,
- James R Carpenter, reader in medical and social statistics5
- 1Department of Social Medicine, University of Bristol, Bristol BS8 2PR
- 2MRC Biostatistics Unit, Institute of Public Health, Cambridge CB2 0SR
- 3Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, and University of Melbourne, Parkville, Victoria 3052, Australia
- 4Cancer and Statistical Methodology Groups, MRC Clinical Trials Unit, London NW1 2DA
- 5Medical Statistics Unit, London School of Hygiene and Tropical Medicine London, WC1E 7HT
- 6Department of Public Health and Primary Care, Institute of Public Health, Cambridge
- Correspondence to: J A C Sterne
- Accepted 30 January 2009
Missing data are unavoidable in epidemiological and clinical research but their potential to undermine the validity of research results has often been overlooked in the medical literature.1 This is partly because statistical methods that can tackle problems arising from missing data have, until recently, not been readily accessible to medical researchers. However, multiple imputation—a relatively flexible, general purpose approach to dealing with missing data—is now available in standard statistical software,2 3 4 5 making it possible to handle missing data semiroutinely. Results based on this computationally intensive method are increasingly reported, but it needs to be applied carefully to avoid misleading conclusions.
In this article, we review the reasons why missing data may lead to bias and loss of information in epidemiological and clinical research. We discuss the circumstances in which multiple imputation may help by reducing bias or increasing precision, as well as describing potential pitfalls in its application. Finally, we describe the recent use and reporting of analyses using multiple imputation in general medical journals, and suggest guidelines for the conduct and reporting of such analyses.
Consequences of missing data
Researchers usually address missing data by including in the analysis only complete cases —those individuals who have no missing data in any of the variables required for that analysis. However, results of such analyses can be biased. Furthermore, the cumulative effect of missing data in several variables often leads to exclusion of a substantial proportion of the original sample, which in turn causes a substantial loss of precision and power.
The risk of bias due to missing data depends on the reasons why data are missing. Reasons for missing data are commonly classified as: missing …