Intended for healthcare professionals

Research Methods & Reporting

Calculating the sample size required for developing a clinical prediction model

BMJ 2020; 368 doi: (Published 18 March 2020) Cite this as: BMJ 2020;368:m441
  1. Richard D Riley, professor of biostatistics1,
  2. Joie Ensor, lecturer in biostatistics1,
  3. Kym I E Snell, lecturer in biostatistics1,
  4. Frank E Harrell Jr, professor of biostatistics2,
  5. Glen P Martin, lecturer in health data sciences3,
  6. Johannes B Reitsma, associate professor4,
  7. Karel G M Moons, professor of clinical epidemiology4,
  8. Gary Collins, professor of medical statistics5,
  9. Maarten van Smeden, assistant professor4 5 6
  1. 1Centre for Prognosis Research, School of Primary, Community and Social Care, Keele University, Staffordshire ST5 5BG, UK
  2. 2Department of Biostatistics, Vanderbilt University School of Medicine, Nashville TN, USA
  3. 3Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
  4. 4Julius Center for Health Sciences, University Medical Center Utrecht, Utrecht, Netherlands
  5. 5Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK
  6. 6Department of Clinical Epidemiology, Leiden University Medical Center Leiden, Netherlands
  1. Correspondence to: R D Riley r.riley{at} (or @richard_d_riley on Twitter)
  • Accepted 19 December 2019

Clinical prediction models aim to predict outcomes in individuals, to inform diagnosis or prognosis in healthcare. Hundreds of prediction models are published in the medical literature each year, yet many are developed using a dataset that is too small for the total number of participants or outcome events. This leads to inaccurate predictions and consequently incorrect healthcare decisions for some individuals. In this article, the authors provide guidance on how to calculate the sample size required to develop a clinical prediction model.

Summary points

  • Patients and healthcare professionals require clinical prediction models to accurately guide healthcare decisions

  • Larger sample sizes lead to the development of more robust models

  • Data should be of sufficient quality and representative of the target population and settings of application

  • It is better to use all available data for model development (ie, avoid data splitting), with resampling methods (such as bootstrapping) used for internal validation

  • When developing prediction models for binary or time-to-event outcomes, a well known rule of thumb for the required sample size is to ensure at least 10 events for each predictor parameter

  • The actual required sample size is, however, context specific and depends not only on the number of events relative to the number of candidate predictor parameters but also on the total number of participants, the outcome proportion (incidence) in the study population, and the expected predictive performance of the model

  • We propose to use such information to tailor sample size requirements to the specific setting of interest, with the aim of minimising the potential for model overfitting while targeting precise estimates of key parameters

  • Our proposal can be implemented in a four step procedure and is applicable for continuous, binary, or time-to-event outcomes

  • The pmsampsize package in Stata or R allows researchers to implement the procedure

Clinical prediction models are needed to …

View Full Text