Intended for healthcare professionals

CCBY Open access
Research Methods & Reporting

How to develop a more accurate risk prediction model when there are few events

BMJ 2015; 351 doi: (Published 11 August 2015) Cite this as: BMJ 2015;351:h3868

This article has a correction. Please see:

  1. Menelaos Pavlou, research associate1,
  2. Gareth Ambler, senior lecturer1,
  3. Shaun R Seaman, senior statistician2,
  4. Oliver Guttmann, cardiology registrar3,
  5. Perry Elliott, professor4,
  6. Michael King, professor5,
  7. Rumana Z Omar, professor1
  1. 1Department of Statistical Science, University College London, WC1E 6BT London, UK
  2. 2Medical Research Council Biostatistics Unit, Cambridge
  3. 3School of Life and Medical Sciences, Institute of Cardiovascular Science, University College London
  4. 4Inherited Cardiac Disease Unit, the Heart Hospital, London
  5. 5Division of Psychiatry, University College London
  1. Correspondence to: Menelaos Pavlou m.pavlou{at}
  • Accepted 21 June 2015

When the number of events is low relative to the number of predictors, standard regression could produce overfitted risk models that make inaccurate predictions. Use of penalised regression may improve the accuracy of risk prediction

Summary points

  • Risk prediction models are used in clinical decision making and are used to help patients make an informed choice about their treatment

  • Model overfitting could arise when the number of events is small compared with the number of predictors in the risk model

  • In an overfitted model, the probability of an event tends to be underestimated in low risk patients and overestimated in high risk patients

  • In datasets with few events, penalised regression methods can provide better predictions than standard regression

Risk prediction models that typically use a number of predictors based on patient characteristics to predict health outcomes are a cornerstone of modern clinical medicine.1 Models developed using data with few events compared with the number of predictors often underperform when applied to new patient cohorts.2 A key statistical reason for this is “model overfitting.” Overfitted models tend to underestimate the probability of an event in low risk patients and overestimate it in high risk patients, which could affect clinical decision making. In this paper, we discuss the potential of penalised regression methods to alleviate this problem and thus develop more accurate prediction models.

Statistical models are often used to predict the probability that an individual with a given set of risk factors will experience a health outcome, usually termed an “event.” These risk prediction models can help in clinical decision making and help patients make an informed choice regarding their treatment.3 4 5 6 Risk models are developed using several risk factors typically based on patient characteristics that are thought to be associated with the health event of interest (box …

View Full Text