Comparing risk prediction models

BMJ 2012; 344 doi: https://doi.org/10.1136/bmj.e3186 (Published 24 May 2012) Cite this as: BMJ 2012;344:e3186
  1. Gary S Collins, senior medical statistician1,
  2. Karel G M Moons, professor of clinical epidemiology2
  1. 1Centre for Statistics in Medicine, Wolfson College Annexe, University of Oxford, Oxford OX2 6UD, UK
  2. 2Julius Centre for Health Sciences and Primary Care, UMC Utrecht, 3508 GA Utrecht, Netherlands
  1. gary.collins{at}csm.ox.ac.uk

Should be routine when deriving a new model for the same purpose

Risk prediction models have great potential to support clinical decision making and are increasingly incorporated into clinical guidelines.1 Many prediction models have been developed for cardiovascular disease—the Framingham risk score, SCORE, QRISK, and the Reynolds risk score—to mention just a few. With so many prediction models for similar outcomes or target populations, clinicians have to decide which model should be used on their patients. To make this decision they need to know, as a minimum, how well the score predicts disease in people outside the populations used to develop the model (“what is the external validation?”) and which model performs best.2

In a linked research study (doi:10.1136/bmj.e3318), Siontis and colleagues examined the comparative performance of several prespecified cardiovascular risk prediction models for the general population.3 They identified 20 published studies that compared two or more models and they highlighted problems in design, analysis, and reporting. What can be inferred from the findings of this well conducted systematic review?

Firstly, direct comparisons are few. A plea for more direct comparisons is increasingly heard in the field of therapeutic intervention and diagnostic research and may be echoed in that of prediction model validation studies. Many more prediction models have been developed than have been validated in independent datasets. Moreover, few models developed for similar outcomes and target populations are directly validated and compared.2 The authors of the current study retrieved various validation studies, but only 20 studies evaluated more than one model and most of those compared just two models. Thus, readers still need to judge from indirect comparisons which of the available models provide the best predictors in different situations. It would be much more informative if investigators who have (large) datasets available were to validate and compare all existing models together. And it would be even better if they first conducted and reported a systematic review of existing models before validating them in their dataset. Fair comparison requires that if an existing model seems to be miscalibrated for the data at hand, attempts should be made to adjust or recalibrate the model.4 5 For example, a prediction model developed in one country or population does not necessarily provide accurate predictions elsewhere. Ideally, attempts should be made to examine pre-existing prediction models in the new target setting and if necessary recalibrate or further update the model and check its performance before developing yet another model.4

Secondly, as Siontis and colleagues concluded, studies that suggest one model is better than another often have potential biases and methodological shortcomings. Authors who develop a new risk prediction model using their data and then compare it with an existing model often report better performance for the new model. Prediction models tend to perform better on the dataset from which they were developed and usually, if not always, perform better than existing models when validated on that dataset. This is simply because the model is tuned to the dataset at hand, which is why a model’s performance should be evaluated in other datasets, preferably by independent investigators. However, some form of reporting bias must play a role here,6 because a newly developed prediction model that performed worse than an existing one would probably not be submitted or published. Greater emphasis should therefore be placed on methodologically sound and appropriately detailed external validation studies, ideally of multiple models at once, to show which model is most useful.7

Thirdly, the Framingham risk score may often require recalibrating when used as a comparator. In many of the studies examined by Siontis and colleagues a new model was compared against the Framingham risk score. Although the Framingham risk score—developed in the United States during the 1970s—has stood the test of time, it has been shown to be miscalibrated in several other settings.8 It is not surprising that without recalibration comparisons against it will often favour the new model, especially if the validation dataset covers specific subpopulations that were not covered in the original Framingham study.

Fourthly, Siontis and colleagues’ review supports the findings of existing systematic reviews of prediction models.9 The conduct and reporting of prediction models has been criticised as poor, and key details needed to evaluate the model objectively are often omitted. In the absence of reporting guidelines for such studies, Siontis and colleagues have provided suggestions for conducting and reporting comparative studies, which if adhered to will make the task of appraising these studies easier. Guidelines for studies reporting the development and validation of prediction models are being developed.10

Finally, there is a lack of consistency between studies that compare prediction models because different statistical measures are used to describe the performance of the models. Statistical properties such as discrimination and calibration are widely recommended characteristics to evaluate; yet calibration is rarely examined. As important as the statistical characteristics of the model are, they do not ensure its clinical usefulness. There should therefore be more emphasis on demonstrating net benefit, for example,11 or, preferably, on conducting a randomised trial to evaluate the model’s ability to change clinicians’ decision making and patient outcomes.7 12

Journal editors and peer reviewers should be more critical of methodological shortcomings in prediction model studies, and they should work towards improved reporting, calling for studies to describe a fair validation and to compare two or preferably more risk prediction models simultaneously.


Cite this as: BMJ 2012;344:e3186


  • Research, doi:10.1136/bmj.e3318
  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years, no other relationships or activities that could appear to have influenced the submitted work.

  • Provenance and peer review: Commissioned; not externally peer reviewed.


View Abstract