- Karel G M Moons, professor of clinical epidemiology1,
- Douglas G Altman, professor of statistics in medicine2,
- Yvonne Vergouwe, assistant professor of clinical epidemiology1,
- Patrick Royston, senior statistician3
- 1Julius Centre for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, Netherlands
- 2Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD
- 3MRC Clinical Trials Unit, London NW1 2DA
- Correspondence to: K G M Moons
- Accepted 6 October 2008
Prognostic models are developed to be applied in new patients, who may come from different centres, countries, or times. Hence, new patients are commonly referred to as different from but similar to the patients used to develop the models.1 2 3 4 But what exactly does this mean? When can a new patient population be considered similar (enough) to the development population to justify validation and eventually application of a model? We have already considered the design, development, and validation of prognostic research and models.5 6 7 In the final article of our series, we discuss common limitations to the application and generalisation of prognostic models and what evidence beyond validation is needed before practitioners can confidently apply a model to their patients. These issues also apply to prediction models with a diagnostic outcome (presence of a disease).
Prognostic models generalise best to populations that have similar ranges of predictor values to those in the development population
When a prognostic model performs less well in a new population, using the new data to modify the model should first be considered rather than directly developing a new model
Application of prognostic models requires unambiguous definitions of predictors and outcomes and reproducible measurements using methods available in clinical practice
Impact studies quantify the effect of using a prognostic model on physicians’ behaviour, patient outcome, or cost effectiveness of care compared with usual care without the model
Impact studies require different design, outcome, analysis, and reporting from validation studies
Limitations to application
Extrapolation versus validation
Most prediction models are developed in secondary care, and it is common to want to apply them to primary care.1 8 9 10 The predictive performance of secondary care models is usually decreased when they are validated in a primary care setting.1 9 One example is the diagnostic model to predict deep vein thrombosis, which had a negative predictive value of 97% (95% confidence interval 95% to 99%) and sensitivity 90% (83% to 96%) in Canadian secondary care patients.11 When the model was validated in Dutch primary care patients, the negative predictive value was only 88% (85% to 91%) and sensitivity 79% (74% to 84%).12 The question arises whether primary and secondary care populations can indeed be considered to be different but similar.
A change in setting clearly results in a different case mix, which commonly affects the generalisability of prognostic models.4 9 13 14 Case mix is here defined as the distribution of the outcome and predictive factors whether included in the model or not. Primary care doctors often selectively refer patients to specialists. Secondary care patients can thus largely be considered to be a subpopulation of primary care patients, commonly with a narrower range of patient characteristics, a larger fraction of patients in later disease stages, and worse outcomes.9 Consequently, application of a secondary care model in general practice requires extrapolation. This view suggests that applying a primary care model to secondary care would have a limited effect on predictive performance, although this requires further research.
Another common generalisation, or rather extrapolation, is from adults to children. Various prognostic models have been developed to predict the risk of postoperative nausea and vomiting in adults scheduled for surgery under general anaesthesia. When validated in children, the models’ predictive ability was substantially decreased.15 The researchers considered children as a different population and developed and validated a separate model for children that included other predictors.16 In contrast, the Intensive Care National Audit and Research Centre model to predict outcome in critical care was initially developed with data from adults but also has good accuracy in children.17
In general, models will be more generalisable when the ranges of predictor values in the new population are within the ranges seen in the development population. The above examples show that we cannot assume that prediction models can simply be generalised from one population or setting to another, although it may be possible. Therefore, accuracy of any prediction model should always be tested in a formal validation study (see third article in this series7).
Adequate prediction versus application
Just because a model is well used does not mean it has adequate prediction. For example, the Framingham risk model discriminates only reasonably in certain (sub)populations, with a receiver-operating characteristic (ROC) curve area of little over 0.70.18 The model is nevertheless widely used. The same applies to various intensive care prediction models—for example, the APACHE scores and the simplified acute physiology scores (SAPS).19 20 A likely reason is the relevance of the outcomes that these rules predict: risk of cardiovascular disease (Framingham) and mortality in critically illness (APACHE, SAPS). Another reason for the wide use of such models is their face validity, such that doctors trust these models to guide their practice rather than their own experience.
Whether the predictive accuracy of a model in new patients is adequate is also a matter of judgment and depends on available alternatives.21 For instance, a prognostic model to predict the probability of spontaneous ongoing pregnancy in couples with unexplained subfertility has good calibration but rather low discriminative ability (ROC area even below 0.70) but remains the best model available.22 Hence, the model was used to identify couples with intermediate probability of spontaneous ongoing pregnancy for a clinical trial.23
Finally, the role of prognostic models and prognostic factors in clinical practice still depends on circumstances. A positive family history of subarachnoid haemorrhage increases the risk of subarachnoid haemorrhage 5.5 times, but only 10% of cases of subarachnoid haemorrhage occur in people with a family history. Thus screening for subarachnoid haemorrhage in people with a family history is not recommended as it will identify relatively too few cases.24
Constraints on the usability of the prognostic model can also limit the application. Application of prognostic models requires unambiguous definitions of predictors and reproducible measurements using methods available in clinical practice. For example, one of the predictors in the deep vein thrombosis model described above is “alternative diagnosis just as likely as deep vein thrombosis.”11 General practitioners may be less experienced in properly coding this predictor for a patient, leading to misclassification that potentially compromises the rule’s predictive performance. Another example of an ambiguous predictor definition is “history of nausea and vomiting after previous anaesthesia” in the prognostic model for postoperative nausea and vomiting.25 A negative answer could mean that the patient has had anaesthesia before but not experienced symptoms or that the patient has never had anaesthesia. Also, children will have had previous anaesthesia less often than adults. As a consequence, this predictor may have a different effect in children.
Similarly, the definition of the outcome variable may vary across populations. Occurrence of neurological sequelae after childhood bacterial meningitis was defined in a development population as mild cases (for example, hearing loss), severe cases (for example, deafness), or dead.26 The prognostic model was validated in a population that included children with mainly mild neurological sequelae. The model showed poor performance in the validation population, possibly because of the different distribution of outcomes.27 In addition, the follow-up time differed between the two populations (the maximum duration of follow-up was 3.3 years in the development population and 10 years in the validation population).
Changes over time
As we discussed in the first article in this series,5 changes in practice over time can limit the application of prognostic models. Improvements in diagnostic tests, biomarker measurement, or treatments may change the prognosis of patients. For example, spiral computed tomography can better visualise the pulmonary circulation than older computed tomography.28 As a consequence, a patient with pulmonary embolism detected by spiral computed tomography and treated accordingly may have a better prognosis on average than a patient with an embolism detected by conventional computed tomography.
Changes over time may even lead to the situation that prognostic models are no longer used to estimate outcome risks and to influence patient management. For example, the suggestion that everyone older than 55 is given a “polypill” to reduce the risk of cardiovascular diseases29 may make models to predict these diseases redundant.
Evidence beyond validation studies
Adjusting and updating prognostic models to improve performance
Newly collected data from prediction research are often used to develop a new prognostic model rather than to validate existing models.2 3 7 14 For example, there are over 60 models to predict outcome after breast cancer30 and about 25 models to predict long term outcome in patients with neurological trauma.31 If researchers do perform a formal validation study of a published model and find poor performance, they often then re-estimate the associations of the predictors with the outcome in their own data. Sometimes even the entire selection of important predictors is repeated. This is unfortunate, since predictive information captured in developing the original model is neglected. Furthermore, validation studies commonly include fewer patients than development studies, making the new model more subject to overfitting and thus even less generalisable than the original model.4 14
When a prognostic model performs less well in another population, adjusting the model using the new data should first be considered to determine whether it will improve the performance in that population.4 13 14 The adjusted model is then based on both the development and validation data, further improving its stability and generalisability. Such adjustment of prognostic models is called updating. Updating methods vary from simple recalibration to more extensive methods referred to as model revision.4 13 14 Recalibration includes adjustment of the intercept of the model and overall adjustment of the associations (relative weights) of the predictors with the outcome. Model revision includes adjustment of individual predictor-outcome associations and addition of new predictors. Interestingly, simple recalibration methods are often sufficient.4 14 The extent to which this process of model validation and adjustment has to be pursued before clinical application, will depend on the context. General rules are as yet unavailable.
Impact of prognostic models
Prognostic models are developed to provide objective estimates of outcome probabilities to complement clinical intuition and guidelines.5 8 10 21 The underlying assumption is that accurately estimated probabilities improve doctors’ decision making and consequently patient outcome. The effect of a previously developed, validated, and (if needed) updated prognostic model on behaviour and patient outcomes should be studied separately in so called impact studies (box).
Consecutive stages to produce a usable multivariable prognostic model
Development studies5 6—Development of a multivariable prognostic model, including identification of the important predictors, assigning the relative weights to each predictor, and estimating the model’s predictive performance (eg, calibration and discrimination) adjusted if necessary for overfitting
Validation studies7—Validating or testing the model’s predictive performance in new subjects, preferably from different centres, with a different case mix or using (slightly) different definitions and measurements of predictors and outcomes
Impact studies—Quantifying whether use of a prognostic model in daily practice improves decision making and, ultimately, patient outcome using a comparative design
Validation and impact studies differ in their design, study outcome, statistical analysis, and reporting (table⇓). A validation study ideally uses a prospective cohort design and does not require a control group.7 For each patient, predictors and outcome are documented, and the rule’s predictive performance is quantified.
By contrast, impact studies quantify the effect of using a prognostic model on doctors’ behaviour, patient outcome, or cost effectiveness of care compared with not using such model (table⇑). They require a control group of healthcare professionals who provide usual care. The preferred design is a randomised trial.3 If behaviour changes of professionals is the main outcome, a randomised study without follow-up of patients would suffice. Follow-up is required if patient outcome or cost effectiveness is assessed. However, since changes in outcome depend on changes in doctors’ behaviour, it may be sensible to start with a randomised study assessing the model’s impact on therapeutic decisions, especially when long follow-up times are needed to assess patient outcome. The same applies to diagnostic procedures32 and therapeutic interventions for which effects are realised by changing behaviour and decisions—for example, exercise therapy to reduce body mass index.
Impact studies may use an assistive approach—simply providing the model’s predicted probabilities of an outcome between 0% and 100%—or a decisive approach that explicitly suggests decisions for each probability category.3 33 The assistive approach clearly leaves room for intuition and judgment, but a decisive approach may have greater effect.3 34 35 Introduction of computerised patient records that automatically give predictions for individual subjects, enhances implementation and thus impact analysis of prognostic models in routine care.35 36
Randomising individual patients in an impact study may result in learning effects because the same doctor will alternately apply and not apply the model to subsequent patients, reducing the contrast between both randomised groups. Randomisation of doctors (clusters) is preferable, although this requires more patients.37 Randomising centres is often the best method as it avoids exchange of experiences between doctors within a single centre.
An alternative design is a before and after study with the same doctors or centres, as was used to evaluate the effect of the Ottawa ankle rule on physicians’ behaviour and cost effectiveness of care.38 39 A disadvantage of this design is the sensitivity to temporal changes in therapeutic approaches. Although impact studies are scarce, are a few good examples exist.40 41 42
When to apply a prognostic model?
Do all prognostic models require a three step assessment (box) before they are used in daily care? Does a model that has shown adequate prediction for its intended use in validation studies—adequately predicting the outcome—still require an impact analysis using a large, multicentre cluster randomised study? The answers depend on the rate of (acceptable) false positives and false negative predictions and their consequences for patient management and outcome. For models with (near) perfect discrimination and calibration in several validation studies the answer may be no, though such success is rare. An example is a model to predict the differential diagnosis of acute meningitis. It showed an area under the ROC curve of 0.97 in the development population43 and of 0.98 in two validation populations.44 45
For models with less perfect performance, only an impact analysis can determine whether use of the model is better than usual care. Impact studies also provide the opportunity to study factors that may affect implementation of a prognostic model in daily care, including the acceptability of the prognostic model to clinicians and ease of use.
An intermediate step using decision modelling techniques or Markov chain models can be helpful. These evaluate the potential consequences of using the prognostic model in terms of subsequent therapeutic decisions and patient outcome.46 If such analysis does not indicate any potential for improved patient outcome, a formal impact study would not be warranted.
Many prognostic models are developed but few have their predictive performance validated in new patients, let alone an evaluation of their impact on decision making and patient outcome.3 47 48 Thus it seems right that few such models are actually used in practice. Recent methodological advances enable the adjustment of prognostic models to local circumstances to give improved generalisability. With these innovations, correctly developed and evaluated prediction models may become more common.
Many questions remain unresolved. How much validation, and perhaps adjustment, is needed before an impact study is justified? Is it feasible for a single model to apply to all patient subgroups, across all levels of care and countries? These issues require further research. Finally, we reiterate that unvalidated models should not be used in clinical practice, and more impact studies are needed to determine whether a prognostic or diagnostic model should be implemented in daily practice.
Cite this as: BMJ 2009;338:b606
This article is the last in a series of four aiming to provide an accessible overview of the principles and methods of prognostic research
KGMM and YV are supported by the Netherlands Organization for Scientific Research (ZON-MW 917.46.360). PR is supported by the UK Medical Research Council. DGA is supported by Cancer Research UK.
Contributors: This series was conceived and planned by DGA, KGMM, PR, and YV. KGMM wrote the first draft of this paper. All the authors contributed to subsequent revisions. KGMM is guarantor.
Competing interests: None declared.
Provenance and peer review: Not commissioned; externally peer reviewed.