Original ArticleA calibration hierarchy for risk models was defined: from utopia to empirical data
Introduction
There is increasing attention for the use of risk prediction models to support medical decision making. Discriminatory performance is commonly the main focus in the evaluation of performance, whereas calibration often receives less attention [1]. A prediction model is calibrated in a given population if the predicted risks are reliable, that is, correspond to observed proportions of the event. Commonly, calibration is defined as “for patients with an predicted risk of R%, on average R out of 100 should indeed suffer from the disease or event of interest.” Calibration is a pivotal aspect of model performance [2], [3], [4]: “For informing patients and medical decision making, calibration is the primary requirement” [2], “If the model is not […] well calibrated, it must be regarded as not having been validated […]. To evaluate classification performance […] is inappropriate” [4].
Recently, a stronger definition of calibration has been emphasized in contrast to the definition of calibration given previously [4], [5]. Models are considered strongly calibrated if predicted risks are accurate for each and every covariate pattern. In this paper, we aim to define different levels of calibration and describe implications for model development, external validation of predictions, and clinical decision making. We focus on predicting binary end points (event vs. no event) and assume that a logistic regression model is developed in a derivation sample with performance assessment in a validation sample. We expand on examples used in recent work by Vach [5].
Section snippets
Methods
We assume that the predicted risks are obtained from a previously developed prediction model for outcome Y (1 = event, 0 = nonevent), for example, based on logistic regression analysis. The model provides a constant (model intercept) and a set of effects (model coefficients). The linear combination of the coefficients with the covariate values in a validation set defines the linear predictor L: L=a+b1×x1+b2×x2+…+bi×xi, where a is the model intercept, b1 to bi a set of regression coefficients,
Calibration, decision making, and clinical utility
Strong calibration implies that an accurate risk prediction is obtained for every covariate pattern. Hence, a strongly calibrated model allows the communication of accurate risks to every individual patient. In contrast, a moderately calibrated model allows the communication of a reliable average risk for patients with the same predicted risk: among patients with an predicted risk of 70% on average, 70 of 100 have the event, although there may exist relevant subgroups with different covariate
Strong calibration: realistic or utopic?
In line with Vach's work [5], we find that moderate calibration does not imply that the prediction model is “valid” in a strong sense. In principle, we should aim for strong calibration because this makes predictions accurate at the individual patient's level as well as at the group level, leading to better decisions on average. However, we consider four problems in empirical analyses. First, strong calibration requires that the model form (e.g., a generalized linear model such as logistic
Moderate calibration: a pragmatic guarantee for nonharmful decision making
Focusing on finding at least moderately calibrated models has several advantages. First, it is a realistic goal in epidemiologic research, where empirical data sets are often of relatively limited size, and the signal to noise ratio is unfavorable [31]. Second, moderate calibration guarantees that decision making based on the model is not clinically harmful. Conversely, it is an important observation that calibration in a weak sense may still result in harmful decision making [18]. Third,
A link with model updating
In model updating, we adapt a model that has poor performance at external validation [34]. Basic updating approaches include, in order of complexity, intercept adjustment, recalibration, and refitting [34], [35]. There are parallels between updating methods and levels of calibration. Intercept adjustment updates the linear predictor L to a+L. This will only address calibration-in-the-large but does not guarantee weak calibration. A more complex updating method involves logistic recalibration,
Statistical testing for calibration
We mainly focused on conceptual issues in assessing calibration of predictions from statistical models. We did not consider statistical testing in detail, and in this area, the assessment of statistical power needs further study. In previous simulations, the Hosmer-Lemeshow test showed such poor performance that it may not be recommended for routine use [7], [37]. In practice, indications of uncertainty such as confidence intervals are far more important than a statistical test.
Conclusion and recommendations
We conclude that strong calibration, although desirable for individual risk communication, is unrealistic in empirical medical research. Focusing on obtaining prediction models that are calibrated in the moderate sense is a better attainable goal, in line with the most common definition of the notion of “calibration of predictions.” In support of this view, we proved that moderate calibration guarantees that clinically nonharmful decisions are made based on the model. This guarantee cannot be
Acknowledgments
The authors thank Laure Wynants for proofreading the article.
References (37)
Calibration of clinical prediction rules does not just assess bias
J Clin Epidemiol
(2013)- et al.
Everything you always wanted to know about evaluating prediction models (but were too afraid to ask)
Urology
(2010) - et al.
A spline-based tool to assess and visualize the calibration of multiclass risk predictions
J Biomed Inform
(2015) - et al.
The number of subjects per variable required in linear regression analyses
J Clin Epidemiol
(2015) - et al.
Simple dichotomous updating methods improved the validity of polytomous prediction models
J Clin Epidemiol
(2013) - et al.
Substantial effective sample sizes were required for external validation studies of predictive logistic regression models
J Clin Epidemiol
(2005) - et al.
External validation of multivariable prediction models: a systematic review of methodological conduct and reporting
BMC Med Res Methodol
(2014) Clinical prediction models. A practical approach to development, validation, and updating
(2009)- et al.
Probabilistic classifiers with high-dimensional data
Biostatistics
(2011) - et al.
Methods for evaluating prediction performance of biomarkers and tests
Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis
A comparison of goodness-of-fit tests for the logistic regression model
Stat Med
Evaluating the PCPT risk calculator in ten international biopsy cohorts: results from the Prostate Biopsy Collaborative Group
World J Urol
Two further applications of a model for binary regression
Biometrika
Validation of probabilistic predictions
Med Decis Making
Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers
Stat Med
Sample size considerations for the external validation of a multivariable prognostic model: a resampling study
Stat Med
Assessing the performance of prediction models: a framework for traditional and novel measures
Epidemiology
Cited by (474)
Development and validation of a prediction score to assess the risk of depression in primary care
2024, Journal of Affective DisordersConsiderations for Enhancing Lung Cancer Risk Prediction and Screening in Asian Populations
2024, Journal of Thoracic OncologyPreoperative risk prediction models for acute kidney injury after noncardiac surgery: an independent external validation cohort study
2024, British Journal of AnaesthesiaWhy do probabilistic clinical models fail to transport between sites
2024, npj Digital Medicine
Funding: This study was supported in part by the Research Foundation Flanders (FWO) (grants G049312N and G0B4716N) and by Internal Funds KU Leuven (grant C24/15/037).
Conflict of interest: None.