A calibration hierarchy for risk models was defined: from utopia to empirical data

doi:10.1016/j.jclinepi.2015.12.005

Journal of Clinical Epidemiology

Volume 74, June 2016, Pages 167-176

https://doi.org/10.1016/j.jclinepi.2015.12.005 Get rights and content

Abstract

Objective

Calibrated risk models are vital for valid decision support. We define four levels of calibration and describe implications for model development and external validation of predictions.

Study Design and Setting

We present results based on simulated data sets.

Results

A common definition of calibration is “having an event rate of R% among patients with a predicted risk of R%,” which we refer to as “moderate calibration.” Weaker forms of calibration only require the average predicted risk (mean calibration) or the average prediction effects (weak calibration) to be correct. “Strong calibration” requires that the event rate equals the predicted risk for every covariate pattern. This implies that the model is fully correct for the validation setting. We argue that this is unrealistic: the model type may be incorrect, the linear predictor is only asymptotically unbiased, and all nonlinear and interaction effects should be correctly modeled. In addition, we prove that moderate calibration guarantees nonharmful decision making. Finally, results indicate that a flexible assessment of calibration in small validation data sets is problematic.

Conclusion

Strong calibration is desirable for individualized decision support but unrealistic and counter productive by stimulating the development of overly complex models. Model development and external validation should focus on moderate calibration.

Introduction

There is increasing attention for the use of risk prediction models to support medical decision making. Discriminatory performance is commonly the main focus in the evaluation of performance, whereas calibration often receives less attention [1]. A prediction model is calibrated in a given population if the predicted risks are reliable, that is, correspond to observed proportions of the event. Commonly, calibration is defined as “for patients with an predicted risk of R%, on average R out of 100 should indeed suffer from the disease or event of interest.” Calibration is a pivotal aspect of model performance [2], [3], [4]: “For informing patients and medical decision making, calibration is the primary requirement” [2], “If the model is not […] well calibrated, it must be regarded as not having been validated […]. To evaluate classification performance […] is inappropriate” [4].

Recently, a stronger definition of calibration has been emphasized in contrast to the definition of calibration given previously [4], [5]. Models are considered strongly calibrated if predicted risks are accurate for each and every covariate pattern. In this paper, we aim to define different levels of calibration and describe implications for model development, external validation of predictions, and clinical decision making. We focus on predicting binary end points (event vs. no event) and assume that a logistic regression model is developed in a derivation sample with performance assessment in a validation sample. We expand on examples used in recent work by Vach [5].

Section snippets

Methods

We assume that the predicted risks are obtained from a previously developed prediction model for outcome Y (1 = event, 0 = nonevent), for example, based on logistic regression analysis. The model provides a constant (model intercept) and a set of effects (model coefficients). The linear combination of the coefficients with the covariate values in a validation set defines the linear predictor L: L=a+b₁×x₁+b₂×x₂+…+b_i×x_i, where a is the model intercept, b₁ to b_i a set of regression coefficients,

Calibration, decision making, and clinical utility

Strong calibration implies that an accurate risk prediction is obtained for every covariate pattern. Hence, a strongly calibrated model allows the communication of accurate risks to every individual patient. In contrast, a moderately calibrated model allows the communication of a reliable average risk for patients with the same predicted risk: among patients with an predicted risk of 70% on average, 70 of 100 have the event, although there may exist relevant subgroups with different covariate

Strong calibration: realistic or utopic?

In line with Vach's work [5], we find that moderate calibration does not imply that the prediction model is “valid” in a strong sense. In principle, we should aim for strong calibration because this makes predictions accurate at the individual patient's level as well as at the group level, leading to better decisions on average. However, we consider four problems in empirical analyses. First, strong calibration requires that the model form (e.g., a generalized linear model such as logistic

Moderate calibration: a pragmatic guarantee for nonharmful decision making

Focusing on finding at least moderately calibrated models has several advantages. First, it is a realistic goal in epidemiologic research, where empirical data sets are often of relatively limited size, and the signal to noise ratio is unfavorable [31]. Second, moderate calibration guarantees that decision making based on the model is not clinically harmful. Conversely, it is an important observation that calibration in a weak sense may still result in harmful decision making [18]. Third,

A link with model updating

In model updating, we adapt a model that has poor performance at external validation [34]. Basic updating approaches include, in order of complexity, intercept adjustment, recalibration, and refitting [34], [35]. There are parallels between updating methods and levels of calibration. Intercept adjustment updates the linear predictor L to a+L. This will only address calibration-in-the-large but does not guarantee weak calibration. A more complex updating method involves logistic recalibration,

Statistical testing for calibration

We mainly focused on conceptual issues in assessing calibration of predictions from statistical models. We did not consider statistical testing in detail, and in this area, the assessment of statistical power needs further study. In previous simulations, the Hosmer-Lemeshow test showed such poor performance that it may not be recommended for routine use [7], [37]. In practice, indications of uncertainty such as confidence intervals are far more important than a statistical test.

Conclusion and recommendations

We conclude that strong calibration, although desirable for individual risk communication, is unrealistic in empirical medical research. Focusing on obtaining prediction models that are calibrated in the moderate sense is a better attainable goal, in line with the most common definition of the notion of “calibration of predictions.” In support of this view, we proved that moderate calibration guarantees that clinically nonharmful decisions are made based on the model. This guarantee cannot be

Acknowledgments

The authors thank Laure Wynants for proofreading the article.

References (37)

W. Vach
Calibration of clinical prediction rules does not just assess bias
J Clin Epidemiol
(2013)
A.J. Vickers et al.
Everything you always wanted to know about evaluating prediction models (but were too afraid to ask)
Urology
(2010)
K. Van Hoorde et al.
A spline-based tool to assess and visualize the calibration of multiclass risk predictions
J Biomed Inform
(2015)
P.C. Austin et al.
The number of subjects per variable required in linear regression analyses
J Clin Epidemiol
(2015)
K. Van Hoorde et al.
Simple dichotomous updating methods improved the validity of polytomous prediction models
J Clin Epidemiol
(2013)
Y. Vergouwe et al.
Substantial effective sample sizes were required for external validation studies of predictive logistic regression models
J Clin Epidemiol
(2005)
G.S. Collins et al.
External validation of multivariable prediction models: a systematic review of methodological conduct and reporting
BMC Med Res Methodol
(2014)
E.W. Steyerberg
Clinical prediction models. A practical approach to development, validation, and updating
(2009)
K.I. Kim et al.
Probabilistic classifiers with high-dimensional data
Biostatistics
(2011)
M.S. Pepe et al.
Methods for evaluating prediction performance of biomarkers and tests

F.E. Harrell

Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis

(2001)

D.W. Hosmer et al.

A comparison of goodness-of-fit tests for the logistic regression model

Stat Med

(1997)

D.P. Ankerst et al.

Evaluating the PCPT risk calculator in ten international biopsy cohorts: results from the Prostate Biopsy Collaborative Group

World J Urol

(2012)

D.R. Cox

Two further applications of a model for binary regression

Biometrika

(1958)

M.E. Miller et al.

Validation of probabilistic predictions

Med Decis Making

(1993)

P.C. Austin et al.

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Stat Med

(2014)

G.S. Collins et al.

Sample size considerations for the external validation of a multivariable prognostic model: a resampling study

Stat Med

(2016)

E.W. Steyerberg et al.

Assessing the performance of prediction models: a framework for traditional and novel measures

Epidemiology

(2010)

Cited by (474)

Development and validation of a prediction score to assess the risk of depression in primary care
2024, Journal of Affective Disorders
Major depression is the most frequent psychiatric disorder and primary care is a crucial setting for its early recognition. This study aimed to develop and validate the DEP-HScore as a tool to predict depression risk in primary care and increase awareness and investigation of this condition among General Practitioners (GPs).
The DEP-HScore was developed using data from the Italian Health Search Database (HSD). A cohort of 903,748 patients aged 18 years or older was selected and followed until the occurrence of depression, death or end of data availability (December 2019). Demographics, somatic signs/symptoms and psychiatric/medical comorbidities were entered in a multivariate Cox regression to predict the occurrence of depression. The coefficients formed the DEP-HScore for individual patients. Explained variance (pseudo-R²), discrimination (AUC) and calibration (slope estimating predicted-observed risk relationship) assessed the prediction accuracy.
The DEP-HScore explained 18.1 % of the variation in occurrence of depression and the discrimination value was equal to 67 %. With an event horizon of three months, the slope and intercept were not significantly different from the ideal calibration.
The DEP-HScore has not been tested in other settings. Furthermore, the model was characterized by limited calibration performance when the risk of depression was estimated at the 1-year follow-up.
The DEP-HScore is reliable tool that could be implemented in primary care settings to evaluate the risk of depression, thus enabling prompt and suitable investigations to verify the presence of this condition.
Considerations for Enhancing Lung Cancer Risk Prediction and Screening in Asian Populations
2024, Journal of Thoracic Oncology
Developing and externally validating multinomial prediction models for methotrexate treatment outcomes in patients with rheumatoid arthritis: results from an international collaboration
2024, Journal of Clinical Epidemiology
In rheumatology, there is a clinical need to identify patients at high risk (>50%) of not responding to the first-line therapy methotrexate (MTX) due to lack of disease control or discontinuation due to adverse events (AEs). Despite this need, previous prediction models in this context are at high risk of bias and ignore AEs. Our objectives were to (i) develop a multinomial model for outcomes of low disease activity and discontinuing due to AEs 6 months after starting MTX, (ii) update prognosis 3-month following treatment initiation, and (iii) externally validate these models.
A multinomial model for low disease activity (submodel 1) and discontinuing due to AEs (submodel 2) was developed using data from the UK Rheumatoid Arthritis Medication Study, updated using landmarking analysis, internally validated using bootstrapping, and externally validated in the Norwegian Disease-Modifying Antirheumatic Drug register. Performance was assessed using calibration (calibration-slope and calibration-in-the-large), and discrimination (concordance-statistic and polytomous discriminatory index).
The internally validated model showed good calibration in the development setting with a calibration-slope of 1.01 (0.87, 1.14) (submodel 1) and 0.83 (0.30, 1.34) (submodel 2), and moderate discrimination with a c-statistic of 0.72 (0.69, 0.74) and 0.53 (0.48, 0.59), respectively. Predictive performance decreased after external validation (calibration-slope 0.78 (0.64, 0.93) (submodel 1) and 0.86 (0.34, 1.38) (submodel 2)), which may be due to differences in disease-specific characteristics and outcome prevalence.
We addressed previously identified methodological limitations of prediction models for outcomes of MTX therapy. The multinomial approach predicted outcomes of disease activity more accurately than AEs, which should be addressed in future work to aid implementation into clinical practice.
Preoperative risk prediction models for acute kidney injury after noncardiac surgery: an independent external validation cohort study
2024, British Journal of Anaesthesia
Numerous models have been developed to predict acute kidney injury (AKI) after noncardiac surgery, yet there is a lack of independent validation and comparison among them.
We conducted a systematic literature search to review published risk prediction models for AKI after noncardiac surgery. An independent external validation was performed using a retrospective surgical cohort at a large Chinese hospital from January 2019 to October 2022. The cohort included patients undergoing a wide range of noncardiac surgeries with perioperative creatinine measurements. Postoperative AKI was defined according to the Kidney Disease Improving Global Outcomes creatinine criteria. Model performance was assessed in terms of discrimination (area under the receiver operating characteristic curve, AUROC), calibration (calibration plot), and clinical utility (net benefit), before and after model recalibration through intercept and slope updates. A sensitivity analysis was conducted by including patients without postoperative creatinine measurements in the validation cohort and categorising them as non-AKI cases.
Nine prediction models were evaluated, each with varying clinical and methodological characteristics, including the types of surgical cohorts used for model development, AKI definitions, and predictors. In the validation cohort involving 13,186 patients, 650 (4.9%) developed AKI. Three models demonstrated fair discrimination (AUROC between 0.71 and 0.75); other models had poor or failed discrimination. All models exhibited some miscalibration; five of the nine models were well-calibrated after intercept and slope updates. Decision curve analysis indicated that the three models with fair discrimination consistently provided a positive net benefit after recalibration. The results were confirmed in the sensitivity analysis.
We identified three models with fair discrimination and potential clinical utility after recalibration for assessing the risk of acute kidney injury after noncardiac surgery.
Development and validation of prognostic nomographs for patients with cervical cancer: SEER-based Asian population study
2024, Scientific Reports
Why do probabilistic clinical models fail to transport between sites
2024, npj Digital Medicine

View all citing articles on Scopus

: Funding: This study was supported in part by the Research Foundation Flanders (FWO) (grants G049312N and G0B4716N) and by Internal Funds KU Leuven (grant C24/15/037).

: Conflict of interest: None.

View full text

Original ArticleA calibration hierarchy for risk models was defined: from utopia to empirical data

Abstract

Objective

Study Design and Setting

Results

Conclusion

Introduction

Section snippets

Methods

Calibration, decision making, and clinical utility

Strong calibration: realistic or utopic?

Moderate calibration: a pragmatic guarantee for nonharmful decision making

A link with model updating

Statistical testing for calibration

Conclusion and recommendations

Acknowledgments

J Clin Epidemiol

Urology

J Biomed Inform

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting

BMC Med Res Methodol

Clinical prediction models. A practical approach to development, validation, and updating

Probabilistic classifiers with high-dimensional data

Biostatistics

Methods for evaluating prediction performance of biomarkers and tests

Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis

A comparison of goodness-of-fit tests for the logistic regression model

Stat Med

Evaluating the PCPT risk calculator in ten international biopsy cohorts: results from the Prostate Biopsy Collaborative Group

World J Urol

Two further applications of a model for binary regression

Biometrika

Validation of probabilistic predictions

Med Decis Making

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Stat Med

Sample size considerations for the external validation of a multivariable prognostic model: a resampling study

Stat Med

Assessing the performance of prediction models: a framework for traditional and novel measures

Epidemiology

Original Article
A calibration hierarchy for risk models was defined: from utopia to empirical data