A calibration hierarchy for risk models was defined: from utopia to empirical data

Ben Van Calster; Daan Nieboer; Yvonne Vergouwe; Bavo De Cock; Michael J Pencina; Ewout W Steyerberg

doi:10.1016/j.jclinepi.2015.12.005

A calibration hierarchy for risk models was defined: from utopia to empirical data

J Clin Epidemiol. 2016 Jun:74:167-76. doi: 10.1016/j.jclinepi.2015.12.005. Epub 2016 Jan 6.

Authors

Ben Van Calster¹, Daan Nieboer², Yvonne Vergouwe², Bavo De Cock³, Michael J Pencina⁴, Ewout W Steyerberg²

Affiliations

¹ KU Leuven, Department of Development and Regeneration, Herestraat 49 Box 7003, 3000 Leuven, Belgium; Department of Public Health, Erasmus MC, 's-Gravendijkwal 230, 3015 CE Rotterdam, The Netherlands. Electronic address: ben.vancalster@med.kuleuven.be.
² Department of Public Health, Erasmus MC, 's-Gravendijkwal 230, 3015 CE Rotterdam, The Netherlands.
³ KU Leuven, Department of Development and Regeneration, Herestraat 49 Box 7003, 3000 Leuven, Belgium.
⁴ Duke Clinical Research Institute, Duke University, 2400 Pratt Street, Durham, NC 27705, USA; Department of Biostatistics and Bioinformatics, Duke University, 2424 Erwin Road, Durham, NC 27719, USA.

PMID: 26772608
DOI: 10.1016/j.jclinepi.2015.12.005

Abstract

Objective: Calibrated risk models are vital for valid decision support. We define four levels of calibration and describe implications for model development and external validation of predictions.

Study design and setting: We present results based on simulated data sets.

Results: A common definition of calibration is "having an event rate of R% among patients with a predicted risk of R%," which we refer to as "moderate calibration." Weaker forms of calibration only require the average predicted risk (mean calibration) or the average prediction effects (weak calibration) to be correct. "Strong calibration" requires that the event rate equals the predicted risk for every covariate pattern. This implies that the model is fully correct for the validation setting. We argue that this is unrealistic: the model type may be incorrect, the linear predictor is only asymptotically unbiased, and all nonlinear and interaction effects should be correctly modeled. In addition, we prove that moderate calibration guarantees nonharmful decision making. Finally, results indicate that a flexible assessment of calibration in small validation data sets is problematic.

Conclusion: Strong calibration is desirable for individualized decision support but unrealistic and counter productive by stimulating the development of overly complex models. Model development and external validation should focus on moderate calibration.

Keywords: Calibration; Decision curve analysis; External validation; Loess; Overfitting; Risk prediction models.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bias
Calibration
Computer Simulation
Decision Support Techniques*
Humans
Models, Statistical*
Reproducibility of Results
Risk Assessment
Risk*