Validation of prediction models in the presence of competing risks: a guide through modern methodsBMJ 2022; 377 doi: https://doi.org/10.1136/bmj-2021-069249 (Published 24 May 2022) Cite this as: BMJ 2022;377:e069249
- Nan van Geloven, biostatistician1,
- Daniele Giardiello, biostatistician1 2,
- Edouard F Bonneville, biostatistician1,
- Lucy Teece, biostatistician3,
- Chava L Ramspek, medical doctor4,
- Maarten van Smeden, biostatistician5,
- Kym I E Snell, biostatistician3,
- Ben van Calster, biostatistician1 6,
- Maja Pohar-Perme, biostatistician7,
- Richard D Riley, biostatistician3,
- Hein Putter, biostatistician1,
- Ewout Steyerberg, biostatistician1 8
- on behalf of the STRATOS initiative
- 1Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, Netherlands
- 2Netherlands Cancer Institute, Amsterdam, Netherlands
- 3Centre for Prognosis Research, Research Institute for Primary Care and Health Sciences, Keele University, Keele, UK
- 4Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, Netherlands
- 5Department of Epidemiology, University Medical Centre Utrecht, Utrecht, Netherlands
- 6Department of Development and Regeneration, KU Leuven, Leuven, Belgium
- 7Department of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
- 8Department of Public Health, Erasmus MC-University Medical Centre, Rotterdam, Netherlands
- Correspondence to: E Steyerberg
- Accepted 8 April 2022
Prediction models are pivotal for counselling patients about their prognosis and for risk stratification.1 Interest often lies in predicting a non-fatal adverse event over a certain time period, for example, breast cancer recurrence within five years after diagnosis. As study populations of common diseases increasingly consist of elderly individuals with high degrees of multimorbidity, patients will experience other events that preclude the occurrence of the event of interest.2 For example, a patient with a previous breast cancer who dies from a cardiovascular cause can no longer experience breast cancer recurrence. In these settings, prediction models should target the cumulative incidence (or absolute risk3) of the adverse event, which is defined as the probability of the event of interest occurring by a particular time point with no other competing event occurring earlier. In the breast cancer example, the cumulative incidence of recurrence at five years is the risk of developing a recurrence within five years, taking into account that patients who die within five years cannot develop recurrence anymore. Failing to account for competing events during model development leads to overestimation of the cumulative incidence.4 The higher the risk of the competing event, the more pronounced the overestimation. Crucially, failure to account for competing events during validation leads to a distorted view on model performance, especially for calibration.
Such distortion was recently revealed for an internationally recommended prediction model of kidney failure, which systematically overestimated the absolute risk of kidney failure at five years in patients with advanced chronic kidney disease. The absolute overestimation by 10 percentage points on average and by 37 percentage points in the highest risk group could have resulted in overtreatment of patients, which therefore led to the conclusion that the model was unfit for use in this population. This overestimation was missed in previous validation efforts that ignored the competing event of death.56 We present model performance obtained when ignoring the competing risk and when accounting for it side by side in supplementary material 1.
For predicting binary and time-to-event outcomes, useful guidance on how to perform model validation exists.78910 For time-to-event outcomes with competing risks, validation guidance is currently spread out over many technical papers, which hampers the uptake of appropriate methods in medical research. We aim to provide an accessible overview of contemporary performance measures for time-to-event outcomes with competing risks. Our overview was made on behalf of the international STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative (http://stratos-initiative.org), which aims to provide guidance documents for relevant topics in the design and analysis of observational studies for a non-specialist audience.11 We focus on how to calculate and interpret performance measures with illustration using a breast cancer prediction model, including accompanying R code. Box 1 provides a list of glossary terms used for the case study and throughout the article.
Patients: Can also refer to individuals or participants. We use the term “patients” to match our illustration using breast cancer data.
Competing risks: The competing risks setting has multiple event types that compete for first occurrence. In the case study, these events are breast cancer recurrence and mortality before recurrence.
Primary event: We assume one event type is the primary event of interest. In the case study, the primary event is breast cancer recurrence.
Prediction horizon: Specified duration of time for which predictions are made. In the case study, we focus on five year risks.
Cumulative incidence: Absolute risk of an individual experiencing the primary event during the prediction horizon, taking into account that a patient who experiences a competing event will never experience the primary event.
Primary event indicator: A patient’s primary event status by the end of the prediction horizon. If a patient experienced the primary event before or at that time point, the primary event indicator is 1. If the event indicator is 0, this value could mean that either the patient has not experienced any event by the end of the prediction horizon or the patient experienced a competing event by that time point.
Censoring: When the patient’s event status by the end of the prediction horizon is unknown (eg, owing to loss to follow-up at an earlier time point).
Observed outcome proportion: Observed proportion of patients with the primary event. In a setting without censoring, this proportion is the sum of the primary event indicators divided by the total number of patients. With censoring, the observed outcome proportions have to be estimated while accounting for the incomplete observations. The observed outcome proportion represents the actual underlying cumulative incidence.
Risk estimates (or estimated risks): Estimates of cumulative incidence from the developed prediction model. Typically, risks up to one or a few time points are of particular interest. The performance of these risk estimates need to be evaluated for new patients.
Validation is a necessary step for prediction models before they are used in clinical practice
In the presence of competing risks, these other risks have to be accounted for at model validation
This article provides a comprehensive overview of performance measures for calibration, discrimination, overall prediction error, and decision curve analysis that account for competing events
Data and the R code used for illustration of the measures are available from https://github.com/survival-lumc/ValidationCompRisks
In this article, we assume that a prediction model has already been developed. The prediction model should have been reported such that it allows calculating the estimates of the cumulative incidence (or absolute risk of an event) at the time point(s) of interest for new patients (supplementary material 2). Our aim is to validate this model in an external dataset while accounting for competing events. Our focus is on external validation studies. The same performance measures could also be used during internal validation when combined with techniques such as bootstrapping or cross validation.12 Typically, interest is in the evaluation of the prediction of the primary event occurring by one specific time point. If multiple time points are of interest clinically, we might assess performance at each of these time points or over a time range until the last time point of interest.
Breast cancer case study
For illustration, we considered a simple competing risks prediction model for the cumulative incidence of breast cancer recurrence within five years after diagnosis developed on the FOCUS cohort, a Dutch cohort of consecutive patients with breast cancer, aged 65 years and older. We used cause specific, Cox proportional hazards, regression modelling with the following four predictors: patient age at diagnosis, tumour size, nodal status, and hormone receptor status (supplementary material 2 and table 1).
We assessed the performance of this model in patient data from the Netherlands Cancer Registry, which is a different dataset to that used for model development. We selected patients aged 70 years or older who received a diagnosis of breast cancer between 2003 and 2009 in the Netherlands, received primary breast surgery, and received no previous neoadjuvant treatment. We used a random subset of 1000 patients from the registry because this selection allowed us to share the individual patient data as open access. Among these 1000 patients, 103 recurrences and 187 non-recurrence deaths occurred within five years (cumulative incidence curve in supplementary fig 1).
Performance measures for risk prediction models and accounting for competing risks
We discuss performance measures for the following four validation aspects: calibration, discrimination, overall prediction error, and decision curve analysis, and give the results of these performance measures in our breast cancer case study. Corresponding R functions are in table 2, and technical descriptions in supplementary material 4.
Calibration refers to the agreement between observed outcome proportions and risk estimates from the prediction model. For example, in the breast cancer cohort, the model predicted a 14% absolute risk of breast cancer recurrence by five years on average. This implies that if the model is well calibrated on average, we expect to observe a recurrence event in about 14% of the patients in the validation set within five years. Ideally, calibration is not only adequate on average (known as calibration in the large), but also across the entire range of predictions.
Calibration plots offer a detailed view on calibration by comparing observed and predicted outcomes among patients with the same estimated risk. The observed outcome proportions and estimated risks by a particular time point of interest are plotted against each other, with deviations from the diagonal signalling miscalibration. A common approach divides individuals into approximately equal groups based on their risk estimates—for example, in tenths of risk defined between deciles. Then, for each group, the observed outcome proportion is plotted against the estimated risk. The main challenge is how to incorporate censored data and competing events into the calculation of the observed outcome proportion. With the grouping approach, the observed outcome proportion can be estimated by use of the Aalen-Johansen estimator (supplementary material 4).131415 However, the grouping approach has been criticised for its arbitrary categorisation and potential loss of information, so we recommend the inclusion of a smoothed curve in the calibration plot.16
One approach of obtaining a smooth curve is using pseudo-observations. These pseudo-observations replace the primary event indicators, which gives a proxy observed event indicator for all patients, even those that were censored observations (box 1).17 After this transformation into pseudo-observations, a smooth curve can be obtained using a non-parametric smoother of the observed outcome proportions (from the validation data) versus estimated risks (from the model).1819 An alternative approach was recently proposed where the smoothed curve is obtained as predictions from a flexible regression model (box 1).2021 For both the pseudo-observations approach and the flexible regression approach, the calibration curve will depend on the chosen strength of the smoothing—that is, the span for the pseudo-observations approach and the degree of flexibility (eg, number of knots when using splines) in the flexible regression approach. Advice on these choices can be found elsewhere.1821 The smoothed curve should only be plotted over the range of observed risks and not extrapolated beyond.
The calibration plot for the breast cancer model shows that the predicted cumulative incidence of breast cancer recurrence at five years is too high at the lower range of the estimated risks in the validation cohort (fig 1, estimated using the pseudo-observations approach). The calibration curve using the flexible regression approach showed similar overestimation (available from https://github.com/survival-lumc/ValidationCompRisks).
Numerical summaries of calibration
A simple method to summarise overall calibration (or calibration in the large) by a particular time point is to use a ratio of observed and expected outcomes (O/E ratio). An O/E ratio of 1 indicates perfect calibration in the large, a ratio <1 indicates that on average the model predictions are too high, and a ratio >1 indicates that on average the model predictions are too low. In the presence of competing events, the O/E ratio can be calculated as the ratio of the observed outcome proportion by the prediction horizon (estimated by the Aalen-Johansen estimator13) and the average risk estimated by the prediction model under evaluation. Supplementary material 3 shows an overview of alternative ways to summarise overall calibration.
Another approach to numerically summarise the calibration plot of predictions by a particular time point is by calculating the calibration intercept and calibration slope. For competing risks data, these can be estimated using pseudo-observations, similar to those proposed for ordinary survival.19 Supplementary material 3 shows further details. If on average the risk estimates equal the observed outcome proportions, the calibration intercept will be zero. The calibration slope equals 1 if the strength of the predictors matches the observed strength in the validation set. The calibration intercept and slope can potentially be used for recalibration of existing models to fit better in new populations.2223
Returning to the breast cancer validation cohort where we focus on the cumulative incidence of recurrence up to five years, we observe a somewhat too high estimated risk on average with an O/E ratio of 0.81 (95% confidence interval 0.62 to 0.99; table 3). The calibration intercept was estimated at −0.15, confirming the overestimation. For example, for an estimated risk of 14%, the observed outcome proportion was 1−0.86^(exp(−0.15))=12%. The calibration slope was 1.22 (95% confidence interval 0.84 to 1.60), which would indicate predictions that are slightly too homogeneous but the wide confidence interval precludes any firm conclusions.
Discrimination: c index and area under the receiver operating characteristic curve
As well as being well calibrated, useful prediction models should have discriminative ability—that is, assign higher risk estimates to patients who will experience the primary event earlier than others. A commonly used performance measure for assessing discrimination over a certain time range is the c index, also known as concordance index. The c index assesses the ordering of predictions for all patient pairs, where at least one patient has the event within the prediction horizon and the other is not censored earlier than that event.24 The c index is the proportion of these examinable pairs for which the patient with the highest estimated risk is observed to experience the event sooner than the other patient. Other versions of the c index have been proposed that depend less on the study specific censoring mechanism.2526 The c index ranges from 0.5 (no discriminating ability) to 1.0 (perfect ability to discriminate between patients with different outcomes).
In the competing risks setting, two definitions of comparison pairs have been considered (supplementary material 4).27 When the target is evaluating cumulative incidence, we propose to compare pairs where one individual has the primary event within the prediction horizon and the other either has the primary event later or experiences a competing event. Such a pair is considered concordant when the first individual has the higher estimated risk. In the presence of censoring, methods for inverse probability of censoring weighting can be applied to estimate the c index (box 2).2728
Techniques for estimating performance measures from competing risks data in the presence of censoring
A pseudo-observation is used as a proxy measure of the primary event indicator of each patient
The pseudo-observations are calculated as the weighted difference between the cumulative incidence estimate at the prediction horizon based on all patients and the same quantity estimated after leaving that patient out
The advantage of pseudo-observations is that censored patients (for whom the primary event indicator is unknown) will have a pseudo-observation and can contribute to the calculation of the observed outcome proportion in a straightforward way
Smoothing using a flexible regression model
The primary event is regressed on (a complementary log-log transformation of) the risk estimates, using restricted cubic splines to allow a non-linear relation. The shape and degree of smoothing is chosen by specifying the number and location of knots. Austin et al have proposed using a Fine and Gray model in this step2021
Observed outcome proportions are estimated using the flexible regression model for all patients, including patients with a censored event status
Inverse probability of censoring weighting (IPCW)
IPCW can create a hypothetical population that would have been observed had no censoring occurred
This hypothetical population can be achieved by up-weighting patients who are similar to censored patients but remain in the study longer—that is, observations that were not likely to remain in follow-up are up-weighted
The weights are estimated from a survival model with censoring as the outcome
Observations are then weighted inversely to their probability of not being censored
If interest is not in the full range of observed follow-up but only in the ability of a model to predict the event occurring by a single time point of interest (eg, the five year recurrence risk), the cumulative/dynamic area under the receiving operator characteristic curve (AUCt) can serve as a measure of discrimination.29 The calculation of AUCt is similar to the c index, except that patient pairs are only compared if one patient has a recurrence by five years and the other has a recurrence later than five years or experiences the competing event (non-recurrence mortality).303132 The ordering of two patients both having a recurrence within five years, for example, after two years and after three years, will not be in included in this calculation. The AUCt can be calculated for multiple time points and shown in a curve.
In the breast cancer data, the c index calculated for the time range until five years of follow-up was 0.71 (95% confidence interval 0.67 to 0.76) and the AUC at five years was 0.71 (0.66 to 0.77; table 3). The AUCt showed a slightly decreasing trend over time with wide confidence intervals (supplementary fig 2).
Overall prediction error
Overall model performance entails the overall ability of the model to predict whether a patient experiences the primary event by a particular time point, combining both the calibration and the discrimination of a model. The Brier score summarises the squared difference between the event indicators and risk estimates.333435 For the competing risks setting, the Brier score is the average squared difference between the primary event indicators at the end of the prediction horizon and the absolute risk estimates by that time point.1836 Weighting techniques or pseudo-observations can account for censoring (box 2).3637
The Brier score can range from 0, for a perfect model, to 0.25, for a non-informative model in a dataset with an overall 50% event occurrence. When the overall outcome risk is lower, the maximum score for a non-informative model is lower, which complicates interpretation. Therefore, a scaled version of the Brier score has been proposed: 1−(model Brier score÷null model Brier score).34383940 The null model (without covariates) is a model that estimates the risk equally for all individuals and can in the setting of competing events be estimated by the Aalen-Johansen estimator.13 The scaled Brier score can be interpreted as an R2 type of measure, representing the amount of prediction error in a null model that is explained by the prediction model. A 100% Brier score corresponds to a perfect model, 0% to an ineffective model, and <0% to a harmful model in the sense that the predictions are further away from the observed data than the null model estimating the average risk for each patient.
In the breast cancer validation cohort, the Brier score (where a lower score is better) was 0.09 (95% confidence interval 0.04 to 0.13; table 3). The scaled Brier score (where a higher percentage is better) showed that 5.7% (1.6% to 8.2%; table 3) of prediction error was explained, which we consider to be fairly low.
Decision curve analysis
Discrimination, calibration, and overall prediction error as described above are important when validating a prediction model, but do not tell us whether the model would do more good than harm if used in clinical practice (clinical usefulness).4142 To use a risk model for making decisions, we have to choose a risk threshold. Patients with a risk exceeding the threshold are selected for additional clinical interventions. Use of the risk model in this way leads to justified interventions (interventions in patients who would develop recurrence) and unnecessary interventions (interventions in patients who would not develop recurrence). The net benefit statistic is based on the proportion of justified interventions minus the proportion of unnecessary interventions (box 3). However, this statistic assigns a weight to the proportion of unnecessary interventions. This weight is related to the chosen threshold: the lower the threshold, the more we value justified interventions and the more we accept unnecessary interventions. The choice of the threshold depends on the (perceived) benefits and harms of the intervention. For example, a highly effective intervention with few side effects suggests the use of a low threshold. Different clinicians and patients might prefer different thresholds. Therefore, net benefit can be calculated for a range of reasonable thresholds, resulting in a decision curve.4143 The decision curve of a model is commonly compared to a reference scenario in which all patients receive the intervention (treat all; fig 2) and another scenario in which no intervention is given (treat none).
Net benefit for competing risks data
Suppose that a physician finds it reasonable that, to treat one patient who would otherwise develop a recurrence within five years (eg, with adjuvant systemic treatment), at most four patients are treated unnecessarily. This number means that at least 20% of treatments should be justified, implying a risk threshold of 20%.
The benefit of a prediction model is defined as the proportion of patients who are correctly classified as high risk. In the presence of competing events, this proportion can be calculated as the cumulative incidence of recurrence among patients with estimated risk ≥20%, multiplied by the proportion of all patients with risk ≥20%.
The harm from using the model is defined as the proportion of patients who are incorrectly classified as high risk. With competing events, this proportion is calculated as: 1−cumulative incidence among patients with estimated risk exceeding 20% multiplied by the probability of exceeding that threshold (supplementary material 4).43
The net benefit is the benefit minus the harm, in which the harm is assigned a weight. This weight is determined by the risk threshold. Here, we find it reasonable that at least 20% (one in five) treatments is justified, implying that the harm of an unnecessary treatment is considered four times smaller than the benefit of a justified treatment. The weight is therefore 1÷4.414445
The decision curve in figure 2 shows the net benefit for predicting recurrence within five years, based on the validation data. With a risk threshold of 20% (box 3), the net benefit was 0.014 (table 3). This net result of 14 of 1000 patients is made up of 34 patients whom the prediction model points out correctly as they would develop recurrence if untreated (benefit) versus 81 patients whom the model points out incorrectly and are overtreated (harm). Given the weight of 1÷4 implied by the risk threshold (box 3), subtracting the weighted harm from the benefit leads to the net result of 34−(81÷4)=14 net more benefiting patients when applying the prediction model to 1000 patients.
A net benefit greater than zero and exceeding that of the reference scenarios suggests that the prediction model can add value to clinical decision making. The decision of whether to implement a model in practice will be further based on practical considerations such as costs and ease with which the information needed in the model can be obtained. In our breast cancer illustration, all four variables are readily available; but in other cases, covariate information can be expensive or invasive to obtain. Preferably a formal impact trial should be performed to obtain definite evidence on the usefulness of a prediction model for clinical decision making.46
This article provides an overview of performance measures for a comprehensive assessment of the performance of a prediction model in the presence of competing risks. This assessment typically requires specialist techniques to process censored data such as reweighing the observations or using pseudo-observations. Contemporary, free software facilitates all the described approaches. The methods can be used for validating any developed time-to-event prediction model, as long as the reporting enables calculation of absolute risk estimates for new patients at the time point(s) of interest.
We recognise that other performance measures are available that have not been described in this overview, which might be important under specific circumstances. For example, methods have been proposed for evaluating estimated absolute risks for several or all competing events at the same time.4748 Also, with exception of the c index and AUCt curve, we limited our descriptions to evaluating absolute risk predictions by one specific time point, because it is relevant for most clinical prediction problems. Several of the performance measures that we described can be extended to evaluating predictions by multiple time points or over the entire range of follow-up. Furthermore, we note that large sample sizes are often required for a reliable assessment of performance.495051
The discussed performance measures relate to the full risk distribution (calibration, discrimination, overall performance) and to a decision analysis perspective (with the potential impact to obtain better patient outcomes). These measures are in line with the TRIPOD guidelines, which form a key framework for reporting of prediction models, including the increasingly common competing risks prediction models.52
Contributors: All authors provided a substantial contribution to the design and interpretation of the paper and revised drafts. ES initiated the project. NvG wrote the initial draft and is the guarantor for the study. DG analysed the breast cancer data. EFB drafted the technical descriptions in supplementary material 4. DG and EFB are the main authors of the GitHub page. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding: No specific funding was given to this study. The research of MPP is supported by the Slovenian Research Agency (grant P3-0154).
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: no support for the submitted work; ES and RDR report they receive royalties for their respective books on prediction models; all other authors declare no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Provenance and peer review: Not commissioned; externally peer reviewed.