Original article
Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis

https://doi.org/10.1016/S0895-4356(01)00341-9Get rights and content

Abstract

The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a logistic regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive logistic regression model.

Introduction

Predictive models are important tools to provide estimates of patient outcome [1]. A predictive model may well be constructed with regression analysis in a data set with information from a series of representative patients. The apparent performance of the model on this training set will be better than the performance in another data set, even if the latter test set consists of patients from the same population 1, 2, 3, 4, 5, 6. This ‘optimism’ is a well-known statistical phenomenon, and several approaches have been proposed to estimate the performance of the model in independent subjects more accurately than based on a naive evaluation on the training sample 3, 7, 8, 9.

A straightforward and fairly popular approach is to randomly split the training data in two parts: one to develop the model and another to measure its performance. With this split-sample approach, model performance is determined on similar, but independent, data [9].

A more sophisticated approach is to use cross-validation, which can be seen as an extension of the split-sample method. With split-half cross-validation, the model is developed on one randomly drawn half and tested on the other and vice versa. The average is taken as estimate of performance. Other fractions of subjects may be left out (e.g., 10% to test a model developed on 90% of the sample). This procedure is repeated 10 times, such that all subjects have once served to test the model. To improve the stability of the cross-validation, the whole procedure can be repeated several times, taking new random subsamples. The most extreme cross-validation procedure is to leave one subject out at a time, which is equivalent to the jack-knife technique [7].

The most efficient validation has been claimed to be achieved by computer-intensive resampling techniques such as the bootstrap [8]. Bootstrapping replicates the process of sample generation from an underlying population by drawing samples with replacement from the original data set, of the same size as the original data set [7]. Models may be developed in bootstrap samples and tested in the original sample or in those subjects not included in the bootstrap sample 3, 8.

In this study we compare the efficiency of internal validation procedures for predictive logistic regression models. Internal validation refers to the performance in patients from a similar population as where the sample originated from. Internal validation is in contrast to external validation, where various differences may exist between the populations used to develop and test the model [10]. We vary the sample size from small to large. As an indicator of sample size we use the number of events per variable (EPV); low EPV values indicate that many parameters are estimated in relation to the information in the data 11, 12. We study a number of measures of predictive performance, and we will show that bootstrapping is generally superior to other approaches to estimate internal validity.

Section snippets

Patients

We analyzed 30-day mortality in a large data set of patients with acute myocardial infarction (GUSTO-I) 13, 14. This data set has been used before to study methodological aspects of regression modeling 15, 16, 17, 18. In brief, this data set consists of 40,830 patients, of whom 2851 (7.0%) had died at 30 days.

Simulation study

Random samples were drawn from the GUSTO-I data set, with sample size varied according to the number of events per variable (EPV). We studied the validity of EPV as an indicator of

Optimism in apparent performance

In Fig. 1 we show the apparent and test performance of the logistic regression model with eight predictors in relation to sample size, as indicated by the number of events per variable (EPV). The apparent performance was determined on random samples from the GUSTO-I data set, with sample sizes (number of deaths) of n = 572 (40), n = 1145 (80), n = 2291 (160), n = 4582 (320), n = 9165 (640) for EPV 5, 10, 20, 40, and 80, respectively. For all performance measures, we note optimism in the

Discussion

Accurate estimation of the internal validity of a predictive regression model is especially problematic when the sample size is small. The apparent performance as estimated in the sample then is a substantial overestimation of the true performance in similar subjects. In our study, split-sample approaches underestimated performance and showed high variability. In contrast, bootstrap resampling resulted in stable and nearly unbiased estimates of performance.

Methods to assess internal validity

Acknowledgements

We would like to thank Kerry L. Lee, Duke Clinical Research Institute, Duke University Medical Center, Durham NC, and the GUSTO investigators for making the GUSTO-I data available for analysis. The research of Dr Steyerberg has been made possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences.

References (33)

  • P Peduzzi et al.

    A simulation study of the number of events per variable in logistic regression analysis

    J Clin Epidemiol

    (1996)
  • E.W Steyerberg et al.

    Stepwise selection in small data setsa simulation study of bias in logistic regression analysis

    J Clin Epidemiol

    (1999)
  • F.E Harrell et al.

    Multivariable prognostic modelsissues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors

    Stat Med

    (1996)
  • J.B Copas

    Regression, prediction and shrinkage

    J R Stat Soc B

    (1983)
  • B Efron

    Estimating the error rate of a prediction rulesome improvements on cross-validation

    JASA

    (1983)
  • D.J Spiegelhalter

    Probabilistic prediction in patient management and clinical trials

    Stat Med

    (1986)
  • J.C Van Houwelingen et al.

    Predictive value of statistical models

    Stat Med

    (1990)
  • C Chatfield

    Model uncertainty, data mining and statistical inference

    J R Stat Soc A

    (1995)
  • B Efron et al.

    An introduction to the bootstrap. Monographs on statistics and applied probability

    (1993)
  • B Efron et al.

    Improvements on cross-validationthe .632+ bootstrap method

    JASA

    (1997)
  • R.R Picard et al.

    Data splitting

    Am Statistician

    (1990)
  • A.C Justice et al.

    Assessing the generalizability of prognostic information

    Ann Intern Med

    (1999)
  • F.E Harrell et al.

    Regression modelling strategies for improved prognostic prediction

    Stat Med

    (1984)
  • An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction

    N Engl J Med

    (1993)
  • K.L Lee et al.

    Predictors of 30-day mortality in the era of reperfusion for acute myocardial infarction. Results from an international trial of 41,021 patients

    Circulation

    (1995)
  • M Ennis et al.

    A comparison of statistical learning methods on the Gusto database

    Stat Med

    (1998)
  • Cited by (2015)

    View all citing articles on Scopus
    View full text