Evidence for clinical prediction requires separating train and test data
Dear BMJ editors,
Please find a comment to expand on Riley et al’s important and insightful paper.
Gaël Varoquaux, Research Director, Inria, France – Visitor Professor, McGill University, Canada
Russell A. Poldrack, Professor Stanford University, California
Sylvain Arlot, Professor, University Paris Saclay, France
Yoshua Bengio, Professor, Mila Quebec Institute for Learning Algorithms, Canada
As a side point on their important paper on sample sizes needed for prediction models, (Riley et al. 2020) wrote that they “do not recommend data splitting (eg, into model training and testing samples), as this is inefficient and it is better to use all the data for model development, with resampling methods (such as bootstrapping) used for internal validation”. As this formulation could lead to confusion, we would like to clarify here that prediction models must be evaluated on data different from that used to fit the model, following classical guidelines on prediction-model validation (Poldrack, Huckins, and Varoquaux 2019; James et al. 2013 chap 5)
A clinical prediction model will be used via its prediction. Consequently, validating such a model and evaluating its performance must gauge the quality of these predictions, even to establish internal validity. It calls for measuring a generalization error, which differs from common goodness-of-fit measures because such an error characterizes the model on unseen individuals. The error of a prediction model evaluated on the data used to fit the model –known as the “train error” or the “apparent error”– will underestimate the generalization error, because the model has been optimized on those specific data points. One extreme example is the one-nearest-neighbor classifier (Hastie, Tibshirani, and Friedman 2013 chap 2): Such a predictive model works by storing all the data during the fit; to make a prediction on a new observation, it finds the most similar observation that it has already seen, and predicts the corresponding outcome. This predictive model will make no error on the data that it has already seen, though it will not in general lead to perfect predictions on new data. For these reasons, the standard practice in evaluating predictive models is to split the data in a “training set”, used to fit the model, and a “test set”, to evaluate model performance. With limited data, a single split leads to a noisy evaluation, and it is preferred to repeatedly split in a so-called “cross-validation” (Moons et al. 2015; Arlot and Celisse 2010).
Cross-validation is linked to the bootstrap as both are resampling procedures. However, care must be taken when using the bootstrap to estimate generalization error, as it is still necessary to evaluate the model on unseen data to avoid bias. One approach is to test the model on the fraction of original data points left out of a bootstrap replicate (Breiman 1996). A disappointing aspect that this approach shares with cross-validation is that it must discard some precious data which could otherwise be used to fit the model, leading to models that perform potentially worse. For this reason, there has been active research in correcting the bias of the apparent error. Yet, it is very hard to achieve estimates of the generalization error without known catastrophic failures. For instance, the refined bootstrap (Efron and Tibshirani 1994 chap 17), popular in clinical research (Moons et al. 2015), investigates the bias of the apparent error by comparing it to evaluating on the full data a model fit on bootstrap replicates. Still, given that the full data shares ~63.2% of its observations with the bootstrap replicates, the estimate of the generalization error is biased. For instance, on data with a completely random binary outcome, for a nearest-neighbor classifier (with zero apparent error), the refined bootstrap will measure 100% * .632 + .368 * 50% = 81.6%, while the actual error rate is 50%, given that the data are fully random. The bootstrap .632 was introduced to correct this bias (Efron and Tibshirani 1994 chap 17), considering a weighted average of apparent error with the error measured on points outside the bootstrap replica. However, this fails also for a one nearest neighbor, leading to reporting an error of 63.2%, instead of 50% on fully random outcomes. For this reason, (Efron and Tibshirani 1997) introduced the bootstrap .632+, correcting the bootstrap .632 with the no-information error rate of the predictor, estimated by evaluating the prediction model on all possible combinations of covariates and outcomes. This last variant of the bootstrap has no obvious loophole, though its theoretical properties are not fully understood. This history of improvements to the variants of the bootstrap shows how difficult it is to avoid its biases when measuring generalization error. Even to estimate confidence intervals of model parameters with i.i.d. data, there are many known settings where the bootstrap fails (Beran 1997; Mammen 2012; Davison, Hinkley, and Young 2003 to list a few). For these reasons, modern practice in predictive modeling uses a variety of techniques, including the bootstrap, to estimate the variability of model parameters, but prefers the more reliable cross-validation approach to estimate the generalization error (James et al. 2013 chap 5). Efron, the inventor of bootstrap, wrote “Cross-validation [...] is a method of such obvious virtue that criticism seems almost churlish”, concluding that “what is not available is theoretical reassurance that the numerical gains of methods like .632+ will hold up in general practice” (Efron 2003)
A good choice of cross-validation scheme leads to small bias and small variance (Arlot and Celisse 2010). K-fold cross-validation leads to training on a fraction (1-1/K) of the whole dataset. When K is sufficiently large (e.g. 5), this subset of the data represents well the original data. Consequently, the generalization error of the corresponding model is close to that trained on the full data. Importantly, using less data will lead to a conservative estimate of generalization which is preferable to an optimistic one. To decrease the variance, it is possible to average the measure across repeated cross-validation with different splits.
The motivation behind avoiding cross-validation is to find an estimate of generalization error with less variance, at the cost of some bias. This bias was small in the typical clinical prediction models: using only simple predictive models, such as linear models in low dimensions, with sufficiently large sample sizes, the apparent error is close enough to the generalization error and the use of the bootstrap is not dangerous (Steyerberg et al. 2001). However, modern AI techniques, such as deep neural networks (Zhang et al. 2016) often achieve very small apparent error on any data (Zhang et al. 2016), as with the one nearest neighbor. These rich models are necessary for prediction using complex data, such as predictive models on medical images, but knowing whether they generalize well or not requires the use of cross-validation, or held out data on very large datasets.
Having clarified this aspect of (Riley et al. 2020) which could be ambiguous, we wish to support the important messages that they bring: predictive models need sufficient data, more than typical statistical practices, and truly external validation data from a separate source or study.
Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4: 40–79.
Beran, Rudolf. 1997. “Diagnosing Bootstrap Success.” Annals of the Institute of Statistical Mathematics 49 (1): 1–24.
Breiman, L. 1996. “Out-of-Bag Estimation.” https://www.stat.berkeley.edu/pub/users/breiman/OOBestimation.pdf.
Davison, A. C., D. V. Hinkley, and G. A. Young. 2003. “Recent Developments in Bootstrap Methodology.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics.
Efron, Bradley. 2003. “Second Thoughts on the Bootstrap.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 18 (2): 135–40.
Efron, Bradley, and R. J. Tibshirani. 1994. An Introduction to the Bootstrap. CRC Press.
Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548–60.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2013. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.
Mammen, Enno. 2012. When Does Bootstrap Work?: Asymptotic Results and Simulations. Springer Science & Business Media.
Moons, Karel G. M., Douglas G. Altman, Johannes B. Reitsma, John P. A. Ioannidis, Petra Macaskill, Ewout W. Steyerberg, Andrew J. Vickers, David F. Ransohoff, and Gary S. Collins. 2015. “Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and Elaboration.” Annals of Internal Medicine 162 (1): W1–73.
Poldrack, Russell A., Grace Huckins, and Gael Varoquaux. 2019. “Establishment of Best Practices for Evidence for Prediction: A Review.” JAMA Psychiatry.
Riley, Richard D., Joie Ensor, Kym I. E. Snell, Frank E. Harrell Jr, Glen P. Martin, Johannes B. Reitsma, Karel G. M. Moons, Gary Collins, and Maarten van Smeden. 2020. “Calculating the Sample Size Required for Developing a Clinical Prediction Model.” BMJ 368: m441.
Steyerberg, E. W., F. E. Harrell Jr, G. J. Borsboom, M. J. Eijkemans, Y. Vergouwe, and J. D. Habbema. 2001. “Internal Validation of Predictive Models: Efficiency of Some Procedures for Logistic Regression Analysis.” Journal of Clinical Epidemiology 54 (8): 774–81.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR.
Competing interests: No competing interests