# Calculating the sample size required for developing a clinical prediction model

BMJ 2020; 368 doi: https://doi.org/10.1136/bmj.m441 (Published 18 March 2020) Cite this as: BMJ 2020;368:m441## All rapid responses

*The BMJ*reserves the right to remove responses which are being wilfully misrepresented as published articles.

Dear BMJ editors,

Please find a comment to expand on Riley et al’s important and insightful paper.

Best regards,

Gaël Varoquaux, Research Director, Inria, France – Visitor Professor, McGill University, Canada

Russell A. Poldrack, Professor Stanford University, California

Sylvain Arlot, Professor, University Paris Saclay, France

Yoshua Bengio, Professor, Mila Quebec Institute for Learning Algorithms, Canada

As a side point on their important paper on sample sizes needed for prediction models, (Riley et al. 2020) wrote that they “do not recommend data splitting (eg, into model training and testing samples), as this is inefficient and it is better to use all the data for model development, with resampling methods (such as bootstrapping) used for internal validation”. As this formulation could lead to confusion, we would like to clarify here that prediction models must be evaluated on data different from that used to fit the model, following classical guidelines on prediction-model validation (Poldrack, Huckins, and Varoquaux 2019; James et al. 2013 chap 5)

A clinical prediction model will be used via its prediction. Consequently, validating such a model and evaluating its performance must gauge the quality of these predictions, even to establish internal validity. It calls for measuring a generalization error, which differs from common goodness-of-fit measures because such an error characterizes the model on unseen individuals. The error of a prediction model evaluated on the data used to fit the model –known as the “train error” or the “apparent error”– will underestimate the generalization error, because the model has been optimized on those specific data points. One extreme example is the one-nearest-neighbor classifier (Hastie, Tibshirani, and Friedman 2013 chap 2): Such a predictive model works by storing all the data during the fit; to make a prediction on a new observation, it finds the most similar observation that it has already seen, and predicts the corresponding outcome. This predictive model will make no error on the data that it has already seen, though it will not in general lead to perfect predictions on new data. For these reasons, the standard practice in evaluating predictive models is to split the data in a “training set”, used to fit the model, and a “test set”, to evaluate model performance. With limited data, a single split leads to a noisy evaluation, and it is preferred to repeatedly split in a so-called “cross-validation” (Moons et al. 2015; Arlot and Celisse 2010).

Cross-validation is linked to the bootstrap as both are resampling procedures. However, care must be taken when using the bootstrap to estimate generalization error, as it is still necessary to evaluate the model on unseen data to avoid bias. One approach is to test the model on the fraction of original data points left out of a bootstrap replicate (Breiman 1996). A disappointing aspect that this approach shares with cross-validation is that it must discard some precious data which could otherwise be used to fit the model, leading to models that perform potentially worse. For this reason, there has been active research in correcting the bias of the apparent error. Yet, it is very hard to achieve estimates of the generalization error without known catastrophic failures. For instance, the refined bootstrap (Efron and Tibshirani 1994 chap 17), popular in clinical research (Moons et al. 2015), investigates the bias of the apparent error by comparing it to evaluating on the full data a model fit on bootstrap replicates. Still, given that the full data shares ~63.2% of its observations with the bootstrap replicates, the estimate of the generalization error is biased. For instance, on data with a completely random binary outcome, for a nearest-neighbor classifier (with zero apparent error), the refined bootstrap will measure 100% * .632 + .368 * 50% = 81.6%, while the actual error rate is 50%, given that the data are fully random. The bootstrap .632 was introduced to correct this bias (Efron and Tibshirani 1994 chap 17), considering a weighted average of apparent error with the error measured on points outside the bootstrap replica. However, this fails also for a one nearest neighbor, leading to reporting an error of 63.2%, instead of 50% on fully random outcomes. For this reason, (Efron and Tibshirani 1997) introduced the bootstrap .632+, correcting the bootstrap .632 with the no-information error rate of the predictor, estimated by evaluating the prediction model on all possible combinations of covariates and outcomes. This last variant of the bootstrap has no obvious loophole, though its theoretical properties are not fully understood. This history of improvements to the variants of the bootstrap shows how difficult it is to avoid its biases when measuring generalization error. Even to estimate confidence intervals of model parameters with i.i.d. data, there are many known settings where the bootstrap fails (Beran 1997; Mammen 2012; Davison, Hinkley, and Young 2003 to list a few). For these reasons, modern practice in predictive modeling uses a variety of techniques, including the bootstrap, to estimate the variability of model parameters, but prefers the more reliable cross-validation approach to estimate the generalization error (James et al. 2013 chap 5). Efron, the inventor of bootstrap, wrote “Cross-validation [...] is a method of such obvious virtue that criticism seems almost churlish”, concluding that “what is not available is theoretical reassurance that the numerical gains of methods like .632+ will hold up in general practice” (Efron 2003)

A good choice of cross-validation scheme leads to small bias and small variance (Arlot and Celisse 2010). K-fold cross-validation leads to training on a fraction (1-1/K) of the whole dataset. When K is sufficiently large (e.g. 5), this subset of the data represents well the original data. Consequently, the generalization error of the corresponding model is close to that trained on the full data. Importantly, using less data will lead to a conservative estimate of generalization which is preferable to an optimistic one. To decrease the variance, it is possible to average the measure across repeated cross-validation with different splits.

The motivation behind avoiding cross-validation is to find an estimate of generalization error with less variance, at the cost of some bias. This bias was small in the typical clinical prediction models: using only simple predictive models, such as linear models in low dimensions, with sufficiently large sample sizes, the apparent error is close enough to the generalization error and the use of the bootstrap is not dangerous (Steyerberg et al. 2001). However, modern AI techniques, such as deep neural networks (Zhang et al. 2016) often achieve very small apparent error on any data (Zhang et al. 2016), as with the one nearest neighbor. These rich models are necessary for prediction using complex data, such as predictive models on medical images, but knowing whether they generalize well or not requires the use of cross-validation, or held out data on very large datasets.

Having clarified this aspect of (Riley et al. 2020) which could be ambiguous, we wish to support the important messages that they bring: predictive models need sufficient data, more than typical statistical practices, and truly external validation data from a separate source or study.

References

Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4: 40–79.

Beran, Rudolf. 1997. “Diagnosing Bootstrap Success.” Annals of the Institute of Statistical Mathematics 49 (1): 1–24.

Breiman, L. 1996. “Out-of-Bag Estimation.” https://www.stat.berkeley.edu/pub/users/breiman/OOBestimation.pdf.

Davison, A. C., D. V. Hinkley, and G. A. Young. 2003. “Recent Developments in Bootstrap Methodology.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics.

Efron, Bradley. 2003. “Second Thoughts on the Bootstrap.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 18 (2): 135–40.

Efron, Bradley, and R. J. Tibshirani. 1994. An Introduction to the Bootstrap. CRC Press.

Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548–60.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2013. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.

Mammen, Enno. 2012. When Does Bootstrap Work?: Asymptotic Results and Simulations. Springer Science & Business Media.

Moons, Karel G. M., Douglas G. Altman, Johannes B. Reitsma, John P. A. Ioannidis, Petra Macaskill, Ewout W. Steyerberg, Andrew J. Vickers, David F. Ransohoff, and Gary S. Collins. 2015. “Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and Elaboration.” Annals of Internal Medicine 162 (1): W1–73.

Poldrack, Russell A., Grace Huckins, and Gael Varoquaux. 2019. “Establishment of Best Practices for Evidence for Prediction: A Review.” JAMA Psychiatry.

Riley, Richard D., Joie Ensor, Kym I. E. Snell, Frank E. Harrell Jr, Glen P. Martin, Johannes B. Reitsma, Karel G. M. Moons, Gary Collins, and Maarten van Smeden. 2020. “Calculating the Sample Size Required for Developing a Clinical Prediction Model.” BMJ 368: m441.

Steyerberg, E. W., F. E. Harrell Jr, G. J. Borsboom, M. J. Eijkemans, Y. Vergouwe, and J. D. Habbema. 2001. “Internal Validation of Predictive Models: Efficiency of Some Procedures for Logistic Regression Analysis.” Journal of Clinical Epidemiology 54 (8): 774–81.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR.

**Competing interests: **
No competing interests

**20 August 2020**

## Response to comments from Gaël P Varoquaux and colleagues

We thank Gaël P Varoquaux and colleagues for their comments on our paper, and for supporting our argument for ensuring appropriate sample sizes for clinical prediction model research. Many published prediction models are based on inadequate sample sizes, and are thus are likely to be unreliable when tested in new data. Our paper provides a framework to address this, as it calculates the sample size required to minimise overfitting and to estimate key parameters precisely. Our article primarily focuses on prediction models to be developed using regression, and for situations where N > P (i.e. sample size is much larger than the number of predictor parameters), in order to minimise overfitting. In this context, we still stand by our statement that we “do not recommend data splitting (e.g., into model training and testing samples)”, as this is inefficient and it is better to use all the data for model development, with resampling methods (such as bootstrapping or cross-validation) used for internal validation. However, we agree that this part of our text warrants further information, to limit potential for confusion.

To clarify, by data splitting, we meant a single split of the data (for example, in the ratio of 70:30), with the first part of the dataset used for model development, and the second part used for testing (validation). This should be avoided in almost all situations. As correctly pointed out, k-fold cross-validation is a resampling approach for internal validation that also uses data splitting (e.g. split data into tenths, and use nine tenths for development and the other tenth for testing, with this process repeated k times). However, the model is still developed using all of the data, and the aim of the k-fold cross-validation approach is to estimate the model’s expected performance in the underlying target population, which is obtained by the average model performance across the k test datasets.

Bootstrapping is an alternative resampling approach for internal validation of a prediction model, which generally performs better than k-fold cross-validation in simulations [1]. For example, Steyerberg et al. [2] performed a simulation study of the performance of bootstrapping, k-fold cross-validation, and single data-splitting. They concluded that “split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias.” Hence, the model’s predictive performance in new data was best revealed by bootstrapping (or, using the words of Varoquaux and colleagues, the generalisability bias was lowest using bootstrapping). Furthermore, “the .632 and .632+ bootstrap variants were not shown to be superior to the regular bootstrap”. Further comparison of the original and 0.632 bootstrap approaches is given at https://hbiostat.org/doc/simval.html.

Other simulations and articles have come to the same conclusion to use bootstrapping [3, 4], although cross-validation can be improved by using a large number of repeats (e.g. 100). The simulations at https://hbiostat.org/doc/simval.html shows that the bootstrap has lower mean squared error than cross-validation except in extreme (P > N) cases. Furthermore, when P > N, 100 repeats of 10-fold cross-validation are recommended.

We suspect that, when the minimum sample size requirements are met as described in our BMJ and related papers [5-7], the differences will be very small between bootstrapping and (repeated) k-fold cross-validation. The key thing is to not use single data splitting, and this was the intent of our aforementioned statement in our article, which we firmly stand by. That is, researchers should use all their data for model development, and use either bootstrapping (our general preference) or (repeated) k-fold cross-validation to examine the likely performance of the model in the target population, as they give estimates of predictive performance closer to the truth (as reflected by smaller mean square error estimates – for example see: https://www.fharrell.com/post/split-val/).

When applying bootstrapping or cross-validation, it is important to replay the same modelling steps that were used to develop the prediction model in the whole dataset. For example, steps such as variable selection need to be repeated in each bootstrap sample (or cross-validation fold), if used. If there are some steps that cannot be automated, then this might give some credence (as a last resort) to using a single split sample approach. Then the data would still need to be very large (so that the model development data and the validation datasets are both fit for purpose) – though there becomes a point where there is little or no gain from splitting as the apparent performance will be unbiased for very large sample sizes. Furthermore, the validation should be carried out truly independently (i.e., different analysis team and with the split performed before any analysis); otherwise there is strong concern of researchers trying different splits until they find the one that shows the best performance.

Lastly, we read with interest the comment that “modern AI techniques, … often achieve very small apparent error on any data, as with the one nearest neighbor. These rich models are necessary for prediction using complex data, such as predictive models on medical images, but knowing whether they generalize well or not requires the use of cross-validation, or held out data on very large datasets.” We agree this is an important setting for prediction model research, and evaluations of cross-validation and bootstrapping are urgently needed in this area; therefore, going forward, we encourage researchers to carry out more simulation studies that compare bootstrapping and cross-validation in this setting.

Richard Riley, Maarten van Smeden, Gary Collins, Frank Harrell, Kym Snell, Joie Ensor, Karel Moons, Glen Martin

Reference List

1. Harrell FE, Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Second Edition). New York: Springer 2015.

2. Steyerberg EW, Harrell FEJ, Borsboom GJ, et al. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001;54:774-81.

3. Steyerberg EW, Harrell FE, Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol 2016;69:245-7.

4. Harrell FE, Jr., Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15(4):361-87.

5. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441.

6. Riley RD, Snell KI, Ensor J, et al. Minimum sample size for developing a multivariable prediction model: Part II - binary and time-to-event outcomes. Stat Med 2019;38(7):1276-96.

7. Riley RD, Snell KIE, Ensor J, et al. Minimum sample size for developing a multivariable prediction model: Part I - Continuous outcomes. Stat Med 2019;38(7):1262-75.

Competing interests:We are the authors of the original article, and have published textbooks and articles that argue for the use of resampling methods (such as bootstrapping or cross-validation) for internal validation of clinical prediction models.15 September 2020