Intended for healthcare professionals

CCBY Open access
Research Methods & Reporting

Evaluation of clinical prediction models (part 1): from development to external validation

BMJ 2024; 384 doi: (Published 08 January 2024) Cite this as: BMJ 2024;384:e074819
  1. Gary S Collins, professor1,
  2. Paula Dhiman, senior researcher in medical statistics1,
  3. Jie Ma, medical statistician1,
  4. Michael M Schlussel, senior medical statistician1,
  5. Lucinda Archer, assistant professor2 3,
  6. Ben Van Calster, associate professor4 5 6,
  7. Frank E Harrell Jr, professor7,
  8. Glen P Martin, senior lecturer8,
  9. Karel G M Moons, professor9,
  10. Maarten van Smeden, associate professor9,
  11. Matthew Sperrin, senior lecturer8,
  12. Garrett S Bullock, assistant professor10 11,
  13. Richard D Riley, professor2 3
  1. 1Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford OX3 7LD, UK
  2. 2Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
  3. 3National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre, UK
  4. 4KU Leuven, Department of Development and Regeneration, Leuven, Belgium
  5. 5Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, Netherlands
  6. 6EPI-Centre, KU Leuven, Belgium
  7. 7Department of Biostatistics, Vanderbilt University, Nashville, TN, USA
  8. 8Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
  9. 9Julius Centre for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht University, Utrecht, Netherlands
  10. 10Department of Orthopaedic Surgery, Wake Forest School of Medicine, Winston-Salem, NC, USA
  11. 11Centre for Sport, Exercise and Osteoarthritis Research Versus Arthritis, University of Oxford, Oxford, UK
  1. Correspondence to: G S Collins gary.collins{at} (or @GSCollins on Twitter)
  • Accepted 4 September 2023

Evaluating the performance of a clinical prediction model is crucial to establish its predictive accuracy in the populations and settings intended for use. In this article, the first in a three part series, Collins and colleagues describe the importance of a meaningful evaluation using internal, internal-external, and external validation, as well as exploring heterogeneity, fairness, and generalisability in model performance.

Healthcare decisions for individuals are routinely made on the basis of risk or probability.1 Whether this probability is that a specific outcome or disease is present (diagnostic) or that a specific outcome will occur in the future (prognostic), it is important to know how these probabilities are estimated and whether they are accurate. Clinical prediction models estimate outcome risk for an individual conditional on their characteristics of multiple predictors (eg, age, family history, symptoms, blood pressure). Examples include the ISARIC (International Severe Acute Respiratory and Emerging Infection Consortium) 4C model for estimating the risk of clinical deterioration in individuals with acute COVID-19,2 or the PREDICT model for estimating the overall and breast cancer specific survival probability at five years for women with early breast cancer.3 Clinical decision making can also be informed by models that estimate continuous outcome values, such as fat mass in children and adolescents, although we focus on risk estimates in this article.4 With increasing availability of data, pressures to publish, and a surge in interest in approaches based on artificial intelligence and machine learning (such as deep learning and random forests56), prediction models are being developed at high volume. For example, diagnosis of chronic obstructive pulmonary disease has >400 models,7 cardiovascular disease prediction has >300 models,8 and covid-19 has >600 prognostic models.9

Despite the increasing number of models, very few are routinely used in clinical practice owing to issues including study design and analysis concerns (eg, small sample size, overfitting), incomplete reporting (leading to difficulty in fully appraising prediction model studies), and no clear link into clinical decision making. Fundamentally, there is often an absence or failure to fairly and meaningfully evaluate the predictive performance of a model in representative target populations and clinical settings. Lack of transparent and meaningful evaluation obfuscates judgments about the potential usefulness of the model, and whether it is ready for next stage of evaluation (eg, an intervention, or cost effectiveness study) or requires updating (eg, recalibration). To manage this deficit, this three part series outlines the importance of model evaluation and how to undertake it well, to help researchers provide a reliable and fair picture of a model’s predictive accuracy.

In this first article, we explain the rationale for model evaluation, and emphasise that it involves examining a model’s predictive performance at multiple stages, including at model development (internal validation) and in new data (external validation). Subsequent papers in this series consider the study design and performance measures used to evaluate the predictive accuracy of a model (part 210) and the sample size requirements for external validation (part 311). Box 1 provides a glossary of key terms.

Box 1

Glossary of terms


Agreement between the observed outcomes and estimated risks from the model. Calibration should be assessed visually with a plot of the estimated risks on the x axis and the observed outcome on the y axis with smoothed flexible calibration curve in the individual data. Calibration can also be quantified numerically with the calibration slope (ideal value 1) and calibration-in-the-large (ideal value 0).


Assesses mean (overall) calibration and quantifies any systematic overestimation or underestimation of risk, by comparing the mean number of predicted outcomes and the mean number of observed outcomes.

Calibration slope

Quantifies the spread of the estimated risks from the model relative to the observed outcomes. A slope <1 suggests that the spread of estimated risks are too extreme (ie, too high for individuals at high risk, and too low for those at low risk). Slope >1 suggests that the spread of estimated risks are too narrow.


Assesses how well the predictions from the model differentiate between those with and without the outcome. Discrimination is typically quantified by the c statistic (sometimes referred to as the AUC or AUROC) for binary outcomes, and the c index for time-to-event outcomes. A value of 0.5 indicates that the model is not better than a coin toss, and a value of 1 denotes perfect discrimination (ie, all individuals with the outcome have higher estimated risks than all individuals without the outcome). What defines a good c statistic value is context specific.


When the prediction model fits unimportant idiosyncrasies in the development data, to the point that the model performs poorly in new data, typically with miscalibration reflected by calibration slopes less than 1.

Parameter tuning

Finding the best settings for a particular model building strategy.


Counteracting against overfitting by deliberately inducing bias in the predictor effects by shrinking them towards zero

  • AUC=area under the curve; AUROC=area under the receiver operating characteristic curve.


Summary points

  • Clinical prediction models use a combination of variables to estimate outcome risk for individuals

  • Evaluating the performance of a prediction model is critically important and validation studies are essential, as a poorly developed model could be harmful or exacerbate disparities in either provision of health care or subsequent healthcare outcomes

  • Evaluating model performance should be carried out in datasets that are representative of the intended target populations for the model’s implementation

  • A model’s predictive performance will often appear to be excellent in the development dataset but be much lower when evaluated in a separate dataset, even from the same population

  • Splitting data at the moment of model development should generally be avoided as it discards data leading to a more unreliable model, whilst leaving too few data to reliably evaluate its performance

  • Concerted efforts should be made to exploit all available data to build the best possible model, with better use of resampling methods for internal validation, and internal-external validation to evaluate model performance and generalisability across clusters

Why do we need to evaluate prediction models?

During model development (or training), study design and data analysis aspects will have an impact on the predictive performance of the model in new data from some target population. A model’s predictive performance will often appear excellent in the development dataset but be much lower when evaluated in a separate dataset, even from the same population, often rendering the model much less accurate. The downstream effect is that the model will be less useful and even potentially harmful, including exacerbating inequalities in either provision of healthcare or subsequent healthcare outcomes. Therefore, once a prediction model has been developed, it is clearly important to carry out a meaningful evaluation of how well it performs.

Evaluating the performance of a prediction model is generally referred to as validation.12 However, the term validation is ill defined, used inconsistently,13 and evokes a sense of achieving some pre-defined level of statistical or clinical usefulness. A validated model might even (albeit wrongly) be considered a sign of approval for use in clinical practice. Many prediction models that have undergone some form of validation will still have poor performance, either a substantial decrease in model discrimination or, more likely, in calibration (see box 1 for definitions of these measures, and part 2 of our series for more detailed explanation10). Yet determining what level of predictive performance is inadequate (eg, how miscalibrated a model needs to be to conclude poor performance) is subjective. Many validation studies are also too small, a consideration that is frequently overlooked, leading to imprecise estimation of a model’s performance (see part 3 on guidance for sample size11). Therefore, referring to a model as having been “validated” or being “valid,” just because a study labelled as validation has been conducted, is unhelpful and arguably misleading. Indeed, variation in performance over different target populations,14 or different time periods and places (eg, different centres or countries), is to be expected15 and so a model can never be proven to be always valid (nor should we expect it to be16).

Figure 1 shows a summary of the different study designs and approaches involving prediction model development and validation. The decision of which validation to carry out depends on the research question that is being asked and the availability of existing data. Regardless of the development approach, the validation component is essential, because any study developing a new prediction model should, without exception, always evaluate the model’s predictive performance for the target population, setting and outcome of interest. We now outline the various options for model evaluation, moving from internal validation to external validation.

Fig 1
Fig 1

Different study design and approaches to develop and evaluate the performance of a multivariable prediction model (D=development; V=validation (evaluation)). Adapted from Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.17 *A study can include more than one analysis type

Evaluation at model development: internal validation approaches

At the stage of model development, depending on the availability, structure (eg, multiple datasets, multicentre) and size of the available data, investigators are faced with deciding how best to use the available data to both develop a clinical prediction model and evaluate its performance in an unbiased, fair, and informative manner. When the evaluation uses the same data (or data source) as used for model development, the process is referred to as internal validation. For example, the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) reporting guideline requires users to “specify type of model, all model-building procedures (including any predictor selection), and method for internal validation.”1718

Widely used approaches for internal validation are based on data splitting (using a subset of the data for development and the remainder for evaluation) or resampling (eg, k-fold cross validation or bootstrapping; table 1). For very large datasets, and computationally intensive model building procedures (eg, including parameter tuning; box 1), the decision on which approach is used for internal validation could be a pragmatic one. Nevertheless, some approaches are inefficient and uninformative, and, especially in small sample sizes, might even lead to biased, imprecise and optimistic results and ultimately misleading conclusions. Therefore, we now describe the advantages and disadvantages of several strategies in detail.

Table 1

Different approaches for evaluating model performance

View this table:

Apparent performance

The simplest approach is to use all the available data to develop a prediction model and then directly evaluate its performance in exactly the same data (often referred to as apparent performance). Clearly, using this approach is problematic, particularly when model complexity and the number of predictors (model parameters to be estimated) is large relative to the number of events in the dataset (indicative of overfitting).20 The apparent performance of the model will therefore typically be optimistic; that is, when the model is subsequently evaluated in new data, even in the same population, the performance will usually be much lower. For small datasets, the optimism and uncertainty in the apparent performance can be substantial. As the sample size of the data used to develop the model increases, the optimism and uncertainty in apparent performance will decrease, but in most healthcare research datasets some (non-negligible) optimism will occur.2021

To illustrate apparent performance, we consider a logistic regression model for predicting in-hospital mortality within 28 days of trauma injury in patients with an acute myocardial infarction using data from the CRASH-2 clinical trial (n=20 207, 3089 died within 28 days)22 using 14 predictors including four clinical predictors (age, sex, systolic blood pressure, and Glasgow coma score) and 10 noise predictors (ie, truly unrelated to the outcome). Varying the sample size between 200 and 10000, models are fit to 500 subsets of the datasets that are created by resampling (with replacement) from the entire CRASH-2 data and each model’s apparent performance calculated. For simplicity, we focus primarily on the c statistic, a measure of a prediction models discrimination (how well the model differentiates between those with and without the outcome, with a value of 0.5 denoting no discrimination and 1 denoting perfect discrimination; see box 1 and part 2 of the series10). Figure 2 shows the magnitude and variability of the difference in the c statistic for the apparent performance estimate compared with the large sample performance value of 0.815 (ie, a model developed on all the available data). For small sample sizes, there is a substantial difference (estimates are systematically much larger) and large variation, with the apparent c statistic ranging anywhere from 0.7 to just under 1. This variability in apparent performance decreases as the sample size increases, and for very large sample sizes, the optimism in apparent performance is negligible and thus a good estimate of the underlying performance in the full (CRASH-2) population.

Fig 2
Fig 2

Variability and overestimation of apparent performance compared to large sample performance, for a model to predict in-hospital mortality within 28 days of trauma injury with increasing sample size of the model development study. ĉ denotes the apparent performance estimate and clarge denotes the performance of the model in the entire CRASH-2 population (n=20 207).22 Red lines=mean ĉ−clarge for each sample size. Jitter has been added to aid display. ĉ−clarge=0 implies no systematic overestimation or underestimation of ĉ

Random split

Randomly splitting a dataset is often erroneously perceived as a methodological strength—it is not. Authors also often label the two datasets (created by splitting) as independent; despite no overlap in patients, the label “independent” is a misnomer, because they clearly both come from the same dataset (and data source).

Randomly splitting obviously creates two smaller datasets,23 and often the full dataset is not even large enough to begin with. Having a dataset that is too small to develop the model increases the likelihood of overfitting and producing an unreliable model,2021242526 and having a test set that is too small will not be able to reliably and precisely estimate model performance—this is a clear waste of precious information272829 (see part 3 in this series11). Figure 3 illustrates the impact of sample size on performance (the c statistic) of a prediction model using a random split sample approach. Using the same approach as before, a logistic regression model for predicting 28 day mortality in patients with acute myocardial infarction was developed using 14 predictors (age, sex, systolic blood pressure, Glasgow coma score, and 10 noise predictors). The models are fit and evaluated in 500 split sample subsets of the CRASH-2 data, whereby 70% of observations are allocated to the development data and 30% to the test data (eg, for total sample size of n=200, 140 are used for development and 60 are used for evaluation). The results clearly show that for small datasets, using a split sample approach is inefficient and unhelpful. The apparent c statistic of the developed model is too large (ie, optimistic) compared with the large sample performance and noticeably variable, while the test set evaluation (validation) shows that the develop model’s c statistic is much lower and highly variable, and underestimated relative to the large sample performance of the model (again, indicative of overfitting during model development due to too few data). Also, when fewer participants (eg, 90:10 split) are assigned to the test set, even more variability is seen in the model’s observed test set performance (supplementary fig 1).

Fig 3
Fig 3

Variability and overestimation of the apparent and internal (split sample and bootstrap) validation performance compared with the large sample performance, for a model to predict in-hospital mortality within 28 days of trauma injury with increasing sample size of the model development study. ĉ denotes the apparent performance estimate and clarge denotes the performance of the model in the entire CRASH-2 population (n=20 207). The red lines denote the mean ĉ−clarge for each sample size and for each approach. Jitter has been added to aid display. Split sample (apparent, 70%)=70% of the available data were used to develop the model, and its (apparent) performance evaluated in this same data. Split sample (validation, 30%)=the performance of the model (developed in 70% of the available data) in the remaining 30% of the data. ĉ−clarge=0 implies no systematic overestimation or underestimation of ĉ

As sample size increases, the difference between the split sample apparent performance and the test set performance reduces. In very large sample sizes, the difference is negligible. Therefore, data splitting is unnecessary and not an improvement on using all the data for model development and reporting apparent performance when the sample size is large or using internal validation methods (eg, bootstrapping, see below) when sample size is smaller. This observation is not new and has been stated in the methodological literature over 20 years ago,30 but the message has still not made it to the mainstream biomedical and machine learning literature.

For models with high complexity (eg, deep learners) that prohibit resampling of the full dataset (eg, using bootstrapping), a split sample approach might still be necessary. Similarly, sometimes two or more datasets could be available (eg, from two e-health databases) but not combinable, owing to local restrictions on data sharing, such that a split sample is enforced. In these situations, we strongly recommended having very large development and test datasets, as otherwise the developed model might be unstable and test performance unreliable, rendering the process futile. Concerns of small sample sizes can be revealed by instability plots and measures of uncertainty.31

In addition to the issues of inefficiency and increased variability (instability), randomly splitting the dataset also opens up the danger of multiple looks and spin. That is, if poor performance is observed when evaluating the model in the test portion of the randomly split dataset, researchers could be tempted to repeat the analysis, splitting the data again until the desired results are obtained, similar to P hacking, and thus misleading readers into believing the model has good performance.

Resampling approaches: bootstrapping and k-fold cross validation

Unlike the split sample approach, which evaluates a specific model, bootstrapping evaluates the model building process itself (eg, predictor selection, imputation, estimation of regression coefficients), and estimates the amount of optimism (due to model overfitting) expected when using that process with the sample size available.32 This estimate of optimism is then used to produce stable and approximately unbiased estimates of future model performance (eg, c statistic, calibration slope) in the population represented by the development dataset.30 The process starts with using the entire dataset to develop the prediction model and its apparent performance estimated. Bootstrapping is then used to estimate and adjust for optimism, in both the estimates of model performance and the regression coefficients (box 2).

Box 2

Using bootstrapping for internal validation

The steps to calculate optimism corrected performance using bootstrapping are:

  1. Develop the prediction model using the entire original data and calculate the apparent performance.

  2. Generate a bootstrap sample (of the same size as the original data), by sampling individuals with replacement from the original data.

  3. Develop a bootstrap model using the bootstrap sample (applying all the same modelling and predictor selection methods, as in step 1):

    1. Determine the apparent performance (eg, c statistic, calibration slope) of this model on the bootstrap sample (bootstrap performance).

    2. Determine the performance of the bootstrap model in the original data (test performance).

  4. Calculate the optimism as the difference between the bootstrap performance and the test performance.

  5. Repeat steps 2 to 4 many times (eg, 500 times).

  6. Average the estimates of optimism in step 5.

  7. Subtract the average optimism (from step 6) from the apparent performance obtained in step 1 to obtain an optimism corrected estimate of performance.

The variability in the optimism corrected estimates, across the bootstrap samples, can also be reported to demonstrate stability.33 The bootstrap models produced in step 2 will vary (and differ from the prediction model developed on the entire data), but these bootstrap models are only used in the evaluation of performance and not for individual risk prediction. Steyerberg and colleagues have shown that the expected optimism could precisely be estimated with as few as 200 bootstraps with minor sampling variability; with modern computational power, we generally recommend at least 500 bootstraps.34 An additional benefit of this bootstrap process is that the value of optimism corrected calibration slope can be used to adjust the model from any overfitting by applying it as shrinkage factor to the original regression coefficients (predictor effects).323536


Figure 3 shows that using all the available data to develop a model and using bootstrapping to obtain an estimate of the model’s optimism corrected performance, is an efficient approach to internal validation, leading to estimates of model performance that are closest to the large sample performance (eg, compared to a split sample approach), as shown elsewhere30 (supplementary table 1). For very large datasets, the computational burden to carry out bootstrapping can prohibit its use; in these instances, however, little is achieved over using the entire dataset to both derive and evaluate a model, because the estimate of apparent performance should be a good approximation of the underlying large sample performance of the model.

Another resampling method, k-fold cross validation, will often perform comparably to bootstrapping.30 Like bootstrapping, all available data are used to develop the model, and all available data are used to evaluate model performance. k-fold cross validation can be seen an extension of the split sample approach but with a reduction in the bias and variability in estimation of model performance (box 3).

Box 3

Use of k-fold cross validation for internal validation

The process of k-fold cross validation entails splitting the data into “k” equal sized groups. A model is developed in k-1 groups, and its performance (eg, c statistic) evaluated in the remaining group. This process is carried out k times, so that each time a different set of k-1 groups is used to develop the model and a different group is used to evaluate model performance (fig 4). The average performance over the k iterations is taken as an estimate of the model performance.

Fig 4
Fig 4

Graphical illustration of k-fold cross validation. Non-shaded parts used for model development; shaded part used for testing

In practice, the value of k is usually taken to be 5 or 10; cherry picking k should be avoided. Repeated k-fold cross validation (where k-fold validation is repeated multiple times and results averaged across them) will generally improve on k-fold cross validation.


Non-random split (at model development)

Alternative splitting approaches include splitting by time (referred to as temporal validation) or by location (referred to as geographical or spatial validation).37 However, they remove the opportunity to explore and capture time and location features during model development to help explain variability in outcomes.

In a temporal validation, data from one time period are used to develop the prediction model while data from a different (non-overlapping) time period are used to evaluate its performance. The concern, though, is selecting which time period should be used to develop the model, and which to use for evaluation. Using data from the older time period for model development might not reflect current patient characteristics (predictors and outcomes) or current care. Conversely, using the more contemporary time period to develop the model leaves the data from an older time period to evaluate the performance, and so only provides information on the predictive accuracy in a historical cohort of patients. Neither option is satisfactory, and this approach (at the moment of model development) is not recommended. For example, improvements over time in surgical techniques have led to larger number of patients surviving surgery,38 and thus the occurrence of the outcome being predicted will decrease over time, which will have an impact on model calibration. Methods such as continual (model) updating should therefore be considered to prevent calibration drift or dynamic prediction models.39 Temporal recalibration is another option40 where the predictor effects are estimated in the whole dataset, but the baseline risk is estimated in the most recent time window.

In a geographical or spatial validation, data from one geographical location (or hospitals, centres) are used to develop the model, while data from a separate geographical location are used to evaluate the model. As with other data splitting approaches previously discussed, in most (if not all) instances, there is often little to be gained in splitting, and rather a missed opportunity in using all available data to develop a model with wider generalisability. However, if data from many geographical regions (or centres) are available to develop a model, comprising a very large number of observations (and outcomes), and computational burden of model development prohibits k-fold cross validation or bootstrapping, leaving out one or more regions or centres to evaluate performance might not be too detrimental.41 As with the random split approach, researchers might be tempted to split the data (eg, into different time periods and lengths, different centres) repeatedly until satisfactory performance has been achieved—this approach should be avoided. If splitting is to be considered, the splits should be done only once (ie, no repeated splitting until good results are achieved), ensuring that the sample sizes for development and evaluation are of sufficient size.

Evaluation at model development: internal-external cross validation

Data from large electronic health record databases, multicentre studies, or individual participant data from multiple studies are increasingly being made available and used for prediction model purposes.1542 Researchers might be tempted to perform some form of (geographical or spatial) splitting, whereby only a portion (eg, a group of centres, regions of a country, or a group of studies) is used to develop the model, and the remaining data is used to evaluate its performance. However, internal-external cross validation is a more efficient and informative approach43444546 that examines heterogeneity and generalisability in model performance (box 4).

Box 4

Internal-external cross validation

Internal-external validation exploits a common feature present in many datasets, namely that of clustering (eg, by centre, geographical region, or study). Instead of partitioning the data into development and validation cohorts, all the data are used to build the prediction model and iteratively evaluate its performance. The performance of this model (developed on all the data) is then examined using cross validation by cluster, where a cluster is held out (eg, a centre, geographical region, study) and the same model building steps (as used on the entire data) are applied to the remaining clusters. The model is then evaluated in the held-out cluster (ie, estimates of calibration and discrimination along with confidence intervals). These steps are repeated, each time taking out a different cluster44 thereby allowing the generalisability and heterogeneity of performance to be examined across clusters (using meta-analysis techniques; fig 5).

Fig 5
Fig 5

Graphical illustration of internal-external cross validation. Non-shaded parts used for model development; shaded part used for testing

The results can then be presented in a forest plot to aid interpretation, and a summary estimate calculated using (random effects) meta-analysis. TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis)-Cluster provides recommendations for reporting prediction model studies that have accounted for clustering during validation, including the approach of internal-external cross validation.4748


For example, internal-external cross validation was used in the development of the ISARIC 4C model to identify individuals at increased risk of clinical deterioration in adults with acute covid-19.2 The authors used all their available data (n=74 944) from nine regions of the UK (each comprising between 3066 and 15 583 individuals) to develop the model but then, to examine generalisability and heterogeneity, performed an internal-external cross validation with eight regions in the model development and the ninth region held out for evaluation. The authors demonstrated that the model performed consistently across regions, with point estimates of the c statistic ranging from 0.75 to 0.77, and a pooled random effects meta-analysis estimate of 0.76 (95% confidence interval 0.75 to 0.77; fig 6).

Fig 6
Fig 6

Internal-external cross validation of the ISARIC (International Severe Acute Respiratory and Emerging Infection Consortium) 4C model. Adapted from Gupta et al.2 Estimates and confidence intervals taken from original paper where they were reported to two decimal places.

Evaluation using new data: external validation

External validation is the process of evaluating the performance of an existing model in a new dataset, differing to that used (and the source used) for model development. It is an important component in the pipeline of a prediction model, as its pursuit is to demonstrate generalisability and transportability of the model beyond the data (and population) used to develop the model (eg, in different hospitals, different countries).49 For example, Collins and Altman conducted an independent external validation of QRISK2 and the Framingham risk score (at the time recommended by National Institute for Health and Care Excellence in the UK), and demonstrated systematic miscalibration of Framingham, no net benefit at current (at the time) treatment thresholds, and the need for different treatment thresholds.50

Some journals refuse to publish model development studies without an external validation51; this stance is outdated and misinformed, and could encourage researchers to perform a meaningless and misleading external validation (eg, non-representative convenience sample, too small, even data splitting under the misnomer of external validation). Indeed, if the model development dataset is large and representative of the target population (including outcome and predictor measurement), and internal validation was done appropriately, then an immediate external validation might not even be needed.14 However, in many situations, the data used to develop a prediction model might not reflect the target population in whom the model is intended, and variation or lack of standardisation in measurements (including measurement error), poor statistical methods, inadequate sample size, handling of missing data (including missing important predictors), and changes in health care could all affect the model performance when applied to a target representative population.52 Supplementary figure 2 and supplementary table 2 demonstrates the impact of sample size in model development on performance at external validation. Thus, most prediction models need evaluation in new data to demonstrate where they should and should not be considered for deployment or further evaluation of clinical impact (eg, in a randomised clinical trial53).

External validations are needed because variations in healthcare provision, patient demographics, and local idiosyncrasies (eg, in outcome definitions) will naturally dictate the performance of a particular prediction model. Frameworks have been proposed to aid the interpretation of findings at external validation by examining the relatedness (eg, how similar in terms of case mix) of the external validation data to the development data, to explore (on a continuum) whether the validation assesses reproducibility (data are similar to the development data) or transportability (data are dissimilar to the development data).5455 The data used in an external validation study could be from the same population as used for model development, but at a different (more contemporary) time period, obtained subsequent to the model development.56 Indeed, continual or periodic assessment in the sample population is important to identify and deal with any model deterioration (eg, calibration drift57), which is expected owing to population or healthcare changes over time. However, researchers should also consider external validation in entirely different populations (eg, different centres or countries) or settings (eg, primary/secondary care or adults/children) where the model is sought to be deployed. External validation might even involve different definitions of predictors or outcome (eg, different prediction horizon) than used in the original development population.

External validation is sometimes included in studies developing a prediction model. However, as noted earlier, at the moment of model development, we generally recommend that all available data should be used to build the model, accompanied by a meaningful internal or internal-external cross validation. Using all the available data to develop a model implies that external validation studies should then (in most instances) be done subsequently and outside the model development study, each with a specific target population in mind (ie, each intended target population or setting for a given prediction model should have a corresponding validation exercise14). The more external validation studies showing good (or acceptable) performance, the more likely the model will also be useful in other untested settings—although clearly there is no guarantee.

Guidance on the design and analysis for external validation studies is provided in parts 2 and 3 of this series.1011 Despite the importance of carrying out an external validation, such studies are relatively sparse,58 and publication bias is most certainly a concern, with (generally) only favourable external validation studies published. Despite the rhetoric chanting for replication and validation, journals seem to have little appetite in publishing external validation studies (presumably and cynically with citations having a role), with preference for model development studies. It is not inconceivable that researchers (who developed the model) will be less likely to publish external validation studies showing poor or weak performance. Incentives for independent researchers to carry out an external validation are also a contributing factor—what are the benefits for them, with seemingly low appetite by journals to publish them, particularly when the findings are not exciting? Failure of authors to report or make the prediction model available will, either through poor reporting or for proprietary reasons,59 also be a clear barrier for independent evaluation, potentially leading to only favourable findings (by the model developers).

Evaluation in subgroups: going beyond population performance to help examine fairness

Evaluating model performance typically focuses on measures of performance at the dataset level (eg, a single c statistic, or a single calibration plot or measure) as a proxy for the intended target population. While this performance is essential to quantify and report, concerted efforts should be made to explore potential heterogeneity and delve deeper into (generalisability of) model performance. Researchers should not only highlight where their model exhibits good performance, but also carry out and report findings from a deeper interrogation and identify instances, settings, and groups of people where the model has poorer predictive accuracy, because using such a model could have a downstream impact on decision making and patient care, and potentially harm patients. For example, in addition to exploring heterogeneity in performance across different centres or clusters (see above), researchers should be encouraged (indeed expected) to evaluate model performance in other key subgroups (such as sex/gender, race/ethnic group), as part of checking algorithmic fairness,60 especially when sample sizes are large enough, and when data have been collected in an appropriate way that represents the diverse range of people the model is intended to be used in.61 For example, in their external validation and comparison of QRISK2 and the Framingham risk score, Collins and Altman demonstrated miscalibration of the Framingham risk score, with systematic overprediction in men across all ages, and a small miscalibration of QRISK2 in those of older age.50

Introducing a new technology in clinical care, such as a prediction model, which is expected only to increase with the surge in interest and investment in artificial intelligence and machine learning, should ideally reduce but certainly not create or exacerbate any disparities in either provision of healthcare or indeed subsequent healthcare outcomes.626364 Consideration of key subgroups is therefore important during the design (and data collection), analysis, reporting, and interpretation of findings.


Evaluating the performance of a prediction model is critically important and therefore validation studies are essential. Here, we have described how to make the most of the available data to develop and, crucially, evaluate a prediction model from development to external validation. Splitting data at the moment of model development should generally be avoided because it discards data leading to a more unreliable model. Rather, concerted efforts should be made to exploit all available data to build the best possible model, with better use of resampling methods for internal validation, and internal-external validation to evaluate model performance and generalisability across clusters. External validation studies should be considered in subsequent research, preferably by independent investigators, to evaluate model performance in datasets that are representative of the intended target populations for the model’s implementation. The next paper in this series, part 2, explains how to conduct such studies.10

Data availability statement

The CRASH-2 and CRASH-3 data used in this paper are freely available at The R code used to produce the figures and supplementary tables is available from


  • Contributors: GSC and RDR conceived the paper and produced the first draft. All authors provided comments and suggested changes, which were then resolved by GSC and RDR. GSC is the guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: This work was supported by Cancer Research UK (C49297/A27294, which supports GSC, JM, and MMS; and PRCPJT-Nov21\100021, which supports PD). The Medical Research Council Better Methods Better Research (grant MR/V038168/1, which supports GSC, LA, and RDR), the EPSRC (Engineering and Physical Sciences Research Council) grant for “Artificial intelligence innovation to accelerate health research” (EP/Y018516/1, which supports GSC, LA, PD, and RDR). National Institute for Health and Care Research Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham (which supports RDR), the Research Foundation-Flanders (G097322N, which supports BVC), Internal Funds KU Leuven (C24M/20/064, which supports BVC), National Center for Advancing Translational Sciences (Clinical Translational Science Award 5UL1TR002243-03, which supports FEH), National Institutes of Health (NHLBI 1OT2HL156812-01, which supports FEH), and the ACTIV Integration of Host-targeting Therapies for COVID-19 Administrative Coordinating Center from the National Heart, Lung, and Blood Institute (which supports FEH) The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at declare: support from Cancer Research UK and the Medical Research Council for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work. GSC and RDR are statistical editors for The BMJ.

  • Patient and public involvement: Patients or the public were not involved in the design, or conduct, or reporting, or dissemination of our research.

  • Provenance and peer review: Not commissioned, externally peer reviewed.

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: