Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation studyBMJ 2012; 345 doi: https://doi.org/10.1136/bmj.e5900 (Published 18 September 2012) Cite this as: BMJ 2012;345:e5900
- Ali Abbasi, PhD fellow123,
- Linda M Peelen, assistant professor3,
- Eva Corpeleijn, assistant professor1,
- Yvonne T van der Schouw, professor of epidemiology of chronic diseases3,
- Ronald P Stolk, professor of clinical epidemiology1,
- Annemieke M W Spijkerman, research associate4,
- Daphne L van der A, research associate5,
- Karel G M Moons, professor of clinical epidemiology3,
- Gerjan Navis, professor of nephrology, internist-nephrologist2,
- Stephan J L Bakker, associate professor, internist-nephrologist/diabetologist2,
- Joline W J Beulens, assistant professor3
- 1Department of Epidemiology, University of Groningen, University Medical Centre Groningen, Groningen, Netherlands
- 2Department of Internal Medicine, University of Groningen, University Medical Centre Groningen, Groningen
- 3Julius Centre for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, Netherlands
- 4Centre for Prevention and Health Services Research, National Institute for Public Health and the Environment (RIVM), Bilthoven, Netherlands
- 5Centre for Nutrition and Health, National Institute for Public Health and the Environment (RIVM), Bilthoven
- Correspondence to: A Abbasi, Department of Epidemiology, University Medical Centre Groningen, Hanzeplein 1, PO Box 30.001, 9700 RB Groningen, Netherlands
- Accepted 27 August 2012
Objective To identify existing prediction models for the risk of development of type 2 diabetes and to externally validate them in a large independent cohort.
Data sources Systematic search of English, German, and Dutch literature in PubMed until February 2011 to identify prediction models for diabetes.
Design Performance of the models was assessed in terms of discrimination (C statistic) and calibration (calibration plots and Hosmer-Lemeshow test).The validation study was a prospective cohort study, with a case cohort study in a random subcohort.
Setting Models were applied to the Dutch cohort of the European Prospective Investigation into Cancer and Nutrition cohort study (EPIC-NL).
Participants 38 379 people aged 20-70 with no diabetes at baseline, 2506 of whom made up the random subcohort.
Outcome measure Incident type 2 diabetes.
Results The review identified 16 studies containing 25 prediction models. We considered 12 models as basic because they were based on variables that can be assessed non-invasively and 13 models as extended because they additionally included conventional biomarkers such as glucose concentration. During a median follow-up of 10.2 years there were 924 cases in the full EPIC-NL cohort and 79 in the random subcohort. The C statistic for the basic models ranged from 0.74 (95% confidence interval 0.73 to 0.75) to 0.84 (0.82 to 0.85) for risk at 7.5 years. For prediction models including biomarkers the C statistic ranged from 0.81 (0.80 to 0.83) to 0.93 (0.92 to 0.94). Most prediction models overestimated the observed risk of diabetes, particularly at higher observed risks. After adjustment for differences in incidence of diabetes, calibration improved considerably.
Conclusions Most basic prediction models can identify people at high risk of developing diabetes in a time frame of five to 10 years. Models including biomarkers classified cases slightly better than basic ones. Most models overestimated the actual risk of diabetes. Existing prediction models therefore perform well to identify those at high risk, but cannot sufficiently quantify actual risk of future diabetes.
Type 2 diabetes is a large burden in healthcare worldwide.1 Studies on lifestyle modifications and drug intervention have convincingly shown that these measures can prevent diabetes.2 3 Early identification of populations at high risk for diabetes is therefore important for targeted prevention strategies and is necessary to enable proper efforts to be taken for prevention in the large number of individuals at high risk, while avoiding the burden of prevention and treatment for the even larger number of individuals at low risk, both for the individual and for society. The professional practice committee of the American Diabetes Association recommends screening for all overweight or obese adults (body mass index (BMI) ≥25) of any age who have one or more additional risk factors for diabetes such as family history or hypertension.4 The European evidence based guidelines for the prevention of type 2 diabetes5 and the International Diabetes Federation6 recommend the use of a reliable, simple, and practical risk scoring system or questionnaire to identify people at high risk of future diabetes.
During the past two decades, many such prediction models have been developed.7 8 9 10 11 Three recent reviews on this topic described existing prediction models and the predictive value of specific risk factors (such as metabolic syndrome) over a wide range of populations.7 8 9 Surprisingly, however, the performance of less than a quarter of the prediction models was externally validated.9 10 11 Because the performance of a prediction model is generally overestimated in the population in which it was developed, external validation of such models in an independent population, ideally by researchers not involved in the development of the models, is essential to broadly evaluate the performance and thus the potential utility of such models in different populations and settings.12 13 14 15 Consequently, certain prediction models to identify those at high risk of diabetes cannot be recommended when external validity of available models is unknown.12 16 Moreover, a direct comparison of the performance of the existing models in the same (external) validation cohort is essential to bridge the gap between the development of models and the conduct of studies for clinical utility.
The recent systematic reviews highlighted the need for an independent study to identify the existing prediction models and subsequently validate and compare their performance to support the current recommendations.7 8 9 Few studies have externally validated such models, commonly not more than two or three at once, and almost always in medium sized cohorts.10 11 14 17 We applied a more comprehensive approach as recently suggested.14 15 Firstly, we carried out a systematic review to identify the most relevant existing models for predicting the future risk of type 2 diabetes. Then we used various analytical measures for validating18 and comparing their predictive performance in a large independent general population based cohort—the Dutch cohort of the European Prospective Investigation into Cancer and Nutrition (EPIC-NL).19
Systematic literature search
We performed a systematic literature search according to the PRISMA guidelines,20 when applicable. We searched PubMed for all published cohort studies that reported prediction models for the risk of type 2 diabetes until February 2011 using the following search string: ((“diabetes” OR “diabetes mellitus” OR “type 2 diabetes”) AND (“risk score” OR “prediction model” OR “predictive model” OR “predicting” OR “prediction rule” OR “risk assessment” OR “algorithm”)) NOT review [pt] AND English [LA]. We repeated this search for publications in German and Dutch. Finally, we checked systematic reviews and validation studies of prediction models to identify other relevant articles for our validation study. Because we did not perform a formal meta-analysis, the PRISMA items related to “protocol and registration” and “synthesis of results” for meta-analyses are not applicable to our study.
Studies were included if they met the following criteria: the study presented at least one formal prediction model or an update on a previously developed model; the endpoint was incident type 2 diabetes in a longitudinal design; and the population had to be at least partly white because the EPIC-NL cohort to be used for validation consists predominantly of white adults. We excluded studies using data on individuals with impaired glucose tolerance or impaired fasting glucose. Furthermore, we excluded models that used the two hour oral glucose tolerance test as a predictor variable because this was not available in our validation dataset and there was no reliable proxy variable available that could be taken as a substitute.
After review of the retrieved titles, two authors (AA and JWJB) independently reviewed the abstracts to select the relevant papers for full text review and subsequently reviewed and assessed the full papers. Discrepancies between the two reviewers were solved by having a third author (EC) review to reach consensus. For included studies, we made a primary plan to extract necessary data from the original studies to validate the models or contact the authors to obtain this information.
Table 1 summarises characteristics of the included studies⇓. The extracted data included the first author’s name, year of publication, country, name of study/score, number of cases and population, ascertainment of diabetes, duration of follow-up, statistical model, number of predictors, and reported performance of the model. The retrieved models were divided into models that contained only non-invasive predictors (“basic models”) and models that also included conventional biomarkers, such as glucose, HbA1c lipids, uric acid, or γ-glutamyltransferase (“extended models”).
The EPIC-NL cohort (n=40 011) includes the Monitoring Project on Risk Factors for Chronic Diseases (MORGEN-EPIC) and Prospect-EPIC cohorts, initiated between 1993 and 1997. The Prospect-EPIC cohort comprises 17 357 women aged 49-70 who participated in a breast cancer screening programme. The MORGEN cohort comprises 22 654 men and women aged 20-64 who were recruited through random population sampling in three Dutch towns (Amsterdam, Maastricht, and Doetinchem). At baseline, all participants were sent a general questionnaire and a food frequency questionnaire; these were returned when they visited the study centre for a medical examination. Reporting of the study results conforms to STROBE along with references to STROBE.21
We excluded 615 individuals with prevalent type 2 diabetes and 1017 with missing follow-up or who did not consent to linkage with disease registries. The 38 379 remaining participants were used to validate the basic models in a full cohort design. We applied similar exclusion criteria in a 6.5% baseline random sample (n=2604) in which measurements of conventional biomarkers were available,19, leaving 2506 individuals. We used this random sample and all incident cases of type 2 diabetes to validate the extended models in a case cohort design.22 Table A in appendix 1 provides baseline characteristics for the entire cohort, the random sample, and the people with incident type 2 diabetes.
Assessment of predictor variables
Variables in the prediction models included in this study were assessed with a baseline general questionnaire for disease history and lifestyle variables. A validated food frequency questionnaire filled in at baseline was used to assess nutritional variables.23 During the baseline visit, body weight, height, waist, and hip circumference, and blood pressure were measured and blood samples were drawn. Details of these procedures have been described elsewhere19 and are shown in appendix 2.
Assessment of type 2 diabetes
Occurrence of diabetes during follow-up was self reported via two follow-up questionnaires at three to five year intervals in the MORGEN and Prospect cohort. In the Prospect cohort, incident cases of diabetes were also detected as glucosuria via a urinary glucose strip test, which was sent out with the first follow-up questionnaire. Diagnoses of diabetes were also obtained from the Dutch Center for Health Care Information, which holds a standardised computerised register of diagnoses at hospital discharge. Follow-up was complete up to 1 January 2006. Potential cases identified by these methods were verified against general practitioner (MORGEN and Prospect) or pharmacist records (Prospect only). Diabetes was defined as present when the diagnosis was confirmed by either of these methods. For 89% (n=1142) of participants with potential diabetes, verification information was available, and 72% (n=924) were verified as having type 2 diabetes and were included as cases of type 2 diabetes in this analysis.24
To evaluate the predictive performance of the retrieved prediction models, we used the original prediction models (regression coefficients with intercept or baseline hazard) as published. If the paper did not contain sufficient information, we asked the authors to provide us with the original model.25 26 Particularly, we obtained regression coefficients26 and the intercept of the model25 by asking for complementary information. Using these original (regression) model formulas, we calculated the probability of developing type 2 diabetes per model for each individual in our study sample. Two authors (AA and JWJB) first matched the predictors of the original models with the variables available in our data. A direct match was available in our data for most variables. If a direct match was not possible, we replaced the original predictor with a proxy variable to avoid having to drop the model from our validation study. For example, we used non-fasting glucose values because fasting glucose values were not available in our data. Also, nutritional variables were collected with our food frequency questionnaire as continuous variables (g/day) and were re-coded into corresponding categories used in the prediction models by using Dutch portion sizes. Table B in appendix 1 provides an overview of the variables used in each of the prediction models, and appendix 2 gives the exact details on the proxy variables that were used.
We assessed performance of the models using measures of discrimination and calibration.13 Discrimination describes the ability of the model to distinguish those at high risk of developing diabetes from those at low risk. The discrimination was examined by calculating Harrell’s C (comparable with the area under the ROC curve), accounting for censored data.27 Calibration indicates the ability of the model to correctly estimate the absolute risks and was examined by calibration plots. In a calibration plot, the predicted risk is plotted against the observed incidence of the outcome. Ideally the predicted risk equals the observed incidence throughout the entire risk spectrum and the calibration plot follows the 45° line. The calibration plot was extended to a “validation” plot as a summary tool.18 27 Appendix 2 gives more details on information provided by this plot. Calibration was also tested with the Hosmer-Lemeshow goodness of fit statistic for time to event data.18 27 Follow-up of our cohort was almost complete until about eight years: 3% were censored at 5 years, 5% at 7.5 years, and 44.6% at 10 years. To account for censoring when obtaining the observed probabilities for assessing calibration over, say, 10 years of follow-up, we first calculated for each individual the linear predictor and subsequently 10 year predicted outcome probability by the original survival models. This predicted probability was then divided into tenths, and we performed a Kaplan-Meier analysis per tenth, which accounts for the observed censoring. Per tenth, we obtained at the 10 year time point the observed outcome percentages, which in turn were compared with the 10 mean predicted outcome probabilities to obtain the calibration plot and measure of goodness of fit. This was done for each model and for the other time points (5 and 7.5 years).28 Moreover, we reported the calibration slope for the logistic regression models18 and calculated observed over predicted (expected) outcomes (O/E ratio) with 95% confidence intervals.18 29 A ratio below 1.0 indicates overestimation of risk, and a ratio more than 1.0 indicates underestimation of risk.
Differences in the incidence of diabetes in our cohort and in the development populations led to significant deviation between observed risk in our cohort and predicted risk estimated by the prediction model. To reduce this source of miscalibration, we “recalibrated” each prediction model by adjusting the intercept (for logistic regression models) or the baseline survival function (for survival regression models).28 30 31
The original models were developed for different time periods of risk prediction (different “prediction horizons”)—for instance, some models estimate 5 year risk and others 10 year risk. We therefore assessed the performance of each model for prediction of risk at 5, 7.5, and 10 years to account for the different time periods. For example, for 5 year risk, we considered individuals as incident cases if they had developed diabetes within the first five years of follow-up. Participants who developed diabetes after more than five years of follow-up were included in five year prediction as non-cases. A similar approach was followed for 7.5 and 10 year predictions. In addition, we performed a sensitivity analysis using the prediction horizon for which each model was developed in case this differed from 5, 7.5, or 10 years.
For the basic prediction models, which included only data from non-invasive clinical variables, we quantified their performance in the full dataset. The extended models were validated in the case cohort data. To account for this design, we applied an extrapolation approach that extends the case cohort data to the size of the full cohort.22 This is achieved by extrapolating the non-cases of the random sample (that is, the total random sample of 2506 individuals minus 79 cases) to the number of non-cases in the full cohort (that is, the total sample of 38 379 individuals minus 924 cases). To do so, we substituted the non-cases in the full cohort (n=37 455) with a random multiplication of non-cases of the random sample (n=2427). On average, we multiplied the non-cases in the random sample by 15.4 (that is, 37 455 divided by 2427). Next, we merged the extrapolated data from non-cases to those from all the cases (total non-cases of 37 455 individuals plus 924 cases), recreating the size and composition of the full cohort. In sensitivity analyses, we estimated the performance of the basic models in the re-sampled data from the case cohort and compared these results with those obtained from the full cohort. This allowed us to confidently use the extrapolation approach for the extended models in the case cohort design.
For most predictors data from less than 1% of the values were missing, although missing values occurred in 5% for family history of diabetes, about 15% for physical activity, and 20.5% for non-fasting glucose concentrations. Because an analysis of only the completely observed participants could lead to biased results,32 33 34 we imputed these missing values using single imputation and predictive mean matching. As the percentage of missing values for the non-fasting glucose concentration was high, we repeated our analyses using only data from the MORGEN cohort, in which less than 10% of values for non-fasting glucose concentration were missing, as a sensitivity analysis. Table C in appendix 1 shows the number of missing values for all variables incorporated in the original model.
We carried out a third sensitivity analysis to account for the use of non-fasting glucose values, as we had to approximate the fasting glucose values included in the models by the non-fasting glucose values in our data. In this analysis, we excluded individuals with a non-fasting glucose of ≥11.1 mmol/L (n=130), as this cut point is considered as a high blood glucose concentration at which diabetes is suspected especially if it is accompanied by the classic symptoms of hyperglycaemia.4 In another sensitivity analysis, we excluded 19 295 individuals (including 537 incident cases of diabetes) with fasting period of under two hours. In a fifth sensitivity analysis we excluded 255 individuals for whom we had no verification information of diabetes status.
All statistical analyses were conducted with SPSS version 18 (SPSS, Chicago, IL) and R version 2.11.0 (Vienna, Austria) for Windows (http://cran.r-project.org/).
Systematic literature search
We scanned 7756 titles and selected 134 abstracts for review. Figure 1⇓ depicts the flow of the study selection process. We selected 46 articles for full text review and added six that were identified from other sources such as recent systematic reviews.7 8 9 After full review of these 52 articles, we excluded 36 as they did not meet all inclusion criteria (appendix 3). The main reasons for exclusion were no prediction of the future risk of diabetes (n=15); validation study (n=10); no formal prediction models provided (n=6); and incomparable derivation populations (n=2) or unavailable data of predictors (n=3). Of three studies that used data from two hour oral glucose tolerance tests, we excluded two because they were cross sectional and one because it did not provide any prediction model.
Table 1 summarises the characteristics of the 16 studies included in this validation study.25 26 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Eleven studies described 34 basic models based on data that can be assessed non-invasively, including demographics, family history of diabetes, measures of obesity, diet, and lifestyle factors, blood pressure, and use of antihypertensive drugs. Of these 34 basic models, 12 models were presented as the final model.8
Nine studies described 42 extended models including data on one to three conventional biomarkers such as glucose, HbA1c, lipids, uric acid, or γ-glutamyltransferase. Of these 42 extended models, 13 models were presented as the final model. The C statistics in the development datasets ranged from 0.71 for the Atherosclerosis Risk in Communities (ARIC) model to 0.86 for the FINDRISC full model. Only half of the studies reported measures of calibration, and almost all showed good calibration in the development datasets. Table B in appendix 1 shows the variables that are part of the prediction models.
Validation of prediction models
Table A in appendix 1 summarises the baseline characteristics of participants in the EPIC-NL study (for the full cohort, random sample, and incident cases of type 2 diabetes). During a median follow-up of 10.2 years (over 387 000 person years), we observed 924 incident cases (rate of 2.2 per 1000 person years). The observed 5, 7.5, and 10 year risks of incident diabetes were 1.3%, 1.8%, and 2.3%, respectively.
Tables 2⇓ and 3⇓ show the performance of the basic models and the extended models, respectively. The basic models performed well in terms of discrimination, with C statistics ranging from 0.74 (95% confidence interval 0.73 to 0.75) to 0.84 (0.82 to 0.85) for the prediction of risk of diabetes at 7.5 years. Similar but slightly higher C statistics were found for the 5 year risk prediction and slightly lower for the 10 year risk prediction of incident diabetes.
For the extended models, the discrimination was higher, with C statistics ranging from 0.81 (0.80 to 0.83) to 0.93 (0.92 to 0.94) for the risk at 7.5 years. Similar, but again slightly higher, C statistics were found for the 5 year risk prediction and slightly lower for the 10 year risk prediction of incident diabetes.
Both basic and extended models showed a poor calibration based on the Hosmer-Lemeshow test (P<0.001). Except for the EPIC-Norfolk and PROCAM models, all models overestimated the predicted against the observed 7.5 year risk of diabetes by 38.9% to more than 100%. Similarly, all observed to expected ratios were different from 1.0 (tables 2 and 3⇑ ⇑). The EPIC-Norfolk model underestimated the 7.5 year risk of incident diabetes by 73.9%. Figure A in appendix 4 shows the calibration plots for the original models.
After adjustment for differences in the incidence of diabetes between our cohort and the development populations, all prediction models showed better calibration (figs 2 and 3⇓ ⇓). For some of the models (such as the ARIC basic model) the calibration plot stayed close to the ideal line throughout the risk spectrum, whereas others showed severe overestimation, especially at higher predicted risks (such as Framingham continuous, DESIR, and BRHS models). Compared with the original models, the models adjusted for differences in the incidence of diabetes between the development and validation cohort performed better, with lower Hosmer-Lemeshow statistics, but deviation of calibration from ideal was still significant for all models, except for the KORA basic model (Hosmer-Lemeshow test P=0.17). For the KORA basic model, AUSDRISK, and EPIC-Norfolk model, calibration slopes were close to 1.0, but those were smaller or larger than 1.0 for other logistic regression models (tables 2 and 3⇑ ⇑). Figure B in appendix 5 shows the calibration plots including calibration statistics for each recalibrated models separately.
To further investigate the different effect size for each predictor, we compared hazard ratios for predictors between the validation cohort and one development cohort49 as an example. We used data from the EPIC-Potsdam study40 because the model was developed in the German cohort of EPIC using Cox proportional-hazards regression. Table C in appendix 1 presents the hazard ratios of the diabetes predictors incorporated in this risk score compared with those obtained in our validation cohort. The hazard ratios for age, intake of red meat, physical activity, and current heavy smoking differed significantly (P<0.05) between both cohorts.
Tables 4⇓ and 5⇓ show the results of sensitivity analyses. Our results using the extrapolation approach for the case cohort design were similar when we looked at C statistics and Hosmer-Lemeshow statistics of 13 basic models obtained from the extrapolation approach compared with those from the full cohort design (for example, C statistics ranging from 0.74 (95% confidence interval 0.72 to 0.76) to 0.84 (0.82 to 0.86), and Hosmer-Lemeshow test P<0.001). Additionally, our results using data only from the MORGEN cohort with less than 10% missing values for non-fasting glucose were comparable with our results using both cohorts; C statistics ranged from 0.79 (0.76 to 0.81) to 0.92 (0.90 to 0.93) for 13 extended models. Exclusion of individuals with a non-fasting glucose of ≥11.1 mmol/L did not influence the results, both for the basic (C statistics ranged from 0.74 (0.72 to 0.75) to 0.83 (0.81 to 0.84)) and the extended models (C statistics ranged from 0.81 (0.80 to 0.83) to 0.93 (0.92 to 0.94)). Moreover, when we excluded the individuals with less than two hours’ fasting or those without verified diabetes status, the C statistics were similar to those of the full cohort analysis. Finally, use of the prediction horizon for which the original models were developed hardly affected the results.
An evaluation of the performance of 25 prediction models for type 2 diabetes in an independent Dutch cohort with over 10 years of follow-up showed that basic models perform similarly well in identifying individuals at high and low risk of developing diabetes. The performance was slightly better for extended models that included conventional biomarkers. With regard to the actual values of the predicted risks, all but two models overestimated the risk of developing diabetes, which improved slightly, but not sufficiently, after correction of the models for differences in incidence of diabetes between development and validation populations.
Strengths and limitations of study
All models were identified through a systematic literature search, and we included most existing prediction models in the validation study. Other strengths included the study’s large sample size, prospective design, verification of incident diabetes, and extensive information on individuals’ characteristics. Nevertheless, some limitations of our study need to be mentioned. Nearly all participants in the EPIC-NL cohort are white adults, and further studies are warranted to validate the models in other populations. In addition, the participation rate was about 40%.19 50 We previously showed that such a low response rate might affect prevalence estimates of baseline characteristics of participants but does not cause bias in the examined associations.50 We therefore consider that our cohort is appropriate for the purpose of our study. Although our data had certain limitations regarding availability of the variables, we made an effort to assign all variables and applied definitions as closely as possible. To handle missing variables, we performed single imputation and repeated the analysis in one of the two cohorts with lower missing values for glucose concentration, which gave similar results. It is therefore unlikely that these limitations influenced our results to a large extent. Next, we used data for non-fasting glucose concentration. We cannot rule out that this affected our results because glucose is an important predictor of diabetes. We therefore performed sensitivity analyses in which we excluded individuals with a non-fasting glucose of ≥11.1 mmol/L4 and those who fasted for less than two hours, which again yielded similar results. This is in line with previous studies showing that using non-fasting lipid concentrations does not influence prediction of, for example, cardiovascular events.51 52 Because we used data only from verified potential cases we could have missed false negative cases in the remainder of the cohort as type 2 diabetes can remain undiagnosed for several months to years. False negatives can lead to an underestimation of the C statistic as the linear predictor resulting from the predictor variables will be high, whereas their event status is that of a non-case. Given the large size of our cohort in combination with the low incidence of diabetes we do not expect this to largely change our findings. Similarly, false negative cases lead to underestimation of the observed risk in our cohort and this influences calibration. We adjusted for this effect, however, by correcting the intercept of the models to the incidence observed in our cohort. In addition, as the incidence is expected to be low,53 potential false negative cases cannot account for the large overestimations of risk in the models observed in our study. Moreover, certain development cohorts used similar methods for verification of diabetes.
External validation of prediction models
The retrieved prediction models differed considerably in terms of type and number of predictors, age ranges, type of model, duration of follow-up, and outcome measure. Three recent systematic reviews presented overviews of studies that developed these models or validated some selected models.7 8 9 These reviews, however, also indicated that most of these models were never validated in an external population. Our study has now evaluated performance of most developed prediction models for future diabetes in an external population and shows that most basic models perform well to identify those at high risk of diabetes and that extended models perform slightly better. Generally, the performance of a prediction model decreases when it is applied in a validation dataset. Despite this, our study showed that most of the basic models identified those at high absolute risk well, with C statistic over 0.80. This discrimination further improved for the extended models with C statistic of about 0.90. Surprisingly, the C statistics in our validation study were, in some cases, even higher than in their original development populations. This might be explained by differences in heterogeneity between the populations30: larger heterogeneity between individuals in a validation study can in some situations lead to a higher C statistic than in the development study. For example, variables like age, sex, and BMI might have larger heterogeneity in our study compared with the older population of the KORA study.35 Although it would be of interest to explore whether performance of diabetes risk scores differs by age or sex, larger studies are warranted for these subgroup analyses. Another aspect that could influence model performance is the type of regression analysis used to derive the prediction model.7 Most studies used logistic regression rather than survival models7 8 and therefore do not account for censoring.54 Similar to the results of the Framingham Offspring Study,42 however, our results showed that the survival models do not necessarily perform better than the logistic ones.
Quantification of actual risk of future diabetes
All except two prediction models overestimated the absolute risk of diabetes in our validation dataset, which can partly be explained by the difference in incidence of diabetes between development and validation populations. To account for this, we adjusted the models for difference in incidence, resulting in much better calibration. Significant deviations between the predicted and observed risks, however, remained for most models. There are various other explanations for the deviation in predicted versus observed risks. Firstly, in large cohorts the Hosmer-Lemeshow test is sensitive to small differences between the predicted and observed risks, so calibration can be indicated as significantly deviant by statistical tests even when the calibration plots indicate good calibration based on visual inspection and for practical purposes.55 So, in large cohorts significant deviations on the Hosmer-Lemeshow test should be interpreted cautiously. Secondly, “mis”calibration can be caused by differences in how certain predictors, the outcome variable, or baseline characteristics of the study populations are measured, which can lead to different predictive effects.13 30 For example, if the two hour oral glucose tolerance test is used to determine the presence of diabetes in a population, the incidence is likely to be higher, and among the cases there will be patients with a less severe form of the disease and different values for the potential predictors. This is also illustrated by comparing the effect sizes of the predictors of the German Diabetes Risk Score in our cohort, which showed significant differences for important predictors like age. It is important to note, however, that most prediction models showed overprediction, particularly at higher absolute risk. Some models might not have been well calibrated in the original populations.7 9 Furthermore, the overestimation of risk at the higher end could be caused by overestimation of certain predictors in development populations with high risk individuals. Although it is important to accurately estimate the risk for people at high risk, it might not directly influence the effects of screening and public health strategies: interventions are often initiated beyond a certain threshold of absolute risk and overprediction beyond this threshold might therefore not necessarily lead to different treatment decisions. Certain models in our study, however, also overestimated the absolute risk in the lower ranges around 10%. Although decision thresholds for type 2 diabetes have not been determined, this prediction could be in the range of a threshold for a clinical decision. To use such models in clinical practice, calibration needs to be further improved.
Prior external validation of existing prediction models
Although the importance of external validation of prediction models is now widely acknowledged, only a quarter of existing prediction models have been externally validated, mostly in studies including only a single model and not reporting any measures of calibration.7 9 To date, four studies have been published that performed a comparative external validation of several different models.10 11 17 56 Two of these studies validated models for presence of diabetes rather than future risk of diabetes.11 56 One prospective validation of three extended models42 44 57 has been performed and showed C statistics ranging from 0.78 to 0.84 with underestimation or overestimation of the risk.10 Another prospective validation study showed C statistics ranging from 0.74 to 0.90, without reporting calibration and performing adjustments.17 These results are in line with the discrimination observed in our study. Altogether, the results from the previous reviews and our study suggest that most of the basic models performed similarly in terms of discrimination, whereas the Diabetes Population Risk Tool (DPoRT) showed slightly lower discrimination. The latter model was primarily developed to predict risk of diabetes at a population level, which could explain its slightly worse performance when it is applied on an individual level.37
Implications for use of prediction models in practice
Results from our study show that prediction models perform well to identify those at high risk of future diabetes, being a first prerequisite for use of such models in practice as currently recommended.5 6 As expected18 30 and observed in our results, however, the model should possibly be adapted to the local setting and purpose of the model and at least corrected for the incidence of diabetes of the population in which it is to be applied. The main relevance of prediction models is to correctly identify individuals at high risk, while avoiding the burden of treatment for individuals at low risk. This requires adequate discriminative power in the general population, as well as in populations characterised by a somewhat higher risk, such as those with excess weight. In public health practice, one would perhaps prefer to use a model including only a limited number of predictors based on non-invasive tests with the highest performance, which would favour use of a basic model. Noble et al8 suggested seven models as most promising for use in clinical or public health practice, of which three were extended models (ARIC enhanced, Framingham, and San Antonio)42 44 57 and four were basic models (AUSDRISK, QDScore, FINDRISC, and Cambridge Risk Score).36 38 43 58 59 According to the current validation, it seems that this judgment is likely to be correct in statistical terms. The basic DESIR model that we additionally evaluated consisted of four predictors,39 while most models—such as QDScore and AUSDRISK—consist of seven to 10 predictors. Interestingly, the models including only four to six predictors35 39 43 performed similarly to the more extensive models.36 38 We found that discrimination of two other basic models—KORA basic35 and DESIR clinical equation39—which were not included in the list of Noble and colleagues,8 approximated performance of the models incorporating more predictors. Moreover, the KORA basic model performed sufficiently to quantify absolute risk after recalibration. This suggests that a basic model like the KORA, which uses a limited set of non-invasive predictors, already provides good discrimination and good calibration and could therefore be useful in practice after appropriate adaptation of the model to the setting. The extended models including biomarkers could then perhaps be used only for those at high risk based on a basic prediction model. Finally, a model developed in one setting (such as public health data) or in a particular country does not necessarily need to be useful in another setting (such as secondary care) or country. As a next step, the utility of such models needs to be further investigated in clinical and public health practice.
Most of the basic prediction models including data on non-invasive variables performed well to identify those at high risk of developing type 2 diabetes in an independent population. The discriminative performance was slightly better for the extended models with additional data on conventional biomarkers. Most models, however, overestimated the actual risk of diabetes. Whether this influences treatment decisions needs to be further investigated. Hence, existing prediction models, even with only limited information, are valid tools to identify those at high risk but do not perform well enough to quantify the actual risk of future diabetes.
What is already known on this topic
There are many prediction models to estimate risk for future development of type 2 diabetes
An independent study to validate and compare the existing models is essential for assessing utility of prediction in practice, but has not yet been performed
What this study adds
Existing prediction models, even those that incorporate only four to six predictors, are valid tools to identify individuals at high risk for future development of type 2 diabetes
Actual risk for development of type 2 diabetes is generally overestimated, making it necessary to adapt models to local settings, and even then the accuracy of the estimated risk remains questionable
The impact of such prediction models on prevention or treatment decisions requires further investigation in clinical practice
Cite this as: BMJ 2012;345:e5900
We thank Statistics Netherlands and the PHARMO Institute for follow-up data on cancer, cardiovascular disease and vital status.
Contributors: AA, LMP, RPS, KGM, SJLB, and JWJB conceived and designed the study. AA, KGM, LMP, and JWJB analysed the data. AA, LMP, GN, and JWJB wrote the first draft of the manuscript. All authors contributed to the writing of the manuscript and agreed with manuscript results and conclusions. AA, LMP, and JWJB are guarantors.
Funding: This study was funded by the Netherlands Heart Foundation, the Dutch Diabetes Research Foundation and the Dutch Kidney Foundation, the Centre for Translational Molecular Medicine (project PREDICCt, grant 01C-104-07), Europe against Cancer Programme of the European Commission (SANCO), the Dutch Ministry of Health, the Dutch Cancer Society, the Netherlands Organization for Health Research and Development (ZonMW), and World Cancer Research Fund (WCRF), and the Netherlands Organization for Scientific Research project (9120.8004 and 918.10.615). None of the study sponsors had a role in the study design, data collection, analysis and interpretation, report writing, or the decision to submit the report for publication
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: The EPIC-NL cohort complies with the Declaration of Helsinki and was approved by the relevant local medical ethics committees. All participants gave written informed consent before study inclusion.
Data sharing: No additional data available.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.