Systematic literature search
We performed a systematic literature search according to the PRISMA guidelines,20 when applicable. We searched PubMed for all published cohort studies that reported prediction models for the risk of type 2 diabetes until February 2011 using the following search string: ((“diabetes” OR “diabetes mellitus” OR “type 2 diabetes”) AND (“risk score” OR “prediction model” OR “predictive model” OR “predicting” OR “prediction rule” OR “risk assessment” OR “algorithm”)) NOT review [pt] AND English [LA]. We repeated this search for publications in German and Dutch. Finally, we checked systematic reviews and validation studies of prediction models to identify other relevant articles for our validation study. Because we did not perform a formal meta-analysis, the PRISMA items related to “protocol and registration” and “synthesis of results” for meta-analyses are not applicable to our study.
Studies were included if they met the following criteria: the study presented at least one formal prediction model or an update on a previously developed model; the endpoint was incident type 2 diabetes in a longitudinal design; and the population had to be at least partly white because the EPIC-NL cohort to be used for validation consists predominantly of white adults. We excluded studies using data on individuals with impaired glucose tolerance or impaired fasting glucose. Furthermore, we excluded models that used the two hour oral glucose tolerance test as a predictor variable because this was not available in our validation dataset and there was no reliable proxy variable available that could be taken as a substitute.
After review of the retrieved titles, two authors (AA and JWJB) independently reviewed the abstracts to select the relevant papers for full text review and subsequently reviewed and assessed the full papers. Discrepancies between the two reviewers were solved by having a third author (EC) review to reach consensus. For included studies, we made a primary plan to extract necessary data from the original studies to validate the models or contact the authors to obtain this information.
Table 1 summarises characteristics of the included studies⇓. The extracted data included the first author’s name, year of publication, country, name of study/score, number of cases and population, ascertainment of diabetes, duration of follow-up, statistical model, number of predictors, and reported performance of the model. The retrieved models were divided into models that contained only non-invasive predictors (“basic models”) and models that also included conventional biomarkers, such as glucose, HbA1c lipids, uric acid, or γ-glutamyltransferase (“extended models”).
Table 1 General characteristics of models to predict risk of incident type 2 diabetes included in study
Validation cohort
The EPIC-NL cohort (n=40 011) includes the Monitoring Project on Risk Factors for Chronic Diseases (MORGEN-EPIC) and Prospect-EPIC cohorts, initiated between 1993 and 1997. The Prospect-EPIC cohort comprises 17 357 women aged 49-70 who participated in a breast cancer screening programme. The MORGEN cohort comprises 22 654 men and women aged 20-64 who were recruited through random population sampling in three Dutch towns (Amsterdam, Maastricht, and Doetinchem). At baseline, all participants were sent a general questionnaire and a food frequency questionnaire; these were returned when they visited the study centre for a medical examination. Reporting of the study results conforms to STROBE along with references to STROBE.21
We excluded 615 individuals with prevalent type 2 diabetes and 1017 with missing follow-up or who did not consent to linkage with disease registries. The 38 379 remaining participants were used to validate the basic models in a full cohort design. We applied similar exclusion criteria in a 6.5% baseline random sample (n=2604) in which measurements of conventional biomarkers were available,19, leaving 2506 individuals. We used this random sample and all incident cases of type 2 diabetes to validate the extended models in a case cohort design.22 Table A in appendix 1 provides baseline characteristics for the entire cohort, the random sample, and the people with incident type 2 diabetes.
Data analysis
To evaluate the predictive performance of the retrieved prediction models, we used the original prediction models (regression coefficients with intercept or baseline hazard) as published. If the paper did not contain sufficient information, we asked the authors to provide us with the original model.25 26 Particularly, we obtained regression coefficients26 and the intercept of the model25 by asking for complementary information. Using these original (regression) model formulas, we calculated the probability of developing type 2 diabetes per model for each individual in our study sample. Two authors (AA and JWJB) first matched the predictors of the original models with the variables available in our data. A direct match was available in our data for most variables. If a direct match was not possible, we replaced the original predictor with a proxy variable to avoid having to drop the model from our validation study. For example, we used non-fasting glucose values because fasting glucose values were not available in our data. Also, nutritional variables were collected with our food frequency questionnaire as continuous variables (g/day) and were re-coded into corresponding categories used in the prediction models by using Dutch portion sizes. Table B in appendix 1 provides an overview of the variables used in each of the prediction models, and appendix 2 gives the exact details on the proxy variables that were used.
We assessed performance of the models using measures of discrimination and calibration.13 Discrimination describes the ability of the model to distinguish those at high risk of developing diabetes from those at low risk. The discrimination was examined by calculating Harrell’s C (comparable with the area under the ROC curve), accounting for censored data.27 Calibration indicates the ability of the model to correctly estimate the absolute risks and was examined by calibration plots. In a calibration plot, the predicted risk is plotted against the observed incidence of the outcome. Ideally the predicted risk equals the observed incidence throughout the entire risk spectrum and the calibration plot follows the 45° line. The calibration plot was extended to a “validation” plot as a summary tool.18 27 Appendix 2 gives more details on information provided by this plot. Calibration was also tested with the Hosmer-Lemeshow goodness of fit statistic for time to event data.18 27 Follow-up of our cohort was almost complete until about eight years: 3% were censored at 5 years, 5% at 7.5 years, and 44.6% at 10 years. To account for censoring when obtaining the observed probabilities for assessing calibration over, say, 10 years of follow-up, we first calculated for each individual the linear predictor and subsequently 10 year predicted outcome probability by the original survival models. This predicted probability was then divided into tenths, and we performed a Kaplan-Meier analysis per tenth, which accounts for the observed censoring. Per tenth, we obtained at the 10 year time point the observed outcome percentages, which in turn were compared with the 10 mean predicted outcome probabilities to obtain the calibration plot and measure of goodness of fit. This was done for each model and for the other time points (5 and 7.5 years).28 Moreover, we reported the calibration slope for the logistic regression models18 and calculated observed over predicted (expected) outcomes (O/E ratio) with 95% confidence intervals.18 29 A ratio below 1.0 indicates overestimation of risk, and a ratio more than 1.0 indicates underestimation of risk.
Differences in the incidence of diabetes in our cohort and in the development populations led to significant deviation between observed risk in our cohort and predicted risk estimated by the prediction model. To reduce this source of miscalibration, we “recalibrated” each prediction model by adjusting the intercept (for logistic regression models) or the baseline survival function (for survival regression models).28 30 31
The original models were developed for different time periods of risk prediction (different “prediction horizons”)—for instance, some models estimate 5 year risk and others 10 year risk. We therefore assessed the performance of each model for prediction of risk at 5, 7.5, and 10 years to account for the different time periods. For example, for 5 year risk, we considered individuals as incident cases if they had developed diabetes within the first five years of follow-up. Participants who developed diabetes after more than five years of follow-up were included in five year prediction as non-cases. A similar approach was followed for 7.5 and 10 year predictions. In addition, we performed a sensitivity analysis using the prediction horizon for which each model was developed in case this differed from 5, 7.5, or 10 years.
For the basic prediction models, which included only data from non-invasive clinical variables, we quantified their performance in the full dataset. The extended models were validated in the case cohort data. To account for this design, we applied an extrapolation approach that extends the case cohort data to the size of the full cohort.22 This is achieved by extrapolating the non-cases of the random sample (that is, the total random sample of 2506 individuals minus 79 cases) to the number of non-cases in the full cohort (that is, the total sample of 38 379 individuals minus 924 cases). To do so, we substituted the non-cases in the full cohort (n=37 455) with a random multiplication of non-cases of the random sample (n=2427). On average, we multiplied the non-cases in the random sample by 15.4 (that is, 37 455 divided by 2427). Next, we merged the extrapolated data from non-cases to those from all the cases (total non-cases of 37 455 individuals plus 924 cases), recreating the size and composition of the full cohort. In sensitivity analyses, we estimated the performance of the basic models in the re-sampled data from the case cohort and compared these results with those obtained from the full cohort. This allowed us to confidently use the extrapolation approach for the extended models in the case cohort design.
For most predictors data from less than 1% of the values were missing, although missing values occurred in 5% for family history of diabetes, about 15% for physical activity, and 20.5% for non-fasting glucose concentrations. Because an analysis of only the completely observed participants could lead to biased results,32 33 34 we imputed these missing values using single imputation and predictive mean matching. As the percentage of missing values for the non-fasting glucose concentration was high, we repeated our analyses using only data from the MORGEN cohort, in which less than 10% of values for non-fasting glucose concentration were missing, as a sensitivity analysis. Table C in appendix 1 shows the number of missing values for all variables incorporated in the original model.
We carried out a third sensitivity analysis to account for the use of non-fasting glucose values, as we had to approximate the fasting glucose values included in the models by the non-fasting glucose values in our data. In this analysis, we excluded individuals with a non-fasting glucose of ≥11.1 mmol/L (n=130), as this cut point is considered as a high blood glucose concentration at which diabetes is suspected especially if it is accompanied by the classic symptoms of hyperglycaemia.4 In another sensitivity analysis, we excluded 19 295 individuals (including 537 incident cases of diabetes) with fasting period of under two hours. In a fifth sensitivity analysis we excluded 255 individuals for whom we had no verification information of diabetes status.
All statistical analyses were conducted with SPSS version 18 (SPSS, Chicago, IL) and R version 2.11.0 (Vienna, Austria) for Windows (http://cran.r-project.org/).
Social bookmarking