Prognostic models for outcome prediction in patients with chronic obstructive pulmonary disease: systematic review and critical appraisalBMJ 2019; 367 doi: https://doi.org/10.1136/bmj.l5358 (Published 04 October 2019) Cite this as: BMJ 2019;367:l5358
- Vanesa Bellou, PhD student1 2,
- Lazaros Belbasis, PhD student1,
- Athanasios K Konstantinidis, assistant professor2,
- Ioanna Tzoulaki, reader1 3 4,
- Evangelos Evangelou, associate professor1 3
- 1Department of Hygiene and Epidemiology, University of Ioannina Medical School, Ioannina, Greece
- 2Department of Respiratory Medicine, University Hospital of Ioannina, University of Ioannina Medical School, Ioannina, Greece
- 3Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, UK
- 4MRC-PHE Center for Environment, School of Public Health, Imperial College London, London, UK
- Correspondence to: E Evangelou (or @eevangelou on Twitter)
- Accepted 12 August 2019
Objective To map and assess prognostic models for outcome prediction in patients with chronic obstructive pulmonary disease (COPD).
Design Systematic review.
Data sources PubMed until November 2018 and hand searched references from eligible articles.
Eligibility criteria for study selection Studies developing, validating, or updating a prediction model in COPD patients and focusing on any potential clinical outcome.
Results The systematic search yielded 228 eligible articles, describing the development of 408 prognostic models, the external validation of 38 models, and the validation of 20 prognostic models derived for diseases other than COPD. The 408 prognostic models were developed in three clinical settings: outpatients (n=239; 59%), patients admitted to hospital (n=155; 38%), and patients attending the emergency department (n=14; 3%). Among the 408 prognostic models, the most prevalent endpoints were mortality (n=209; 51%), risk for acute exacerbation of COPD (n=42; 10%), and risk for readmission after the index hospital admission (n=36; 9%). Overall, the most commonly used predictors were age (n=166; 41%), forced expiratory volume in one second (n=85; 21%), sex (n=74; 18%), body mass index (n=66; 16%), and smoking (n=65; 16%). Of the 408 prognostic models, 100 (25%) were internally validated and 91 (23%) examined the calibration of the developed model. For 286 (70%) models a model presentation was not available, and only 56 (14%) models were presented through the full equation. Model discrimination using the C statistic was available for 311 (76%) models. 38 models were externally validated, but in only 12 of these was the validation performed by a fully independent team. Only seven prognostic models with an overall low risk of bias according to PROBAST were identified. These models were ADO, B-AE-D, B-AE-D-C, extended ADO, updated ADO, updated BODE, and a model developed by Bertens et al. A meta-analysis of C statistics was performed for 12 prognostic models, and the summary estimates ranged from 0.611 to 0.769.
Conclusions This study constitutes a detailed mapping and assessment of the prognostic models for outcome prediction in COPD patients. The findings indicate several methodological pitfalls in their development and a low rate of external validation. Future research should focus on the improvement of existing models through update and external validation, as well as the assessment of the safety, clinical effectiveness, and cost effectiveness of the application of these prognostic models in clinical practice through impact studies.
Systematic review registration PROSPERO CRD42017069247
Chronic obstructive pulmonary disease (COPD) is a major public health problem. COPD accounts for at least 2.9 million deaths annually1; it is a leading cause of morbidity and mortality, and its prevalence is projected to increase over the coming years. Morbidity associated with the disease entails physician visits, emergency department visits, and hospital admissions,2 all of which lead to a substantial economic burden. The greatest proportion of the costs is attributed to exacerbations of COPD.2
COPD is a fairly heterogeneous disease, and stratifying cases according to prognosis would raise the possibility of a precision medicine approach. For many years now, forced expiratory volume in one second (FEV1) and age have been considered to be the most important prognostic indicators in COPD.3 More recently, a wide variety of individual clinical factors have been also linked to prognosis of COPD.
Prognostic models, in general, have two distinct uses: they classify patients in groups with different prognosis and estimate prognosis for individual patients. Although these are two different ways of looking at the same information, they differ fundamentally and the ultimate goal is to guide therapeutic and further diagnostic choices.4 Use of a composite index to assess prognosis in COPD patients may provide a more comprehensive method of evaluation, incorporating a cluster of systemic manifestations of the disease.5 Furthermore, in patients with COPD, multivariable prognostic models for various clinical outcomes could be used in clinical practice to assist decision making about hospital admission or admission to intensive care units and treatment strategy.6
Many prognostic models, combining multiple predictors for COPD related outcomes, have been developed. Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines recommend the use of multivariable prediction models to assess the prognostic profile and facilitate follow-up of patients, instead of single predictors such as spirometry or history of exacerbations alone. Also, in the latest GOLD statement, the BODE index is proposed as a tool to determine who needs referral for consideration for lung transplantation.7
In this study, we aimed to systematically summarise the reported multivariable prognostic models developed for predicting subsequent outcomes in patients diagnosed as having COPD, to map their characteristics, and to examine whether they have undergone external validation. We used the Prediction model Risk Of Bias ASsessment Tool (PROBAST) to apply risk of bias assessment of the methodological features of the available studies developing or validating prognostic models. For prognostic models with multiple validation studies, we did a meta-analysis for performance and calibration of the models to obtain more accurate estimates.
We designed this systematic review according to the Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) and the recent guidance by Debray et al.89 A protocol for this study was published on PROSPERO (registration number CRD42017069247).
We systematically searched PubMed from inception to 11 November 2018 to capture all studies developing and/or validating a prognostic model for clinical outcomes in COPD patients. On the basis of previous research,1011 we created the following search algorithm: (predict* OR progn* OR “risk prediction” OR “risk score” OR “risk calculation” OR “risk assessment” OR “c statistic” OR discrimination OR calibration OR AUC OR “area under the curve” OR “area under the receiver operator characteristic curve”) AND (“chronic obstructive pulmonary disease” OR emphysema OR “chronic bronchitis” OR COPD). Two researchers (VB, LB) did the literature search independently, and discrepancies were resolved by a third researcher (IT). We further hand searched the references of each eligible article for potential additional eligible studies.
We included all studies that reported the development or validation of at least one multivariable model for predicting the risk for any clinical outcome in COPD patients. Table 1 shows a detailed description of the PICOTS for this review.89 To consider a study as eligible, we followed the definition of prognostic model studies as proposed by the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement.12 Accordingly, it should specifically report the development, the update, or the external validation of a prognostic model used for making individualised predictions in COPD patients, either in its objectives or its conclusions. A study was also eligible if the development or update of a prognostic model could be deduced by the available information through the full text (for example, model presentation, measures of predictive performance for a multivariable model). Eligible outcomes included any possible clinical endpoint of COPD patients, such as mortality, exacerbations, and hospital admissions.
The eligible studies could report the development of multivariable models, the external validation of an existing model, and/or the update of an existing model. Updating of models may range from simple adjustment of the baseline risk/hazard or additional adjustment of predictors’ weights by using the same or different adjustment factors to re-estimate predictors’ weights to adding new predictors or removing existing predictors from the original model.13 External validation studies aim to assess the predictive performance of an existing model in an independent population.13 We included external validation studies that explicitly estimated and presented a measure of the model’s performance. We also considered studies validating prediction models originally developed for other diseases in a COPD population. Also, eligible articles should report original research, study humans, and be written in English.
We excluded studies developing or validating diagnostic models to detect or exclude presence of COPD in patients with suspected COPD, studies examining only independent prognostic factors, methodological studies, and COPD case finding or screening studies. We also excluded studies that developed search algorithms to identify existing cases of COPD on the basis of administrative data. Given that prognostic models estimate a probability of a certain outcome for an individual patient over a specified time horizon, we excluded cross sectional studies because in this study design predictors and the outcome are measured concurrently. However, cohort studies that did an external validation of a model derived from a cross sectional study were deemed eligible.
To facilitate the data extraction process, three researchers (VB, LB, IT) constructed a standardised form by following recommendations in the CHARMS checklist.8 Two researchers (VB, LB) independently extracted data. From all eligible articles, we extracted information on first author, year and journal of publication, and model name. From articles describing model development, we extracted the following information: study design, study population, geographical location, predicted outcome, definition of outcome, prediction horizon, definition of COPD, modelling method, method of internal validation, number of participants and number of events, number and type of predictors in final model, model presentation, and measures of predictive performance (discrimination, calibration, classification, overall performance). Potential measures of discrimination were C statistic and D statistic; potential measures of classification were sensitivity, specificity, positive and negative predictive value, and predictive accuracy; potential measures for overall performance were R2 and Brier score; and potential measures for assessment of calibration were calibration plot, calibration-in-the-large, calibration slope, Hosmer-Lemeshow test, Harrell’s E statistic, and calibration test.1415 Harrell’s E statistic is defined as the absolute difference between smoothed observed outcomes and predicted probabilities.14 Furthermore, we evaluated whether the authors reported only the apparent performance of a prognostic model or examined overfitting by using internal validation. Additionally, we examined whether a shrinkage of regression coefficients towards zero was performed in eligible studies and which method was used. We considered that the authors adjusted for optimism sufficiently if they re-evaluated the performance of a model in internal validation and performed shrinkage of model coefficients as well. We extracted information on whether the authors did decision curve analysis and net benefit analysis to evaluate the clinical usefulness of a model.1516 Moreover, for each eligible study, we examined whether the authors reported the presence of missing data on examined outcomes and/or variables included in prediction models; if so, we recorded how missing data were treated. We also extracted information on how continuous variables were handled and whether non-linear trends for continuous predictors were assessed by applying polynomials, fractional polynomials, or cubic splines. If the handling of continuous predictors was not described explicitly, we scrutinised the full text and the tables of the respective papers to derive this information from the reported effect sizes. If this process was inconclusive, we described the handling of continuous predictors as unclear.
In articles examining the performance of the same prediction model on various outcomes or multiple timepoints, we retained the prediction model referring to the outcome or timepoint mentioned as the primary analysis of the study. In cases in which a primary timepoint was not specified, we considered the prediction with the longest horizon as the primary analysis of the study, because longer follow-up would lead to a larger number of events. Whenever a study described a model’s performance both in an overall sample and in specific subgroups of the population, we extracted the analysis on the total population.
From articles describing external validation of models, we extracted study population, geographical location, number of participants and events, the model’s performance, and calibration. If an article described multiple models, we extracted data separately for each model. For each model externally validated in multiple articles, we included in our analysis only external validation studies with non-overlapping populations. Furthermore, we examined whether the research team performing the external validation was independent of the research team developing the prediction model.
Risk of bias assessment
We appraised the presence of bias in the studies developing or externally validating prognostic models by using PROBAST, which is a risk of bias assessment tool designed for systematic reviews of diagnostic or prognostic prediction models.1718 It contains a multitude of questions in four different domains: participants, predictors, outcome, and statistical analysis. Questions are answered with yes, probably yes, probably no, no, and no information, depending on the characteristics of the study. If a domain contains at least one question signalled as no or probably no, it is considered to be at high risk. To be considered at low risk, a domain should contain all questions answered with yes or probably yes. Overall risk of bias is graded as low risk when all domains are considered low risk, and overall risk of bias is considered high risk when at least one of the domains is considered high risk. Two researchers (VB, LB) independently assessed risk of bias.
PROBAST describes the assessment of both development studies and external validation studies. Often, articles describe the development of multiple prognostic models using different populations or different statistical approaches. Hence, differences in the risk of bias assessment is expected among different prognostic models developed in the same article. For this reason, we chose to report the risk of bias assessment per developed prognostic model and not per article. Furthermore, articles may describe the external validation of multiple prognostic models in the same population or in multiple different populations. For this reason, we refer to external validation efforts and we report the risk of bias assessment per external validation effort.
We calculated and reported descriptive statistics to summarise the characteristics of the models. We calculated the median and interquartile range for continuous variables and the respective percentages for binary variables.
For the prediction models that were examined in more than two independent datasets (excluding the model development dataset), we did a random effects meta-analysis to calculate a summary estimate for models’ performance and calibration. We also considered for the meta-analysis those prediction models that were internally validated through bootstrapping or cross validation and were externally validated in only two independent datasets. We followed a recently published framework for the meta-analysis of prediction models.919 If a measure of uncertainty (standard error or 95% confidence interval) was not available for mean C statistic, we used a formula to approximate the standard error of mean C statistic based on number of events and number of participants.91920 We quantified between study heterogeneity by using the I2 and τ2 statistics.21 We used R version 3.5.2 for the statistical analysis. For the meta-analysis of prediction models, we used the R package “metamisc.”19
Patient and public involvement
No patients or participants were involved in setting the research question or the outcome measures, nor were they involved in developing plans for design or implementation of the study. No patients were asked to advise on interpretation or writing up of results. There are no plans to disseminate the results of the research to study participants or the relevant patient community.
Of the 17 538 screened papers, 228 papers were eligible (fig 1). These articles described the development of 408 prognostic models in COPD patients, the external validation of 38 prognostic models, and the application of 20 prognostic models originally developed for health outcomes other than prognosis of COPD patients. One of the eligible papers was identified through the hand search of references from eligible articles.22 The prognostic models were mainly developed in the US (n=91; 22%), Spain (n=57; 14%), and the UK (n=34; 8%), whereas 80 (20%) models were developed in multicentre studies from multiple countries. For the derivation cohorts, the median sample size was 409 (interquartile range 163-1033) and the median number of events was 63 (36-188). For the internal validation cohorts, the median sample size was 831 (225-4192) and the median number of events was 77 (40-370).
The eligible prognostic models were developed in a variety of clinical settings; 239 (59%) models were developed in an outpatient setting, and 155 (38%) models were developed on a sample of patients admitted to hospital; 14 (3%) prognostic models were developed for COPD patients attending the emergency department. The developed models focused on a wide range of clinical outcomes. The most commonly used endpoints were mortality (n=209; 51%), exacerbation (n=42; 10%), and readmission after an index hospital admission (n=36; 9%). Supplementary table A shows a summary of the predicted outcomes per clinical setting. Twenty four prognostic models focused on a composite outcome. The most commonly used predictors were age (n=166; 41%), FEV1 (n=85; 21%), sex (n=74; 18%), body mass index (n=66; 16%), smoking (n=65; 16%), previous exacerbations (n=53; 13%), previous hospital admissions (n=50; 12%), BODE index (n=43; 11%), modified Medical Research Council (mMRC) dyspnoea scale (n=42; 10%), and Charlson comorbidity index (n=35; 9%). Supplementary table B shows the top predictors in the 408 prognostic models for COPD patients stratified by clinical setting. Figure 2 shows the predictors that were used in at least 20 models, and figure 3 shows the 10 most common predictors stratified by clinical setting. Below, we describe the methodological and clinical characteristics for a total of 408 prognostic models, based on clinical setting.
Prognostic models for outpatients
Most of the prognostic models (n=239; 59%) were developed on a sample of COPD patients examined in an outpatient facility (supplementary table C). For the derivation cohort, the median sample size was 431 (244-1000) and the median number of events was 63 (33-155). For the internal validation cohort, the median sample size was 249 (204-3468) and the median number of events was 150 (64-1642). The most common clinical endpoints examined by these models were mortality (n=124; 52%), exacerbation (n=40, 17%), spirometric indices (n=25; 10%), hospital admission (n=16; 7%), treatment failure during an acute exacerbation (n=8; 3%), and composite outcome (n=9; 4%). The most commonly used predictors in these models were age (n=105; 44%), FEV1 (n=69; 29%), smoking (n=54; 23%), body mass index (n=51; 21%), sex (n=43; 18%), previous exacerbations (n=43; 18%), BODE index (n=43; 18%), previous hospital admissions (n=28; 12%), and diabetes mellitus (n=24; 10%).
A C statistic was reported for most (n=198; 83%) of these models, and the remaining 41 (17%) did not have a discrimination metric reported. For 172 prognostic models, only the apparent performance was reported in the development study. One prognostic model had temporal validation, and the remaining models had cross validation (n=28; 12%), bootstrapping (n=24; 10%), random split (n=12; 5%), or a combination of methods (n=2). Most (n=193; 81%) prognostic models were not calibrated; calibration was assessed for 46 prognostic models, and the most frequent method used was the Hosmer-Lemeshow test (n=35; 15%). Various modelling methods were applied, of which the most frequent were Cox regression (n=90; 38%), logistic regression (n=79; 33%), negative binomial regression (n=21; 9%), and linear regression (n=16; 7%). For 12 prognostic models, shrinkage of regression coefficients was done to reduce overfitting. Application of a uniform shrinkage factor to all the regression coefficients was used for nine models, application of a penalised maximum likelihood method to estimate the regression coefficients was described for one prognostic model, and lasso regression was applied in two prognostic models to perform shrinkage for selection of predictors. For 17 prognostic models, a non-linear association between continuous predictors and predicted outcome was examined using the following methods: polynomials (n=7), restricted cubic splines (n=6), fractional polynomials (n=2), and Box-Tidwell transformation (n=2). A considerable number (n=178; 75%) of models did not have any type of model presentation, and only 24 (10%) reported the full regression equation. The most common type of presentation was sum score (n=30; 13%). Only one study performed decision analysis.23 In this study, net benefit and decision curves are available for the updated ADO index. Net benefit is a category of decision analysis, comparing benefits and harms directly after transforming them on the same scale. Table 2 gives a detailed description of all the methodological characteristics.
Prognostic models for patients admitted to hospital
One hundred and fifty five models were developed in patients admitted to medical wards, intensive care units, or rehabilitation centres (supplementary table D). The median sample size of the derivation cohort was 303 (102-920), and the median number of events was 67 (37-311). The median sample size of the internal validation cohort was 4131 (731-4840), and the median number of events was 333 (35-370). The most prevalent outcomes assessed were mortality (n=78; 50%), readmission after an index admission (n=36; 23%), failure of non-invasive ventilation (n=14; 9%), and composite outcomes (n=13; 8%). The predictors encountered in most of the prognostic models were age (n=56; 36%), sex (n=30; 19%), partial pressure of carbon dioxide (n=24; 15%), previous hospital admissions (n=20; 13%), length of hospital stay (n=20; 13%), Charlson comorbidity index (n=19; 12%), pH (n=18; 12%), heart failure (n=16; 10%), body mass index (n=15; 10%), and serum albumin (n=15; 10%).
Of the 155 prognostic models, 31 (20%) were developed for patients admitted to intensive care units to predict mortality (n=22), weaning success (n=2), need for mechanical ventilation (n=6), and duration of mechanical ventilation (n=1). The most commonly used predictors were age (n=12), Glasgow or Japan Coma Scale (n=9), APACHE II (n=8), sex (n=6), pH (n=6), haemoglobin (n=6), serum albumin (n=6), heart failure (n=6), and hypertension (n=6).
A C statistic was reported for only 102 (66%) prognostic models; discrimination was not assessed for 53 (34%) models. One hundred and thirty one (85%) prognostic models did not have internal validation, and for the few models for which this was done, bootstrapping (n=9; 6%), random split (n=7; 5%), cross validation (n=3; 2%), or a combination of the aforementioned methods (n=2; 1%) was used. Three (2%) prognostic models had temporal validation. Calibration was not assessed for 116 (75%) prognostic models; the Hosmer-Lemeshow test (n=34; 22%) was the most frequently used method of calibration. Most of the prognostic models did not have a model presentation (n=104; 67%). A regression formula was available for 27 (17%) prognostic models. The most frequently used modelling methods were logistic regression (n=111; 72%) and Cox regression (n=21; 14%). For four prognostic models, shrinkage was applied to reduce overfitting. Application of a uniform shrinkage factor to all the regression coefficients was performed for two models, the penalised maximum likelihood approach was used in one model, and lasso shrinkage was applied for one model. For three prognostic models, the non-linear association of predictors with the predicted outcome was considered using polynomials (n=1), fractional polynomials (n=1) and Box-Tidwell transformation (n=1). One study did a decision analysis after developing a prognostic model.24
Prognostic models for patients presenting to emergency department
Only 14 prognostic models were developed for patients who attend the emergency department (supplementary table E), with a median sample size of 1195 (871-1250) and a median number of events of 77 (40-137) in the derivation cohort. The median sample size of internal validation cohort was 1235 (266-1244), and the median number of events was 52 (29-66). The outcomes examined were mortality (n=7; 50%), change in physical activity (n=2), composite outcome (n=2), hospital admission (n=1), intensive care unit admission (n=1), and treatment failure after a visit to the emergency department for an acute exacerbation (n=1). Five of these models examined a long term prediction horizon (>1 month). The most prevalent variables included in these models were long term oxygen therapy or non-invasive ventilation at home (n=8; 57%), age (n=5; 36%), mMRC dyspnoea scale (n=5; 36%), Charlson comorbidity index (n=4; 29%), partial pressure of carbon dioxide (n=3; 21%), use of inspiratory accessory muscles and paradoxical breathing (n=3; 21%), and Glasgow or Japan Coma Scale (n=3; 21%).
An assessment of discrimination was not reported for three of these models, and a C statistic was reported for 11 models. Five models did not have any internal validation, and a random split of the dataset was used for eight models. Bootstrapping was used for internal validation of a single model. The most frequently used modelling method was logistic regression (n=10). A shrinkage procedure was not applied for any model.
External validation studies
Of 408 prognostic models, 38 (9%) were externally validated at least once. However, only 12 (3%) models were externally validated by a fully independent research team. The prognostic models that were externally validated more than five times were ADO (17 cohorts), BODE (13 cohorts), BODEx (8 cohorts) and CODEX (7 cohorts).
Four prognostic models (DOSE index, SAFE index, mBODE% index, and COPD Severity Score) were developed in cross sectional studies, and these models were not described in the aforementioned sections. We retained only their external validation in cohort studies, of which there were 12 for DOSE index and one each for COPD Severity Score, SAFE index, and mBODE% index. Supplementary table F shows all the external validation studies of the prognostic models for outcome prediction in COPD patients.23252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
Risk of bias assessment
We used PROBAST to assess the risk of bias of all studies developing or externally validating a prognostic model. In figure 4, we show a summary of the risk of bias assessment of developed models by domain. Seven prognostic models were assessed as being at low risk of bias, and all these models were developed for ambulatory COPD patients (ADO index, B-AE-D index, B-AE-D-C index, extended ADO index, updated BODE index, updated ADO index, and a model developed by Bertens et al). Table 3 shows the clinical setting, the predicted outcome and the time horizon, the events per variable number, the shrinkage method, and the optimism corrected C statistic for these seven prognostic models with low risk of bias. Table 4 shows the predictors included in these seven prognostic models. For one of these models (extended ADO index), a model presentation was not available. For the remaining six models, table 5 describes the model equations. Overall, 338 models were at low risk of bias for participants, 394 models were at low risk of bias for predictors, and 402 models were at low risk of bias for outcome, but only 10 models were at low risk of bias for statistical analysis.
We additionally assessed a total of 116 external validation efforts (fig 5). Of these efforts, only five were graded as being at low risk of bias according to PROBAST. These were one validation of the model developed by Bertens et al, one validation of DECAF, one validation of BAP-65, and two validations of PEARL. The remaining validation efforts were at high risk of bias.
Validation of prognostic models originally developed for other diseases
Twenty eight papers examined the predictive ability of 20 prediction models originally developed for diseases other than COPD (supplementary table G). Specifically, these models are APACHE II and III, CHA2DS2-VASC, Charlson comorbidity index, CURB-65, CRB-65, CREWS, Elixhauser comorbidity index, Framingham risk score, GRACE, HOSPITAL, LACE, MDA, MODS, NEWS, NRS 2002, PSI, Salford-NEWS, SAPS, and SOFA. Overall, the predictive ability of these models was examined for mortality, exacerbation, hospital admission, failure of non-invasive ventilation, or identification of high cost patients.
Meta-analysis of prognostic models
Overall, we did 19 meta-analyses of C statistics for 12 prognostic models (ADO index, APACHE II, BOD index, BODE index, BODEx index, CODEX index, COTE index, CURB-65, DOSE index, LACE index, updated ADO index, and updated BODE index). For ADO index, APACHE II, BODE index, BODEx index, and CODEX index, we did two different meta-analyses for two distinct outcomes, whereas for DOSE index we did three different meta-analyses for three distinct outcomes. Eleven meta-analyses examined the risk of mortality, two meta-analyses examined the risk of acute exacerbation of COPD, five meta-analyses examined the risk of readmission or mortality, and one meta-analysis was focused on failure of non-invasive ventilation. I2 estimates ranged from 0% to 96%, whereas τ2 estimates ranged between 0 and 0.2605. In 12 meta-analyses of C statistics, we observed large between study heterogeneity (I2>50%). Summary C statistic estimates ranged from 0.611 for DOSE index in prediction of a composite outcome to 0.769 for APACHE II in prediction of mortality. Figure 6 shows a forest plot of all the meta-analyses, and table 6 shows the results of the meta-analyses of C statistics. We could not do meta-analysis of calibration measures, because they were not adequately reported in the external validation studies.
Our systematic search yielded a detailed map of more than 400 prognostic models for the prediction of clinical outcomes in COPD patients. These models were developed in a wide range of clinical settings, including outpatient services, emergency departments, medical wards, intensive care units, and primary care structures. We identified seven prognostic models that were developed in studies at low risk of bias as assessed with PROBAST, and all these models were externally validated at least once. We complemented our systematic review and bias assessment with a meta-analysis of C statistics for 12 prognostic models.
Principal findings in context
Most of the prognostic models were developed in Western countries; more than half were developed in the US, Spain, and the UK. Although COPD is a quite prevalent chronic disease in low and middle income countries,82 only a very small number of prognostic models were developed or validated in Asia, Africa, or South America. In the developing world, the main risk factors for COPD are history of tuberculosis and exposure to indoor air pollution.83 Previous literature has shown a more favourable prognosis in COPD inflicted by biomass fuel than in smoking induced COPD.8485 We found only one paper reporting an external validation of BODEx index and COTE index in patients with COPD associated with biomass fuels52; however, this study was conducted in Spain. Our literature search indicates that currently developed prognostic models could not be generalised to developing countries, given that they have not been validated in these populations, except an external validation of BODE index in Brazilian population.43
Our systematic review showed several methodological pitfalls in the development of the models, which is also reflected in the risk of bias assessment. Only a quarter of the models were internally validated, and a tenth of the models were externally validated. The performance of a prognostic model is overestimated when simply determined in the sample of patients that was used to construct the model. Internal validation provides a more accurate estimate of model performance in new patients when it is properly performed—that is, using bootstrapping or cross validation techniques.86 To ensure the generalisability of a prognostic model in populations with different characteristics, external validation studies are needed.13 However, independent populations with large sample sizes of COPD patients and available COPD specific information (used as predictors in the prognostic models) can be hard to obtain to measure external validity. This necessitates the use of suitable internal validation techniques to provide an optimism adjusted performance for the population in which the model was originally developed. Nevertheless, an evaluation of a model’s performance in a different sample is not sufficient to overcome overfitting, and studies developing prognostic models should also apply shrinkage, which is a method to reduce overfitting by re-adjusting the regression coefficients.8788 Our systematic review showed that only a very small number of prognostic models performed shrinkage.
An important finding of our systematic review was that only a quarter of the models assessed calibration, which is the accuracy of absolute risk estimates—that is, it informs clinicians how similar the predicted absolute risk is to the true (observed) risk in groups of patients classified in different risk strata.89 In addition, most of the models either did not report any method of handling missing data or performed a complete case analysis. Missing data often lead to biased estimates if not imputed, because they can distort the performance of a prediction model if the missingness of values is related to other known characteristics.90 Additionally, in about half of the prognostic models, continuous predictors were dichotomised or categorised, and the non-linearity of continuous predictors was examined for only a small percentage of prognostic models. However, categorising continuous predictors into two or more categories has already been shown to lead to weaker prediction performance than analysing predictors on a continuous scale, owing to significant loss in information.91 Additionally, non-linear associations can be efficiently modelled using restricted cubic splines or fractional polynomials.92
Another key factor is that discrimination and classification statistics that are usually reported in studies of prognostic models do not inform us about the clinical value of a model. Decision analysis is needed to evaluate whether the implementation of a prognostic model in clinical practice would be beneficial—that is, do more good than harm.16 However, only two eligible studies did decision analysis.2324 Moreover, the applicability of a prediction model in clinical practice depends on the model presentation. In clinical practice, decision trees, sum scores, nomograms, and risk charts are commonly used in decision making. Sum scores and decision trees are more suitable for acute care settings, whereas risk charts and nomograms allow for a more detailed risk assessment and are more fitted for outpatient settings. However, more than two thirds of the developed models did not have any type of model presentation. Lack of presentation of a predictive tool does not allow its use in clinical practice. Additionally, lack of reporting of the regression formula in many of the prognostic models hinders future efforts for validation, update, and recalibration.93
The variables most commonly used in the development of prognostic models were age, FEV1, sex, body mass index, smoking, previous exacerbations, previous hospital admissions, mMRC dyspnoea scale, BODE index, and Charlson comorbidity index. These variables are either anthropometric features, important factors in the natural progression of the disease, or markers of disease severity. They are easily measured, so they are available in settings where resources are limited (such as primary care) and in acute care facilities where prompt decisions need to be made (such as emergency departments). Another advantage of these predictors is their low risk for measurement bias, which leads to a smaller possibility of exposure misclassification. Finally, these variables have been identified as individual prognostic factors in COPD.39495969798 However, we observed variability in the top predictors when the predictors were stratified by clinical setting. For example, in the prognostic models designed on the basis of COPD patients presenting at the emergency department, the most commonly used predictor was the use of long term oxygen therapy or non-invasive ventilation at home, which is uncommon in other settings. Also, smoking was a frequently used predictor only in models derived from outpatient settings, and it was only rarely used as a predictor in patients admitted to hospital. Furthermore, comorbid conditions, either in the form of multidimensional indices such as the Charlson comorbidity index, or as distinct conditions (for example, diabetes mellitus or cardiovascular disorders), were widely used and ranked among the most common predictors of clinical outcomes in all settings. Serum albumin and arterial blood gases were used almost exclusively as predictors in patients admitted to hospital and those visiting the emergency department.
The most extensively validated prognostic models were the BODE index and the ADO index.4999 The BODE index is the most established prognostic model in COPD and was developed to predict mortality.99 In the GOLD statement, the BODE index is used in the prediction of mortality and in clinical decision making for lung transplantation and post-discharge follow-up of patients. The predictors included in the BODE index are body mass index, FEV1, dyspnoea, and exercise capacity. Despite the lack of calibration in the original study of the BODE model, it has been validated and updated extensively in medical literature. The updated BODE index, a recalibration of the BODE index, is among the models with a low risk of bias.
The ADO index was based on the predictors used to develop the BODE index. It uses FEV1, dyspnoea, and age.49 The elimination of the six minute walking distance that was used in the BODE index was based on the rationale of developing a more easily applicable model, even by primary care physicians in settings with limited resources, rather than respiratory professionals alone. Despite the good predictive performance that the ADO index achieved in its development study, it showed poor calibration. This led to a recalibration of the ADO index in an independent population resulting in an updated ADO index, as well as an extended version of the recalibrated model with the addition of two variables. The ADO index, updated ADO index, and extension of updated ADO index had a low risk of bias and have been externally validated.
Three additional prognostic models presented low risk of bias and were developed for the outpatient setting. The B-AE-D index, and its update, the B-AE-D-C index, were developed for stable COPD patients at GOLD stage II to IV to predict the risk of two year all cause mortality.27 The prognostic model developed by Bertens et al was the only prognostic model at low risk of bias that was developed to predict the risk of future exacerbations at two years in stable COPD patients.68
An essential step before the application of prediction models in clinical practice is their external validation in independent populations with different clinical characteristics and comparison of performance among different prediction models to identify the models with the best discrimination and calibration. A large scale effort to externally validate and compare multiple prognostic models for COPD patients was recently published.100 The researchers used network meta-analysis to compare the performance of eight multivariable prognostic models and two different GOLD classifications in 24 cohort studies. In this analysis, the updated ADO index had the best ability to predict three year mortality in patients with COPD, followed by the updated BODE index and e-BODE index. However, the researchers pointed out that the approach of network meta-analysis has not yet integrated the synthesis of calibration measures.100
Recommendations and policy implications
On the basis of the aforementioned methodological pitfalls, the following recommendations could be stated to improve the research on prognostic models for prediction of outcome in COPD patients. Firstly, model development studies should adjust for overfitting by doing internal validation (mainly through non-random split or resampling techniques such as bootstrapping) and using shrinkage techniques and should provide an optimism adjusted performance. Secondly, model calibration should be examined. If a prognostic model has poor calibration, efforts should be made to improve its calibration by updating it either through recalibration or through addition of new variables. Thirdly, researchers should apply imputation techniques when data are missing, and they should report the full equation of the prognostic model to allow its external validation and update by independent research teams. Fourthly, continuous predictors should not be dichotomised, and potential non-linear association with the outcome should be examined using fractional polynomials or restricted cubic splines.88
The vast majority of prognostic models predicted the risk for mortality. Other clinically important outcomes, such as risk for exacerbation, a very common outcome in randomised clinical trials for COPD treatment, attracted much less attention. Also, the predictive ability of existing models focused on European and North American populations and could not be easily generalised. Thus, external validation studies of existing models in other populations are needed.
External validation studies are not sufficient to guarantee the clinical utility of a prediction model. To select a prediction model for implementation in clinical practice, impact studies are needed.13 These are randomised clinical trials applying a prognostic model in a clinical setting and assessing its clinical utility for decision making. However, we found only one impact study in the literature.101 This study concluded that the DECAF score, a prognostic model that was initially developed for patients admitted to hospital with an exacerbation to predict in-hospital mortality,102 is safe, clinically effective, and cost effective in the selection of COPD patients with an exacerbation that could be treated at home.101103
Comparison with other studies
A previously published systematic review identified 15 prognostic models (either original models or updates of existing models) for stable COPD patients that were published up to September 2010.6 This systematic review mainly focused on the description of clinical characteristics of prognostic models—that is, population characteristics and predictors. In contrast, our systematic review included a much broader spectrum of COPD patients by additionally detecting prognostic models for COPD patients admitted to hospital and for those visiting the emergency department. As a consequence, we captured a total of 408 prognostic models from various clinical settings. Furthermore, we reported a detailed presentation of methodological characteristics in multivariable prognostic models for outcome prediction in COPD patients. We additionally did a meta-analysis for prognostic models with multiple external validation studies, and we assessed the risk of bias by using PROBAST.
Strengths and limitations of study
The major strength of our study is that it provides an overall mapping of the available research on prognostic models for outcome prediction in COPD patients. We collected all published prognostic models used to forecast any clinical outcome that may occur in the course of COPD. We presented a detailed description of the characteristics of the developed models, as well as updates and validation studies of existing models. Another important aspect of our paper is the critical appraisal of prognostic models in COPD by using the PROBAST tool. We also did a meta-analysis of C statistics for prognostic models that were externally validated in multiple independent populations.
A limitation of our study is the inability to do meta-analysis of calibration measures for prognostic models, owing to poor reporting of calibration in the validation studies. Also, we observed large between study heterogeneity in the meta-analyses of C statistics. Potential sources of heterogeneity could be the differences in clinical setting, patients’ characteristics, and time horizons across the validation studies, but we could not do meta-regression analyses or sensitivity analyses owing to the small number of external validation studies per prognostic model.92
Our paper constitutes a map of the research on multivariable prognostic models for outcome prediction in COPD patients, aiming to summarise their methodological characteristics, their calibration, and their performance. An abundance of prognostic models is available for patients with COPD, so deciding on which one to use in a specific setting or population can be challenging for healthcare professionals. Future prognostic research should steer towards recalibration or update of existing prognostic models with the addition of new predictors to enhance their prognostic performance. Studies updating existing models should sufficiently estimate optimism adjusted performance and calibration measures by applying appropriate internal validation and should adjust for overfitting by applying shrinkage techniques. Future studies should also use multiple imputation to handle missing data as well as examine non-linearity of continuous predictors.
Moreover, to ensure the generalisability of prognostic models, validation studies in populations with different characteristics, with regards to setting and inclusion criteria, are needed. Prognostic tools with good calibration and external validity should inform clinical practice as well as be recommended by guidelines after they have undergone impact studies to examine the effect of using the model for a specific outcome in clinical practice.
What is already known on this topic
Historically, spirometry and age have been identified as the most important prognostic indicators in chronic obstructive pulmonary disease (COPD)
Global Initiative for Chronic Obstructive Lung Disease guidelines recommended use of multivariable prediction models to assess prognosis, instead of single predictors such as spirometry or history of exacerbations
No systematic overview has been published to summarise and critically appraise all multivariable prognostic models for outcome prediction in COPD patients
What this study adds
More than 400 prognostic models for outcome prediction in COPD patients exist, but only a minority have been externally validated and most were characterised by major drawbacks in the statistical analysis
Applying PROBAST showed that ADO, B-AE-D, B-AE-D-C, extended ADO, updated ADO, updated BODE, and a model developed by Bertens et al were derived in studies assessed as being at low risk of bias
We thank Karel Moons for his constructive comments on the study design and the first draft of the manuscript.
Contributors: VB, LB, IT, and EE designed the study. VB and LB did the literature search and the data extraction and wrote the first draft of the manuscript. All the authors wrote the final version of the manuscript. EE accepts full responsibility for the work and conduct of the study, had access to the data, and controlled the decision to publish. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. VB and EE are the guarantors.
Funding: VB and LB are supported by PhD scholarships funded by the Greek State Scholarships Foundation. No funding body has influenced data collection, analysis, or interpretation.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not needed.
Data sharing: Additional data for the eligible studies are available on request from the corresponding author at firstname.lastname@example.org.
Transparency: The corresponding author affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.