Intended for healthcare professionals

CCBY Open access
Research

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

BMJ 2020; 369 doi: https://doi.org/10.1136/bmj.m1328 (Published 07 April 2020) Cite this as: BMJ 2020;369:m1328

Linked Editorial

Prediction models for diagnosis and prognosis in covid-19

Read our latest coverage of the coronavirus outbreak

  1. Laure Wynants, assistant professor1 2,
  2. Ben Van Calster, associate professor2 3,
  3. Gary S Collins, professor4 5,
  4. Richard D Riley, professor6,
  5. Georg Heinze, associate professor7,
  6. Ewoud Schuit, assistant professor8 9,
  7. Marc M J Bonten, professor8 10,
  8. Darren L Dahly, principal statistician11 12,
  9. Johanna A A Damen, assistant professor8 9,
  10. Thomas P A Debray, assistant professor8 9,
  11. Valentijn M T de Jong, assistant professor8 9,
  12. Maarten De Vos, associate professor2 13,
  13. Paula Dhiman, research fellow4 5,
  14. Maria C Haller, medical doctor7 14,
  15. Michael O Harhay, assistant professor15 16,
  16. Liesbet Henckaerts, assistant professor17 18,
  17. Pauline Heus, doctoral candidate8 9,
  18. Nina Kreuzberger, research associate19,
  19. Anna Lohmann, researcher in training20,
  20. Kim Luijken, doctoral candidate20,
  21. Jie Ma, medical statistician5,
  22. Glen P Martin, lecturer21,
  23. Constanza L Andaur Navarro, doctoral student8 9,
  24. Johannes B Reitsma, associate professor8 9,
  25. Jamie C Sergeant, senior lecturer22 23,
  26. Chunhu Shi, research associate24,
  27. Nicole Skoetz, medical doctor19,
  28. Luc J M Smits, professor1,
  29. Kym I E Snell, lecturer6,
  30. Matthew Sperrin, senior lecturer25,
  31. René Spijker, information specialist8 9 26,
  32. Ewout W Steyerberg, professor3,
  33. Toshihiko Takada, assistant professor8,
  34. Ioanna Tzoulaki, assistant professor27 28,
  35. Sander M J van Kuijk, research fellow29,
  36. Florien S van Royen, research fellow8,
  37. Jan Y Verbakel, assistant professor30 31,
  38. Christine Wallisch, research fellow7 32 33,
  39. Jack Wilkinson, research fellow22,
  40. Robert Wolff, medical doctor34,
  41. Lotty Hooft, associate professor8 9,
  42. Karel G M Moons, professor8 9,
  43. Maarten van Smeden, assistant professor8
  1. 1Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University, Peter Debyeplein 1, 6229 HA Maastricht, Netherlands
  2. 2Department of Development and Regeneration, KU Leuven, Leuven, Belgium
  3. 3Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, Netherlands
  4. 4Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Musculoskeletal Sciences, University of Oxford, Oxford, UK
  5. 5NIHR Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford, UK
  6. 6Centre for Prognosis Research, School of Primary, Community and Social Care, Keele University, Keele, UK
  7. 7Section for Clinical Biometrics, Centre for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria
  8. 8Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht University, Utrecht, Netherlands
  9. 9Cochrane Netherlands, University Medical Centre Utrecht, Utrecht University, Utrecht, Netherlands
  10. 10Department of Medical Microbiology, University Medical Centre Utrecht, Utrecht, Netherlands
  11. 11HRB Clinical Research Facility, Cork, Ireland
  12. 12School of Public Health, University College Cork, Cork, Ireland
  13. 13Department of Electrical Engineering, ESAT Stadius, KU Leuven, Leuven, Belgium
  14. 14Ordensklinikum Linz, Hospital Elisabethinen, Department of Nephrology, Linz, Austria
  15. 15Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
  16. 16Palliative and Advanced Illness Research Center and Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
  17. 17Department of Microbiology, Immunology and Transplantation, KU Leuven-University of Leuven, Leuven, Belgium
  18. 18Department of General Internal Medicine, KU Leuven-University Hospitals Leuven, Leuven, Belgium
  19. 19Evidence-Based Oncology, Department I of Internal Medicine and Centre for Integrated Oncology Aachen Bonn Cologne Dusseldorf, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
  20. 20Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, Netherlands
  21. 21Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, Manchester Academic Heath Science Centre, University of Manchester, Manchester, UK
  22. 22Centre for Biostatistics, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
  23. 23Centre for Epidemiology Versus Arthritis, Centre for Musculoskeletal Research, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
  24. 24Division of Nursing, Midwifery and Social Work, School of Health Sciences, University of Manchester, Manchester, UK
  25. 25Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
  26. 26Amsterdam UMC, University of Amsterdam, Amsterdam Public Health, Medical Library, Netherlands
  27. 27Department of Epidemiology and Biostatistics, Imperial College London School of Public Health, London, UK
  28. 28Department of Hygiene and Epidemiology, University of Ioannina Medical School, Ioannina, Greece
  29. 29Department of Clinical Epidemiology and Medical Technology Assessment, Maastricht University Medical Centre+, Maastricht, Netherlands
  30. 30EPI-Centre, Department of Public Health and Primary Care, KU Leuven, Leuven, Belgium
  31. 31Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
  32. 32Charité Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany
  33. 33Berlin Institute of Health, Berlin, Germany
  34. 34Kleijnen Systematic Reviews, York, UK
  1. Correspondence to: L Wynants laure.wynants{at}maastrichtuniversity.nl
  • Accepted 31 March 2020
  • Final version accepted 1 July 2020

Abstract

Objective To review and appraise the validity and usefulness of published and preprint reports of prediction models for diagnosing coronavirus disease 2019 (covid-19) in patients with suspected infection, for prognosis of patients with covid-19, and for detecting people in the general population at increased risk of becoming infected with covid-19 or being admitted to hospital with the disease.

Design Living systematic review and critical appraisal by the COVID-PRECISE (Precise Risk Estimation to optimise covid-19 Care for Infected or Suspected patients in diverse sEttings) group.

Data sources PubMed and Embase through Ovid, arXiv, medRxiv, and bioRxiv up to 5 May 2020.

Study selection Studies that developed or validated a multivariable covid-19 related prediction model.

Data extraction At least two authors independently extracted data using the CHARMS (critical appraisal and data extraction for systematic reviews of prediction modelling studies) checklist; risk of bias was assessed using PROBAST (prediction model risk of bias assessment tool).

Results 14 217 titles were screened, and 107 studies describing 145 prediction models were included. The review identified four models for identifying people at risk in the general population; 91 diagnostic models for detecting covid-19 (60 were based on medical imaging, nine to diagnose disease severity); and 50 prognostic models for predicting mortality risk, progression to severe disease, intensive care unit admission, ventilation, intubation, or length of hospital stay. The most frequently reported predictors of diagnosis and prognosis of covid-19 are age, body temperature, lymphocyte count, and lung imaging features. Flu-like symptoms and neutrophil count are frequently predictive in diagnostic models, while comorbidities, sex, C reactive protein, and creatinine are frequent prognostic factors. C index estimates ranged from 0.73 to 0.81 in prediction models for the general population, from 0.65 to more than 0.99 in diagnostic models, and from 0.68 to 0.99 in prognostic models. All models were rated at high risk of bias, mostly because of non-representative selection of control patients, exclusion of patients who had not experienced the event of interest by the end of the study, high risk of model overfitting, and vague reporting. Most reports did not include any description of the study population or intended use of the models, and calibration of the model predictions was rarely assessed.

Conclusion Prediction models for covid-19 are quickly entering the academic literature to support medical decision making at a time when they are urgently needed. This review indicates that proposed models are poorly reported, at high risk of bias, and their reported performance is probably optimistic. Hence, we do not recommend any of these reported prediction models for use in current practice. Immediate sharing of well documented individual participant data from covid-19 studies and collaboration are urgently needed to develop more rigorous prediction models, and validate promising ones. The predictors identified in included models should be considered as candidate predictors for new models. Methodological guidance should be followed because unreliable predictions could cause more harm than benefit in guiding clinical decisions. Finally, studies should adhere to the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) reporting guideline.

Systematic review registration Protocol https://osf.io/ehc47/, registration https://osf.io/wy245.

Readers’ note This article is a living systematic review that will be updated to reflect emerging evidence. Updates may occur for up to two years from the date of original publication. This version is update 2 of the original article published on 7 April 2020 (BMJ 2020;369:m1328), and previous updates can be found as data supplements (https://www.bmj.com/content/369/bmj.m1328/related#datasupp).

Introduction

The novel coronavirus disease 2019 (covid-19) presents an important and urgent threat to global health. Since the outbreak in early December 2019 in the Hubei province of the People’s Republic of China, the number of patients confirmed to have the disease has exceeded 8 963 350 in 188 countries, and the number of people infected is probably much higher. More than 468 330 people have died from covid-19 (up to 22 June 2020).1 Despite public health responses aimed at containing the disease and delaying the spread, several countries have been confronted with a critical care crisis, and more countries could follow.234 Outbreaks lead to important increases in the demand for hospital beds and shortage of medical equipment, while medical staff themselves could also get infected.

To mitigate the burden on the healthcare system, while also providing the best possible care for patients, efficient diagnosis and information on the prognosis of the disease is needed. Prediction models that combine several variables or features to estimate the risk of people being infected or experiencing a poor outcome from the infection could assist medical staff in triaging patients when allocating limited healthcare resources. Models ranging from rule based scoring systems to advanced machine learning models (deep learning) have been proposed and published in response to a call to share relevant covid-19 research findings rapidly and openly to inform the public health response and help save lives.5 Many of these prediction models are published in open access repositories, ahead of peer review.

We aimed to systematically review and critically appraise all currently available prediction models for covid-19, in particular models to predict the risk of developing covid-19 or being admitted to hospital with covid-19, models to predict the presence of covid-19 in patients with suspected infection, and models to predict the prognosis or course of infection in patients with covid-19. We included model development and external validation studies. This living systematic review, with periodic updates, is being conducted by the COVID-PRECISE (Precise Risk Estimation to optimise covid-19 Care for Infected or Suspected patients in diverse sEttings) group in collaboration with the Cochrane Prognosis Methods Group.

Methods

We searched PubMed and Embase through Ovid, bioRxiv, medRxiv, and arXiv for research on covid-19 published after 3 January 2020. We used the publicly available publication list of the covid-19 living systematic review.6 This list contains studies on covid-19 published on PubMed and Embase through Ovid, bioRxiv, and medRxiv, and is continuously updated. We validated whether the list is fit for purpose (online supplementary material) and further supplemented it with studies on covid-19 retrieved from arXiv. The online supplementary material presents the search strings. Additionally, we contacted authors for studies that were not publicly available at the time of the search,78 and included studies that were publicly available but not on the living systematic review6 list at the time of our search.9101112

We searched databases repeatedly up to 5 May 2020 (supplementary table 1). All studies were considered, regardless of language or publication status (preprint or peer reviewed articles; updates of preprints are only included and reassessed after publication in a peer reviewed journal). We included studies if they developed or validated a multivariable model or scoring system, based on individual participant level data, to predict any covid-19 related outcome. These models included three types of prediction models: diagnostic models for predicting the presence or severity of covid-19 in patients with suspected infection; prognostic models for predicting the course of infection in patients with covid-19; and prediction models to identify people at increased risk of covid-19 in the general population. No restrictions were made on the setting (eg, inpatients, outpatients, or general population), prediction horizon (how far ahead the model predicts), included predictors, or outcomes. Epidemiological studies that aimed to model disease transmission or fatality rates, diagnostic test accuracy, and predictor finding studies were excluded. Starting with the second update, retrieved records were initially screened by a text analysis tool developed by artificial intelligence to prioritise sensitivity (supplementary material). Titles, abstracts, and full texts were screened for eligibility in duplicate by independent reviewers (pairs from LW, BVC, MvS) using EPPI-Reviewer,13 and discrepancies were resolved through discussion.

Data extraction of included articles was done by two independent reviewers (from LW, BVC, GSC, TPAD, MCH, GH, KGMM, RDR, ES, LJMS, EWS, KIES, CW, AL, JM, TT, JAAD, KL, JBR, LH, CS, MS, MCH, NS, NK, SMJvK, JCS, PD, CLAN, RW, GPM, IT, JYV, DLD, JW, FSvR, PH, VMTdJ, and MvS). Reviewers used a standardised data extraction form based on the CHARMS (critical appraisal and data extraction for systematic reviews of prediction modelling studies) checklist14 and PROBAST (prediction model risk of bias assessment tool) for assessing the reported prediction models.15 We sought to extract each model’s predictive performance by using whatever measures were presented. These measures included any summaries of discrimination (the extent to which predicted risks discriminate between participants with and without the outcome), and calibration (the extent to which predicted risks correspond to observed risks) as recommended in the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement.16 Discrimination is often quantified by the C index (C index=1 if the model discriminates perfectly; C index=0.5 if discrimination is no better than chance). Calibration is often quantified by the calibration intercept (which is zero when the risks are not systematically overestimated or underestimated) and calibration slope (which is one if the predicted risks are not too extreme or too moderate).17 We focused on performance statistics as estimated from the strongest available form of validation (in order of strength: external (evaluation in an independent database), internal (bootstrap validation, cross validation, random training test splits, temporal splits), apparent (evaluation by using exactly the same data used for development)). Any discrepancies in data extraction were discussed between reviewers, and remaining conflicts were resolved by LW and MvS. The online supplementary material provides details on data extraction. We considered aspects of PRISMA (preferred reporting items for systematic reviews and meta-analyses)18 and TRIPOD16 in reporting our article.

Patient and public involvement

It was not possible to involve patients or the public in the design, conduct, or reporting of our research. The study protocol and preliminary results are publicly available on https://osf.io/ehc47/ and medRxiv.

Results

We retrieved 14 209 titles through our systematic search (of which 9306 were included in the present update; supplementary table 1, fig 1). Two additional unpublished studies were made available on request (after a call on social media). We included a further six studies that were publicly available but were not detected by our search. Of 14 217 titles, 275 studies were retained for abstract and full text screening (of which 76 in the present update). One hundred seven studies describing 145 prediction models met the inclusion criteria (of which 56 papers and 79 models added in the present update, supplementary table 1).789101112192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119 These studies were selected for data extraction and critical appraisal (table 1, table 2, table 3, and table 4).

Fig 1
Fig 1

PRISMA (preferred reporting items for systematic reviews and meta-analyses) flowchart of study inclusions and exclusions

Table 1

Overview of prediction models for use in the general population

View this table:
Table 2

Overview of prediction models for diagnosis of covid-19

View this table:
Table 3

Overview of prediction models for prognosis of covid-19

View this table:
Table 4

Risk of bias assessment (using PROBAST) based on four domains across 107 studies that created prediction models for coronavirus disease 2019

View this table:

Primary datasets

Forty five studies used data on patients with covid-19 from China (supplementary table 2), six from Italy,323972747679 three from Brazil,6981109 three from France,7177110 three from the United States,96108112 two from South Korea,6380 one from Belgium,82 one from the Netherlands,95 one from the United Kingdom,75 one from Israel,67 one from Mexico,70 and one from Singapore.40 Twenty two studies used international data (supplementary table 2) and two studies used simulated data.3541 Three studies used proxy data to estimate covid-19 related risks (eg, Medicare claims data from 2015 to 2016).890113 Twelve studies were not clear on the origin of covid-19 data (supplementary table 2).

Based on 59 studies that reported study dates, data were collected between 8 December 2019 and 21 April 2020. Four studies reported median follow-up time (4.5, 8.4, 15, and 18 days),203783108 while another study reported a follow-up of at least five days.42 Some centres provided data to multiple studies and several studies used open Github120 or Kaggle121 data repositories (version or date of access often unspecified), and so it was unclear how much these datasets overlapped across our identified studies (supplementary table 2). One study25 developed prediction models for use in paediatric patients. The median age in studies on adults varied from 34 to 68 years, and the proportion of men varied from 35% to 75%, although this information was often not reported at all (supplementary table 2).

Among the studies that developed prognostic models to predict mortality risk in people with confirmed or suspected infection, the percentage of deaths varied between 1% and 59% (table 3). This wide variation is partly because of substantial sampling bias caused by studies excluding participants who still had the disease at the end of the study period (that is, they had neither recovered nor died).7212223449698100 Additionally, length of follow-up could have varied between studies (but was rarely reported), and there might be local and temporal variation in how people were diagnosed as having covid-19 or were admitted to the hospital (and therefore recruited for the studies). Among the diagnostic model studies, only nine reported on the prevalence of covid-19 and used a cross sectional or cohort design; the prevalence varied between 17% and 79% (table 2). Because 58 diagnostic studies used either case-control sampling or an unclear method of data collection, the prevalence in these diagnostic studies might not have been representative of their target population.

Table 1, table 2, and table 3 give an overview of the 145 prediction models reported in the 107 identified studies. Supplementary table 2 provides modelling details and box 1 discusses the availability of models in a format for use in clinical practice.

Box 1

Availability of models in format for use in clinical practice

Several studies presented their models in a format for use in clinical practice. However, because all models were at high risk of bias, we do not recommend their routine use before they are properly externally validated.

Models to predict risk of developing coronavirus disease 2019 (covid-19) or of hospital admission for covid-19 in general population

The “COVID-19 Vulnerability Index” to detect hospital admission for covid-19 pneumonia from other respiratory infections (eg, pneumonia, influenza) is available as an online tool.8122

Diagnostic models

Several sum scores,3195110117 and model equations81102 are available to support the diagnosis. Graphical diagnostic aids include nomograms4378117 and a decision tree.74 The “COVID-19 diagnosis aid” app is available on iOS and android devices to diagnose covid-19 in asymptomatic patients and those with suspected disease.12 Additionally, online tools are available.10457495123124125 Classification in terms of disease severity can be done using a published equation.114 A decision tree to detect severe disease for paediatric patients with confirmed covid-19 is also available in an article.25

Diagnostic models based on images

Five artificial intelligence models to assist with diagnosis based on medical images are available through web applications.2427307391126127128129130 One model is deployed in 16 hospitals, but the authors do not provide any usable tools in their study.33 Two papers includes a severity scoring system to classify patients based on images.5472

Prognostic models

To assist in the prognosis of mortality, a nomogram,7 a decision tree,22 a score system,70 online tools,80849698131132133134 and a computed tomography based scoring rule are available in the articles.23 Other online tools predict in-hospital death and the need for prolonged mechanical ventilation,113135 or in-hospital death and a composite of poor outcomes.116136 Additionally nomograms,88119 sumscores8388 and a model equation60 are available to predict progression to severe covid-19.

Several studies made their code available on GitHub.8113435384755656667687073869298101104105109 Seventy four studies did not include any usable equation, format, code, or reference for use or validation of their prediction model.

RETURN TO TEXT

Models to predict risks of covid-19 in the general population

We identified four models that predicted risk of covid-19 in the general population. Three models from one study used hospital admission for non-tuberculosis pneumonia, influenza, acute bronchitis, or upper respiratory tract infections as proxy outcomes in a dataset without any patients with covid-19.8 Among the predictors were age, sex, previous hospital admissions, comorbidity data, and social determinants of health. The study reported C indices of 0.73, 0.81, and 0.81. A fourth model used deep learning on thermal videos from the faces of people wearing facemasks to determine abnormal breathing (not covid related) with a reported sensitivity of 80%.90

Diagnostic models to detect covid-19 in patients with suspected infection

We identified 22 multivariable models to diagnose covid-19. Most models targeted patients with suspected covid-19. Reported C index values ranged between 0.65 and 0.99. A few models also evaluated calibration and reported good results.6978117 The most frequently used diagnostic predictors (at least 10 times) were flu-like signs and symptoms (eg, shiver, fatigue), imaging features (eg, pneumonia signs on computed tomography scan), age, body temperature, lymphocyte count, and neutrophil count (table 2).

Nine studies aimed to diagnose severe disease in patients with covid-19: eight in adults with covid-19 with reported C indices between value of 0.80 and 0.99, and one in paediatric patients with reported perfect performance.25 Predictors of severe covid-19 used more than once were comorbidities, liver enzymes, C reactive protein, imaging features, and neutrophil count.

Sixty prediction models were proposed to support the diagnosis of covid-19 or covid-19 pneumonia (and some also to monitor progression) based on images. Most studies used computed tomography images or chest radiographs. Others used spectrograms of cough sounds53 and lung ultrasound.73 The predictive performance varied widely, with estimated C index values ranging from 0.81 to more than 0.99.

Prognostic models for patients with diagnosis of covid-19

We identified 50 prognostic models (table 3) for patients with a diagnosis of covid-19. The intended use of these models (that is, when to use them, and for whom) was often not clearly described. Prediction horizons varied between one and 30 days, but were often unspecified.

Of these models, 23 estimated mortality risk and eight aimed to predict progression to a severe or critical state (table 3). The remaining studies used other outcomes (single or as part of a composite) including recovery, length of hospital stay, intensive care unit admission, intubation, (duration of) mechanical ventilation, and acute respiratory distress syndrome. One study used data from 2015 to 2019 to predict mortality and prolonged assisted mechanical ventilation (as a non-covid-19 proxy outcome).113

The most frequently used prognostic factors (for any outcome, included at least 10 times) included comorbidities, age, sex, lymphocyte count, C reactive protein, body temperature, creatinine, and imaging features (table 3).

Studies that predicted mortality reported C indices between 0.68 and 0.98. Some studies also evaluated calibration.767116 When applied to new patients, the model by Xie et al yielded probabilities of mortality that were too high for low risk patients and too low for high risk patients (calibration slope >1), despite excellent discrimination.7 The mortality model by Zhang et al also showed miscalibrated (overfitted and underestimated) risks at external validation,116 while the model by Barda et al showed underfitting.67

The studies that developed models to predict progression to a severe or critical state reported C indices between 0.73 and 0.99. Three of these studies also reported good calibration, but this was evaluated internally (eg, bootstrapped)88 or in an unclear way.83119

Reported C indices for other outcomes varied between 0.72 and 0.96. Singh et al and Zhang et al also evaluated calibration externally (in new patients). Singh showed that the Epic Deterioration Index overestimated the risk or a poor outcome, while the poor outcome model by Zhang et al underestimated the risk of a poort outcome.108116

Risk of bias

All studies were at high risk of bias according to assessment with PROBAST (table 1, table 2, and table 3), which suggests that their predictive performance when used in practice is probably lower than that reported. Therefore, we have cause for concern that the predictions of the proposed models are unreliable when used in other people. Box 2 gives details on common causes for risk of bias for each type of model.

Box 2

Common causes of risk of bias in the reported prediction models

Models to predict coronavirus disease 2019 (covid-19) risk in general population

These models were based on proxy outcomes to predict covid-19 related risks, such as presence of or hospital admission due to severe respiratory disease, in the absence of data of patients with covid-19.890

Diagnostic models

Controls are probably not representative of the target population for a diagnostic model (eg, controls for a screening model had viral pneumonia).12414578102 The test used to determine the outcome varied between participants,124195 or one of the predictors (eg, fever) was part of the outcome definition.10

Diagnostic models based on medical imaging

Generally, studies did not clearly report which patients had imaging during clinical routine, and it was unclear whether the selection of controls was made from the target population (that is, patients with suspected covid-19). Often studies did not clearly report how regions of interest were annotated. Images were sometimes annotated by only one scorer without quality control.2628475255919293 Careful description of model specification and subsequent estimation were lacking, challenging the transparency and reproducibility of the models. Studies used different deep learning architectures, some were established and others specifically designed, without benchmarking the used architecture against others.

Prognostic models

Study participants were often excluded because they did not develop the outcome at the end of the study period but were still in follow-up (that is, they were in hospital but had not recovered or died), yielding a highly selected study sample.7212223449698100 Additionally, only six studies accounted for censoring by using Cox regression2042708388 or competing risk models.62 Some studies used the last available predictor measurement from electronic health records (rather than measuring the predictor value at the time when the model was intended for use).2267100

RETURN TO TEXT

Fifty three of the 107 studies had a high risk of bias for the participants domain (table 4), which indicates that the participants enrolled in the studies might not be representative of the models’ targeted populations. Unclear reporting on the inclusion of participants prohibited a risk of bias assessment in 26 studies. Fifteen of the 107 studies had a high risk of bias for the predictor domain, which indicates that predictors were not available at the models’ intended time of use, not clearly defined, or influenced by the outcome measurement. One diagnostic imaging study used a simple scoring rule and was scored at low predictor risk of bias. The diagnostic model studies that used medical images as predictors in artificial intelligence were all scored as unclear on the predictor domain. The publications often lacked clear information on the preprocessing steps (eg, cropping of images). Moreover, complex machine learning algorithms transform images into predictors in a complex way, which makes it challenging to fully apply the PROBAST predictors section for such imaging studies. Most studies used outcomes that are easy to assess (eg, death, presence of covid-19 by laboratory confirmation). Nonetheless, there was cause for concern about bias induced by the outcome measurement in 19 studies, for example due to the use of subjective or proxy outcomes (eg, non covid-19 severe respiratory infections).

All but one of these studies50 were at high risk of bias for the analysis domain (table 4). Many studies had small sample sizes (table 1, table 2, table 3), which led to an increased risk of overfitting, particularly if complex modelling strategies were used. Three studies did not report the predictive performance of the developed model, and four studies reported only the apparent performance (the performance with exactly the same data used to develop the model, without adjustment for optimism owing to potential overfitting). Only 13 studies assessed calibration,71222435067697883108116117119 but the method to check calibration was probably suboptimal in two studies.12119

Twenty five models were developed and externally validated in the same study (in an independent dataset, excluding random training test splits and temporal splits).71226424351525967778183849195100102110112113116119 However, in 11 of these models, the datasets used for the external validation were likely not representative of the target population,71226425991100102116 and in one study, data from before the covid-19 crisis were used.113 Consequently, predictive performance could differ if the models are applied in the targeted population. In one study, commonly used performance statistics for prognosis (discrimination, calibration) were not reported.42 Gozes,52 Fu,51 Chassagnon,77 Hu,84 Kurstjens,95 and Vaid112 had satisfactory predictive performance on an external validation set, but it is unclear how the data for the external validation were collected (eg, whether the patients were consecutive), and whether they are representative. Wang,43 Barda,67 Guo,83 Tordjman,110 and Gong119 obtained satisfactory discrimination on probably unbiased validation datasets, but each of these had fewer than the recommended number of events for external validation (100).137138 Diaz-Quijano externally validated a diagnostic model in a large registry with reasonable discrimination, but many patients had to be excluded because no polymerase chain reaction (PCR) testing was performed.81

One study presented a small external validation (27 participants) that reported satisfactory predictive performance of a model originally developed for avian influenza H7N9 pneumonia. However, patients who had not recovered at the end of the study period were excluded, which again led to a selection bias.23 Another study was a small scale external validation study (78 participants) of an existing severity score for lung computed tomography images with satisfactory reported discrimination.54 Three studies validated existing early warning or severity scores to predict in-hospital mortality or deterioration.8596108 They had satisfactory discrimination but less than the recommended number of events for validation137138 or unclear sample sizes, excluded patients who remained in hospital at the end of the study period, or had an unclear study design.

Discussion

In this systematic review of prediction models related to the covid-19 pandemic, we identified and critically appraised 107 studies that described 145 models. These prediction models can be divided into three categories: models for the general population to predict the risk of having covid-19 or being admitted to hospital for covid-19; models to support the diagnosis of covid-19 in patients with suspected infection; and models to support the prognostication of patients with covid-19. All models reported moderate to excellent predictive performance, but all were appraised to have high risk of bias owing to a combination of poor reporting and poor methodological conduct for participant selection, predictor description, and statistical methods used. Models were developed on data from different countries, but the majority used data from China or public international data repositories. With few exceptions, the available sample sizes and number of events for the outcomes of interest were limited. This is a well known problem when building prediction models and increases the risk of overfitting the model.139 A high risk of bias implies that the performance of these models in new samples will probably be worse than that reported by the researchers. Therefore, the estimated C indices, often close to 1 and indicating near perfect discrimination, are probably optimistic. The majority of studies developed new models, only 27 carried out an external validation, and calibration was rarely assessed.

We reviewed 57 studies that used advanced machine learning methodology on medical images to diagnose covid-19, covid-19 related pneumonia, or to assist in segmentation of lung images. The predictive performance measures showed a high to almost perfect ability to identify covid-19, although these models and their evaluations also had a high risk of bias, notably because of poor reporting and an artificial mix of patients with and without covid-19. Therefore, we do not recommend any of the 145 identified prediction models to be used in practice.

Challenges and opportunities

The main aim of prediction models is to support medical decision making. Therefore, it is vital to identify a target population in which predictions serve a clinical need, and a representative dataset (preferably comprising consecutive patients) on which the prediction model can be developed and validated. This target population must also be carefully described so that the performance of the developed or validated model can be appraised in context, and users know which people the model applies to when making predictions. Unfortunately, the studies included in our systematic review often lacked an adequate description of the study population, which leaves users of these models in doubt about the models’ applicability. Although we recognise that all studies were done under severe time constraints, we recommend that any studies currently in preprint and all future studies should adhere to the TRIPOD reporting guideline16 to improve the description of their study population and their modelling choices. TRIPOD translations (eg, in Chinese and Japanese) are also available at https://www.tripod-statement.org.

A better description of the study population could also help us understand the observed variability in the reported outcomes across studies, such as covid-19 related mortality and covid-19 prevalence. The variability in prevalence could in part be reflective of different diagnostic standards across studies. Note that the majority of diagnostic models use viral nucleic acid test results as the gold standard, which may have unacceptable false negative rates.

Covid-19 prediction problems will often not present as a simple binary classification task. Complexities in the data should be handled appropriately. For example, a prediction horizon should be specified for prognostic outcomes (eg, 30 day mortality). If study participants have neither recovered nor died within that time period, their data should not be excluded from analysis, which most reviewed studies have done. Instead, an appropriate time to event analysis should be considered to allow for administrative censoring.17 Censoring for other reasons, for instance because of quick recovery and loss to follow-up of patients who are no longer at risk of death from covid-19, could necessitate analysis in a competing risk framework.140

A prediction model applied in a new healthcare setting or country often produces predictions that are miscalibrated141 and might need to be updated before it can safely be applied in that new setting.17 This requires data from patients with covid-19 to be available from that system. Instead of developing and updating predictions in their local setting, individual participant data from multiple countries and healthcare systems might allow better understanding of the generalisability and implementation of prediction models across different settings and populations. This approach could greatly improve the applicability and robustness of prediction models in routine care.142143144145146

The evidence base for the development and validation of prediction models related to covid-19 will quickly increase over the coming months. Together with the increasing evidence from predictor finding studies147148149150151152153 and open peer review initiatives for covid-19 related publications,154 data registries120121155156157 are being set up. To maximise the new opportunities and to facilitate individual participant data meta-analyses, the World Health Organization has released a new data platform to encourage sharing of anonymised covid-19 clinical data.158 To leverage the full potential of these evolutions, international and interdisciplinary collaboration in terms of data acquisition, model building and validation is crucial.

Study limitations

With new publications on covid-19 related prediction models rapidly entering the medical literature, this systematic review cannot be viewed as an up-to-date list of all currently available covid-19 related prediction models. Also, 87 of the studies we reviewed were only available as preprints. These studies might improve after peer review, when they enter the official medical literature; we will reassess these peer reviewed publications in future updates. We also found other prediction models that are currently being used in clinical practice without scientific publications,159 and web risk calculators launched for use while the scientific manuscript is still under review (and unavailable on request). These unpublished models naturally fall outside the scope of this review of the literature.160 As we have argued extensively elsewhere,161 transparent reporting that enables validation by independent researchers is key for predictive analytics, and clinical guidelines should only recommend publicly available and verifiable algorithms.

Implications for practice

All 145 reviewed prediction models were found to have a high risk of bias, and evidence from independent external validation of the newly developed models is currently lacking. However, the urgency of diagnostic and prognostic models to assist in quick and efficient triage of patients in the covid-19 pandemic might encourage clinicians and policymakers to prematurely implement prediction models without sufficient documentation and validation. Earlier studies have shown that models were of limited use in the context of a pandemic,162 and they could even cause more harm than good.163 Therefore, we cannot recommend any model for use in practice at this point.

The current oversupply of insufficiently validated models is not useful for clinical practice. Future studies should focus on validating, comparing, improving, and updating promising available prediction models, rather than developing new ones.17 For example, Diaz-Quijano developed and externally validated a diagnostic model using Brazilian surveillance data with reasonable discrimination, but many patients had to be excluded because no PCR testing was performed, hence this model needs further validation.17 Two other models to diagnose covid-19 also showed promising discrimination at external validation in small unselected cohorts.43110 An externally validated model that used computed tomography based total severity scores showed good discrimination between patients with mild, common, and severe-critical disease.54 Two models to predict progression to severe covid-19 within two weeks showed promising discrimination when validated externally on unselected cohorts.83119 Another model discriminated well between survivors and non-survivors among confirmed cases, but the prediction horizon was not specified, and the study had many missing values for key parameters.67 Because reporting in each of these studies was insufficiently detailed and the validation was in datasets with fewer than 100 events in the smallest outcome category, validation in larger, international datasets is needed. Such external validations should assess not only discrimination, but also calibration and clinical utility (net benefit).141146163 Owing to differences between healthcare systems (eg, Chinese and European) in when patients are admitted to and discharged from hospital, as well as the testing criteria for patients with suspected covid-19, we anticipate most existing models will be miscalibrated, but this can usually be solved by updating and adjustment to the local setting.

When creating a new prediction model, we recommend building on previous literature and expert opinion to select predictors, rather than selecting predictors in a purely data driven way.17 This is especially important for datasets with limited sample size.164 Based on the predictors included in multiple models identified by our review, we encourage researchers to consider incorporating several candidate predictors. Common predictors include age, body temperature, lymphocyte count, and lung imaging features. Flu-like signs and symptoms and neutrophil count are frequently predictive in diagnostic models, while comorbidities, sex, C reactive protein, and creatinine are frequently reported prognostic factors. By pointing to the most important methodological challenges and issues in design and reporting of the currently available models, we hope to have provided a useful starting point for further studies aiming to develop new models, or to validate and update existing ones.

This living systematic review has been conducted in collaboration with the Cochrane Prognosis Methods Group. We will update this review and appraisal continuously to provide up-to-date information for healthcare decision makers and professionals as more international research emerges over time.

Conclusion

Several diagnostic and prognostic models for covid-19 are currently available and they all report moderate to excellent discrimination. However, these models are all at high risk of bias, mainly because of non-representative selection of control patients, exclusion of patients who had not experienced the event of interest by the end of the study, and model overfitting. Therefore, their performance estimates are probably optimistic and misleading. The COVID-PRECISE group does not recommend any of the current prediction models to be used in practice. Future studies aimed at developing and validating diagnostic or prognostic models for covid-19 should explicitly address the concerns raised. Sharing data and expertise for the validation and updating of covid-19 related prediction models is urgently needed.

What is already known on this topic

  • The sharp recent increase in coronavirus disease 2019 (covid-19) incidence has put a strain on healthcare systems worldwide; an urgent need exists for efficient early detection of covid-19 in the general population, for diagnosis of covid-19 in patients with suspected disease, and for prognosis of covid-19 in patients with confirmed disease

  • Viral nucleic acid testing and chest computed tomography imaging are standard methods for diagnosing covid-19, but are time consuming

  • Earlier reports suggest that elderly patients, patients with comorbidities (chronic obstructive pulmonary disease, cardiovascular disease, hypertension), and patients presenting with dyspnoea are vulnerable to more severe morbidity and mortality after infection

What this study adds

  • Four models identified patients at risk in the general population (using proxy outcomes for covid-19)

  • Ninety one diagnostic models were identified for detecting covid-19 (60 were based on medical images; nine were for severity classification); and 50 prognostic models for predicting, among others, mortality risk, progression to severe disease

  • Proposed models are poorly reported and at high risk of bias, raising concern that their predictions could be unreliable when applied in daily practice

Acknowledgments

We thank the authors who made their work available by posting it on public registries or sharing it confidentially. A preprint version of the study is publicly available on medRxiv.

Footnotes

  • Contributors: LW conceived the study. LW and MvS designed the study. LW, MvS, and BVC screened titles and abstracts for inclusion. LW, BVC, GSC, TPAD, MCH, GH, KGMM, RDR, ES, LJMS, EWS, KIES, CW, JAAD, PD, MCH, NK, AL, KL, JM, CLAN, JBR, JCS, CS, NS, MS, RS, TT, SMJvK, FSvR, LH, RW, GPM, IT, JYV, DLD, JW, FSvR, PH, VMTdJ, and MvS extracted and analysed data. MDV helped interpret the findings on deep learning studies and MMJB, LH, and MCH assisted in the interpretation from a clinical viewpoint. RS and FSvR offered technical and administrative support. LW and MvS wrote the first draft, which all authors revised for critical content. All authors approved the final manuscript. LW and MvS are the guarantors. The guarantors had full access to all the data in the study, take responsibility for the integrity of the data and the accuracy of the data analysis, and had final responsibility for the decision to submit for publication. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: LW, BVC, LH, and MDV acknowledge specific funding for this work from Internal Funds KU Leuven, KOOR, and the COVID-19 Fund. LW is a postdoctoral fellow of Research Foundation-Flanders (FWO). BVC received support from FWO (grant G0B4716N) and Internal Funds KU Leuven (grant C24/15/037). TPAD acknowledges financial support from the Netherlands Organisation for Health Research and Development (grant 91617050). VMTdJ was supported by the European Union Horizon 2020 Research and Innovation Programme under ReCoDID grant agreement 825746. KGMM and JAAD acknowledge financial support from Cochrane Collaboration (SMF 2018). KIES is funded by the National Institute for Health Research (NIHR) School for Primary Care Research. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care. GSC was supported by the NIHR Biomedical Research Centre, Oxford, and Cancer Research UK (programme grant C49297/A27294). JM was supported by the Cancer Research UK (programme grant C49297/A27294). PD was supported by the NIHR Biomedical Research Centre, Oxford. MOH is supported by the National Heart, Lung, and Blood Institute of the United States National Institutes of Health (grant R00 HL141678). The funders played no role in study design, data collection, data analysis, data interpretation, or reporting.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: support from Internal Funds KU Leuven, KOOR, and the COVID-19 Fund for the submitted work; no competing interests with regards to the submitted work; LW discloses support from Research Foundation-Flanders; RDR reports personal fees as a statistics editor for The BMJ (since 2009), consultancy fees for Roche for giving meta-analysis teaching and advice in October 2018, and personal fees for delivering in-house training courses at Barts and the London School of Medicine and Dentistry, and the Universities of Aberdeen, Exeter, and Leeds, all outside the submitted work; MS coauthored the editorial on the original article.

  • Ethical approval: Not required.

  • Data sharing: The study protocol is available online at https://osf.io/ehc47/. Most included studies are publicly available. Additional data are available upon reasonable request.

  • The lead authors affirm that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

  • Dissemination to participants and related patient and public communities: The study protocol is available online at https://osf.io/ehc47/.

http://creativecommons.org/licenses/by/4.0/

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/.

References

View Abstract