Living risk prediction algorithm (QCOVID) for risk of hospital admission and mortality from coronavirus 19 in adults: national derivation and validation cohort study

Abstract Objective To derive and validate a risk prediction algorithm to estimate hospital admission and mortality outcomes from coronavirus disease 2019 (covid-19) in adults. Design Population based cohort study. Setting and participants QResearch database, comprising 1205 general practices in England with linkage to covid-19 test results, Hospital Episode Statistics, and death registry data. 6.08 million adults aged 19-100 years were included in the derivation dataset and 2.17 million in the validation dataset. The derivation and first validation cohort period was 24 January 2020 to 30 April 2020. The second temporal validation cohort covered the period 1 May 2020 to 30 June 2020. Main outcome measures The primary outcome was time to death from covid-19, defined as death due to confirmed or suspected covid-19 as per the death certification or death occurring in a person with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in the period 24 January to 30 April 2020. The secondary outcome was time to hospital admission with confirmed SARS-CoV-2 infection. Models were fitted in the derivation cohort to derive risk equations using a range of predictor variables. Performance, including measures of discrimination and calibration, was evaluated in each validation time period. Results 4384 deaths from covid-19 occurred in the derivation cohort during follow-up and 1722 in the first validation cohort period and 621 in the second validation cohort period. The final risk algorithms included age, ethnicity, deprivation, body mass index, and a range of comorbidities. The algorithm had good calibration in the first validation cohort. For deaths from covid-19 in men, it explained 73.1% (95% confidence interval 71.9% to 74.3%) of the variation in time to death (R2); the D statistic was 3.37 (95% confidence interval 3.27 to 3.47), and Harrell’s C was 0.928 (0.919 to 0.938). Similar results were obtained for women, for both outcomes, and in both time periods. In the top 5% of patients with the highest predicted risks of death, the sensitivity for identifying deaths within 97 days was 75.7%. People in the top 20% of predicted risk of death accounted for 94% of all deaths from covid-19. Conclusion The QCOVID population based risk algorithm performed well, showing very high levels of discrimination for deaths and hospital admissions due to covid-19. The absolute risks presented, however, will change over time in line with the prevailing SARS-C0V-2 infection rate and the extent of social distancing measures in place, so they should be interpreted with caution. The model can be recalibrated for different time periods, however, and has the potential to be dynamically updated as the pandemic evolves.


AbstrAct
Objective To derive and validate a risk prediction algorithm to estimate hospital admission and mortality outcomes from coronavirus disease 2019 (covid-19) in adults.

Design
Population based cohort study.
setting anD participants QResearch database, comprising 1205 general practices in England with linkage to covid-19 test results, Hospital Episode Statistics, and death registry data. 6.08 million adults aged 19-100 years were included in the derivation dataset and 2.17 million in the validation dataset. The derivation and first validation cohort period was 24 January 2020 to 30 April 2020. The second temporal validation cohort covered the period 1 May 2020 to 30 June 2020.

Main OutcOMe Measures
The primary outcome was time to death from covid-19, defined as death due to confirmed or suspected covid-19 as per the death certification or death occurring in a person with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in the period 24 January to 30 April 2020. The secondary outcome was time to hospital admission with confirmed SARS-CoV-2 infection. Models were fitted in the derivation cohort to derive risk equations using a range of predictor variables. Performance, including measures of discrimination and calibration, was evaluated in each validation time period. results 4384 deaths from covid-19 occurred in the derivation cohort during follow-up and 1722 in the first validation cohort period and 621 in the second validation cohort period. The final risk algorithms included age, ethnicity, deprivation, body mass index, and a range of comorbidities. The algorithm had good calibration in the first validation cohort. For deaths from covid-19 in men, it explained 73.1% (95% confidence interval 71.9% to 74.3%) of the variation in time to death (R 2 ); the D statistic was 3.37 (95% confidence interval 3.27 to 3.47), and Harrell's C was 0.928 (0.919 to 0.938). Similar results were obtained for women, for both outcomes, and in both time periods. In the top 5% of patients with the highest predicted risks of death, the sensitivity for identifying deaths within 97 days was 75.7%. People in the top 20% of predicted risk of death accounted for 94% of all deaths from covid-19.

Introduction
The first cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection were reported in the UK on 24 January 2020, with the first death from coronavirus disease 2019 (covid-19) on 28 February 2020. As of 18 August 2020, more than 41 000 deaths from covid-19 had occurred in the UK and more than 773 000 deaths globally. 1 In the initial absence of any vaccination or prophylactic or curative treatments, the UK government implemented social distancing and shielding measures to suppress the rate of infection and protect vulnerable people, thereby trying to minimise the risk of serious adverse outcomes. 2 3 Emerging evidence throughout the course of the pandemic, initially from case series and then from cohorts of patients with confirmed SARS-CoV-2 doi: 10.1136/bmj.m3731 | BMJ 2020;371:m3731 | the bmj infection, has shown associations of age, sex, certain comorbidities, ethnicity, and obesity with adverse covid-19 outcomes such as hospital admission or death. [4][5][6][7][8][9][10][11] The knowledge base regarding risk factors for severe covid-19 is growing. As many countries are cautiously attempting to ease "lockdown" measures or reintroduce measures if rates are rising, an opportunity exists to develop more nuanced guidance based on predictive algorithms to inform risk management decisions. 12 Better knowledge of individuals' risks could also help to guide decisions on mitigating occupational exposure and in targeting of vaccines to those most at risk. Although some prediction models have been developed, a recent systematic review found that they all have a high risk of bias and that their reported performance is optimistic. 13 The use of primary care datasets with linkage to registries such as death records, hospital admissions data, and covid-19 testing results represents a novel approach to clinical risk prediction modelling for covid-19. It provides accurately coded, individual level data for very large numbers of people representative of the national population. This approach draws on the rich phenotyping of individuals with demographic, medical, and pharmacological predictors to allow robust statistical modelling and evaluation. Such linked datasets have an established track record for the development and evaluation of established clinical risk models, including those for cardiovascular disease, diabetes, and mortality. [14][15][16] We aimed to develop and validate population based prediction models to estimate the risks of becoming infected with and subsequently dying from covid-19 and of becoming infected and subsequently admitted to hospital with covid-19. The model we have developed is designed to be applied across the adult population so that it can be used to enable risk stratification for public health purposes in the event of a "second wave" of the pandemic, to support shared management of risk and occupational exposure, and in early targeting of vaccines to people most at risk. An ongoing companion study will externally validate the models, using datasets across all four nations of the UK, and will be reported separately.

Methods
This study was commissioned by the Chief Medical Officer for England on behalf of the UK Government, who asked the New and Emerging Respiratory Virus Threats Advisory Group (NERVTAG) to establish whether a clinical risk prediction model for covid-19 could be developed in line with the emerging evidence. The protocol has been published. 17 The study was conducted in adherence with TRIPOD 18 and RECORD 19 guidelines and with input from our patient advisory group. study design and data sources We did a cohort study of primary care patients using the QResearch database (version 44). QResearch was established in 2002 and has been extensively used for the development of risk prediction algorithms across the National Health Service (NHS) and for epidemiological research. By April 2020, 1205 practices in England were contributing to QResearch, covering a population of 10.5 million patients. The database is linked at individual patient level, using a project specific pseudonymised NHS number, to hospital admissions data (including intensive care unit data), positive results from covid-19 real time reverse transcriptase polymerase chain reaction tests held by Public Health England, cancer registrations (including detailed radiotherapy and systemic chemotherapy records), the national covid-19 shielded patient list in England, and mortality records held by NHS Digital.
We identified a cohort of people aged 19-100 years registered with participating general practices in England on 24 January 2020. We excluded patients (approximately 0.1%) who did not have a valid NHS number. Patients entered the cohort on 24 January 2020 (date of first confirmed case of covid-19 in the UK) and were followed up until they had the outcome of interest or the end of the first study period (30 April 2020), which was the date up to which linked data were available at the time of the derivation of the model, or the second time period (1 May 2020 until 30 June 2020) for the temporal cohort validation.

Outcomes
The primary outcome was time to death from covid-19 (either in hospital or outside hospital), defined as confirmed or suspected death from covid-19 as per the death certification or death occurring in an individual with confirmed SARS-CoV-2 infection at any time in the period 24 January to 30 April 2020. The secondary outcome was time to hospital admission with covid-19, defined as an ICD-10 (International Classification of Diseases, 10th revision) code for either confirmed or suspected covid-19 or new hospital admission associated with a confirmed SARS-CoV-2 infection in the study period.

predictor variables
We selected candidate predictor variables on the basis of the presence of existing clinical vulnerability group criteria (table 1), associations with outcomes in other respiratory diseases, or hypothesised to be linked to adverse outcomes on clinical/biological plausibility and likely to be available for implementation. They are summarised in box 1 and supplementary box A. We defined variables according to information recorded using Read Codes in general practices' electronic health records at the start of the study period. The exception to this was information on chemotherapy, radiotherapy, and transplants, which was based on linked hospital records.

QcOviD model development
We randomly allocated 75% of practices to the derivation dataset, which we used to develop the models. We evaluated the models' performance in the remaining 25% of practices (the validation set).
All models were fitted separately in men and women. The outcomes of interest are subject to competing risks. For the primary outcome of death from covid-19, the competing risk is death due to other causes. For the secondary outcome of hospital admission, the competing risk is death from any cause before admission. We fitted a sub-distribution hazard (Fine and Gray 21 ) model for each outcome to account for competing risks. Individuals who did not have the outcome of interest were censored at the study end date, including those who had a competing event.
For all predictor variables, we used the most recently available value at the entry date (24 January 2020). We used second degree fractional polynomials to model non-linear relations for continuous variables (age, body mass index, and Townsend material deprivation score, an area level score based on postcode 20 ). Initially, we fitted a complete case analysis by using a model within the derivation data to derive the fractional polynomial terms. For indicators of comorbidities and medication use, we assumed the absence of recorded information to mean absence of the factor in question. Data were missing in four variables: ethnicity, Townsend score, body mass index, and smoking status. We used multiple imputation with chained equations under the missing at random assumption to replace missing values for these variables. For computational efficiency, we used a combined imputation model for both outcomes. The imputation model was fitted in the derivation data and included predictor variables, the Nelson-Aalen estimators of the baseline cumulative sub-distribution hazard, and the outcome indicators (death from covid-19 and hospital admission with covid-19). We carried out five imputations. Each analysis model was fitted in each of the five imputed datasets. We used Rubin's rules to combine the model parameter estimates and the baseline cumulative incidence estimates across the imputed datasets.
We initially sought to fit models using all predictor variables. Owing to sparse cells, some conditions were combined if clinically similar in nature (such as rare neurological disorders). We examined interactions between body mass index and ethnicity and interactions between predictor variables and age, focusing on predictor variables that apply across the age range (asthma, epilepsy, diabetes, severe mental illness). We explored the use of penalised models (LASSO) to screen variables for inclusion, but this retained all the predictor variables and most interaction terms. 17 In line with the protocol, we subsequently removed a small number of variables with low numbers of events and adjusted (sub-distribution) hazard ratios close to 1 (as these will have minimal effect on predicted risks) or with uncertain clinical credibility, defined as counterintuitive results in light of the emerging literature. Lastly, we combined regression coefficients from the final models with estimates of the baseline cumulative incidence function evaluated at 97 days to derive risk equations for each outcome. We used all the available data in the database.

Model evaluation
We did all model evaluation using the validation data with two separate periods of follow-up. The first validation study period was the same as for the derivation cohort: 24 January to 30 April 2020. The second temporal validation covered the subsequent period of 1 May 2020 to 30 June 2020. This was carried out with the same validation cohort except for exclusion of patients who died during 24 January to 30 April 2020. In the validation cohort, we fitted an imputation model to replace missing values for ethnicity, body mass index, Townsend score, and smoking status. This excluded the outcome indicators and Nelson-Aalen terms, as the aim was to use covariate data to obtain a prediction as if the outcome had not been observed to reflect intended use.
We applied the final risk equations developed from the derivation dataset to men and women in the validation dataset and evaluated R 2 values, Brier scores, and measures of discrimination and calibration for the two time periods. 22 where lower values indicate better accuracy. 25 D statistics (a discrimination measure that quantifies the separation in survival between patients with different levels of predicted risks) and Harrell's C statistics (a discrimination metric that quantifies the extent to which people with higher risk scores have earlier events) were evaluated at 97 days (the maximum followup period available at the time of the derivation of the model) and 60 days for the second temporal validation, with corresponding 95% confidence intervals. 26 We assessed model calibration by comparing mean predicted risks with observed risks by twentieths of predicted risk for each of the validation cohorts. Observed risks were derived in each of the 20 groups by using non-parametric estimates of the cumulative incidences. Additionally, we did a recalibration for the mortality outcome, using the method proposed by Booth et al by updating the baseline survivor function based on the temporal validation cohort with the prognostic index as an offset term. 27 We also applied the algorithms to the validation cohort for the first time period to define the centile thresholds based on absolute risk. We also defined centiles of relative risk (defined as the ratio of the individual's predicted absolute risk to the predicted absolute risk for a person of the same age and sex with a white ethnicity, body mass index of 25, and mean deprivation score with no other risk factors). We calculated the performance metrics in the whole validation cohort and in the following pre-specified  17 we evaluated performance by calculating Harrell's C statistics in individual general practices and combining the results using a random effects meta-analysis. 28 patient and public involvement Patients were involved in setting the research question and in developing plans for design and implementation of the study. Patients were asked to aid in interpreting and disseminating the results.

results
Overall study population Overall, 1205 practices in England met our inclusion criteria. Of these, 910 practices were randomly assigned to the derivation dataset and 295 to the validation cohort. The practices had 8 256 158 registered patients aged 19-100 years on 24 January 2020. We included 6 083 102 of these in the derivation cohort, and the validation dataset comprised 2 173 056 people. Table 2 shows the baseline characteristics of patients in the derivation cohort. Of these patients, 3 035 409 (49.9%) were men and 990 799 (16.3%) were of black, Asian, or other minority ethnic (BAME) background.

baseline characteristics
In the derivation cohort, 10 776 (0.18%) patients had a covid-19 related hospital admission and 4384 (0.07%) had a covid-19 related death during the 97 days' follow-up, of which 4265 (97.3%) were recorded on the death certificate and 119 (2.71%) were based only on a positive test (and of these <15 were based on a test more than 28 days before death). Admissions and deaths due to covid-19 occurred across all regions, with the greatest numbers in London, which accounted for 3799 (35.3%) of admissions and 1287 (29.4%) of deaths. Of those who died, 2517 (57.4%) were male, 732 (16.7%) were BAME, 3616 (82.5%) were aged 70 and over, 1417 (32.3%) had type 2 diabetes, 1311 (29.9%) had dementia, and 1033 (23.6%) were identified as living in a care home.
The characteristics of the validation cohort were similar to those of the derivation cohort, as shown in supplementary tables A and B. In the first validation period (24 January to 30 April 2020), 1722 deaths and 3703 hospital admissions due to covid-19 occurred. In the second validation period (1 May to 30 June 2020), 621 deaths and 1002 admissions due to covid-19 occurred.

predictor variables
The variables included in the final models were fractional polynomial terms for age and body mass index, Townsend score (linear), ethnic group, domicile (residential care, homeless, neither), and a range of conditions and treatments as shown in figure 1, figure 2, figure 3, and figure 4. These conditions and treatments were cardiovascular conditions (atrial fibrillation, heart failure, stroke, peripheral vascular disease, coronary heart disease, congenital heart disease), diabetes (type 1 and type 2 and interaction terms for type 2 diabetes with age), respiratory conditions (asthma, rare respiratory conditions (cystic fibrosis, bronchiectasis, or alveolitis), chronic obstructive pulmonary disease, pulmonary hypertension or pulmonary fibrosis), cancer (blood cancer, chemotherapy, lung or oral cancer, marrow transplant, radiotherapy), neurological conditions (cerebral palsy, Parkinson's disease, rare neurological conditions (motor neurone disease, multiple sclerosis, myasthenia, Huntington's chorea), epilepsy, dementia, learning disability, severe mental illness), other conditions (liver cirrhosis, osteoporotic fracture, rheumatoid arthritis or systemic lupus erythematosus, sickle cell disease, venous thromboembolism, solid organ transplant, renal failure (CKD3, CKD4, CKD5, with or without dialysis or transplant)), and medications (≥4 prescriptions from general practitioner in previous six months for oral steroids, long acting β agonists or leukotrienes, immunosuppressants). Figure 1 and figure 2 show the adjusted hazard ratios in the final models for covid-19 related death in the derivation cohort in women and men. Figure 3 and figure 4 show the adjusted hazard ratios for the final models for covid-19 related hospital admission in the derivation cohort.
Supplementary figures A and B show graphs of the adjusted hazard ratios for body mass index, age, and the interaction between age and type 2 diabetes for deaths and hospital admissions due to covid-19 (which showed higher risks associated with younger ages). Supplementary figures C and D show fully adjusted hazard ratios for variables for the full model, including variables that were not retained in the final model (for example, adjusted hazard ratios close to one or those which lacked clinical credibility). Other variables with too few events for inclusion were HIV, sphingolipidoises, short bowel syndrome, polymyositis, dermatomyositis, Ehlers-Danlos syndrome, biliary cirrhosis, hepatitis B and C, haemochromatosis, non-alcoholic fatty liver disease, chronic pancreatitis, drug misuse, asplenia, cholangitis, scleroderma, Sjogren's syndrome, and pregnancy. Supplementary figures E and F show fully adjusted hazard ratios for a combined outcome of either covid-19 related death or hospital admission. This gave very similar absolute risks to the hospital admission outcome. Table 3 shows the performance of the risk equations in the validation cohort for women and men over 97 days for the main study period and for the temporal validation cohort evaluated from 1 May 2020 to 30 June 2020. Overall, the values for the R 2 , D, and C statistics were similar in women and men. Values for the mortality outcome tended to be higher than those for the hospital admission outcome. For example, in the first validation period, the equation explained 74% of the variation in time to death from covid-19 in women; the D statistic was 3.46, and Harrell's C statistic was 0.933. The corresponding values in men were 73.1%, 3.37, and 0.928. The results for the second validation period were similar except for covid-19 related admissions in women, for which the explained variation and discrimination were lower than for the first period (explained variation 45.4%, D statistic 1.87, and Harrell's C statistic 0.776).

Model evaluation Discrimination
Supplementary tables C-F show the corresponding results by region, age band, and fifth of deprivation and within each ethnic group in men and women in both validation periods. Performance was generally similar to the overall results except for age, for which the values were lower within individual age bands. Figure 5 shows funnel plots of Harrell's C statistic for each general practice in the validation cohort versus the number of deaths in each practice in men and women in the first validation period. The summary (average) C statistic for women was 0.916 (95% confidence interval 0.908 to 0.924) from a random effects meta-analysis. The corresponding summary C statistic for men was 0.919 (0.912 to 0.926).

discussion
We have developed and evaluated a novel clinical risk prediction model (QCOVID) to estimate risks of hospital admission and mortality due to covid-19. We have used national linked datasets from general practice and national SARS-CoV-2 testing, death registry, and hospital episode data for a sample of more than 8 million adults representative of the population of England. The risk models have excellent discrimination (Harrell's C statistics >0.9 for the primary outcome). Although the calibration for the hospital admission outcome was good in both time periods, some under-prediction existed for the mortality outcome in the second validation cohort, which improved after recalibration. The recalibration method could be used to transport the risk models to other settings or time periods with different absolute risks of covid-19. QCOVID represents a new approach for risk stratification in the population. It could also be deployed in several health and care applications, either during the current phase of the pandemic or in subsequent "waves" of infection (with recalibration as needed). These could include supporting targeted recruitment for clinical trials, prioritisation for vaccination, and discussions between patients and clinicians on workplace or health risk mitigation-for example, through weight reduction as obesity may be an important modifiable risk factor for serious complications of covid-19 if a causal association is established. 10 Although QCOVID has been specifically designed to inform UK health policy and interventions to manage covid-19 related risks, it also has international potential, subject to local validation. One of the variables in our model (the Townsend measure of deprivation) may need to be replaced with locally available equivalent measures, or some recalibration may be needed. Previous risk prediction models based doi: 10 29 30 comparison with other studies Although similarities exist between our study and the recently reported analysis of risk factors from another English general practice database using a different clinical computer system, our project had a different aim-namely, to develop and evaluate a risk prediction model. We used a more comprehensive outcome (including deaths in patients with positive tests for SARS-CoV-2), a much wider range of predictors, and a more granular assessment of ethnicity and body mass index. Our C statistic for mortality (>0.92) is substantially higher than the previous study's reported value of 0.77. 31 Other prediction models have been reported, although these focus on other outcomes of covid-19, including risk of admission to intensive care or death following a positive test, or clinical decision tools that integrate biochemical and imaging parameters to aid diagnostis. 13 However, most such studies are at high risk of bias, as they have been developed in highly selected cohorts, have limited transparency, are likely to have optimistic reported performance, or did not use covid-19 specific data. 13 This study represents a substantial improvement on previously developed risk algorithms in terms of the size and representativeness of the study population, the richness of data linkages enabling accurate ascertainment of cases (including both in-hospital and out of hospital deaths) across the health network, and the breadth of candidate predictor variables considered. Importantly, it analyses risks at the population level, rather than risks in people with confirmed or suspected infection, and may have relevance for shielding or other policies that seek to mitigate risk of viral exposure. complexities of modelling Several complexities of modelling adverse risks from covid-19 in the general population warrant discussion. We used a general population approach which, although not able to incorporate all determinants of being infected, offers an overall estimate of risk of adverse outcomes from covid-19 that could be used in discussions between clinicians and patients about adjustment of lifestyle or occupational and behavioural factors that could limit viral exposure. Our model predicts risks of "catching covid-19 and then having a severe outcome," on the basis of data collected during the first peak of the pandemic. The endpoint in this study examines a risk trajectory that comprises two elements: becoming infected, which is predominantly a function of behavioural/environmental factors including occupation, local infection rate, and numbers of social interactions; and risk of hospital admission and death due to the infection, which is arguably primarily driven by "vulnerability" (that is, biological/ physiological factors including age, sex, body mass index, comorbidities, and medications). Although producing a prediction model for risk of "death if infected" is feasible in principle, this approach is not yet possible owing to the approach to testing in the UK and the context of an as yet incompletely quantified degree of asymptomatic background transmission. Limited covid-19 testing data are available, but the difficulty is that no systematic community testing was done in the UK during the study period, so only patients unwell enough to attend hospital were tested. This means that a risk score developed in those who tested positive would overestimate risks of severe outcomes. As more widespread testing is done and those data become available, we will be able to update the model to take background infection rates into account and also model regional differences. Although the absolute risk levels will of course change over time, depending on the incidence of the disease, our analysis over two validation time periods indicates that the relative risk measures and discrimination are likely to remain stable. Secondly, the model estimates the absolute risk for a non-infected individual in the general population of becoming infected and then dying (or needing to be admitted to hospital) from the virus over a 97 day period. Although many more than 40 000 people have died from covid-19 in the UK to date, when the denominator is a population of multi-millions, the absolute risk for most people may be low. Therefore, when conveying this type of risk score to an individual, due emphasis is needed on the different meanings of absolute and relative risk.
Thirdly, the absolute risk of catching covid-19 depends not only on the incidence of the infection but also on the number of people one gets close to. For this reason, non-pharmacological interventions such as social distancing and shielding were introduced in the UK during the study period. We have included some measures of multi-occupancy, as we have factored care homes into the analysis. The data generated during the study period will therefore be affected by the uptake of

No of covid-19 deaths in men
Harrell's C statistic  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Twentieth of predicted risk at 97 days 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Twentieth of predicted risk at 97 days interventions such as social distancing and shielding, intended to mitigate the risks of SARS-CoV-2 infection. This could result in underestimation of some model coefficients and hence underestimation of absolute risk in people who were shielded. Also, as this is a prediction model derived from an observational study, the associations estimated for individual predictor variables should not be interpreted as causal effects.
However, ethical questions must be considered regarding how the tools may be used. We have presented two ways of stratifying risk based on either absolute or relative risk measures with associated centile values, but the choice of whether to have a threshold (given that risk is a continuous measure), and if so what threshold, will depend on the purpose for which the risk assessment tool is to be used, the available resources, and the ethical framework for decision making. We have analysed this within the "four ethical principles" framework that is widely used in medical decision making. The four principles are autonomy, beneficence, justice, and non-maleficence. 32 The new risk equations, when implemented in clinical software, are designed to provide more accurate information for patients and clinicians on which to base decisions, thereby promoting shared decision making and patient autonomy. They are intended to result in clinical benefit by identifying where changes in management are likely to benefit patients, thereby promoting the principle of beneficence. Justice can be achieved by ensuring that the use of the risk equations results in fair and equitable access to health services that is commensurate with patients' level of risk. Lastly, the risk assessment must not be used in a way that causes harm either to the individual patient or to others (for example, by introducing or withdrawing treatments where this is not in the patient's best interest), thereby supporting the non-maleficence principle. How this applies in clinical practice will naturally depend on many factors, especially the patient's wishes, the evidence base for any interventions, the clinician's experience, national priorities, and the available resources. The risk assessment equations therefore supplement clinical decision making and do not replace it. With these caveats, the predicted risk estimates can be used to identify people at higher risk, to inform shared decision making between healthcare professionals and service users, or for population level stratification.
strengths and limitations of study Our study has some major strengths, but some important limitations, which include the specific factors related to covid-19 along with others that are similar to those for a range of other widely used clinical risk prediction algorithms developed using the QResearch database. [14][15][16] Key strengths include the use of a very large validated data source that has been used to develop other risk prediction tools; the wealth of candidate risk predictors; the prospective recording of outcomes and their ascertainment using multiple national level database linkage; lack of selection, recall and respondent biases; and robust statistical analysis. We have used non-linear terms for body mass index and age. We examined interaction terms, which  show increased risks at younger ages for adults with type 2 diabetes. We also established a new linkage to the systemic anti-cancer therapy (SACT) database for chemotherapy prescribed and administered in secondary care (which may not be recorded well in general practice software) to circumvent possible missing data for this important variable. Specific limitations include the occurrence of shielding during the study period and that the study was conducted during the first phase of the UK epidemic. We have accounted for many risk factors for covid-19 mortality, but risks may be conferred by some rare medical conditions or other factors such as occupation that have not yet been observed or are poorly recorded in general practice or hospital data. In particular, the model does not include two important predictorsnamely, prevailing infection rate and personal social distancing measures. A lack of comprehensive testing has led to some missing data on covid-19 admissions and/or deaths, which means that development of a valid model for predicting death in people infected with SARS-CoV-2 is not yet possible. We acknowledge that absolute risks are changing during the course of the pandemic, so these should be interpreted with caution. However, we would expect predictors of risk, relative risk measures, and discrimination to be more stable over time, which is consistent with the results from our temporal validation. Although this tool was modelled on the best available data from the first wave of the pandemic, it will be updated as further testing and outcome data accrue, immunity levels change, and (potentially) a vaccine becomes available. Nevertheless, having a risk score available at this stage of the pandemic may be useful to identify people at high risk before a vaccine or treatment is available.
We have reported a validation in each of two time periods using practices from QResearch, but these practices were completely separate from those used to develop the model. We have used this approach previously to develop and validate other widely used prediction models. When these have been further externally validated on completely different clinical databases, by ourselves and others, the results have been very similar. [33][34][35] Work is already under way to evaluate the models in external datasets across all four nations of the UK and to integrate the algorithms within NHS clinical software systems.

policy implication and conclusions
This study presents robust risk prediction models that could be used to stratify risk in populations for public health purposes in the event of a "second wave" of the pandemic and support shared management of risk. We anticipate that the algorithms will be updated regularly as understanding of covid-19 increases, as more data become available, as behaviour in the population changes, or in response to new policy interventions. It is important for patients/carers and clinicians that a common, appropriately developed, evidence based model exists that is consistently implemented and is supported by the academic, clinical, and patient communities. This will then help to ensure consistent policy and clear national communication between policy makers, professionals, employers, and the public.