Integrating genome-wide polygenic risk scores and non-genetic risk to predict colorectal cancer diagnosis using UK Biobank data: population based cohort study

Abstract Objective To evaluate the benefit of combining polygenic risk scores with the QCancer-10 (colorectal cancer) prediction model for non-genetic risk to identify people at highest risk of colorectal cancer. Design Population based cohort study. Setting Data from the UK Biobank study, collected between March 2006 and July 2010. Participants 434 587 individuals with complete data for genetics and QCancer-10 predictions were included in the QCancer-10 plus polygenic risk score modelling and validation cohorts. Main outcome measures Prediction of colorectal cancer diagnosis by genetic, non-genetic, and combined risk models. Using data from UK Biobank, six different polygenic risk scores for colorectal cancer were developed using LDpred2 polygenic risk score software, clumping, and thresholding approaches, and a model based on genome-wide significant polymorphisms. The top performing genome-wide polygenic risk score and the score containing genome-wide significant polymorphisms were combined with QCancer-10 and performance was compared with QCancer-10 alone. Case-control (logistic regression) and time-to-event (Cox proportional hazards) analyses were used to evaluate risk model performance in men and women. Results Polygenic risk scores derived using the LDpred2 program performed best, with an odds ratio per standard deviation of 1.584 (95% confidence interval 1.536 to 1.633), and top age and sex adjusted C statistic of 0.733 (95% confidence interval 0.710 to 0.753) in logistic regression models in the validation cohort. Integrated QCancer-10 plus polygenic risk score models out-performed QCancer-10 alone. In men, the integrated LDpred2 model produced a C statistic of 0.730 (0.720 to 0.741) and explained variation of 28.2% (26.3 to 30.1), compared with 0.693 (0.682 to 0.704) and 21.0% (18.9 to 23.1) for QCancer-10 alone. In women, the C statistic for the integrated LDpred2 model was 0.687 (0.673 to 0.702) and explained variation was 21.0% (18.7 to 23.7), compared with 0.645 (0.631 to 0.659) and 12.4% (10.3 to 14.6) for QCancer-10 alone. In the top 20% of individuals at highest absolute risk, the sensitivity and specificity of the integrated LDpred2 models for predicting colorectal cancer diagnosis was 47.8% and 80.3% respectively in men, and 42.7% and 80.1% respectively in women, with increases in absolute risk in the top 5% of risk in men of 3.47-fold and in women of 2.77-fold compared with the median. Illustrative decision curve analysis indicated a small incremental improvement in net benefit with QCancer-10 plus polygenic risk score models compared with QCancer-10 alone. Conclusions Integrating polygenic risk scores with QCancer-10 modestly improves risk prediction over use of QCancer-10 alone. Given that QCancer-10 data can be obtained relatively easily from health records, use of polygenic risk score in risk stratified population screening for colorectal cancer currently has no clear justification. The added benefit, cost effectiveness, and acceptability of polygenic risk scores should be carefully evaluated in a real life screening setting before implementation in the general population.


Base Genome-wide Association Study Meta-analysis
The base dataset for polygenic risk score (PRS) development was obtained through meta-analysis of the datasets included in Law et al., 1 excluding the UK Biobank dataset. Summary data from the following genome-wide association study (GWAS) datasets was therefore included: NSCCG-OncoArray; SCOT; SOCCS/GS; SOCCS/LBC; CCFR1; CCFR2; COIN; CORSA; Croatia; DACHS; FIN; UK1; Scotland1; VQ58. The contributing datasets, genotyping and imputation information, quality control (QC) and study approvals are described in detail in Law et al. 1

Cancer Incidence Calculation
Whole UKB cohort CRC incidence rates were calculated based on linked registry cases, without removal of prevalent cases, to reflect registration as would occur in national data. In addition ASIRs were calculated in the Integrated Modelling Cohort, in which prevalent cases were removed and cases identified through cancer and death registry, and linked hospital inpatient data; follow-up duration was as defined in the main methods. This analysis used R packages 'survival' and 'epitools'. 2,3

PRS Sample QC and dataset definitions
We performed standard per-person QC on all individuals with imputed genetic data available, removing those with sex chromosome aneuploidy, sex-mismatch and an excess of relatives in the dataset. The Derivation Dataset (see Figure 1) included individuals identified by UKB as having white-British ancestry (on the basis of self-report and principal components analysis), and recruited through English and Welsh centres. We performed further QC on this cohort, 4 removing those who were not included in the PCA calculation (which removes related individuals at 3 degrees of relatedness or closer from the dataset), and restricting further to a genetically homogeneous subset (those within log-distance of 5 following computation of a robust Mahalanobis distance), resulting in a dataset of 310664 individuals.
The Geographic Validation Cohort comprised 34152 individuals recruited in Scotland and of European ancestry (UK Biobank self-reported ethnicities of "British", "Irish", "White", and "Any other white background") passing standard QC. Scotland was chosen for validation as this cohort contained more than the recommended number of cases for model validation (a minimum of 100, and ideally 200, cases), 5 and represents a population with different demographics to England and Wales, testing the models portability.
A Minority Ethnic Validation Cohort (n = 27503) comprised all UK Biobank participants passing standard QC with self-reported ethnicities not in the above categories (including individuals who responded "Do not know" and "Prefer not to answer", but not those with missing ethnicity data).
Thirty thousand randomly selected individuals from the white-British Derivation Dataset were used to derive a Training Cohort for PRS hyper-parameter selection. Ten thousand individuals from within this Training Cohort were used for linkage disequilibrium (LD) matrix construction (used for C+T, SCT and LDpred2 models). 4 The remaining 280664 individuals in the Derivation Dataset comprised the Test Cohort in which PRS performance was evaluated.
We used imputed SNP allele dosage data from UK Biobank, restricting variants to those included in HapMap3, and with matched SNPs in the base data. Of 12972739 SNPs present in the base GWAS summary statistics, 1798524 ambiguous SNPs were removed and 1117002 variants matched with UK Biobank data. QC was performed as recommended by Privé et al. 4 on the summary statistics, comparing standard deviations of genotypes in the summary statistics (SDss) and 10000 individuals from the LDpred2 Training Cohort (SDldtr), and removing variants where SDss < 0.5(SDldtr), SDss > 0.1 + SDldtr, SDss < 0.1, or SDldtr < 0.05 ( Figure S1), 4 leaving 1104409 SNPs included in analysis. Following QC, the minimum INFO score was 0.411.

GWAS significant PRS
We manually curated a list of SNPs derived from previously published GWAS in European populations including Law et al. and Huyghe et al. 1,6 and the references within these (Table S1). We excluded SNPs which did not reach genome-wide significance (p<5x10 -8 ) in our base meta-analysis, and used the effect sizes from our meta-analysis, adjusted for the winner's curse using the False Discovery Rate Inverse Quantile Transformation (FIQT) method. 7 Where SNPs were reported at the same loci in different studies and were correlated at ! >0.1 we retained the most significantly associated SNP. We confirmed that all of the included SNPs imputed well in the UKB data with INFO scores > 0.9. The PRS was calculated as the sum of allele dosages weighted by their effect sizes.

C+T and SCT PRS
Clumping and thresholding approaches to SNP selection generate PRS scores across a range of LD ! values (with a given window size for clumping selected) and association p value thresholds. We used R package bigsnpr by Privé et al. 8 to generate scores across a grid of ! , p-value threshold, and clumping window size values. Default parameters were used: clumping ! of 0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95; 50 p-value thresholds spaced equally between 0.1 and the most significant p-value on the log scale; and a base clumping window size of 50, 100, 200 and 500 (where actual window size in kb is the base size divided by clumping ! ).
From this grid, a maximum score was selected based on AUC (the C+T score in this paper), and stacking used to learn the optimal linear combination of scores generated through efficient penalised regression (the SCT score). 8

LDpred2 PRS
LDpred2 4 uses a Bayesian approach to SNP selection and shrinkage for PRS, based on an LD matrix and GWAS summary statistics, implemented in the R package bigsnpr. This updated version of LDpred has been demonstrated to provide higher predictive performance, particularly with large GWAS sample size as in this study, 4 and also addresses previous instability issues. 9 The use of a larger window of 3cM (using genetic distance rather than number of bases) improves performance when causal variants are located in regions of longrange LD, such as HLA regions. Colorectal cancer-associated variants in these regions have recently been reported, 1,6 and this improvement may therefore be of benefit in CRC-prediction. LDpred2 also evaluates more hyper-parameters (a grid of 126 instead of 7 in LDpred).
There are multiple options for PRS construction within LDpred2. An infinitesimal model (LDpred2-inf), in which all makers are assumed to be causal; grid models (LDpred2-grid) in the hyper-parameters SNP heritability, ℎ ! , proportion of causal variants, p, and optionally sparsity, are tuned in a validation set; and an auto model (LDPred2-auto) in which sparsity and SNP heritability are estimated automatically, negating the need for a validation set. LDpred2 estimates heritability calculated from constrained LD score regression. The estimate for this dataset was 0.1602065.
We evaluated LDpred2-inf and LDpred2-grid models (sparse and non-sparse), running them genome-wide as recommended. LDpred2-grid outputs SNP effect sizes for each of the grid values; the optimally performing model was then selected based on best Z-score for the logistic regression slope ( Figure S2), in which we adjusted for array platform and first 4 principal components (PCs).

Evaluation of polygenic risk score performance
Each PRS was evaluated in logistic regression and Cox models, adjusting for age, sex, array and 4 principal components. Age, sex and PCs were all modelled as continuous variables, assuming a linear relationship. For Cox models we confirmed proportional hazards assumptions held through visual inspection of plots of Schoenfeld residuals. We evaluated potential interactions between PRS and age by examining the prognostic strength and significance of interaction terms based on Wald ! statistics, and plotting marginal effects of PRS with age. We compared model performance to a reference model, containing age, sex, array and 4 principal components, to assess the contribution of the PRS to model performance. Further models were also derived which did not adjust for age and sex, 10 to evaluate the contribution which these factors (known to be independent predictors for CRC risk) made to the performance of the full model.
In order to compare PRS distributions for each cohort, and effect sizes per SD of each PRS, we standardised the PRS to have a mean of 0 and standard deviation of 1 in the Test Cohort. We also used these standardised scores in plots of marginal effects of PRS in interaction with age. Remaining analyses used non-standardised scores.

Validation of QCancer-10
Validation of QCancer-10 in UKB permits evaluation of model performance in a population of approximately bowel screening age. Full QCancer-10 model specification is available at https://www.qcancer.org/15yr/colorectal/. 11 CRC outcomes were identified as described in the main paper. Of note, in QCancer-10 (colorectal cancer) development ICD-10 codes for anal cancer were included in case definition. We did not include these in this study, as anal cancers are of a different aetiology to CRC, and bowel cancer screening does not aim to detect these lesions. Previous medical history, alcohol and smoking status, and family history were all taken from selfreported data in baseline touch-screen and verbal UKB assessment centre interviews.
Mapping of ethnicity, smoking and alcohol intake is given in Table S2. Ethnicity was coded from self-reported ethnicity (UKB field 21000). Smoking history was compiled from the smoking summary field (field 20116), frequency of smoking (field 1239) and number of cigarettes smoked (field 3456). To calculate alcohol intake, reported alcohol intake frequency (field 1558) was combined with detailed drink-based intake reported in glasses/pints at touchscreen interview. Drinks intake was converted to units using NHS Choices Livewell alcohol units (as in Usher-Smith et al. 12 ), and average daily units calculated.
Previous medical history was taken from self-reported cancer and non-cancer illnesses (fields 20001 and 20002) at touch-screen interview (Table S3).
Family history in UKB is for first degree relatives, detailed for father, mother and siblings separately; we considered positive family history to be CRC in any of these relatives. In QCancer-10 development, absence of data carries the assumption that the individual does not have any family history; family history was therefore coded as missing only if the answer for all of these was either 'Do not know' or 'Prefer not to answer'.
Distributions of continuous predictors were evaluated. One implausible value for BMI was set to missing and otherwise all values were retained. Of note there are a very small number of UKB participants aged 38-39 and 71-73 years at baseline assessment, who were included in our modelling.  . Whilst we continued with model development, for the female integrated models the estimate of outcome risk may be less precise, and the model may be more subject to over-fitting. 14 External validation of the integrated model will be essential prior to implementation.
We constructed Cox models in the Integrated Modelling Cohort including two predictors: the risk score from QCancer-10 and a PRS. PRS were adjusted for genotyping array and the first four principal components from UK Biobank study. We developed male and female models separately, and compared the use of the topperforming genome-wide PRS, and the GWAS-sig PRS. We truncated the lower and upper 0.5% of the distributions of each predictor to the outer bounds. 15 Inspection of Schoenfeld residuals showed that the proportional hazard assumption held. We evaluated the use of multiple fractional polynomials to model the predictors. We assessed possible interactions between the predictors by visual inspection of plots of marginal effects of the QCancer-10 risk score across PRS values, and examining the prognostic strength and significance of interaction terms based on Wald ! statistics.

Decision Curve Analysis
We used decision curve analysis (DCA) to evaluate the potential impact of our models on clinical decision making. 16,17 We assumed the decision in question was whether an individual in the general population ought to undergo screening colonoscopy, based on their risk. The outcome of DCA, net benefit (NB), was calculated as the number of true positives minus the number of false positives (i.e. unnecessary procedures), with an "exchange rate" applied to false positives by weighting them by the odds at the given risk threshold. 18 As given in Vickers et al., 16 where " is the probability threshold and is the sample size. One can also evaluate NB in terms of true negatives rather than true positives, which equates to the number of unnecessary interventions avoided in using a risk model. We used the R function dca::stdca to calculate NB and unnecessary interventions avoided at 8 years of follow-up, and plotted these across the full range of thresholds in which the models provided benefit.
Overall, the model with the highest NB is considered the "best" strategy in DCA. However this evaluation does not incorporate the added implications of undertaking PRS. Whether the additional burden is worthwhile can be evaluated by calculating the test trade-off. We calculated the increase in net benefit (∆NB) at pre-specified thresholds (see below) afforded by adding the LDpred2 PRS to the QCancer-10 model, and used this to calculated the test trade-off, which is 1/∆NB. 18 This indicates the number of additional tests (here PRS) which would be needed in order to obtain one more true positive CRC diagnosis using the model. Future analyses to measure and evaluate financial costs of risk score tests, environmental impact, 20 and potential effects on screening participation would be required to investigate this issue in detail.
Vickers et al. note that the probability threshold may also be considered as the number needed to test to identify a cancer, 21 which here would be the number needed to screen. In order to identify relevant thresholds in which to evaluate NB and test trade-off, we searched the existing literature to identify potentially relevant thresholds which have been deemed acceptable in clinical practice. In randomised trials of colonoscopy based screening, the NNS for CRC was 1000 in two studies based in Spain and the USA, 22,23 182 in a Dutch colonoscopy trial, 24 and 202 in a multi-country European study, 25 equating to probability thresholds of 0.1-0.5%. Of note, however, these thresholds reflect immediate risk of CRC, rather than longer term risk (e.g. 8-year risk in our DCA).
The threshold probability considered is context/patient specific and is the level at which one feels equivocally about the benefits and harms of the intervention. 26 In a population-based screening programme, concern around cost of colonoscopy (in terms of financial and opportunity costs due to capacity) will be greater than in a randomised trial, resulting in a higher threshold probability. In informing individual patient choice, patient and clinician preference around risk may vary considerably depending on their level of concern around cancer and the colonoscopy procedure. We therefore evaluated thresholds of 0.5%, 1%, 1.5% and 2% in our calculations, to provide a range of potentially relevant values for policy makers, clinicians and screening participants.
We note also that a significant portion of the benefit of colonoscopy screening arises from preventing neoplasia through the removal of advanced adenomas (AA) at colonoscopy, which is not reflected in these numbers, and could not be readily evaluated in this study due to lack of sufficiently high resolution data on colorectal polyps in UKB. For all advanced neoplasia (i.e. AAs and CRC), NNS in the above trials ranged from 9-49, equating to probability thresholds of 2-10% and reflecting the higher prevalence of AA compared to CRC.

Figure S3. Age specific CRC rates in men and women in the UK Biobank cohort overall and Integrated
Modelling cohorts, compared to Office for National Statistics 2013 Cancer Registry data. 31 Cases for the whole UK Biobank cohort are from linked cancer registry data; cases for the Integrated Modelling Cohort are from linked cancer registry, death registry, and hospital data.

Figure S4. Distributions of standardised PRS for PRS Test Cohort and Validation Cohorts.
Case distribution is shown in green, controls in blue.

Interactions between PRS and Age
Evaluation of interaction terms (Table S6) indicated a significant interaction between age and PRS (at p < 0.01) for the LDpred-inf model only in logistic regression models, and for LDpred2-grid, LDpred2-grid-sp and C+T Cox models. Plots of marginal effects (shown for logistic regression models in Figure S5) indicated a reduction in effect of PRS with increasing age. Plots for Cox models were similar. Given the weakness of the interaction terms relative to the other predictors based on Wald ! , we elected not to include interaction terms in the models.        Table S10. Subgroup analysis of PRS Cox model performance by sex in the Geographic Validation Cohort. We did not assess performance specifically in those with a first degree family history, as there were too few incident cases in this group.

QCancer-10+PRS model specification
We confirmed QCancer-10 risk score and PRS fulfilled proportional hazards assumptions. Evaluation of multiple fractional polynomials (MFP) for modelling of these predictors resulted in use of MFP terms for the PRS in the female QCancer-10+LDP model (see Model specification below). Evaluation of interaction terms showed no significant interactions (Table S13). Plots of marginal effects ( Figure S18) indicated a reduction in effect of QCancer-10 with increasing PRS score. Given the weakness of the interaction terms relative to the other predictors based on Wald ! , we elected not to include interaction terms in the models.    Table S17. Sensitivity, specificity, detection rate, and false positive rate of QCancer-10 for CRC diagnosis across top 25 centiles of absolute risk in men and women. Calculated following recalibration of the QCancer-10 model. Table S18. Sensitivity, specificity, detection rate, and false positive rate of QCancer-10+LDP across top 25 centiles of relative risk. Risk is calculated relative to an individual of the same age and sex, of white-British ethnicity, with no CRC risk factors, BMI of 25, mean Townsend Deprivation Score, and mean PRS.  Table S20. Fold-increase in absolute risk between 95 th centile and median risk for QCancer-10+LDP, QCancer-10+GWS, and QCancer-10 models