Education and coronary heart disease: mendelian randomisation study

Objective To determine whether educational attainment is a causal risk factor in the development of coronary heart disease. Design Mendelian randomisation study, using genetic data as proxies for education to minimise confounding. Setting The main analysis used genetic data from two large consortia (CARDIoGRAMplusC4D and SSGAC), comprising 112 studies from predominantly high income countries. Findings from mendelian randomisation analyses were then compared against results from traditional observational studies (164 170 participants). Finally, genetic data from six additional consortia were analysed to investigate whether longer education can causally alter the common cardiovascular risk factors. Participants The main analysis was of 543 733 men and women (from CARDIoGRAMplusC4D and SSGAC), predominantly of European origin. Exposure A one standard deviation increase in the genetic predisposition towards higher education (3.6 years of additional schooling), measured by 162 genetic variants that have been previously associated with education. Main outcome measure Combined fatal and non-fatal coronary heart disease (63 746 events in CARDIoGRAMplusC4D). Results Genetic predisposition towards 3.6 years of additional education was associated with a one third lower risk of coronary heart disease (odds ratio 0.67, 95% confidence interval 0.59 to 0.77; P=3×10−8). This was comparable to findings from traditional observational studies (prevalence odds ratio 0.73, 0.68 to 0.78; incidence odds ratio 0.80, 0.76 to 0.83). Sensitivity analyses were consistent with a causal interpretation in which major bias from genetic pleiotropy was unlikely, although this remains an untestable possibility. Genetic predisposition towards longer education was additionally associated with less smoking, lower body mass index, and a favourable blood lipid profile. Conclusions This mendelian randomisation study found support for the hypothesis that low education is a causal risk factor in the development of coronary heart disease. Potential mechanisms could include smoking, body mass index, and blood lipids. In conjunction with the results from studies with other designs, these findings suggest that increasing education may result in substantial health benefits.


Traditional observational analyses
The association between education and self-reported CHD prevalence was derived from NHANES data using multivariable logistic regression adjusting for age and sex (Supplementary Table 1). (1) The association between education and the incidence of clinically-verified CHD was derived from two data sources. First, prospective data from the HAPIEE study was analysed with a Cox proportional hazard regression adjusting for age, sex, and country.(2) Second, we reanalysed published summary-level data from the MORGAM consortium, where the published hazard ratios (HR) were initially expressed in terms of country-specific top vs. bottom tertiles of education. (3) Here, we took the log of the hazard ratios corresponding to the lowest and highest educational tertiles, and divided it by the number of years of education between the lowest and highest tertiles from another publication,(4) and inverted the sign (to express the estimate per 1-year of additional education). Results were multiplied by 1-SD (3.6 years of education), and log HRs were back transformed into HR of CHD per 1-SD increase in education. Country and sex-specific MORGAM estimates were then pooled using fixedeffects inverse-variance meta-analysis. HAPIEE and MORGAM estimates were meta-analysed similarly, to derive a summary estimate of incidence.
As our MR analyses assumed a linear relationship between education and CHD, we used individual-level data from NHANES and HAPIEE to model the dose-response observational association between years of education (as an ordinal categorical variable) and risk of CHD ( Supplementary Figures 10-11).

Genetic correlation between education and CHD
To investigate the genetic correlation between education and CHD, we used the "Lookup Center" function of the LD Hub platform (http://ldsc.broadinstitute.org).(5) On 28th October 2016, we downloaded an XLS file of genetic correlations, based on latest data that has been previously uploaded onto the LD Hub platform. Analyses were done by Linkage Disequilibrium Score Regression.

Causal analyses 3.1. Conventional Mendelian randomization
To avoid biases due to overlapping datasets (i.e. where data from the gene-education association and gene-CHD association are derived from the same samples, and which can lead to bias in the direction of the observational association in the presence of a weak instrument),(6) we excluded data from the following studies from the SNPeducation data source: deCODE, WTCCC, KORA, THISEAS, and 23andMe. For all MR analyses, alleles from SSGAC and CARDIoGRAMplusC4D datasets were aligned to correspond to an increase in educational attainment. Conventional MR analysis was conducted using the inverse-variance weighted (IVW) approach, i.e. a linear regression of the SNPs-education estimates on SNPs-CHD estimates, weighted by the minor allele frequencies of each SNP and forced to pass through the origin. (9) We estimated the power for the conventional MR analysis to detect the same magnitude of association reported in the observational studies, using a two-sided alpha of 0.05. Power was ≥98% for both sets of genetic instruments (Supplementary Table 2).

Sensitivity Mendelian randomization analyses
Sensitivity analyses investigated the potential presence of unbalanced horizontal pleiotropy among the genetic variants under analysis.

Penalized weighted median Mendelian randomization
A penalized weighted median MR analysis was conducted (implemented in Stata using the mrrobust package; available at: https://github.com/remlapmot/mrrobust).(10) This gives more weight to genetic variants close to the median causal estimate. Weighted Median methods yield robust and precise results even when up to 50% of the weight in the analysis stems from invalid genetic variants.(10)

MR-Egger regression
MR-Egger regression was applied as described by Bowden et al. (9) Based on the same principles as the Egger test (which assesses small study bias in meta-analysis) the method is similar to conventional MR analyses. However, the regression is not constrained to pass through the origin. A significant departure of the y-intercept from the origin gives evidence for the presence of unbalanced pleiotropy. If the level of pleiotropy is independent of the strength of the association between SNPs and the exposure under analysis, the MR-Egger estimate thus represents the true causal effect, even if all the genetic variants present pleiotropic effects (as per the InSIDE rule). (9) The standard error (SE) of the causal estimate was corrected by dividing the reported SE of the estimate by the residual SE.
Additional sensitivity analysis (Supplement only)

MR-Egger +SIMEX
All MR approaches rely on the fact that the SNP-exposure association is true (NO Measurement Error [NOME] assumption), but whenever the SNP-exposure association estimates are spurious (violation of the NOME assumption), weak instrument bias can distort the causal effect estimate (specifically diluting it towards the null value).(11) Using the I2 statistic, we thus quantified the expected dilution in the MR-Egger causal effect estimates due to the variance of the estimates of the SNP-education association: I2 was only moderate for the set of 162 SNPs (I2=66%; potential dilution of 44%), whereas I2 for the set of 72 SNPs indicated a reduced risk of bias (I2=93%; potential dilution of 7%). As described by Bowden et al, we applied simulation extrapolation (SIMEX; implemented in R using the simex package) to adjust the MR-Egger causal estimates to account for violations to the NOME assumption (NO Measurement Error; results based on 10,000 simulations are presented in Supplementary Table 4).(11)

Mode Based Methods (assuming Zero Modal Pleiotropy)
It is also possible to assess the potential role of horizontal unbalanced pleiotropy through recently developed methods that relax the conventional MR assumptions, and instead form a less stringent assumption of Zero Modal Pleiotropy. This postulates that pleiotropic SNPs are unlikely to converge on the same modal (most common) estimate due to pleiotropic effects not being identical. In contrast, valid SNPs are more likely to converge on the same, common modal estimate. We performed three analyses to exploit this assumption: We explored a range of these, following which ߮=0.5 was chosen to best fit the data. The analysis makes the assumption that the most commonly observed causal effect estimate comes from valid genetic instruments, and it can provide a consistent causal effect estimate even if the majority of (non-modal) genetic instruments are invalid.
One advantage of the Mode-Based Estimate is that it is less influenced by outlying (and possibly pleiotropic) genetic instruments without formally removing them from the analyses, thus making full use of the data. However, the uncertainty around the point estimate can sometimes be prohibitively wide. For this reason, we exploited the Zero Modal Pleiotropy assumption using another strategy, which involves actually removing some genetic instruments from the analysis.

Largest Homogeneous Subset-MR
In our second analysis, SNPs were removed, one-by-one, until the final set of SNPs contained only sufficiently similar (according to some criteria) effect estimates. As such this final set of SNPs can be thought of as a relatively "homogeneous subset". The steps we took are: Our third analysis was similar to the second one described above. This time, instead of removing any SNP (either at the left-or right-hand tail of the causal effect distribution) we only removed those SNPs that provided the strongest causal estimates (on one side of the tail), until the heterogeneity P value was attained as above. This method is unlikely to provide a valid causal estimate, as its result will be biased towards the null. However, it is an extreme example of a very stringent sensitivity check which asks the question "What if the most outlying SNPs, all of which produce the strongest causal estimates, were deemed as invalid (due to having suspected horizontal, unbalanced pleiotropic effects on CHD)?" Results from all three Mode Based Methods are presented in Supplementary Table 4. To summarize these findings, the first Mode-Based Estimate yielded directionally concordant point estimates. However, this test was grossly underpowered to detect a causal effect. The Largest Heterogeneous Subset analyses, by contrast, were better powered. The vast majority of SNPs (i.e. 90%) were highly homogeneous in their causal effect estimates. Removing these 0-10% heterogeneous SNPs made little difference to the point estimates, and furthermore all overall MR estimates retained conventional levels of statistical significance, so were unlikely to have been observed by chance alone. Altogether, findings from the three mode based analyses were consistent with those from the main IVW analyses, and with the hypothesis of limited confounding from unbalanced horizontal pleiotropy.

Causal association between genetic liabilities for CHD and education
Since genetic liabilities for CHD may also influence educational attainment already at earlier ages, we tested whether the genetic risk of CHD was associated with educational attainment. We used data from the CARDIoGRAMplusC4D Consortium to extract SNP-CHD estimates for 53 independent SNPs (at r2<0.02) that were GWAS significant (P<5x10-8). (13)(14)(15)(16) We directly matched these with the corresponding SNPs from the SSGAC GWAS, involving 328,917 individuals (Supplementary Dataset 3). (7) The analyses were conducted similar to the analyses described above. This analysis was performed using data where the underlying participants overlapped slightly between the SNP-exposure and SNP-outcome estimates. However, as such overlap biases results away from the null,(17) and since our finding was quite definitively null, we did not purse seeking non-overlapping data in this instance, for this particular sensitivity analysis

Causal relationships from education to 10 cardiovascular risk factors, from 6 GWAS consortia
The conventional MR approach was used to identify associations between genetic predisposition towards higher educational attainment and cardiovascular risk factors that could be potential mediators. Outcome data were taken from various publicly available datasets: ever smokers vs. never smokers (from the Tobacco and Genetics Consortium); (18)  Statistical analyses were conducted using Stata v.13. Simulation extrapolation analyses were conducted using R v3.3.1.  Figure 2) Power calculation was based on the method developed by Brion et al. (25) Supplementary   , key findings (green), their interpretation (yellow), conclusion (orange) and final suggestions for discussion (red).

Supplementary
verview of the main steps in this study, showing existing datasets (grey), hypothesis formulation (blue), key findings (green), their interpretation (yellow), conclusion (orange) and final suggestions for education GWAS were mapped against SNPs Where necessary, proxies were retrieved using the SNP Annotation and Proxy Search online tool (SNAP, http://archive.broadinstitute.org/mpg/snap/ldsearch.php ; reference panel = 1000 Genomes; LD threshold r2>0.80).

Supplementary Figures 9
Supplementary Figure 9 Observational estimates from the MORGAM consortium results from meta-analysis.
Meta-analysis performed using inverse-variance weighted fixed CHD, coronary heart disease. SD, standard deviation. HR, hazard ratio. CI, confidence interval. Supplementary Figures 9-11, concerning OBSERVATIONAL RESULTS: bservational estimates from the MORGAM consortium, showing cohort variance weighted fixed-effect modelling. CHD, coronary heart disease. SD, standard deviation. HR, hazard ratio. CI, confidence interval. 18 11, concerning OBSERVATIONAL RESULTS: howing cohort-level estimates and the Supplementary Figure 10 Dose response relationship between education and CHD, using observational NHANES data Lowest education group represents "some high school" in USA system, i.e. typically 16 Logistic regression model was adjusted for age and sex. P for trend < 0.0001. CHD, coronary heart disease; NHANES, National Health and Nutrition Examination Survey; OR, odds ratio CI, confidence interval.
Dose response relationship between education and CHD, using observational NHANES data Lowest education group represents "some high school" in USA system, i.e. typically 16-year old pupil.
for age and sex.
CHD, coronary heart disease; NHANES, National Health and Nutrition Examination Survey; 19 Dose response relationship between education and CHD, using observational NHANES data year old pupil.
Supplementary Figure 11 Dose response relationship between education and CHD incidenc data Lowest education group represents "Primary education or lower", i.e. max. 4 years of education.
Cox proportional hazard regression model was adjusted for age, sex and country. P for trend<0.001. CHD, coronary heart disease; HAPIEE, The Health, Alcohol and Psychosocial factors In Eastern Europe Study; HR, hazard ratio CI, confidence interval.
Dose response relationship between education and CHD incidence, using observational HAPIEE Lowest education group represents "Primary education or lower", i.e. max. 4 years of education.
Cox proportional hazard regression model was adjusted for age, sex and country. Each dot represents one single nucleotide polymorphism (SNP). The red line represents the regression slope of the causal effect estimate of education on risk of CHD (where each SNP is weighted by its inverse allele frequency). CHD, coronary heart disease. OR, odds ratio. CI, confidence interval.
The instrument strength, representing the minor allele frequency corrected genetic association with risk of standard error of the SNP-outcome association for each SNP. The conventional MR (I SNP, single nucleotide polymorphism. CHD, coronary heart interval.
instrument strength against the causal estimates (genetic liability for CHD on The instrument strength, representing the minor allele frequency corrected genetic association with risk of CHD is calculated by dividing the SNP outcome association for each SNP. The conventional MR (IVW in red) and Egger MR (MR-Egger in blue) causal effect estimates are presented. heart disease. IVW, inverse variance weighted approach. MR, Mendelian randomization. OR, odds rat 29 on educational outcomes) is calculated by dividing the SNP-exposure association by the Egger in blue) causal effect estimates are presented. disease. IVW, inverse variance weighted approach. MR, Mendelian randomization. OR, odds ratio. CI, confidence