Cardiovascular screening to reduce the burden from cardiovascular disease: microsimulation study to quantify policy options

Objectives To estimate the potential impact of universal screening for primary prevention of cardiovascular disease (National Health Service Health Checks) on disease burden and socioeconomic inequalities in health in England, and to compare universal screening with alternative feasible strategies. Design Microsimulation study of a close-to-reality synthetic population. Five scenarios were considered: baseline scenario, assuming that current trends in risk factors will continue in the future; universal screening; screening concentrated only in the most deprived areas; structural population-wide intervention; and combination of population-wide intervention and concentrated screening. Setting Synthetic population with similar characteristics to the community dwelling population of England. Participants Synthetic people with traits informed by the health survey for England. Main outcome measure Cardiovascular disease cases and deaths prevented or postponed by 2030, stratified by fifths of socioeconomic status using the index of multiple deprivation. Results Compared with the baseline scenario, universal screening may prevent or postpone approximately 19 000 cases (interquartile range 11 000-28 000) and 3000 deaths (−1000-6000); concentrated screening 17 000 cases (9000-26 000) and 2000 deaths (−1000-5000); population-wide intervention 67 000 cases (57 000-77 000) and 8000 deaths (4000-11 000); and the combination of the population-wide intervention and concentrated screening 82 000 cases (73 000-93 000) and 9000 deaths (6000-13 000). The most equitable strategy would be the combination of the population-wide intervention and concentrated screening, followed by concentrated screening alone and the population-wide intervention. Universal screening had the least apparent impact on socioeconomic inequalities in health. Conclusions When primary prevention strategies for reducing cardiovascular disease burden and inequalities are compared, universal screening seems less effective than alternative strategies, which incorporate population-wide approaches. Further research is needed to identify the best mix of population-wide and risk targeted CVD strategies to maximise cost effectiveness and minimise inequalities.


CHAPTER 1. HIGH LEVEL DESCRIPTION OF IMPACT NCD
IMPACTNCD is a discrete time dynamic stochastic microsimulation model. 1,2 Within IMPACTNCD each unit is a synthetic individual and is represented by a record containing a unique identifier and a set of associated attributes.
For this study we considered age, sex, quintile groups of index of multiple deprivation (QIMD) * , body mass index (BMI), systolic blood pressure (SBP), total plasma cholesterol (TC), diabetes mellitus † (DM, binary variable), smoking status (current/ex/never smoker), environmental tobacco exposure (ETS, binary variable), fruit and vegetable (F&V) consumption and physical activity ‡ (PA) as the set of associated attributes. A set of stochastic rules are then applied to these individuals, such as the probability of developing coronary heart disease (CHD) or dying, as the simulation advances in discrete annual steps. The output is an estimate of the burden of CHD and stroke in the synthetic population including both total aggregate change and, more importantly, the distributional nature of the change.
This allows, among others, for an investigation of the impact of different scenarios on social equity.
IMPACTNCD is a complex model that simulates the life course of synthetic individuals and consists of two modules: The 'population' module and the 'disease' module. Figure S1 highlights the steps of the algorithm that generate the life course of each synthetic individual. We will fully describe IMPACTNCD by describing the processes in each of these steps in the following chapters. The description is from an epidemiological rather than technical perspective. The source code and all parameter input files are available in https://github.com/ChristK/IMPACTncd/tree/CVD-policy-options under the GNU GPLv3 licence. Tables S1 and S2 summarise the sources of the input parameters and the main assumptions and limitations, respectively.

Technical requirements
IMPACTNCD is being developed in R v3.2.0 4 and is currently deployed in an 80 logical core server with 2TB of RAM running Scientific Linux v6.2. IMPACTNCD is built around the R package 'data.table' 5 , which imports a new heavily optimised data structure in R. Most functions that operate in a data table have been coded in C to improve performance. Each iteration for each scenario is running independently in one of the CPU cores and the R package 'foreach' 6 is responsible for the distribution of the jobs and collection of the results. To ensure statistical independence of the pseudo-random number generators * QIMD is a measure of relative area deprivation based on the 2010 version of the Index of Multiple Deprivation. 3 † We defined as diabetics those with self-reported medically diagnosed diabetes (excluding pregnancy-only diabetes) or glycated haemoglobin (HbA1c) ≥ 6.5. ‡ Measured as days per week with 30 or more minutes of moderate or vigorous activity. running in parallel, the R package 'doRNG' 7 was used to produce independent random steams of numbers, generated by L'Ecuyer's combined multiple-recursive generator. 8 Figure S1 Simplified IMPACTNCD algorithm for individuals. For each step, the algorithm uses information from all appropriate previous steps. CHD denotes coronary heart disease.

Population module
The 'population' module consists of steps 1 to 4 in Figure S1. Synthetic individuals enter into the simulation in the initial year (2011 for this study). The number of synthetic individuals that enter into the simulation is user defined and for this study was set to 400,000. The algorithm ensures that the age, sex and QIMD distribution of the sample is similar to this of the English population in mid-2011. This concludes step 1, which only happens at the beginning of each simulation. Steps 2-7 are calculated annually (in simulation time) for each synthetic individual until the simulation horizon is reached, or death occurs.

Estimating exposure to risk factors (steps 2-3)
In steps 2 and 3, IMPACTNCD estimates the exposure of the synthetic individual to the modelled risk factors. It is essential the risk profile of each synthetic individual to be similar to the risk profiles that can be observed in the real English population. For this, we first built a 'close to reality' synthetic population of England from which we sampled the synthetic individuals. Then, we used generalised linear models (GLM) for each modelled risk factor, to simulate individualised risk factor trajectories for all synthetic individuals.

Generating the 'close to reality' synthetic population for IMPACTNCD
The 'close to reality' synthetic population ensures that the sample of synthetic individuals for the simulation is drawn from a synthetic population similar to the real one in terms of age, sex, socioeconomic circumstance, and risk factors conditional distributions. In our implementation we used the same statistical framework originally developed by Alfons et al. 9 and adapted it to make it compatible with epidemiological principles and frameworks.
In general, this method uses a nationally representative survey of the real population to generate a 'close to reality' synthetic population. Therefore, the method expands the, often small, sample of the survey into a significantly larger synthetic population, while preserves the statistical properties and important correlations of the original survey.
The main advantages over other approaches is: 1) it takes into account the hierarchical structure of the sample design of the original survey, and 2) it can generate trait combinations which were not present in the original survey but are likely to exist in the real population. The second is particularly important, because it avoids bias from excessive repetition of specific combination of traits present in the original survey that results from multilevel stratification of a relatively small sample. individuals with a BMI between 35 and 40. This is possible because the synthetic population is produced by drawing from conditional distributions that were estimated from multinomial models fitted in the original survey data. The detailed statistical methodology and justification can be found elsewhere. 9 Our approach consists of four stages from which the first is common with the original method described by Alfons et al. 9 The following stages have been adapted in order to be compatible with the widely accepted 'wider determinants of health' framework. 10 The main notion of this framework is that upstream factors such as the socioeconomic conditions, influence individual behavioural risk factors (e.g. diet, smoking), which in turn, influence individual downstream risk factors such as systolic blood pressure and total cholesterol. The four stages are: 1. Setup of the household structure.

Generate the biological variables.
In each stage, information from all previous stages is used. All the variables of the synthetic population for this study were informed by the Health Survey for England 2011 (HSE11). 11,12 The R language for statistical computing v3.2.0 and the R package 'simPopulation' v0.4.1 were used to implement the method. 4,13

STAGE 1: HOUSEHOLD STRUCTURE
The household size, and the age and sex of the individuals in each household that have been recorded in HSE11 were used to inform the synthetic population, stratified by Strategic Health Authority (SHA) * .

STAGE 2: SOCIOECONOMIC VARIABLES
Once the basic age, sex, household and spatial information of the synthetic population was generated, other socioeconomic information was built up. QIMD for each synthetic individual was generated dependent on the household size and the age and sex of the individuals, stratified by SHA. Then, the equivalised income quintile groups 14 (EQV5) for each household was generated, dependent on fiveyear age groups and sex, stratified by QIMD. Finally, the employment status of the head of the household (HPNSSEC8) was generated using the National Statistics Socio-Economic Classification 15 , dependent on 5-year age groups, sex and EQV5, stratified by QIMD.

STAGE 3: BEHAVIOURAL VARIABLES
* SHAs were 10 large geographic areas, part of the structure of the National Health Service in England before 2013. SHA is the only variable with spatial information in HSE11 and it was used as a proxy, to include some spatial information to the synthetic population.
In this stage, behavioural variables such as F&V portions per day, days achieving more than 30 min of moderate or vigorous PA per week, * smoking status, and exposure to ETS were generated, dependent on 5-year age groups, sex, HPNSSEC8 and EQV5, stratified by QIMD. Moreover, statins and antihypertensive medication use (two separate binary variables) were generated, dependent on 5- year age groups, sex and HPNSSEC8, stratified by QIMD. Other smoking related variables like cigarettes smoked per day for smokers, years since cessation for ex-smokers and pack-years for ever-smokers were also generated in this step.

STAGE 4: BIOLOGICAL VARIABLES
The last stage is the generation of the biological variables. Widely accepted causal pathways that have been observed in cohort studies, were used to identify associations between biological and behavioural variables. F&V consumption was used as a proxy to healthy diet. Citations refer to specific evidence regarding the associations. BMI is associated with SBP [16][17][18][19] , TC 20 and DM 21 . Thus, BMI was the first to be generated in the synthetic population dependent on 5-year age groups, sex, EQV5, F&V consumption 22 and PA [22][23][24] , stratified by QIMD. Then, DM was generated dependent on 5-year age groups, sex, HPNSSEC8 and QIMD, stratified by BMI deciles. The TC was generated dependent on 5-year age groups, sex, deciles of BMI, use of a statin and F&V consumption, stratified by QIMD.
Similarly, for the SBP the 5-year age groups, sex, deciles of BMI, smoking status 25,26 and deciles of salt consumption were used as predictors, stratified by QIMD. Socioeconomic variables were used as predictors for both behavioural and biological variables to allow for possible interaction between socioeconomic and behavioural variables.
In the end, a synthetic population of 55 million synthetic individuals with similar characteristics to the non-institutionalised population of England in 2011. The synthetic population was validated against the original HSE11 sample (see p35, Synthetic population validation).

IMPACTNCD implementation of individualised risk factor trajectories
IMPACTNCD only applies the previous process for the initial year of the simulation. As the simulation evolves over time, all variables are recalculated to take into account age and period effects. This feature justifies the classification of IMPACTNCD as a dynamic microsimulation. The process depends on the nature of each variable and the available information but generally, it uses HSE01 -HSE12 12,27-37 to capture the time trends by age, sex, and QIMD and project them into the future. * For PA, HSE2012 was used as HSE2011 did not measure PA.

AGE, SEX AND SOCIOECONOMIC VARIABLES
As the simulation progress in annual circles the age of the synthetic individuals in the model increase by one year in each loop. The sex and socioeconomic variables remain unchanged. Therefore, social mobility is not simulated in the current version of IMPACTNCD.

FRUIT & VEG CONSUMPTION AND PHYSICAL ACTIVITY
Both F&V consumption (portions/day) and PA (days with more than 30 min of moderate or vigorous activity/week) were modelled as ordinal factor variables. A proportional odds logistic regression model * was fitted in the HSE01, HSE02, HSE04-11 individual level data with F&V consumption as the dependent variable and year, 2 nd degree polynomial of age, sex, QIMD and their 1 st order interactions.
Similarly, for PA a similar model † was fitted in the HSE06, HSE08 and HSE12 data. These models were used for individual level predictions about the synthetic individuals as the simulation was evolving.
Tables S3 and S4 present coefficients of the two models. Footnotes have the direct links to the actual R objects in the GitHub repository.

SMOKING
The 'close to reality' synthetic population is an accurate snapshot of active, ex-, and never smokers in 2011, as it was observed in HSE11. Then IMPACTNCD uses transitional probabilities for smoking initiation ‡ , cessation § , and relapse ** , to generate and record smoking histories of the synthetic individuals. For smoking initiation and cessation probabilities, logistic regression models were fitted in HSE data with age, sex, and QIMD as the independent variables. A similar approach was followed for relapse probabilities with years since cessation, sex and QIMD as the independent variables. Tables S5, S6, and S7 present coefficients of the models. Footnotes have the direct links to the actual R objects in the GitHub repository.

ENVIRONMENTAL TOBACCO SMOKING
For ETS we assumed a linear relation between smoking prevalence and ETS, stratified by QIMD. The models are estimated dynamically during the simulation. We assumed no intercept; when smoking prevalence reaches 0, ETS prevalence will be 0 too.
Secondly, because the variance of the risk factor distributions increases with age, and we wanted to model this phenomenon. Below we describe the stages:  Figure S2 illustrates the previous example. Despite, individuals preserve their percentile for the respective risk factor throughout the simulation (vertical position in Figure S2), this stage remains stochastic, because each time this stage is implemented a different sample from the synthetic population is drawn. Finally, the distance from the mean for each risk factor is calculated stratified by 5-year age group, sex, and QIMD. For instance, if a synthetic individual has SBP of 140 mmHg and the mean SBP in the respective group of same age group, sex and QIMD is 130 mmHg, the distance from the mean is 140 -130 = 10 mmHg.

Stage 2:
Similarly to the approach followed for other variables, we fitted regression models to the HSE01-12 data. For BMI † , year, age, sex, QIMD and PA were the independent variables. For SBP ‡ , year, age, sex, QIMD, smoking status, BMI, and PA were the independent variables. Finally, for TC § , year, * For the percentile rank the formula = ( − 1) ( − 1) ⁄ is used, where is the percentile rank and = ( 1 , … , ) is the rank vector constructed from a random observation vector ( 1 , … , ). In IMPACTNCD specifically, vector is constructed from the subset of the respective continuous risk factor values, by 5-year age group, sex and QIMD, for each year of the simulation. † https://github.com/ChristK/IMPACTncd/blob/CVD-policy-options/Lagtimes/bmi.svylm.rda ‡ https://github.com/ChristK/IMPACTncd/blob/CVD-policy-options/Lagtimes/sbp.svylm.rda § https://github.com/ChristK/IMPACTncd/blob/CVD-policy-options/Lagtimes/chol.svylm.rda age, sex, QIMD, BMI, F&V consumption and PA were the independent variables. * These models are used to predict the mean of the relevant group. These predicted means are added then, to the distances calculated in previous stage. The result is the final value of the relevant risk factor that will be used for risk estimation. Tables S8, S9, and S10 present coefficients of the models. Footnotes have the direct links to the actual R objects in the GitHub repository.

DIABETES MELLITUS
As with smoking, the 'close to reality' synthetic population is an accurate snapshot of diagnosed and non-diagnosed diabetics in 2011, as it was observed in HSE11. We assumed DM is an incurable chronic condition. IMPACTNCD uses the validated for English population Qdiabetes algorithm (ex QDscore) to calculate annual transitional probabilities of non-diabetic synthetic individuals to develop DM. 38 * As before, the independent variables for each risk factor were selected based on known associations from longitudinal studies. Therefore, only the magnitude of the association is informed by cross-sectional data and possibly attenuated due to reverse causality.

Lag times
All the function that have been described above for risk factor trajectories include time and age (in years) as one of the independent variables. Therefore, lag times can be potentially considered on a per risk factor basis. For instance, let us consider a 50-year-old synthetic individual in 2015 and an assumed lag time of 5 years for F&V. When IMPACTNCD calculates the probabilities for F&V consumption of this individual, it will use time -(lag time) = 2015 -5 = 2010 and age -(lag time) = 50 -5 = 45. So, when the 'disease' module of IMPACTNCD, uses the risk exposure to F&V to estimate a disease incidence transitional probability, the lag timed exposure will be used.
For the sake of simplicity, in this study we assumed that the lag time between exposure and CVD is 5 years. [39][40][41] . The lag time was roughly informed from risk reversibility trials, when available, or the median observation times of the cohort studies we used to inform the risk magnitude for each risk factor.

Birth engine (Step 4)
The Office for National Statistics (ONS) principal-assumption fertility projections for England are used to estimate the number of new synthetic individuals entering the model through birth, in every simulated year. 42 The birth engine only becomes important for simulations featuring a horizon of more than 30 years. Therefore, we do not describe this step in detail here.

CHAPTER 2. DISEASE MODULE
The disease module contains the last three steps of the model ( Figure S1). The risk (probability) for each synthetic individual aged 30 -84, to develop each of the modelled diseases is estimated in step 5 conditional on the exposure to relevant risk factors. The step ends by selecting synthetic individuals to develop the modelled diseases. Finally, in steps 6 and 7 the risk of dying from one of the modelled diseases or any other cause is estimated and applied. Steps 2 to 7 are then repeated for the surviving individuals until the simulation horizon is reached.

Estimating the annual individualised disease risk and incidence (Step 5)
In order to estimate the individualised annual probability of a synthetic individual to develop a specific disease conditional on his/her relevant risk exposures we follow a 3-stage approach: 1. The proportion of incidence attributable to each modelled risk factor by age group and sex is estimated, assuming a specific time lag.
2. Assuming multiplicative risks, the portion of the disease incidence attributable to all the modelled risk factors is estimated and subtracted from the total incidence.
3. For each individual in the synthetic population, the probability to develop the disease is estimated and then is used in an independent Bernoulli trial to select those who finally develop the disease.
Next, the implementation of the above method is described in more detail using CHD as an example.
The same process is used for stroke.

Stage 1
The population attributable risk (PAF) is an epidemiological measure that estimates the proportion of the disease attributable to an associated risk factor. 43 It depends on the relative risk associated with the risk factor and the prevalence of the risk factor in the population. Specifically, for each modelled binary risk factors associated with CHD, PAF was calculated by 5-year age group and sex using the formula: where is the prevalence of the risk factor in the population, and is the relative risk of the risk factor. For categorical risk factors with levels of exposure, we used the formula: where is the prevalence of the risk factor at level in the population and is the relative risk associated with level of exposure. The same formula was used to approximate the PAF of continuous risk factors because they behave like discrete variables in the model. Consistent with findings from the respective meta-analyses that were used for IMPACTNCD (Error! Reference source not found.), SBP below 115 mmHg, TC below 3.8 mmol/l and BMI below 20 Kg/m 2 were considered to have a relative risk of 1. Similarly, consumption of 8 or more portions of F&V and 5 or more days with more than 30 minutes of moderate to vigorous activity per week were also considered to have a relative risk of 1.
For the estimation of prevalence of risk factors, we used their prevalence in the synthetic population taking into account any assumed lag time. All the relative risks were taken from published metaanalyses and cohort studies (Error! Reference source not found.).

Stage 2
Assuming multiplicative risk factors with no interactions, the incidence of CHD not attributable to the modelled risk factors can be estimated by the formula: Where is the CHD incidence and 1… are the PAF of each risk factor from Step 1.
ℎ represents CHD incidence if all the modelled risk factors were at optimal levels.
The theoretical minimum incidence is calculated by 5-year age group and sex only in the initial year of the simulation and it is assumed stable thereafter.

Stage 3
Assuming that ℎ is the baseline annual probability of a synthetic individual to develop CHD for a given age and sex due to risk factors not included in the model (i.e. genetics etc.), the individualised annual probability to develop CHD, ℙ(CHD | age, sex, exposures), given his/her risk factors were estimated by the formula: the relative risks that are related to the specific risk exposures of the synthetic individual, same as in stage 1. Depending on data availability this method can be further stratified by QIMD; however, data were not available for this in the current study.
The above method can be used only when the incidence of the disease in the population is known.
The true incidence of CHD (and stroke) though, is largely unknown. Several estimates exist nonetheless all have limitations. Therefore, for the estimation of CHD incidence by age and sex we opted for a modelling solution to synthesise all the available sources of information and minimise bias.
Specifically, we used ONS CHD mortality (ICD10 I20-I25) for England in 2011, 44 self-reported prevalence of CHD from HSE11, incidence of angina from primary care data 45 and incidence of acute myocardial infarction (AMI) from mortality and hospital statistics 46 to inform the World Health Organisation (WHO) DISMOD II model. 47 DISMOD II is a multi-state life table model that is able to estimate the incidence, prevalence, mortality, fatality and remission of a disease, when information about at least three of these indicators is available. A similar approach has been followed by the Global Burden of Disease team and others. 48,49 We considered CHD an incurable chronic disease (i.e. remission rate was set to 0); therefore, the derived DISMOD II incidence refers to the first ever manifestation of angina or AMI excluding any recurrent episodes. For the DISMOD II calculations, we assumed that incidence and case-fatality had been declining by 3% (relative), over the last 20 years.
The derived CHD incidence, prevalence, and fatality were used as an input for IMPACTNCD. Similar approach was used for stroke.
For the initial year of the simulation, some synthetic individuals need to be allocated as prevalent cases for each of the modelled diseases. DISMOD II model 47 is used again to estimate the number of prevalent cases of the disease by age and sex. Then, the estimated number of prevalent cases are sampled independently from the individuals in the population with weights proportional to their relevant exposures to the associated risk factors.

Simulating disease histories (Step 6)
In the current stage of development, IMPACTNCD does not contain a detailed disease history module.
However, Step 6 is used to simulate significant aspects of the disease. For CVD, this was used to simulate the observable spike of short-term (30 days) mortality after the first event of AMI or stroke.
Data about short term mortality were used from the 'Coronary heart disease statistics 2012 edition' report. 45

Simulating mortality (Step 7)
All synthetic individuals are exposed to the risk of dying from any of their acquired modelled diseases or any other non-modelled cause. However, the algorithm behaves differently depending on the age and life course trajectory of the synthetic individual.
For ages 0 to 29, we used all-cause mortality rate by age, sex, and QIMD to inform an independent Bernoulli trial and select synthetic individuals that die every year. For years 2011 to 2013 we used the observed mortality rates as were reported from ONS. 44 For years after 2013, functional demographic models by sex and QIMD were fitted to the ONS reported annual mortality rates, from years 2002 to 2013, and then they were projected to the simulation horizon using the R package 'demography'. 50 Functional demographic models are generalisations of the Lee-Carter demographic model, influenced by ideas from functional data analysis and non-parametric smoothing. 51 The same approach as above was followed for synthetic individuals aged 85 to 100. We considered a mortality rate of 1 for all synthetic individuals reaching the age of 100. Hence, IMPACTNCD maximum synthetic individual age is 100 years.
Finally, for synthetic individuals with ages between 30 and 84 the all-cause mortality was decomposed into modelled-diseases specific mortality and any-other cause mortality. The former applies only to the prevalent cases of each modelled disease in the synthetic population. For this, case-fatality rates by age and sex were estimated by DISMOD II for each modelled disease, as described before. Then case-fatality rates are used in a Bernoulli trial to select prevalent cases that die from the disease in a year.
For the any-other cause mortality, a process similar to the one described for ages 0 to 29 and 85 to 100. However, this time CHD and stroke specific mortality are removed from the observed mortality and mortality projections to avoid double counting.
The case mortality and fatality rates are further parametrised and individualised based on established epidemiological evidence. The 'male British doctors' and DECODE studies have showed that smokers and diabetics have increased overall mortality even when CVD was excluded 52,53 . IMPACTNCD adjusts for that by inflating the any-other cause mortality rate for smokers and diabetics and deflating it for non-smokers and non-diabetics, while it constrains the sum to remain the same as before the adjustments. Furthermore, we assumed that CVD case-fatality is improving by 3%, and that there is a constant case-fatality socioeconomic gradient of approximately 5% by QIMD level (halved for ages over 70) for CHD, and 2% for stroke. The socioeconomic gradient forces the more deprived to experience worse disease outcomes. These assumptions are based on empirical evidence. 45 Finally, synthetic individuals who remain alive after this step progress to the next year and start again from step 1, unless the simulation horizon has been reached.

CHAPTER 3. SCENARIOS
The method described above, is used to for the baseline scenario. In general, primary prevention interventions or policies can then be modelled as counterfactual scenarios, through their estimated effects on the relevant risk factors. The modelled scenarios are generated from alterations of the baseline scenario, mainly in three ways: 1. Population-wide interventions can be modelled, by altering the intercept or the coefficients of the regression equations that are used to estimate risk factor exposures. For example, when continuous risk factors are considered, adding or subtracting from the intercept increases or decreases the related risk factor for each synthetic individual; therefore, the mean of the risk factor for the whole population. Altering the year coefficient accelerates, decelerates or reverses the trend for the whole population. Likewise, altering the QIMD coefficients or/and the coefficient of the interaction between year and QIMD can simulate differential effects and trends by QIMD. A similar approach sometimes can be used also for the non-continuous risk factors. The benefit is that by just altering a few parameters the changes are translated down to individual level characteristics in a computationally efficient way.

Targeted interventions can be modelled by selecting synthetic individuals with a specific trait
or combination of traits, and apply an intervention to them, by changing their attributes. For example, to simulate the effect of statins a simple approach would be to randomly select 30% of the synthetic individuals with TC higher than 4 mmol/l not currently on statins; and apply a 25% reduction of their TC between steps 4 and 5 ( Figure S1).
3. Some hybrid combination of the previous methods or some 'exotic' approaches like have the time stop at a specific year, or running backwards to simulate disaster scenarios etc.
In the following paragraphs, we highlight some details of the scenarios that we used in the main paper.
They are meant to be read in conjunction with the scenario description in the main text.

Universal screening
This was a typical targeted intervention, so this scenario was built with the second approach, as described above. The high-risk synthetic participants eligible for treatment were selected based on the QRISK2 score. 54 The score requires extra information about the synthetic individual that was not originally modelled and at the current stage is used exclusively for the calculation of the QRISK2 score.
This includes information about ethnicity, specific type of diabetes (I or II), family history of CVD, chronic kidney disease (stage 4 or 5), atrial fibrillation, rheumatoid arthritis, and the TC/HDL ratio. To model these extra attributes for the synthetic individuals we fitted multinomial, logistic or generalised linear regression models to HSE data, and then we used the models to predict synthetic individuals' status. Exceptions, to this approach were type I diabetes and rheumatoid arthritis prevalence. We assumed a prevalence of 0.5% for type I diabetes and we extracted age and sex specific rheumatoid arthritis prevalence from published data. 55 To simulate ethnicity of synthetic individuals, a multinomial model was fitted to HSE data with 5-year age group, sex and QIMD as the independent variables. * To simulate family history of CVD, a logistic regression model was fitted in HSE06 data that contained this information, with age and QIMD as the independent variables. † For the prevalence of atrial fibrillation, a logistic regression model was fitted Where Prescription is a binary variable (whether Atorvastatin was prescribed (1), or not (0)), Persistence is a binary variable (whether the synthetic individual continue with the medication (1), or not (0)), and Adherence is a value between 0 and 1 modelling the proportion of daily dose taken. For these variables, values where drawn from distributions (Table S14).
A similar approach was used for antihypertensive medication. Given the numerous antihypertensive treatment combinations, we assumed that medication could fully control hypertension for all synthetic individuals down to a target of 115mmHg of SBP. We applied the same approach as above to adjust treatment effectiveness to prescription, persistence, and adherence.
Information regarding medication prescription after a Health Check was extracted from Forster et al. 58 This study was conducted while the recommendation for primary prevention statin prescription was

Population-wide intervention
Many of the interventions in this scenario were modelled by altering the coefficients of the models that were used to estimate the attributes of the synthetic individuals. Specifically, this approach was followed for BMI and SBP. Smoking and F&V consumption interventions were modelled by altering the attributes of synthetic individuals after they were estimated in step 2 ( Figure S1). Given the existing limitations to measure the direct effect of a structural population-wide intervention, we inflated the uncertainty around the inputs we have used for this scenario (Table S14).

Risk factors trajectories
The effects of all modelled scenarios on population risk factors are summarised in the following graphs, for ages 30 to 84.

CHAPTER 4. SENSITIVITY ANALYSIS
Here we present the full output of the three scenarios that were produced as variations of the main scenarios with modified assumptions; namely the '20% treatment threshold universal screening', the 'socioeconomic differential uptake universal screening', and the 'diet-only population-wide intervention'. Tables S11, S12, and S13 summarise the results.

CHAPTER 5. UNCERTAINTY
IMPACTNCD implements a 2 nd order Monte Carlo approach to estimate uncertainty distributions of the outputs for each scenario. 60,61 Each simulation runs 1000 times. For each iteration, a different set of input parameters is used, by sampling from the respective distributions * of input parameters, and a different sample of the synthetic population is drawn. However, the scenarios are 'paired'. For instance, the nth iteration of all scenarios runs with the same set of input parameters and on the same initial synthetic population sample for all of them. † This explains why the uncertainty of in-between scenarios comparisons is significantly smaller than the uncertainty of isolated scenarios.
The framework allows stochastic uncertainty, parameter uncertainty and individual heterogeneity to be reflected in the reported uncertainty intervals (UI). The following example illustrates the different types of uncertainty that were considered in IMPACTNCD. Let us assume that the annual risk for CHD is 5%. If we apply this risk to all individuals and randomly draw from a Bernoulli distribution with = 5% to select those who will manifest CHD, we only consider stochastic uncertainty. If we allow the annual risk for CHD to be conditional on individual characteristics (i.e. age, sex, exposure to risk factors), then individual heterogeneity is considered. Finally, when the uncertainty of the relative risks due to sampling errors is considered in the estimation of the annual risk for CHD, the parameter uncertainty is considered. From these three types of uncertainty, only the parameter uncertainty can be reduced from better studies in the future.
Due to lack of information and for computational efficiency, not all three types of uncertainty are considered in every step ( Figure S1) of IMPACTNCD. Specifically, stochastic uncertainty is included in every step, individual heterogeneity in every step except 1 and 4 and parameter uncertainty in step 5.
Of course, parameter uncertainty of scenario targets are also estimated in steps 2 and 3.
The structure of the model is grounded on fundamental epidemiological ideas and well-established causal pathways; therefore, we considered this type of uncertainty relatively small and did not study it. However, mortality from each of the modelled diseases and any-other cause (steps 6 and 7) is calculated serially, one modelled disease at a time. To avoid bias that this approach might introduce, the order of the modelled diseases in each mortality estimation is randomised. * We assumed lognormal distributions for relative risks and hazard ratios, normal distributions for coefficients of regression equations, and PERT distributions for scenario-specific and other parameters. Specifically for relative risks and hazard ratios, the distributions were bounded above 1 when the mean was above 1 and vice versa. † Individual life-course trajectories however, are not. The same normotensive individual may evolve and develop hypertension under scenario 'A' but not under scenario 'B' due to chance, and not as a direct effect of the scenarios.
From our experience in communicating our results to policy makers and researchers, we realised that they tend to misinterpret 95% UIs as 95% confidence intervals (CI) and overlapping UIs as 'evidence against statistical significance'. This does not apply in our model because the scenarios share common sources of uncertainty as explained above; therefore, scenarios are not independent. We decided to present medians and interquartile ranges (IQRs) exactly to avoid this misunderstanding with UIs and CIs. We hope that readers will mentally visualise the distribution from medians and IQRs rather than attempt to apply frequentist statistical inference and hypothesis testing rules, which do not apply in this particular situation. In any case, all our output distributions were approximately normal and their standard deviation can be approximated by dividing IQR with 1.35. Then, z scores can be used to approximate any probability of UI.

CHAPTER 6. EQUITY METRICS Absolute and relative equity slope index
The 'absolute equity slope index' and the 'relative equity slope index' are two regression-based metrics, to measure the impact of the modelled interventions on absolute and relative socioeconomic health inequalities. They are inspired by the slope index of inequality (SII) and the relative index of inequality (RII); 62 however, instead of directly measuring inequalities in a population, like SII and RII do, they measure the impact of an intervention to existing inequalities.
The basic principles of the metrics are illustrated in this simplified example. Let us consider the simple example of a population that consists of only two mutually exclusive and same-sized socioeconomic groups, the 'deprived' and the 'affluent'. The two groups experience different incidence of a disease; supposedly, 50 and 10 incident cases among the deprived and the affluent, respectively, every year.
Hence, the absolute socioeconomic inequality for disease incidence is 50 -10 = 40 cases and the relative socioeconomic inequality is 50 / 10 = 5. If a hypothetical intervention 'A' prevents the same number of cases in both groups, absolute inequality will remain stable. Similarly, if intervention 'A' prevents more cases in the affluent group, absolute inequality will increase and vice versa. For relative inequality to remain stable, the decrease in cases need to be proportional to the observed number of cases. For example, a hypothetical intervention 'B' that reduces 10% of cases in each group will have no effect on relative inequality. If the proportional reduction is higher in the affluent group compared to the deprived, then relative inequality will increase and vice versa.
As in many real-world examples, IMPACTNCD uses QIMD to classify population in five socioeconomic groups of unequal sizes. In this case, SII and RII can be used to measure absolute and relative socioeconomic inequalities in health, respectively. The same principles about intervention effectiveness and inequalities described in the previous paragraph, also apply here. If an intervention prevents equal number of cases in all QIMD groups SII will remain unchanged, while if the proportional reductions of cases in all QIMD groups are equal, RII will remain unchanged. * Inspired by SII and RII, the absolute equity slope index is the slope of the regression line fitted in the number of cases prevented or postponed by an intervention (dependent variable), on ridit scores 63 of QIMD (independent variable). Ridit scores reflect the average cumulative frequency of each QIMD group † .
As in SII and RII they are used to account for the different sizes of each QIMD group (the distribution * Assuming that the deaths prevented by the intervention does not change the relative size of the socioeconomic groups. † So, if in QIMD 1,2,3,4 and 5 areas live 14%, 22%, 22%, 24% and 18% of the population respectively, the cumulative frequency is 14%, 36%, 58%, 82% and 100% and the rigid scores are 0+0.14/2 = 0.07, (0. of inequality), and allow for comparisons between populations. A positive slope means that the intervention prevents more cases in the more deprived QIMD groups and reduces absolute inequality in the population, and vice versa. The magnitude of the slope is proportional to the reduction in absolute inequality. The relative equity slope index is constructed and interpreted similarly, except that the proportion of cases prevented or postponed over the total cases in each socioeconomic group is the independent variable, and it measures the effect on relative inequality.

Equity summary chart
The equity summary chart presents, in a simple two-dimensional chart, the impact of the interventions on disease incidence, and absolute and relative socioeconomic inequality. The horizontal axis represents the number of cases prevented or postponed and the vertical axis represents the decrease (or increase) in absolute inequality. An 'equity' curve (or line) divides the graph in two parts.
Interventions above the equity curve decrease relative inequality and interventions below it increase relative inequality. The underlying assumption of the equity summary chart and the equity curve is that for a given overall reduction of disease burden in the whole population, attributable to an intervention, there is one and only one way to distribute the reduction among the socioeconomic groups that can reduce absolute socioeconomic inequality and have no impact on relative socioeconomic inequality.
Let us consider the simple example of a population that consists of only two mutually exclusive socioeconomic groups, the 'deprived' and the 'affluent', with different disease incidence. Then the disease incident cases of the whole population = + , where is the cases in the 'deprived' group and is the cases in the 'affluent' group. Also by definition, absolute inequality = − , and relative inequality = ⁄ .
For a given overall reduction in disease incident cases across groups ( ), the distribution of the reduction among the 2 groups can be described as The generalisation of the previous example to populations with more than two levels of socioeconomic deprivation and unequal sizes of the socioeconomic groups is explained below.
Complex measures to summarise absolute and relative socioeconomic inequalities already exist for this situation 62,64 . We will use slope index of inequality (SII) to summarise absolute inequality in the population and relative index of inequality (RII) to summarise relative inequality. They are regressionbased approaches with the first, representing the health measure difference between the most and least deprived individuals and the second, the health measure ratio between the most and least deprived individuals. After the intervention, the new incidence ′ = ′ * ′ + (1 − ′) * ′ and will result in a new ′ and ′ . Because ′ is dependent on the impact of intervention on the different socioeconomic groups, the 'equity line' cannot be defined as in the previous simplified example.
However for a given intervention, ′ can be estimated and assuming there is an ′′ for = ′.
The horizontal axis of the equity summary chart represents = − ′ = and the vertical axis ΔSII = SII − SII′. The graph in the main text was created by plotting and ΔSII for each scenario (in year 2015). Then each scenario was plotted on the graph. For each scenario (ΔΙ, − ′′ ) was also plotted. To improve readability of the graph and for presentation purposes only, a constrained b-spline curve was fitted to them to represent the 'equity' curve. Scenarios above the equity curve decrease relative inequality and scenarios below the equity curve increase it. The vertical distance from the curve, roughly represents the impact of the scenario on relative inequality. Consequently, the health equity impact chart presents on a two axes chart, the impact of the intervention on CVD incidence, absolute and relative inequality.

CHAPTER 7. VALIDATION
In this chapter, we first present the internal validation of the synthetic population and the risk factor trends, as an evidence that the synthetic population used in IMPACTNCD was similar to English population. Then, we present the predictive validation of IMPACTNCD by comparing observed to predicted mortality rates for years 2006 to 2013 by age group, sex, QIMD, and modelled disease.
Specifically for the predictive validation, IMPACTNCD was calibrated to data up to 2006. The only exception was the regression models that were used in for individual predictions of exposure to risk factors (steps 2 and 3 in Figure S1). These models were fitted in data from 2001 to 2012 as described before. This is appropriate, as the main use of IMPACTNCD is to translate changes in risk factor into changes in CVD * incidence and mortality, and not forecasting. Finally, we present the comparison of IMPACTNCD baseline scenario with BAMP, a Bayesian age-period-cohort model. 65

Synthetic population validation
The

Risk factor trends validation
Here we compare mean exposure of IMPACTNCD synthetic population to the observed exposure through relevant national representative surveys. We stratified by sex, age group and when data allowed by QIMD. Overall, the plots provide evidence that the regression models used (steps 2 and 3 Figure S1), have captured trends by age, sex and QIMD well enough.

Mortality concurrent and predictive validation
Here we validate the IMPACTNCD estimated mortality against the observed mortality in England between 2006 and 2013. Furthermore, we present the comparison of IMPACTNCD predicted CVD mortality rates, with forecasts from BAMP v1.3.0.1. 65 BAMP is a software that performs Bayesian-ageperiod-cohort forecasting and we used it to fit a model, with 2 nd order random walks priors, to the observed CVD mortality in England for years 2002 to 2013 and then project it up to 2030. A more detailed description of the BAMP forecasting can be found elsewhere. 67 Overall, the plots support the argument that IMPACTNCD is capable to translate changes in risk factors prevalence into changes in disease mortality, rather accurately.

Population module
Immigration is not considered.
Social mobility is not considered.