Breast cancer mortality in 500 000 women with early invasive breast cancer diagnosed in England, 1993-2015: population based observational cohort study

Abstract Objectives To describe long term breast cancer mortality among women with a diagnosis of breast cancer in the past and estimate absolute breast cancer mortality risks for groups of patients with a recent diagnosis. Design Population based observational cohort study. Setting Routinely collected data from the National Cancer Registration and Analysis Service. Participants All 512 447 women registered with early invasive breast cancer (involving only breast and possibly axillary nodes) in England during January 1993 to December 2015, with follow-up to December 2020. Main outcome measures Annual breast cancer mortality rates and cumulative risks by time since diagnosis, calendar period of diagnosis, and nine characteristics of patients and tumours. Results For women with a diagnosis made within each of the calendar periods 1993-99, 2000-04, 2005-09, and 2010-15, the crude annual breast cancer mortality rate was highest during the five years after diagnosis and then declined. For any given time since diagnosis, crude annual breast cancer mortality rates and risks decreased with increasing calendar period. Crude five year breast cancer mortality risk was 14.4% (95% confidence interval 14.2% to 14.6%) for women with a diagnosis made during 1993-99 and 4.9% (4.8% to 5.0%) for women with a diagnosis made during 2010-15. Adjusted annual breast cancer mortality rates also decreased with increasing calendar period in nearly every patient group, by a factor of about three in oestrogen receptor positive disease and about two in oestrogen receptor negative disease. Considering just the women with a diagnosis made during 2010-15, cumulative five year breast cancer mortality risk varied substantially between women with different characteristics: it was <3% for 62.8% (96 085/153 006) of women but ≥20% for 4.6% (6962/153 006) of women. Conclusions These five year breast cancer mortality risks for patients with a recent diagnosis may be used to estimate breast cancer mortality risks for patients today. The prognosis for women with early invasive breast cancer has improved substantially since the 1990s. Most can expect to become long term cancer survivors, although for a few the risk remains appreciable.


Figure S1
Crude all-cause (blue), breast cancer (red) and non-breast-cancer (green) mortality rates in 512,447 women with early breast cancer by combinations of numbers of positive nodes and tumour size, for categories of attained age 7 Figure S2 Composition of the study population among women diagnosed with invasive breast cancer in England during January 1993 to December 2015 8 Text S2 Statistical Methods 9 Table S1 Patterns of missing values in the study population of 512,447 women with early breast cancer 15  Table S4 Numbers of women diagnosed with early breast cancer, crude annual breast cancer mortality rates and cumulative breast cancer mortality risks by calendar period of diagnosis, and time since diagnosis 22 Table S5 Numbers of women diagnosed with early breast cancer at ages 50-64 years, crude annual breast cancer mortality rates and cumulative breast cancer mortality risks by screening status, calendar period of diagnosis, and time since diagnosis 23 Table S6 Numbers of women diagnosed with early breast cancer, crude annual breast cancer mortality rates and cumulative breast cancer mortality risks by ER status, and time since diagnosis Adjusted annual breast cancer mortality rates during 5-15 years after diagnosis in women with early breast cancer with ER-positive or ER-negative disease according to nine characteristics, and time since diagnosis 31 Figure S10 Adjusted annual breast cancer mortality rates during 15+ years after diagnosis in women with early breast cancer with ER-positive or ER-negative disease according to nine characteristics, and time since diagnosis 32 Figure S11 Adjusted annual breast cancer mortality rates during 0-5 years after diagnosis in women with early breast cancer with ER-positive or ER-negative disease by calendar period of diagnosis, according to age at diagnosis 33 Figure S12 Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ER-negative disease by calendar period of diagnosis, according to breast cancer laterality, index of multiple deprivation and region of residence 34 Figure S13 Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ER-negative disease by calendar period of diagnosis, according to age at diagnosis

Comparable Studies
Table S11 Other studies of breast cancer specific mortality in populations of patients with breast cancer or of relative mortality in populations of patients with breast cancer compared with the general population 68 4 Tabulation   5 Text S1: Data collating and checking 1. Datafiles The central dataset received from the National Cancer Registration and Analysis Service (NCRAS) contained one record for every woman registered with invasive breast cancer during the period 1 January 1993 to 31 December 2015. Women who had previously been registered with an invasive cancer were excluded. The file contained a pseudonymised patient identifier, a pseudonymised tumour identifier for each woman's first ('index') breast cancer, and details of that cancer. A total of eighteen other datasets were also received from NCRAS. Each record in these datasets either related to an individual woman and contained the pseudonymised patient identifier and information about that woman, including any previous non-invasive breast cancers and any subsequent invasive cancers, or related to the index tumour and contained the pseudonymised tumour identifier and information relating to that tumour. Many of the files included multiple records per woman or multiple records per tumour.

Statistical Methods and Basic
To create a single datafile for use in analyses, it was necessary to identify the relevant records and variables in each dataset and merge them together to create a new datafile with a single record per woman. This datafile included: A. Patient details: age of the woman when the index breast cancer was diagnosed, quintile of multiple deprivation 1 (a measure of poverty) based on her address at diagnosis, region of residence (based on the regions covered by the former regional cancer registries), and, if relevant, month and year of any prior non-invasive breast cancers and of any subsequent invasive cancers, month and year of embarkation or month and year of death, together with cause of death.
B. Tumour details for the index breast cancer: year and month of registration, death certificate only registration (yes/no), screen-detected 2 (yes/no), surgery (mastectomy/breast conserving/none), International Classification of Diseases (ICD)_O2 code, morphology, behaviour, histology (categorised into cancer subtype), grade, pathological tumour size, number of involved axillary nodes, oestrogen receptor (ER) status, laterality, indication of metastatic disease at diagnosis (yes/no), indication of neoadjuvant therapy (yes/no). In all analyses, the status screen-detected=yes was only allocated to women aged 50-64 years at diagnosis plus, from 2005, women aged 65-70 years at diagnosis, in order to reflect the coverage of the National Health Service screening program.
Before conducting any analyses, consistency and range checks were carried out. The checks described below were also conducted. Where appropriate, inconsistencies were queried with NCRAS staff for clarification.

Review of diagnostic information
For all the index cancers, and wherever there was an indication that the woman had been diagnosed with a previous non-invasive breast cancer, all the available pathological and diagnostic information for that tumour was reviewed by a team of three oncologists, while a pathologist reviewed all the histology codes. Any inconsistencies were discussed with NCRAS staff and, if required, the information recorded was corrected.
For any women recorded as having a second cancer, the date of the first subsequent primary cancer (ignoring non-melanoma skin cancer, records of a diagnoses with no accompanying cancer site, and non-invasive tumours) was ascertained. Women whose index cancer was followed by a second primary cancer or a contralateral invasive breast cancer within 3 months were excluded from the study. Figure S1: Crude all-cause (blue), breast cancer (red) and non-breast-cancer (green) mortality rates in 512,447 women with early breast cancer by combinations of numbers of positive nodes and tumour size, for categories of attained age 8 Figure S2: Composition of the study population among women diagnosed with invasive breast cancer in England during January 1993 to December 2015 *Women with metastatic disease were identified as follows: record of metastatic disease, drug usually given for metastatic disease, or palliative radiotherapy within a year of breast cancer diagnosis (N=17,434). This category also includes women with no surgery recorded (N=121,477). Reasons for no surgery may include metastatic disease, and treatment of ER-positive disease with endocrine therapy only in some elderly women.

Tabulation of person-years at risk and observed events
For each woman who was eligible for the study, her contribution to the person-years at risk was calculated. Women started to contribute to the person-years at risk from 3 months after diagnosis of breast cancer and stopped on the earliest of: date of death, emigration, 95th birthday or end of followup.
In the analysis, calendar period was considered in five-year categories except that women diagnosed during 1993/1994 were included with the 1995-1999 period, and women diagnosed during 2015 were included with the 2010-2014 period. These categories were chosen (before any data were analysed) for the following reasons. The individual contributions to the person-years at risk were then added together. The number of women whose contribution to the person-years was terminated by death from breast cancer (or from all causes) was also obtained. Both the numbers of person-years and the numbers of deaths were then tabulated simultaneously according to all the factors considered in the study. It was assumed throughout that the date of each woman's cancer registration was her date of diagnosis.

Nature and amount of missing data
The following tables show the nature and amount of missing data: Tables: S1, S2 Information on the following variables was available for all women in the study: year and month of breast cancer diagnosis, age at breast cancer diagnosis, screening status, quintile of index of multiple deprivation (IMD) and region of residence, together with month and year of emigration if she had emigrated, and month and year of death, with cause of death, if she had died before the end of her follow-up. For 130,745 (25.5%) of the women, information was also available on all the following characteristics: laterality, tumour grade, HER2 status (during the calendar period 2010-2015), tumour size, number of positive lymph nodes, and ER status (Table S1). However, for 26.7%, 28.8%, 14.3%, 4.0%, 0.6% and 0.04% respectively information was missing on 1, 2, 3, 4, 5, or all 6 of these variables.
For each variable with some missing values, the proportion that were missing was strongly correlated with calendar period of cancer diagnosis (Table S2). This trend is likely to have arisen from the increasing effort put into obtaining these characteristics by cancer registration staff in recent years. All the variables concerned would have been collected, or missed, at -or close to -the time the woman was registered with breast cancer.
From 2013 onwards, the Cancer Outcomes and Services Dataset started to enable those providing information on cancer registrations and subsequent cancer treatments to submit their data to NCRAS using a variety of different systems. However, its implementation was phased and it made little difference to our study, apart from the fact that there were fewer missing values during 2013-2015 than during previous years. Notifications of death dates and causes continued to be supplied to NCRAS from the Office of National Statistics without interruption.

Multiple imputation
The following tables and figures use the imputed datasets: Tables: 1, S3, S6-S10 Figures: 1-8 In order to be able to include all the women in every analysis, irrespective of whether data on some characteristics were missing, the method of multiple imputation was used. 1,2 In this method, multiple datasets are created in which the missing values are replaced by imputed (i.e. predicted) values that have been sampled from their predictive distributions. The predictive distributions take into account any correlations between the known values of the variables with missing values and other variables in the data, thus enabling the imputed values to take appropriate account of the correlations present in the data. In addition, differences between the different imputed datasets enable the uncertainty arising from the imputed values to be taken into account in any ensuing confidence intervals and significance tests.
All the variables for which any values were missing were categorical and there were no complex design features in the data. For each missing value, its predictive distribution was obtained using chained ordered logistic regressions. Independent variables in the predictions were age at diagnosis, screening status, index of multiple deprivation, region of residence, breast cancer death (yes or no), non-breastcancer death (yes or no) and the Nelson-Aalen estimate of the cumulative hazard function for breastcancer deaths. Cancer subtype (16 categories) was also included as an auxiliary variable in the prediction because of its association with ER-status. As the study covered cancer diagnoses during a period of over 20 years, prediction was carried out separately for each calendar period of diagnosis. This allows for possible interactions between calendar period and all the other variables in the prediction model. The assumption was then made that, within each calendar period, missing values in the other random variables in the prediction model occurred at random.
Calculation was carried out using the multiple imputation suite of programs in Stata. 3 The default burn-in period was 10 iterations and, to confirm that this number was sufficient, trace plots for a burnin period of 100 iterations were produced and examined. To examine the plausibility of the imputed values, their distribution for each variable in every calendar period category was compared with that of the known values for that variable. In every case good agreement was found. In addition, the distributions of the variables for which values were missing were compared with distributions in the published literature for other comparable populations of breast cancer patients and discussed with a number of experienced oncologists. The tables and analyses presented in this report are based on a total of 60 imputed datasets. This number was chosen in accordance with the guidelines recommended by van Buuren ie. that the number of imputations should be similar to the percentage of cases that are incomplete. 2 In order to confirm that it was sufficient, several of the analyses were conducted using both 40 and 100 imputations and the results were found to differ little.
For the variables where imputation was necessary, the tables and figures in this report present the number of women in each category averaged over the imputed datasets. Analyses were carried out separately on each imputed dataset, as described in sections 4. and 5. below. The resulting estimates and their variances were then combined via Rubin's rules. 1 Tests of statistical significance of heterogeneity (screening status and region of residence) or for a linear trend (all other variables) were conducted. These assumed that the between-imputation variance was proportional to the withinimputation variance. 1 No corrections have been made to p-values for multiple testing. However, the large number of tests that have been conducted should be borne in mind when interpreting our results and, to aid with this, we have presented p-values in scientific notation. The following tables and figures make use of smoothing:  Tables: S4, S5, S6  Figures: 1, S3-S5, S23-S25 Poisson regression 4 was used to estimate the breast cancer mortality rates by time since diagnosis. These rates were then approximated by a continuous function of the logarithm of time since diagnosis, and this function provided a smoothed estimate of the crude annual breast cancer mortality rate for all women combined.

Smoothed crude annual breast cancer mortality rates and cumulative risks
The function was derived by considering restricted cubic splines with three degrees of freedom, interior knots at the 33 rd and 66 th percentiles of the event time distribution and boundary knots at the minimum and maximum values of time since diagnosis. The appropriate number of interior knots was determined both graphically and formally, using the Bayesian information criterion. Sensitivity analyses confirmed that the shape of the smoothed crude annual breast cancer mortality rate was not sensitive to the placement of the knots. The mortality rate was then estimated for each 1 month interval using Poisson regression with the number of deaths as the dependent variable, the log of the person-years as a fixed offset, and the spline basis variables and covariate of interest as independent variables.
Separate models were fitted for each covariate considered. In each of these models, an interaction between each category of the covariate (e.g. calendar period of diagnosis) and each spline basis variable was included. In this way, the models provided smoothed estimates for the different category levels of the covariate that were not constrained to be proportional to each other.
Before proceeding further, a comparison was made between smoothed and unsmoothed rates to check for any systematic effect that might have been introduced by the smoothing algorithm. These provided reassurance that the changes in the crude rates with time since diagnosis were well described in all the calendar period categories (see e.g. Figure S3).
Where imputation was needed, the mortality rates were calculated as described above within each imputed dataset, and then combined using Rubin's rules.
Cumulative risks were derived by first summing the smoothed mortality rates over time since diagnosis, with weighting proportional to the length of the time-interval covered, to give the cumulative rate, Λ, and then calculating 1-exp(-Λ). The standard errors of the cumulative risks were based on the standard errors of the cumulative rates. The cumulative risk calculation for each calendar period category continued until the maximum time since diagnosis, described in section 1 above, was reached.

Adjusted annual breast cancer mortality rates
The following figures display adjusted annual mortality rates: To simplify the calculation of adjusted annual breast cancer mortality rates, a two-stage approach was adopted.
In the first step, separate models were fitted for women with ER-positive and ER-negative disease. In these models, adjusted annual breast cancer mortality rate ratios were estimated using Poisson regression, with the numbers of deaths as the dependent variable and the log of the person-years, which were assumed to be fixed, as an offset. Time since diagnosis was classified into one-year intervals for the first five years and five-year intervals thereafter, and it was considered as a categorical variable. This variable, and all the other variables listed in Table 1 (apart from ER-status and HER2 status), were included in the model simultaneously as independent variables using the categories displayed in Table 1.
In addition, in order to investigate the pattern of breast cancer mortality with time since diagnosis, a series of models were fitted to the data which included all the variables described above and also a two-way interaction between calendar period of diagnosis and each one of the other variables in turn. These models provided estimates of the adjusted annual mortality rates and their confidence intervals and the rate ratios and their confidence intervals separately for women with ER-positive and ERnegative disease.
In the second step, the estimated coefficients arising from the above models were used to obtain adjusted annual mortality rates for women with ER-positive disease and ER-negative disease that were directly comparable with each other. This was done by means of the technique of adjusted predictions, as described by Williams, 5 and using the Stata command 'margins'. This comprised predicting, for each variable, the number of deaths in each category of the variable for each combination of all the other variables included in the model. The adjusted annual mortality rates were then obtained by averaging these predicted numbers of deaths by the proportions of the total person years in the entire study (i.e. both women with ER-positive and women with ER-negative disease) which had that combination of all the other variables included in the model. This is equivalent to fitting a single model that includes an interaction between ER-status and every other variable, but is much easier and quicker to fit. Further explanation is given in Williams 5 and examples 1 and 15 of the relevant section of the documentation for Stata 3 (available in https://www.stata.com/manuals/rmargins.pdf ).
As HER2 status was only available for women diagnosed during 2010-2015, it was not considered in analyses of women diagnosed during the whole study period, 1993-2015. HER2 status was, however, included in a separate analysis just of women diagnosed during 2010-2015. For women diagnosed during 2010-2015, the women were categorised simultaneously by all of the five available tumour characteristics (ER status, HER2 status, grade, tumour size and number of positive nodes) as well as by their age and screening status. This gave rise to 576 possible groups of women with different combinations of the available characteristics. Some of these groups contained hundreds or thousands of women, but some contained few women, or no women at all. To avoid basing risk estimates on very sparse data, it was decided to focus just on groups that contained at least 40 women on average across the 60 imputed datasets. There were 253 such groups and they included 97.9% (i.e. 153,006/156,338) of the women diagnosed during 2010-2015.
In 248 of the 253 groups, there was at least one death within five years of breast cancer diagnosis in at least three quarters (i.e. 45) of the 60 imputed datasets. In the remaining 5/253 groups there were no deaths within five years of breast cancer diagnosis in at least 45 of the 60 imputed datasets.
13 Different methods were used to estimate the five-year cumulative mortality risk for these two different types of group.

Method for 248 groups with at least one death in at least 45 of the 60 imputed datasets.
The annual mortality rate during the first five years of follow-up (Λ) was calculated by dividing the total number of deaths from breast cancer (or from all causes) by the corresponding number of person-years at risk within each of the 60 imputed datasets. Cumulative annual mortality rates were then calculated as wΛ, where w is the length of the interval being considered (i.e. 4.75 years, as followup began at 3 months after diagnosis and ended at 5 years after diagnosis), and percentage cumulative mortality risks were calculated as 100 [1-exp(-wΛ)]. The standard errors of the cumulative risks were based on the standard errors of the cumulative rates and the estimates and their variances were then combined via Rubin's rules, as in section 3 above.
6.2 Method for 5 groups in which there were no deaths in at least 45 of the 60 imputed datasets. For these 5 groups, the total numbers of deaths in all 60 imputed datasets were 14, 5, 2, 1 and 1 respectively. The large sample assumptions used for estimating the standard errors in section 6.1 above were, therefore, inappropriate and a simpler, approximate, approach was taken. The annual mortality rate, Λ, in each group was estimated by the sum of the deaths in all the 60 imputed datasets for the group divided by the sum of the person-years in all 60 imputed datasets for the group. The cumulative mortality rate was then estimated as wΛ, where, as above, w is the length of the interval being considered, and the percentage cumulative mortality risk was estimated as 100 [1-exp(-wΛ)]. An upper 95% confidence limit for the percentage cumulative risk was derived assuming that the total number of deaths across the 60 imputed datasets had a Poisson distribution. Therefore, if OU is the upper 95% confidence limit for the mean of the underlying Poisson distribution, then the approximate upper limit of the 95% confidence interval for the percentage cumulative mortality risk is 100[1-exp(-wOU/P)], where P is the total number of person-years observed in all 60 imputed datasets. For these groups the lower limit of the confidence interval was set to zero.

Estimating risks for broader groups of women
If estimates of the five-year cumulative mortality risk are desired for broader groups of women than those displayed in Table S9, then the following approximate method can be used.
The annual mortality rate, Λ, in the broader group can be estimated by the sum of the numbers of deaths shown in Table S9 for all the groups that are being combined, divided by the sum of the personyears in all the groups that are being combined. The cumulative mortality rate in the broader group is then estimated as wΛ, where, as above, w is the length of the interval being considered (i.e. 4.75 years), and the percentage cumulative mortality risk can be estimated as 100[1-exp(-wΛ)]).
If the number of deaths in the groups being combined into the broader group is large (e.g. more than about 15), then an approximate 95% confidence interval for the percentage cumulative risk is 100{1-exp(-w(Λ±1.96√ /P))}, where O is the total number of deaths in the groups being combined and P is the total number of person-years. This method does not account for the variance due to the imputation.
Alternatively, if the total number of deaths observed in the broader group is small (e.g. less than about 15), then a 95% confidence interval for the percentage cumulative risk can then be estimated as in section 6.2 above.
For example, for a trial including women with tumour size 1-20 mm, grade 1 or 2, node negative, and oestrogen receptor positive disease, five year breast cancer mortality risk and its confidence interval 14 may be estimated by combining the relevant rows in table S9. In this example, combining rows provides a total of 350 deaths and 251,737.9 person-years. Hence, the cumulative risk is calculated by Therefore, these women have an estimated five year breast cancer mortality risk of 0.66% (95% confidence interval 0.59% to 0.73%).

Figure S11: Adjusted annual breast cancer mortality rates during 0-5 years after diagnosis in women with early breast cancer with ER-positive or ER-negative disease by calendar period of diagnosis, according to age at diagnosis
In this figure all age-groups have the same length of follow-up.
Rates are adjusted for all the characteristics shown in figure S7.
Vertical lines are 95% confidence intervals.
Breast cancer mortality (to accompany figure 3)

Figure S12: Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ERnegative disease by calendar period of diagnosis, according to breast cancer laterality, index of multiple deprivation and region of residence
Further details are in figures S18-S20.
Vertical lines are 95% confidence intervals.  Breast cancer mortality (to accompany figure 3) Cancer screen-detected

Figure S14: Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ERnegative disease by calendar period of diagnosis, according to screening status (Figure 4)
For each characteristic, rates are adjusted for all the characteristics shown in figure S7. Breast cancer mortality (to accompany figure 4) Tumour size

Figure S15: Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ERnegative disease by calendar period of diagnosis, according to tumour size (Figure 5)
For each characteristic, rates are adjusted for all the characteristics shown in figure S7.     Breast cancer mortality (to accompany figure 6) Tumour grade

Figure S17: Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ERnegative disease by calendar period of diagnosis, according to tumour grade (Figure 7)
For each characteristic, rates are adjusted for all the characteristics shown in figure S7. Breast cancer mortality (to accompany figure 7) Breast cancer laterality ER-positive ER-negative

Figure S18: Adjusted annual breast cancer mortality rates in women with early breast cancer with ER-positive or ERnegative disease by calendar period of diagnosis, according to breast cancer laterality (Figure S12)
For each characteristic, rates are adjusted for all the characteristics shown in figure S7.  Breast cancer mortality (to accompany figures 3-7)      Breast cancer mortality (to accompany figures 3-7)

ER
All-cause mortality (to accompany figure 2) All-cause mortality (to accompany figure 2) Figure S28: Adjusted annual all-cause mortality rates in women with early breast cancer with ER-positive or ERnegative disease according to nine characteristics, and time since diagnosis For each characteristic, rates are adjusted for all the other characteristics shown including time since diagnosis. All-cause mortality (to accompany For each characteristic, rates are adjusted for all the characteristics shown in figure 2 and also for time since diagnosis, breast cancer laterality, index of multiple deprivation and region of residence.

ER-positive ER-negative
Vertical lines are 95% confidence intervals.
All-cause mortality (to accompany figures 3-7) Age at diagnosis Figure S30: Adjusted annual all-cause mortality rates (dashed lines) and breast cancer mortality rates (solid lines) in women with early breast cancer with ER-positive or ER-negative disease by calendar period of diagnosis, according to age at diagnosis Rates are adjusted for all the characteristics shown in figure 2 and also for time since diagnosis, breast cancer laterality, index of multiple deprivation and region of residence. Vertical lines are 95% confidence intervals