- James Raftery, professor of health technology assessment,
- Maria Chorozoglou, research fellow
- Correspondence to: J P Raftery
- Accepted 18 November 2011
Objective To assess the claim in a Cochrane review that mammographic breast cancer screening could be doing more harm than good by updating the analysis in the Forrest report, which led to screening in the United Kingdom.
Design Development of a life table model, which replicated Forrest’s results before updating and extending them with data from relevant systematic reviews, trials, and other models based on purposive literature searches.
Participants Women aged 50 and over invited for breast cancer screening.
Main outcome measures Quality adjusted life years (QALYs), combining life years gained from screening with losses of quality of life from false positive diagnoses and surgery.
Results Inclusion of the effects of harms reduced the updated estimate of net cumulative QALYs gained after 20 years from 3301 to 1536 or by more than half. The best estimates from the Cochrane review generated negative QALYs for the first seven years of screening, 70 QALYs after 10 years, and 834 QALYs after 20 years. Sensitivity analysis showed these results were robust to a range of assumptions, particularly up to 10 years. It also indicated the importance of the level and duration of harms from surgery.
Conclusions This analysis supports the claim that the introduction of breast cancer screening might have caused net harm for up to 10 years after the start of screening.
The Forrest report in 1986,1 which led to the introduction of mammographic breast screening in the United Kingdom, analysed the costs and benefits in terms of quality adjusted life years (QALYs). One of the earliest uses of QALYs to guide policy, it suggested that screening would reduce the death rate from breast cancer by almost one third with few harms and at low cost (for details see appendix on bmj.com).
The key data used in the Forrest report were drawn from two randomised trials, the Swedish two counties trial2 and the Health Insurance Plan (HIP) New York trial.3 The Forrest report claimed that overdiagnosis was not a problem, based on the New York trial, but noted that the Swedish trial found possible overdiagnosis of 20%. It stated that “further follow up is required to find out whether this excess persisted.” We have updated the Forrest report’s estimates for mortality and extended them to include the effects of false positives and overdiagnosis.
Since the Forrest report, the harms of mammographic breast cancer screening have been acknowledged. A WHO report defined false positives and overdiagnosis:
“The term false positive refers to an abnormal mammogram (one requiring further assessment) in a woman ultimately found to have no evidence of cancer. Overdiagnosis refers to the diagnosis and treatment of cancer that would never have caused symptoms. Thus a false positive result can be found only in a woman without cancer, while overdiagnosis can only be made for women with cancer.”4
It went on to note that “overdiagnosis is a foreign concept to most prospective screenees (and many clinicians).”
The WHO report noted that a considerable part of overdiagnosis involved ductal carcinoma in situ, which accounts for around a fifth of mammographically detected cancers. While this is a risk factor for breast cancer, only a minority of these develop into breast cancer. Indeed the inclusion of the term “carcinoma” in ductal carcinoma in situ has been questioned.5
The WHO report claimed that the success of breast cancer screening programmes should be assessed only in terms of mortality: “Screening programmes should ultimately be monitored in terms of deaths, the measure directly related to the purpose of screening.” A focus solely on deaths, however, implies ignoring harms to the living.
Gøtzsche and Nielsen’s Cochrane review6 raised the disturbing possibility that mammographic breast cancer screening could be doing more harm than good. This was because of their lower estimate of the reduction in mortality from breast cancer and their inclusion of the harms from overtreatment. They said that “this means that for every 2000 women invited for screening throughout 10 years, one will have her life prolonged, and 10 healthy women, who would not have been diagnosed if there had not been screening, will be diagnosed as breast cancer patients and will be treated unnecessarily. Furthermore, more than 200 women will experience important psychological distress for many months because of false positive findings. It is thus not clear whether screening does more good than harm.”6
Their meta-analysis included eight randomised trials, three of which they considered adequately randomised and five suboptimally randomised. Only the suboptimally randomised trials found a significant effect of screening on deaths ascribed to breast cancer. For all the eight trials taken together the relative risk reduction for mortality from breast cancer was 19% (95% confidence interval 26% to 13%) after 13 years. Given the quality of the evidence, Gøtzsche and Nielsen’s best estimate of the effect of screening was a 15% decline in mortality.
The increased risk of surgery was the basis of Gøtzsche and Nielsen’s estimate of unnecessary treatment. Four trials provided data on breast operations (mastectomies and lumpectomies), with more performed in the screened groups than in the control groups: the relative risk increase was 31% (22% to 42%) for the two adequately randomised trials and 35% (26% to 44%) for all four trials. For false positive results, Gøtzsche and Nielsen stated “it seems that screening inflicts important psychological distress for many months on more than a 10th of the healthy population of women who attend a screening program.”
A systematic review and meta-analysis by Nelson and colleagues for the US Preventive Services Task Force independently analysed the same eight clinical trials in the Cochrane review but by age group.7 8 This put the reduction in mortality from breast cancer at 15% for those aged 39-49, 14% for those aged 50-59, and 32% for those aged 60-69. It used US registry data to suggest that about 10% of those screened would have a false positive result requiring further investigation.7 It differed from the Cochrane review in relation to overdiagnosis. “Rates of overdiagnosis vary from less than 1% to 30% with most from 1% to 10%. Estimates differ by outcome (invasive vs in situ breast cancer), by whether cases are incident or prevalent, and by age. The studies are too heterogeneous to combine statistically.”7 These studies, it should be noted, included both randomised trials and observational studies.
Thus the two systematic reviews agreed that screening reduced mortality from breast cancer but differed in how much. Nelson and colleagues estimated a false positive rate around 10% per round of screening, while Gøtzsche and Nielsen put it at around 10% over 10 years. Only Gøtzsche and Nielsen provided data on the increased relative risk of surgery with screening, with two estimates: 31% based on the better quality trials and 35% based on all trials reporting this outcome.
We assessed the claim of Gotzsche and Nielsen by updating the Forrest report framework, extended to include harms. The Forrest report used life tables to estimate the number of women surviving by year up to 15 years in two cohorts aged 50, only one of which was screened. Deaths could be from breast cancer or all other causes. Baseline mortality and the reduction from triennial breast cancer screening were based on the two randomised controlled trials then available. The difference in life years between the two cohorts after 15 years was expressed in QALYs by reducing their quality of life by 8% to reflect the effects of treatment.
The Southampton model used the same life table approach as Forrest to estimate life years. To ensure that the Southampton model was fully compatible, we confirmed that use of Forrest inputs generated the same number of deaths in our model.
Forrest took baseline mortality from breast cancer from the two trials then available but acknowledged that as this was below the English mortality rate from breast cancer, his results were underestimates. In updating Forrest, we corrected this by using the mortality rate from breast cancer for England.9 We also took the baseline risk of surgery for breast cancer from the English NHS.10 11 Data for both these baselines were for 1985, the latest year before screening for which we could locate data. These changes meant more favourable results for screening than if we used the control arms of trials as baselines.
We drew parameter inputs for the Southampton model (table 1⇓) from the published literature, giving priority to systematic reviews, followed by randomised clinical trials and other published models, and then observational data supplemented by clearly stated assumptions when necessary. Sensitivity analysis varied mean estimates to their 95% confidence intervals and other inputs by ±33%. The results of individual sensitivity analyses are reported in the appendix on bmj.com. Probabilistic sensitivity analysis varied key inputs simultaneously by sampling from their probability distributions for 10 000 iterations.
All the input values are listed in table 1 with sources and discussed more fully in the appendix on bmj.com. In brief, the changed relative risks for breast cancer mortality and surgery from screening were based on the meta-analyses of the relevant trials.6 7 8 The losses of quality of life from false positive results and surgery were based on a systematic review, supplemented by relevant randomised trials and values used in previous models. The extent and duration of the loss of quality of life from surgery have been least researched. We assumed a 6% permanent loss from surgery, less than in the Forrest report but informed by recent randomised trials.12 13 Sensitivity analyses explored changing the extent and duration of these and other values. Figure 1 illustrates the modelling approach⇓.
Setting, participants, and outcome measures
The setting was England. The outcomes of 100 000 women aged 50 were modelled in two cohorts, one screened the other not. The outcome measures were deaths from breast cancer, deaths from all other causes, and the number of women having false positive diagnoses and surgery, which we combined into the main outcome—quality adjusted life years (QALYs).
Figure 2 graphically presents the five scenarios⇓, and table 2 summarises the results⇓. Scenario 1 shows the QALY gains that Forrest would have got if he had used English breast cancer mortality rates as baseline with his risk reductions. Scenario 2 updates this with the reduction in mortality from breast cancer for all ages from all eight trials. The losses of quality of life from surgery and false positive diagnoses were added in scenario 3. Scenario 4 used the reduction in mortality from breast cancer suggested by Gøtzsche and Nielsen.6 In scenario 5, we used the reductions in mortality from breast cancer by age group from Nelson et al.7 8 The results are based on 100 000 women being invited for mammographic screening, with 73% attending, and are presented for each year up to 20 years after the entry to the screening programme.
Scenario 1 accumulated just over 3300 net QALYs after 20 years. This is what Forrest would have got had he used as baseline the breast cancer mortality rate for England and the mortality reduction from the two trials. When we updated the estimate for reduction in breast cancer mortality for all ages, with the meta-analysis of the eight trials (scenario 2), the net cumulative QALY gain at 20 years fell to around 3100 QALYs or by about 6%.When we added harms in scenario 3, this was reduced to just over 1500 QALYs or by half. When we changed the reduction in mortality from breast cancer to that suggested by Gøtzsche and Nielsen.6 the net QALYs at year 20 fell to 834 (scenario 4). Scenario 5, based on the reductions in mortality from breast cancer by age group suggested by Nelson et al,7 8 generated 1685 QALYs by year 20.
Scenarios 3, 4, and 5 had negative cumulative QALY values for the first four, seven, and eight years, respectively, but had positive values after 10 years. The harms from surgery and false positive diagnoses impacted from the start because they were linked to each round of screening. Mortality from breast cancer, however, was reduced only after several years but accumulated over time so that positive net QALY applied by 10 and especially by 20 years.
Sensitivity analyses explored the effects of varying input values independently (see appendix on bmj.com). When we combined four key parameters (reduced mortality from breast cancer, increased surgery for breast cancer, and losses of quality of life from false positive results and from surgery) in a probabilistic sensitivity analysis for scenario 3 (Forrest updated including harms), the net cumulative gain in QALYs after 20 years was between 771 and 2136 (mean 1532), with lower values in the earlier years (fig 3⇓). The mean number of years with negative QALYs was four, with a range of two to nine.
We did not include the duration of harms from surgery in the probabilistic sensitivity analysis because of uncertainty about the appropriate distribution. Instead we used deterministic sensitivity analyses to explore reducing the duration of harms from surgery, assumed as permanent in the base case, to five and 10 years (see appendix on bmj.com). This led to unchanged net QALYs up to five and 10 years but with more QALYs over longer periods.
Assessment of the effects of mammographic breast screening in terms of mortality or life years inevitably shows positive benefits because of the omission of harms. Despite its espousal of a QALY framework, the Forrest report focused mainly on life years gained, which it adjusted for quality of life only from necessary surgery and ignored all other harms. Our analysis shows that inclusion of the harms from false positive results and unnecessary surgery reduced the benefits of screening by about half with negative net QALYs in the early years after the introduction of screening.
We assumed that the loss of quality of life in women who had unnecessary surgery was the same as for those who had had “necessary” surgery. A key feature of overtreatment is that individuals affected cannot be identified. Of Gøtzsche and Nielsen’s 10 women who had unnecessary surgery, all believed that it was necessary.6 This has been dubbed the paradox of overtreatment—“overdiagnosis and overtreatment create a paradoxical popularity because each individual justifies their experience by believing they have had a dramatic benefit.”20 The more people are (over)treated, the more people think screening saved their lives.
Would knowing whether or not treatment was necessary affect quality of life? None of the surveys of quality of life included overtreatment, implicitly assuming all surgery was necessary. To answer this question surveys would have to ask each woman whether her quality of life would be affected if it could be shown that her surgery had been unnecessary. While the methodological problems of measuring quality of life in cancer screening are considerable,21 ignoring overtreatment is inexcusable.
Ways of reducing the harms from screening might include less frequent screens, particularly for younger women. While further modelling might explore the clinical and cost effectiveness of various options, conclusions will inevitably be limited without better estimates of the level and impact of overtreatment.
Strengths and limitations
Our analysis does have limitations. Following the Forrest report, it relies heavily on clinical trials, most of which were completed in other countries several decades ago. As mortality from breast cancer in 1985 was higher in England than in those trials, we took as baseline the rate for England for 1985, before screening was introduced. We have assumed that the risk reductions shown in the trials apply to this higher baseline rate. An assessment of the value added by screening today might require disentangling the effects of screening from the effects of improved treatments, which is difficult.22 23 24 It would also require consideration of how screening methods have changed. Double view mammography and improved imaging might reduce the false positive rate but could have increased overtreatment by creating more (harmful) true positive diagnoses.25
As with breast cancer mortality, our baseline risk of breast cancer surgery was that for England in 1985. We assumed that the risk increase shown in the trials applied to this risk. Observational studies, including those summarised in the US systematic review, provided a wide range of estimates of overtreatment from 1%7 to 52%6 7 26 27 28 but have also been criticised for poor quality.29 We assigned a single loss of quality of life to all forms of surgery but acknowledge that lesser harms are likely with lumpectomy than with mastectomy. Against this, we have not included the harms from radiotherapy and chemotherapy. Future studies on quality of life might usefully distinguish between the effects of different treatments.
How plausible are the losses in QALYs from surgery? The Forrest report estimate of 8% seems to have been based on a single small study from which it took the lowest estimate30 (see appendix on bmj.com). A 2010 systematic review of health state utilities in breast cancer15 found only two relevant studies. One put the loss in utility at 8% in year one, 4% in intervening years, and 11% in the last year of life. The other put the loss at 38% in year one after diagnosis, 31% in years one to five, and 29% after five years. The 2010 UK COMICE trial put the loss in quality of life from surgery in 1625 women with a low risk breast cancer at 5% after 12 months.12 The five year follow-up to the PRIME trial showed that the quality of life losses after surgery were unchanged after five years.13 Overall, our assumption of a permanent 6% loss in quality of life from surgery does not seem unreasonable, but more robust estimates are needed.
We assumed the base case to be a loss in quality of life from false positive results of 5% of full health for 0.2 years. This is lower than that of a relevant US model16 but similar to the Dutch model.17 The 2010 systematic review of the utility losses from breast cancer included estimates of this loss of between 11% and 34%15 but warned that the studies could not be synthesised.
The time frame of up to 20 years is long relative to the duration of the trials, results of which have been synthesised up to 13 years. Extrapolation required an assumption of constant benefits and harms from additional rounds of screening. Longer time frames generate greater net QALYs but rely on increasingly strong assumptions, both to do with the rate of survival from breast cancer and the pattern of losses in quality of life over time.
We assumed no recurrence of cancer, despite the 10 year survival rate for breast cancer being 72% in the UK.30 We also assumed no re-operations, even though 17% of women with tumours detected at screening in the UK had more than one therapeutic operation in 2006-7.31 While it is possible that some cancers that were detected early by screening might have progressed in longer time frames, a recent analysis has shown no decline in the incidence of advanced breast cancer.32 Modelling the longer term effects of breast cancer screening should include these factors.
Finally, our list of benefits and harms excluded the potential reassurance from a negative result on mammography. As a negative mammogram has little predictive value, any reassurance is limited to relief at not having cancer at that time.33 The 2010 systematic review of utility states in breast cancer15 found no evidence of improved quality of life from negative results.
Comparison with other studies
Our results can be compared with attempts to model the effectiveness of mammographic screening in terms of cost per QALY. Although Stout et al16 included only losses of quality of life in a sensitivity analysis, their inclusion roughly doubled the cost per QALY. The Dutch MISCAN study concluded that including the effects on quality of life of both treatment and false positive results had little consequence,17 but this seems to be because of the relatively low level of surgery assumed in that model. In a review of the cost effectiveness of extending the age range for the UK breast screening programme, Madan et al18 showed that inclusion of losses of quality of life from false positive results considerably increased the cost per QALY.
Conclusions and policy implications
Overall, our study supports the suggestion by Gøtzsche and Nielsen that mammographic breast cancer screening could be causing more harm than good after 10 years.6 Scenario 4, based on Gotzsche and Nielsen’s best estimate, had negative QALYs for the first seven years after screening and minimal gains of 70 QALYs after 10 years. Thereafter, net QALYs accumulate but much less than would be expected by our updating of the Forrest report. The uncertainty around this result, explored in scenarios 3 and 5 and in greater detail in other scenarios, applies more to the longer than the shorter term. Harms largely offset the gains up to 10 years, after which the gains accumulate at an increasing rate.
More research is required on the extent of unnecessary treatment and its impact on quality of life. Most of the observational studies of overtreatment have focused on the relation between the incidence of breast cancer and mortality rather than on the levels of treatment, especially surgery. The effects of treatment on quality of life could be established observationally or in longer follow-up studies of trials.13 Improved ways of identifying those most likely to benefit from surgery and for measuring the levels and duration of the harms from surgery should be research priorities.
As randomised trials might be the only way to resolve the extent of overtreatment, researchers in countries that have not yet implemented breast cancer screening should consider trials that include the harms of screening. There have been suggestions for more sophisticated approaches to the prevention and treatment of breast cancer.33 34 From a public perspective, the meaning and implications of overdiagnosis and overtreatment need to be much better explained and communicated to any woman considering screening.
What is already known on this topic
Mammographic screening for breast cancer saves lives but also imposes losses in quality of life from false positive results and unnecessary treatment
It has been suggested that the harms outweigh the benefits, but this has not been quantified
What this study adds
By combining the life years saved with the quality of life losses in quality adjusted life years (QALYs), this study combined the benefits and harms into a single measure
The net QALYs from screening were negative for the early years after the introduction of screening, after which net positive QALYs accumulated but by much less than predicted by the Forrest report
Cite this as: BMJ 2011;343:d7627-
We thank Ruairidh Milne for noting that the Forrest report used QALYs, the authors of both the Cochrane review (Peter Gøtzsche) and US Preventive Task Force review (Heidi Nelson) for help with queries, David Turner, Jason Madan and Lily Yao for advice on modelling, and Maryrose Tarpey for ongoing insights. Responsibility for errors remains with the authors.
Contributors: JR was responsible for the idea, early modelling, writing up of successive drafts, and is guarantor. MC was responsible for the detailed modelling. Both authors were involved in the final drafts of both the paper and the appendix.
Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not required.
Data sharing: No additional data available.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.