Quantifying and monitoring overdiagnosis in cancer screening: a systematic review of methodsBMJ 2015; 350 doi: https://doi.org/10.1136/bmj.g7773 (Published 07 January 2015) Cite this as: BMJ 2015;350:g7773
- Jamie L Carter, resident physician1,
- Russell J Coletti, resident physician2,
- Russell P Harris, professor of medicine3
- 1Department of Medicine, University of California, San Francisco, San Francisco, CA 94110, USA
- 2Division of General Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- 3Sheps Center for Health Services Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Correspondence to: R P Harris
- Accepted 21 October 2014
Objective To determine the optimal method for quantifying and monitoring overdiagnosis in cancer screening over time.
Design Systematic review of primary research studies of any design that quantified overdiagnosis from screening for nine types of cancer. We used explicit criteria to critically appraise individual studies and assess strength of the body of evidence for each study design (double blinded review), and assessed the potential for each study design to accurately quantify and monitor overdiagnosis over time.
Data sources PubMed and Embase up to 28 February 2014; hand searching of systematic reviews.
Eligibility criteria for selecting studies English language studies of any design that quantified overdiagnosis for any of nine common cancers (prostate, breast, lung, colorectal, melanoma, bladder, renal, thyroid, and uterine); excluded case series, case reports, and reviews that only reported results of other studies.
Results 52 studies met the inclusion criteria. We grouped studies into four methodological categories: (1) follow-up of a well designed randomized controlled trial (n=3), which has low risk of bias but may not be generalizable and is not suitable for monitoring; (2) pathological or imaging studies (n=8), drawing conclusions about overdiagnosis by examining biological characteristics of cancers, a simple design limited by the uncertain assumption that the measured characteristics are highly correlated with disease progression; (3) modeling studies (n=21), which can be done in a shorter time frame but require complex mathematical equations simulating the natural course of screen detected cancer, the fundamental unknown question; and (4) ecological and cohort studies (n=20), which are suitable for monitoring over time but are limited by a lack of agreed standards, by variable data quality, by inadequate follow-up time, and by the potential for population level confounders. Some ecological and cohort studies, however, have addressed these potential weaknesses in reasonable ways.
Conclusions Well conducted ecological and cohort studies in multiple settings are the most appropriate approach for quantifying and monitoring overdiagnosis in cancer screening programs. To support this work, we need internationally agreed standards for ecological and cohort studies and a multinational team of unbiased researchers to perform ongoing analysis.
Overdiagnosis, the detection and diagnosis of a condition that would not go on to cause symptoms or death in the patient’s lifetime, is an inevitable harm of screening. Overdiagnosis in cancer screening can result from non-progression of the tumor or from competing mortality due to other patient conditions (that is, other conditions that would lead to the patient’s death before the cancer would have caused symptoms). The consequences of overdiagnosis include unnecessary labeling of people with a lifelong diagnosis as well as unneeded treatments and surveillance that cause physical and psychosocial harm.1 A patient who is overdiagnosed cannot benefit from the diagnosis or treatment but can only be harmed.2
Patients, healthcare providers, and policy makers need information about the frequency of overdiagnosis as they weigh the benefits and harms of screening. Several studies have found that patients want to factor information about overdiagnosis into their decisions about screening for breast or prostate cancer.3 4 5 On a policy level, accurate measurement of the frequency of overdiagnosis is essential for monitoring the effects over time of both new screening technology (which could result in either increased or decreased overdiagnosis), new treatment, and interventions to reduce overdiagnosis.
Because it is impossible to distinguish at the time of diagnosis between an overdiagnosed cancer and one that will become clinically meaningful, measurement of overdiagnosis is not straightforward. Researchers have used various methods to indirectly quantify overdiagnosis resulting from cancer screening, but the magnitude of these estimates varies widely. We conducted a systematic review to identify and evaluate the methods that have been used for measuring overdiagnosis of cancer. We also analyzed the advantages and disadvantages of each method for providing valid and reliable estimates of the magnitude of overdiagnosis, and for monitoring overdiagnosis over time.
We have the following key questions:
1: What research methods have been used to measure overdiagnosis resulting from cancer screening tests?
2: What are the advantages and disadvantages of each method for:
Providing a valid and reliable estimate of the frequency of overdiagnosis?
Monitoring overdiagnosis over time?
We included studies that have quantified the frequency of overdiagnosis resulting from cancer screening in an asymptomatic adult population. We limited the scope of the review to studies of overdiagnosis in the nine types of solid tumors with the highest incidence in the United States in 2012—prostate, breast, lung, colorectal, melanoma, bladder, renal, thyroid, and uterine cancers.6 Studies in English from any setting and time frame were included. All study designs were included except non-systematic reviews,, case reports, and case series. Systematic reviews were excluded if they simply summarized studies that had each quantified overdiagnosis (for example, by combining data from several estimates of overdiagnosis). We included systematic reviews that used data from identified studies to independently compute a new estimate of overdiagnosis.
We accepted any of three definitions of overdiagnosis, each with excess incidence attributable to screening in the numerator: (1) cancers diagnosed by screening; (2) all cancers diagnosed by any method during the screening period; and (3) all cancers diagnosed by any method over the patient’s lifetime (or long term follow-up).
Study identification and selection
We conducted a systematic search of PubMed and Embase on 28 February 2014 with no limits placed on dates or study design (see appendices for search strategy). To further find relevant studies, we also hand searched reference lists of systematic and narrative reviews identified during the initial search. Abstracts and full texts were reviewed independently by two reviewers for inclusion. Any disagreements about inclusion or exclusion of these studies were resolved by consensus, and a third senior reviewer was consulted to resolve any remaining disagreements.
One reviewer extracted relevant data into a standardized form. These data were verified by a second reviewer, and discrepancies were resolved by consensus.
Risk of bias assessment
We created standard criteria to evaluate risk of bias for each of the four main types of studies found in this review: modeling studies, pathological and imaging studies, ecological and cohort studies, and follow-up of a randomized controlled trial. Two reviewers independently rated the risk of bias for each study, and we resolved discrepancies by consensus. We adapted the criteria for ecological and cohort studies from quality criteria used in a recent systematic review of observational studies of breast cancer screening.7 Risk of bias criteria for randomized controlled trial follow-up and pathological and imaging studies were adapted from standard criteria used by the U S Preventive Services Task Force (USPSTF).8 We developed a new set of criteria for evaluating modeling studies for the purpose of this review, outlined in table 1⇓.
Based on these criteria, we rated a study as having high, moderate, or low risk of bias. Studies with high risk of bias had a fatal flaw that made their results very uncertain; studies with low risk of bias met all criteria, making their results more certain. Studies that did not meet all criteria but had no fatal flaw (thus making their results somewhat uncertain) were rated as having moderate risk of bias. We give general deficiencies of the studies in each study type category in the appropriate section.
Strength of evidence assessment
We developed criteria to evaluate overall strength of evidence for the body of literature for each study type based on criteria used by the USPSTF8 and the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group.9 Each individual study was evaluated for risk of bias, directness (see below), external validity, and precision. Ecological and cohort studies and randomized controlled trials were also rated on the appropriateness of their analysis and time frame. For these studies, analysis is a central consideration because a study can be well designed and performed with minimal bias but still provide an unreliable estimate of overdiagnosis because of a faulty analysis. Two reviewers independently determined ratings for each of these criteria, and we resolved discrepancies by consensus. We adapted criteria for evaluating external validity of individual studies from the USPSTF procedure manual.8 Although we initially assessed the external validity of studies based on their relevance to a general US adult population, we then reassessed external validity based on relevance to a Western European population, finding no change in our conclusions.
The GRADE working group defines directness as the extent to which the evidence being assessed reflects a single direct link between the interventions of interest and the ultimate outcome.9 In this review, we evaluated the extent to which the evidence links the screening test directly to the health outcome of excess cases of cancer attributable to screening without making assumptions. A study with good directness requires minimal assumptions to draw conclusions about the magnitude of overdiagnosis and avoids extrapolating over gaps in the evidence.
We combined the ratings for risk of bias, directness, analysis, time frame, external validity, and precision with an evaluation of the consistency of the results to determine the strength of evidence for the overall body of evidence for each study design and cancer type. Table 2⇓ outlines our definitions of these terms; a complete list of criteria used to evaluate risk of bias and strength of evidence by study design can be found in the online supplemental tables 3 and 4.
Based on the criteria above, we rated the strength of evidence for each study type as being high (that is, met all criteria), moderate (did not meet all criteria but had no fatal flaw), or low (had at least one fatal flaw that made estimates highly uncertain). We give general deficiencies of the literature for each study type studies in the appropriate section, including examples of what we regarded as fatal flaws.
Data synthesis and analysis
We performed qualitative data synthesis, organizing the results by study design and cancer type. We did not attempt to perform quantitative synthesis because of the heterogeneity of the study designs, populations, and results. Using our critical appraisal of individual studies and the body of evidence for each study design, we identified strengths and weaknesses of each study design used to measure overdiagnosis. We did not assess publication bias.
We reviewed 968 abstracts and 120 full texts, including 52 individual studies. When we identified multiple reports from the same authors investigating the same population or model, we included only the most recent study. The figure⇓ shows the flow diagram of the study selection process.10 The included studies fell into four methodological groups, which we categorized as modeling studies (n=21), pathological and imaging studies (n=8), ecological and cohort studies (n=20), and follow-up of a randomized controlled trial (n=3).
Characteristics of included studies: modeling studies
We included 21 modeling studies in this review: 10 of prostate cancer,11 12 13 14 15 16 17 18 19 20 seven of breast cancer,21 22 23 24 25 26 27 three of lung cancer,28 29 30 and one of colon cancer overdiagnosis.31 In general, these studies model the way cancer would hypothetically occur without screening, and then the way cancer occurs with screening, comparing the two to determine the frequency of overdiagnosis. These studies modeled a variety of screening situations and schedules. Some studies modeled only the non-progressive disease and not the competing mortality component of overdiagnosis23 26 27; such studies almost certainly underestimate its magnitude. Table 3⇓ summarizes the evidence from the included modeling studies.
Risk of bias: modeling studies
Several concerns raised the risk of bias in modeling studies. First, no modeling study discussed the potential biases in the data sources used in their models. Only two modeling studies13 28 provided a table of assumptions and data sources. No studies were supported by systematically reviewed evidence; most studies picked data inputs from a variety of sources without justification, raising the risk of bias to achieve a desired output.
Second, several studies found that mean sojourn time was a key uncertain variable for which sensitivity analyses should be performed, yet only six studies specifically varied mean sojourn time or its equivalent in univariate or probabilistic sensitivity analyses.11 13 24 25 27 28 All other studies either performed minimal sensitivity analyses that did not directly address key uncertain variables or did not perform sensitivity analyses at all, both of which we considered fatal flaws with high risk of bias.
Third, no study validated their model using a dataset and population different from the one to which the model was calibrated. Several studies used a dataset to calibrate the model and then “validated” the model by fitting it to the same original dataset. Performing true external validation would lend more credibility to the assumptions made in the model and would make it more likely that the calibrated parameters are applicable to other populations. Furthermore, all modeling studies adjusted for mean sojourn time or lead time using model-derived estimates of these values which are obtained with overdiagnosed cancers included in the calculation, resulting in incorrectly prolonged estimates of lead time which bias the overdiagnosis results toward zero.32
Overall, 15 of 21 modeling studies had a high risk of bias because they had the fatal flaw of not performing key sensitivity analyses. The five studies that performed univariate sensitivity analyses for mean sojourn time were rated as having moderate risk of bias.11 24 25 27 28
Strength of evidence: modeling studies
We rated overall strength of evidence as low for breast, prostate, lung, and colon cancer modeling studies. We rated directness as poor for all modeling studies, as they used insufficiently supported assumptions to draw conclusions about overdiagnosis, especially progression of cancer in the absence of screening. The frequency and rate of this progression is fundamental to overdiagnosis; the estimates from such models are by nature indirect. We rated the overall risk of bias for modeling studies as high, external validity as good, and consistency as poor. Precision could not be determined.
Pathological and imaging studies
Characteristics of included studies: pathological and imaging studies
We found eight studies that drew conclusions about overdiagnosis based on a pathological or imaging characteristic.33 34 35 36 37 38 39 40 These studies examined only overdiagnosis resulting from non-progressive disease and not competing mortality; thus, they underestimated total overdiagnosis. Table 4⇓ summarizes the evidence from the included pathological and imaging studies.
Risk of bias: pathological and imaging studies
Several problems increased risk of bias for these studies: inability to obtain complete follow-up information on the included patients,35 36 38 non-management of potential confounders,33 35 39 40 use of inconsistent methods for determining tumor characteristics,38 40 and invalid or unreliable ascertainment of cause of death.35 Overall, three lung cancer studies had a high risk of bias,35 38 and three had a moderate risk of bias. Both prostate cancers studies had a high risk of bias.
Strength of evidence: pathological and imaging studies
We rated the strength of evidence as low for all prostate and lung cancer pathological and imaging studies. With one exception, directness was poor for all pathological and imaging studies because the validity of the study estimates was contingent on the unexamined assumption that the pathological or imaging characteristics were directly and strongly correlated with cancer related morbidity and mortality. We rated the overall risk of bias of pathological and imaging studies as moderate to high, external validity as fair to good, and consistency as poor. Precision could not be determined.
Ecological and cohort studies
Characteristics of included studies: ecological and cohort studies
We found 20 ecological and cohort studies that met our criteria, 18 of breast cancer41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 and two of prostate cancer.59 60 The breast cancer studies were typically European with screening programs involving biennial mammography for women aged 50–69 years. Table 5⇓ summarizes the evidence from the included ecological and cohort studies.
Risk of bias: ecological and cohort studies
In general, ecological and cohort studies have a high risk of selection bias and confounding due to the comparison of non-randomized populations or cohorts. The included studies used several variations of unscreened reference populations with varying potential for bias. Most studies modeled the prescreening incidence trend through the study period to determine reference incidence, though this assumes that incidence would have continued at the same rate without non-linear changes. Several studies45 47 53 used contemporary geographic areas without screening programs as the reference population; this approach could introduce confounders that are distributed differently between the two geographic areas. The use of a historical control group is complicated by potential confounders that may have changed in a substantial way between time periods. Two studies used a combination of three control groups, including a contemporary unscreened group and historical groups in the regions with and without screening.49 51 These studies are better able to control for differences in incidence growth between regions but could still be biased by differential influence of confounders between regions. Some studies took additional steps to reduce the probability of selection bias and confounding, including adjusting for risk factors on a population level41 48 50 and considering “extreme” scenarios.41 Because of such additional steps, we were able to rate 18 of 20 ecological and cohort studies as having moderate risk of selection bias and confounding. We rated two studies as having high risk of bias because they compared screening attenders with non-attenders, groups with known differences in general health and health behaviors.44 55 We rated 17 of 20 breast cancer studies as moderate risk of measurement bias because they did not discuss the validity and reliability of their data sources.
Overall, we rated 17 of 20 ecological and cohort studies as having a moderate risk of bias. Three breast cancer studies,44 53 55 however, had a high risk of bias overall due to a high risk of confounding. Both prostate cancer studies had a moderate risk of bias overall.
Analysis: ecological and cohort studies
Several analysis issues related to measuring and calculating overdiagnosis are unique to ecological and cohort studies. Screening advances the time of diagnosis of preclinical cancers by the lead time, such that incidence is predictably increased in a screened population during the screening period. After the screening period, in the absence of overdiagnosis, cancers that would have presented clinically have already been detected by screening, so cumulative incidence tends to increase more gradually, and incidence rate declines. Because lead time varies among cancers and individuals, and with different screening strategies, there is no single lead time that correctly captures this time period for a population.
Often, overdiagnosis is calculated by determining the number of excess cases of cancer in a screened population (compared with a non-screened population) during the screening period, subtracting the deficit of cases in post-screening women compared with an unscreened reference, thus estimating the absolute difference in long term cumulative incidence attributable to screening. Studies that obtain follow-up data for a short period after screening ends may not sufficiently capture the post-screening deficit of cases and thus can overestimate overdiagnosis. Other studies instead perform a statistical adjustment for lead time as an alternative to achieving longer term follow-up. The validity of these statistical adjustments, however, is not clear. For example, adjusting for a “mean” lead time is likely biased because overdiagnosis depends not just on “average” lead time but also on the distribution of individual lead times, which is much more difficult to estimate and may be less generalizable from population to population. Also, most estimates of lead time are derived from models which include overdiagnosed cancers in their calculation of lead time, leading to underestimates of true overdiagnosis.32
We rated the adequacy of the time frame of included ecological and cohort studies. Because the lead time magnitude and distributions are largely unknown, we used these ratings as a general guide to highlight where biased estimation might be occurring. When studies performed a statistical adjustment for lead time43 45 50 52 we did not rate their time frame. Two studies with no follow-up post-screening received poor ratings.53 56 We rated as good a cohort study that achieved at least 10 years follow-up post-screening on all women44 and a study41 performed over a 30 year period during which screening had reached a steady state.44 We rated the remaining ecological and cohort studies as fair, as they achieved variable amounts of follow-up time (4–14 years) post-screening. Studies that performed an unjustified statistical adjustment for lead time introduced greater uncertainty into the analysis and greater concern about bias; we thus rated their analysis as poor.43 45 50 52
Six ecological and cohort studies calculated overdiagnosis as the risk ratio of cumulative incidence of cancer in the screening group compared with the reference group over the screening period and a period of follow-up post-screening.51 54 55 59 60 These studies defined overdiagnosis as the proportion of all cancers (including ones diagnosed after the screening period) that would never have caused clinical problems. The inclusion of cases diagnosed after the screening period in the denominator dilutes the estimate of overdiagnosis and makes the frequency of overdiagnosis highly dependent on the length of follow-up time. We rated the analysis of these studies as poor, because they all provided underestimates of overdiagnosis according to our definition. We rated as good studies that calculated overdiagnosis as the absolute excess of cases in the screened population divided by cases diagnosed in the screened population during the screening period.41 44 46 47 57 58
Directness: ecological and cohort studies
We rated directness as good for 15 of the 20 ecological and cohort studies because they directly quantified excess cumulative incidence in a screened population. The exceptions were the studies that performed an unjustified statistical adjustment for lead time,43 45 50 52 which we rated as having poor directness because these studies require additional assumptions about cancer progression, and one study which excluded data from the prevalence screening round.42
Strength of evidence: ecological and cohort studies
We rated the strength of evidence as low for the overall body of ecological and cohort studies. However, five breast cancer ecological studies stood out among the others for having a moderate risk of bias, an unbiased analysis, and fair to good time frames.41 46 47 57 58 The estimates of overdiagnosis from these studies gave greater confidence of accuracy. We rated the overall directness of ecological and cohort studies as good (n=15) or poor (n=5), external validity as good, precision as fair, consistency as poor, analysis as good (n=6) or poor (n=14), and time frame as fair.
Follow-up of a randomized controlled trial
Characteristics of included studies: follow-up of a randomized controlled trial
We included three long term follow-up studies of randomized controlled trials: one of the Malmo randomized controlled trial in Sweden,61 which randomized women aged 44 to 69 years to several mammography rounds or no screening and followed them for 15 years; the second of the National Lung Screening Trial,62 which randomized high risk US patients aged 55–74 years and followed them for up to seven years; and the third of the Canadian National Breast Screening trial,63 which randomized Canadian women aged 40–59 and followed them for an average of 22 years. Table 6⇓ summarizes the evidence from the included follow-up studies of a randomized controlled trial.
Risk of bias: follow-up of a randomized controlled trial
All studies had a low risk of selection bias and confounding. We rated the risk of measurement bias as moderate in the Malmo and National Lung Screening Trial studies because the authors did not describe the validity and reliability of their data sources, particularly over the long term follow-up periods. In the National Lung Screening Trial follow-up, measurement bias was also moderate because lung cancer incidence assessment was not masked. Overall risk of bias was low for all three studies.
Strength of evidence: follow-up of a randomized controlled trial
We rated the time frame of the Malmo and Canadian studies as good because they achieved complete 15 year follow-up of all women in the study and 22 year follow-up on average, respectively. We rated the National Lung Screening Trial time frame as fair, only achieving seven years follow-up. The initial analysis of the Malmo study received a poor rating for diluting the overdiagnosis estimate, but the re-analysis performed by Welch and colleagues was unbiased.64 The Welch re-analysis used the denominator of cases diagnosed during the screening period and a numerator of excess cases diagnosed in the screening group, resulting in 18% overdiagnosis rather than 10%. Overall strength of evidence was moderate for both cancer types, as only one or two studies represented each type. We rated the overall directness for follow-up of randomized controlled trials as good, external validity as good, precision as fair, consistency was not applicable, analysis as good, and time frame as fair to good. The overall rating was moderate.
Principal findings of the review
This review identified four major research methods that have been used to measure overdiagnosis from cancer screening: modeling studies, pathological and imaging studies, ecological and cohort studies, and follow-up of a randomized controlled trial. Using the frameworks for evaluating risk of bias and strength of evidence, we identified strengths and weaknesses of each of these methods for providing valid and reliable estimates of the frequency of overdiagnosis and the suitability for monitoring overdiagnosis over time (table 7⇓).
For the purposes of estimation of overdiagnosis at a point in time, follow-up of a randomized controlled trial is ideal for minimizing biases and directly addressing the question of interest. However, because these studies require significant time and resources and often have limited external validity, they are less useful for monitoring overdiagnosis over time.
Modeling studies require less time than randomized controlled trials and, with the help of sometimes unexamined assumptions, are able to project through areas of uncertainty. These projections, however, do not change the fact of uncertainty. Sensitivity analyses demonstrate that varying key uncertain inputs such as the distribution of sojourn time substantially changes overdiagnosis estimates.11 Most of the included studies made no efforts to mitigate these uncertainties with unbiased selection of data sources, sensitivity analyses, or external validation, and most had a high risk of bias. Because the effectiveness of treatment and the sensitivity of screening tests change over time (which may change the natural history of both treated and untreated cancer), models would need constant modification to provide valid monitoring of overdiagnosis over time. Finally, models would need to continually adjust for changes in competing mortality risks, which also change over time.
Pathological and imaging studies tend to over-simplify overdiagnosis, with an arbitrary cutoff of a defining characteristic such as volume doubling time. Furthermore, both modeling and pathological imaging studies are indirect because they require assumptions about cancer progression.
Some ecological and cohort studies are limited by confounding and problematic analyses, including uncertain statistical adjustments. However, when well designed and interpreted in combination with studies from other geographic areas and time periods, these studies can provide credible estimates of overdiagnosis. They are also suitable for monitoring of “real world” overdiagnosis over time.
Similar to models that require an estimate of lead time, some ecological and cohort studies have performed a statistical adjustment for lead time in their analyses, which introduces uncertainty. Lead time is not only uncertain, but actually prospectively unknowable, as it varies among individuals and by screening practice. It is possible to calculate an average lead time from randomized trials, though these estimates are biased because they include overdiagnosed cancers in the calculation if a model is used. The heterogeneity of cancer and of individuals among and between populations, as well as variation and changes in the sensitivity of screening tests and treatment effectiveness, makes estimating the lead time distribution a highly uncertain endeavor.
Ongoing ecological and cohort studies within established national or regional screening programs, however, with appropriate collection of information about cancer incidence, potential confounders, screening adherence, and treatments used, have the ability to compare cancer incidence in areas with one type of screening program to incidence in areas with a different type of screening program. When carefully analyzed in an unbiased manner, such international ecological and cohort studies have the potential to help us better understand the effects of different screening programs on overdiagnosis, as well as trends in overdiagnosis as screening programs and treatments change over time. The potential credibility and usefulness of ecological and cohort studies is greater than modeling studies for these purposes.
Strengths and weaknesses of the review
The major strengths of this study are that it is a systematic review, that it offers specific criteria for evaluating studies measuring overdiagnosis, and that it looks broadly at studies of overdiagnosis of different types of cancer. There are several limitations of our review. We had to modify criteria for strength of evidence to fit the different research designs; readers should examine these criteria when interpreting our findings. We combined certain studies when multiple studies were available from the same authors or were using the same model and population, and it is possible that we missed some of the variability in the data available. We also limited the scope of our review to include only the nine types of solid tumors with the highest incidence in US adults, and no overdiagnosis estimates were available for melanoma and bladder, renal, thyroid, and uterine cancers.
Strengths and weaknesses in relation to other studies
To our knowledge, there are no other systematic reviews that have comprehensively identified all studies that measure overdiagnosis. Several systematic and non-systematic reviews have explored a subset of the overdiagnosis literature. Biesheuvel and colleagues systematically reviewed studies of breast cancer overdiagnosis with a focus on potential sources of bias in the estimates.65 We disagree that statistical adjustment and exclusion of prevalence screening data, which they recommend, are adequate to manage problems of lead time. Furthermore, they advocate the “cumulative incidence method,” in which overdiagnosis is calculated as a risk ratio of cumulative incidences several years after screening has ended, which has been a major source of confusion for other researchers who have referenced this review. This analysis method is problematic because it dilutes the overdiagnosis estimate and makes it dependent on the length of follow-up time.65 Puliti and colleagues reviewed European observational studies of breast cancer overdiagnosis, making note of which studies adequately adjusted for breast cancer risk and lead time.66 We question their assessment, as they favorably rated studies that statistically adjusted for lead time as well as studies that included post-screening follow-up years in the analysis.
Etzioni and colleagues non-systematically reviewed studies of breast and prostate cancer overdiagnosis.67 They label ecological and cohort studies that do not statistically adjust for lead time as the “excess incidence approach” of overdiagnosis estimation and argue that these studies may yield a biased estimate if the early years of screening dissemination are included. They advocate excluding the first few years of screening data to make an overdiagnosis estimate less biased. We agree that if a study includes only the first few years of screening dissemination without any post-screening follow-up that this can lead to overestimation, but most existing studies appropriately measure incidence during a screening and post-screening follow-up period and are thus able to measure overdiagnosis without this bias.67
Etzioni and colleagues also discuss modeling studies for measuring overdiagnosis which they refer to as the “lead time approach.” They claim that the main limitation of modeling studies is their lack of transparency, and that prior publication of the model in peer reviewed statistics literature is a positive indicator of the model’s validity.67 Rather than lack of transparency, we found that the inherent lack of directness of modeling studies and the potential for key uncertain inputs to greatly alter overdiagnosis estimates are the primary limitations of modeling studies. Prior model publication in the statistics literature is not a sufficient indicator of a model’s validity, and authors of modeling studies should be encouraged to take steps to increase the validity of their study by using systematically reviewed data inputs and performing sensitivity analyses and external validation.
Finally, Etzioni and colleagues point out a dichotomy in the selected studies they chose to present, where modeling studies tended to have much lower estimates of overdiagnosis than ecological studies, particularly among breast cancer studies.67 However, we found several breast cancer modeling studies22 25 27 with much higher overdiagnosis estimates than the ones they presented, as well as ecological studies51 54 55 with lower estimates than those presented. Their suggestion that all ecological and cohort studies overestimate overdiagnosis is unfounded.
Meaning of the review: implications for future practice and research
We suggest that the public health policy community begin a coordinated effort to develop an international ecological and cohort data monitoring system for cancer screening programs, including monitoring overdiagnosis. We found that well conducted ecological and cohort studies performed in a variety of settings can give accurate estimates and enable us to compare overdiagnosis among different screening programs and to monitor overdiagnosis over time. Some of this research is ongoing, especially in European countries with breast cancer screening programs, but it is not being performed in a uniform way. We suggest the formation of a group of unbiased international experts to set standards for ecological and population cohort studies, for countries to adopt these standards in their registries, and then for unbiased methodological experts to conduct ongoing studies to monitor screening and overdiagnosis over time.
These standards should include an adequate time frame that achieves sufficient follow-up post-screening, such that all participants in the post-screening age groups have previously been offered screening. Researchers should determine standard population level confounders, unique to each cancer type, that should be monitored and adjusted for. In addition to considering cancer risk factors as potential confounders, information systems should monitor screening strategies, screening adherence, treatments used, and patient outcomes (such as complications, morbidity, and mortality). Finally, standards for analysis should include calculation of overdiagnosis as an absolute excess of cases attributable to screening divided by a denominator of cases diagnosed during the screening period.
Setting up these registries and information systems may be challenging for some countries, but others have already made great strides in this direction. There will certainly need to be an initial investment of resources, but, once established, the potential benefits from these information systems are large. These systems could examine the effects of variations in screening programs on the magnitude of harms and costs of overdiagnosis, as well as determining when a screening program is no longer effective because of improved treatment. Beyond overdiagnosis, these studies may also provide real world information about the benefits and harms of newer screening technologies, helping to make policy decisions about which programs to implement more widely. Such information systems could also provide platforms on which randomized controlled trials of new screening programs could be efficiently tested.
Researchers have measured overdiagnosis using four main methods. Follow-up of a randomized trial is ideal for internal validity but requires extended time, may lack external validity, and is not useful for monitoring. Modeling studies and pathological and imaging studies are simpler to perform but introduce uncertainty by lack of directness and requiring assumptions about cancer progression. Ecological and cohort studies can be limited by confounding and require careful analysis, but when performed well they can provide a more valid and reliable estimate of overdiagnosis. They are also well designed to monitor and compare screening programs over time. A group of unbiased researchers should set standards for these studies and monitor overdiagnosis and other outcomes of cancer screening programs in multiple countries. Monitoring screening programs is important not only in attempts to reduce overdiagnosis, but for maximizing the benefits of cancer screening while minimizing the harms and costs.
What is already known on this topic
Studies of cancer overdiagnosis, using various methods, have found an extremely wide range of results
It is unclear how to evaluate the methods of these studies in order to interpret the conflicting results and how to better perform such studies in the future
What this study adds
This systematic review highlights the high potential for bias and the reliance on unproven assumptions in modeling studies and studies that quantify overdiagnosis using pathological or imaging characteristics
We recommend that well done ecological or cohort studies performed by unbiased researchers be used to quantify and monitor overdiagnosis in various settings worldwide
Cite this as: BMJ 2015;350:g7773
Contributors: All the authors were involved in conceptualizing the work, performing abstract and full text review, performing quality rating of included studies, synthesizing the results and drawing conclusions, and drafting and reviewing the manuscript. JLC performed the literature search, was involved in all aspects of the systematic review process, and was the lead author of the manuscript. RJC was heavily involved in abstract and full text review, data abstraction, quality rating, and data synthesis and analysis. RPH was the lead reviewer and editor of the manuscript in addition to being involved in the review process. JLC and RPH are the guarantors of the paper.
Funding: This project was supported by the Agency for Healthcare Research and Quality (AHRQ) Research Centers for Excellence in Clinical Preventive Services (grant No P01 HS021133). The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality. The funders had no role in study design; in the collection, analysis, or interpretation of data; in the writing of the report; and in the decision to submit the article for publication.
Competing interests: All authors have completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years, no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not required.
Transparency: The lead author affirms that this manuscript is an honest, accurate and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from this study as planned have been explained.
Data sharing: Technical appendices are available from the corresponding author at firstname.lastname@example.org.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.