Quantification of harms in cancer screening trials: literature review
BMJ 2013; 347 doi: https://doi.org/10.1136/bmj.f5334 (Published 16 September 2013) Cite this as: BMJ 2013;347:f5334- Bruno Heleno, PhD fellow1,
- Maria F Thomsen, registrar1,
- David S Rodrigues, consultant general practitioner2,
- Karsten J Jørgensen, senior researcher3,
- John Brodersen, associate professor1
- 1Research Unit for General Practice and Section of General Practice, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5, PO Box 2099, 1014 Copenhagen K, Denmark
- 2Family Medicine Department, Nova Medical School, Campo dos Mártires da Pátria 130, 1169-056 Lisbon, Portugal
- 3Nordic Cochrane Centre, Rigshospitalet, Department 7811, Blegdamsvej 9, 2100 Copenhagen, Denmark
- Correspondence to: B Heleno bruno.heleno{at}sund.ku.dk
- Accepted 16 August 2013
Abstract
Objectives To assess how often harm is quantified in randomised trials of cancer screening.
Design Two authors independently extracted data on harms from randomised cancer screening trials. Binary outcomes were described as proportions and continuous outcomes with medians and interquartile ranges.
Data sources For cancer screening previously assessed in a Cochrane review, we identified trials from their reference lists and updated the search in CENTRAL. For cancer screening not assessed in a Cochrane review, we searched CENTRAL, Medline, and Embase.
Eligibility criteria for selecting studies Randomised trials that assessed the efficacy of cancer screening for reducing incidence of cancer, cancer specific mortality, and/or all cause mortality.
Data extraction Two reviewers independently assessed articles for eligibility. Two reviewers, who were blinded to the identity of the study’s authors, assessed whether absolute numbers or incidence rates of outcomes related to harm were provided separately for the screening and control groups. The outcomes were false positive findings, overdiagnosis, negative psychosocial consequences, somatic complications, invasive follow-up procedures, all cause mortality, and withdrawals because of adverse events.
Results Out of 4590 articles assessed, 198 (57 trials, 10 screening technologies) matched the inclusion criteria. False positive findings were quantified in two of 57 trials (4%, 95% confidence interval 0% to 12%), overdiagnosis in four (7%, 2% to 18%), negative psychosocial consequences in five (9%, 3% to 20%), somatic complications in 11 (19%, 10% to 32%), use of invasive follow-up procedures in 27 (47%, 34% to 61%), all cause mortality in 34 (60%, 46% to 72%), and withdrawals because of adverse effects in one trial (2%, 0% to 11%). The median percentage of space in the results section that reported harms was 12% (interquartile range 2-19%).
Conclusions Cancer screening trials seldom quantify the harms of screening. Of the 57 cancer screening trials examined, the most important harms of screening—overdiagnosis and false positive findings—were quantified in only 7% and 4%, respectively.
Introduction
Cancer screening can lead to harm as well as benefit.1 2 3 Harm related to screening can be somatic or psychosocial.4 5 6 7 8 9 10 11 12 13 Harms result from the screening test itself, from investigations because of false positive findings, and from overdiagnosis with subsequent overtreatment.3 5 12 13 Given the potential for serious harms in healthy individuals, screening should be offered only when the benefits are firmly documented and considered to outweigh the harms, which should be equally well quantified. The determination of benefit from screening requires assessment in randomised clinical trials, which are also capable of providing high quality evidence on harms.14 15 In general, however, harms are poorly reported in randomised trials,16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 and there is some evidence that reporting of harms is worse in non-pharmacological trials than in trials assessing drugs.22 23 24
At least three additional arguments support the importance of reporting harms in randomised trials of cancer screening. Firstly, screening is offered to healthy individuals and is an intervention initiated by the healthcare system, not at the request from a patient to solve a health problem. Secondly, interventions for which the benefits are modest or uncertain merit detailed consideration of harms,32 and systematic reviews of randomised trials of screening have shown either modest33 34 35 or no36 reductions in cancer specific mortality. Thirdly, a benefit for some will come at the expense of harm to others.37 38 39
The minimum evidence required to assess the harms of screening includes the frequencies of false positive findings, overdiagnosis, and complications of diagnostic investigations and treatment.13 In addition, withdrawals because of harms19 and the use of invasive follow-up procedures can be considered as proxy measures of severe harms. We hypothesised that cancer screening trials would not consistently or sufficiently quantify the expected associated harms.
Methods
Eligibility criteria
We included trials that evaluated breast cancer screening with mammography, self examination, or clinical examination; colorectal cancer screening with sigmoidoscopy or colonoscopy, faecal occult blood testing, or virtual colonoscopy; liver cancer screening with ultrasonography, α fetoprotein, or a combination; lung cancer screening with chest radiography or low dose spiral computed tomography of chest; ovarian cancer screening with ultrasonography, serological markers, or a combination; oral cancer screening with visual inspection; prostate cancer screening with prostate specific antigen, digital rectal examination, or a combination; and testicular cancer screening with self examination or clinical examination.
Publications reporting randomised trials were eligible if the trial compared a group of participants undergoing a cancer screening intervention with either no screening or an alternative screening intervention. Participants could be part of the general population or of a high risk population, such as heavy smokers. Trials had to assess the efficacy of cancer screening, defined as a reduction in the incidence of cancer, cancer specific mortality, or all cause mortality. Trials were included regardless of risk of bias. Individual articles were eligible if they provided data for both the screening and control groups, and if they did not pool data from randomised trials and observational studies. Finally, to be eligible, articles must have provided data for all participants, a random sample of all participants, or all participants enrolled in a single centre of a multicentre trial.
Search strategy for the identification of articles
We extracted the references to trials from Cochrane Systematic Reviews when these were available and performed an updated search in the Cochrane Central Register of Controlled Trials using the search terms described in each Cochrane review to find articles published since the review (appendix 1).
When no Cochrane Systematic Review was available, we sought clinical trial reports in the Cochrane Central Register of Controlled Trials. Our search strategies used a combination of controlled vocabulary (MeSH terms) and free text terms. They had three dimensions: terms related to cancer, terms related to the screening technology, and the term “screening” and its synonyms (appendix 1). We planned not to have language restriction, but because of lack of resources we were unable to translate 12 articles identified for assessment of eligibility (six in Mandarin and six in Russian). Our last search was in May 2012.
It subsequently became clear that our search strategy missed some potentially relevant articles. We amended the protocol and designed new searches including either the name of trials known to us or the name of the principal investigators of the trials. These new searches were performed in Medline (1946-14 Aug 2012), Medline In-process and other non-indexed citations (to 14 Aug 2012), and Embase (1974-14 Aug 2012) (appendix 1).
We did not contact study authors nor did we perform searches of grey literature, such as doctoral theses or conference proceedings, as this would not reflect the information that is readily available in the literature.
Data collection and analysis
Selection of studies
BH and DSR independently scanned the titles and abstracts from reference lists of Cochrane Systematic Reviews and from the electronic searches. When the title or abstract did not provide sufficient data to rule out eligibility, the full text was obtained. Disagreements were solved through consensus.
Data extraction and management
All articles were collected in a digital file format (pdf). Two weeks before data extraction, BH concealed information about authors, affiliations, date of publication, journal, and references with the stamp function in Adobe Acrobat 9 Pro (version 9.5.2). The pdf files were encrypted with a password, which restricted changes in the file security settings. BH and MFT independently extracted the data using standardised forms and blinded to each other’s results. Both authors extracted the data from the encrypted pdf files that concealed author identification, year of publication, and the name of the journal, but not trial identification. Disagreements were solved through consensus.
When results from a single trial were reported in multiple publications, data were collected in separate forms for each publication, but our unit of analysis was at trial level. If a single publication provided data from two or more trials, information for each trial was collected in separate forms.
Harm data
We included seven types of harms related to cancer screening: overdiagnosis, false positive findings, somatic complications caused by screening or follow-up procedures, negative psychosocial consequences caused by screening test or follow-up procedures, the additional number of participants subjected to invasive procedures, all cause mortality (which might increase if harms include, for example, invasive follow-up procedures or substantial overdiagnosis and overtreatment), and withdrawals because of adverse events. For the qualitative assessment of harms, we required that two criteria were met before considering that an outcome had been reported: the absolute numbers or incidence rates had to be provided and the outcome must have been explicitly mentioned. We accepted the trial authors’ definition of the outcome and did not assess whether that definition was appropriate. For example, in sigmoidoscopy trials we considered that false positives had been reported if false positives were mentioned, regardless of they were defined as a screening test with a positive result but no cancer, a screening test with positive result but no advanced adenoma or cancer, or a screening test with a positive result but no polyp.
We also extracted a crude quantitative measure of harm reporting. We marked the results section of each publication using a 1 cm2 grid. Thereafter, we measured the space devoted to the results section and the space devoted to reporting any of the harms mentioned above. The quantitative measure of harm reporting was the percentage of space devoted to harms out of the total space in the results section. Similar quantitative measures have been used in previous reviews of reporting of harm in randomised trials.19 23 24 30
Other publication parameters
We extracted information on the type of screening technology, the year of recruitment of the first participant, whether disease specific mortality or incidence had been quantified, target population (general population or high risk group), geographical location, type of control group (unscreened group or alternative screening technology), and whether participants were individually or cluster randomised. If publications from the same trial mentioned different dates for the year of recruitment of the first participant, we chose the earliest date.
Data at trial level
We pooled data from all articles from the same trial. If at least one article from the trial did so, we considered that a trial had quantified a specific harm. We applied the same criterion to data on incidence of cancer and mortality. When trials were reported in more than one article, the space devoted to the results section was defined as the sum of the space of the results sections in each article. Likewise, the space devoted to harms was the sum of the space devoted to harms in each of the articles reporting that trial.
Prespecified analyses
Analyses consisted of a descriptive assessment of included variables. We used proportions and exact confidence intervals for binary outcomes and medians and interquartile ranges for continuous outcomes. In the protocol, we hypothesised that the date of enrolment of participants could be an explanatory variable for quantification of harm and that this could be tested with regression models. Because of a lack of data, however, this could not be done. All statistical analyses were performed in R version 3.0.1.
Additional analyses
In the original protocol (appendix 2), we specified that we would include only articles that reported data from all trial arms. While BH and DSR were assessing the articles for eligibility, we noted that several articles contained relevant information on harms only for the screened participants. Hence, BH reassessed all articles related to the included trials and identified those that reported data on harms only in the intervention group. These articles were included in an unplanned subsidiary analysis to test the robustness of the findings of the main analysis. As 12 articles were not translated, we preformed another unplanned sensitivity analysis assuming that these articles had quantified harms.
We tabulated harm quantification according to the screening technology, geographical location, and type of control group (unscreened group or alternative screening technology). Given the small number of trials, we did not perform stratified statistical analysis. In some trials screening led to a reduced incidence of cancer (thus, if overdiagnosis existed it would be impossible to detect), and in other trials the study design might have been inappropriate to assess it (short follow-up or use of another screening intervention as a control group), so we performed an analysis that excluded both these groups of trials from the denominators.
The protocol for this literature review and its amendment are available in appendix 2.
Results
Out of 4590 titles identified, we found 63 trials that aimed to assess the effect of cancer screening on cancer specific or all cause mortality, or both. Of these, only 57 had published results in at least one article that matched our eligibility criteria (figure⇓ and appendix 3). Of the six remaining trials, one trial was not completed because of low compliance, one trial had not started enrolling participants, and four trials are not yet completed. These four trials have reported results for the screened group but not for the control group. The 57 trials assessed 10 different screening interventions and enrolled 3 419 036 participants. We found no trials on testicular cancer screening or colorectal cancer screening with virtual colonoscopy.
Some of the 57 trials were reported in several articles. We found 198 articles that included data on both the screened and the control groups and used these in our main analyses. We also found 44 articles that reported data for the screened groups but not for the control groups. Our analyses were replicated in the combined 242 articles to assess whether harms had been quantified in at least the screened groups.
Table 1 shows the proportion of trials that quantified each individual outcome in at least one of the eligible articles describing that trial⇓ (the individual assessment of the trials and respective characteristics are available from the authors). Overall, cancer specific mortality and cancer specific incidence were quantified more often than harm related outcomes. While the former two were quantified in more than 80% of the trials, false positive findings were quantified for only two trials (4%, 95% confidence interval 0% to 12%), and overdiagnosis was quantified in four trials (7%, 2% to 18%). Only one trial (2%, 0% to 11%) quantified the number of withdrawals because of adverse effects. The median percentage of space in the results section devoted to harms data was 12% (interquartile range 2-19%). Table 2 shows quantification of harm stratified by type of screening.⇓ Quantification of harm stratified by geographical location of the trial and type of control group is shown in tables S1 and S2 in appendix 4.
We performed several sensitivity analyses with less strict criteria. When we also considered the 44 articles that reported data from only the screened groups, the proportion of trials that quantified some of the harms increased (table 1⇑). For example, false positive findings were now quantified in 32% (95% confidence interval 20% to 45%) of trials, while overdiagnosis remained quantified in only 7% of trials. The results of the sensitivity analyses—which assumed the non-translated articles had quantified all the outcomes—were similar to those of the main analyses (table S3 in appendix 4). We also restricted the analysis of overdiagnosis to the trials which, after extended follow-up, found a higher incidence of cancer in screened participants than in the unscreened control group. In this subset, overdiagnosis was quantified in two out of 12 trials (17%, 2% to 48%).
Discussion
Summary of main results
The most important harms of screening—overdiagnosis and false positive findings—were quantified in only a minority of trials. Out of 57 cancer screening trials, 7% quantified overdiagnosis40 41 42 43 and 4% quantified false positive results.44 45 Only one trial reported the number of withdrawals because of harmful events (2%),46 and the median amount of space devoted to reporting of harms in the results section of the trial reports was 12%. Consequently, cancer screening trials rarely report what is considered the minimum amount of evidence required to quantify the harms of screening. In contrast, the effect of cancer screening on cancer specific mortality was reported in 89% of trials. It is therefore often difficult or impossible to weigh benefits against harms in cancer screening.
Interpretation of the results
We found few trials that met our criteria for minimal harm reporting, which suggests poor reporting of harms. An alternative explanation would be that our assessment criteria focus on irrelevant outcomes or that they do not capture the important aspects of harm reporting. Their relevance, however, is supported in several concept papers and editorials about the harms of screening.4 5 6 7 8 9 10 11 12 13 The exception is withdrawals, which is an unusual concept in screening literature and was taken from previous reviews of harm reporting.16 19 We included this outcome because it reflects the ultimate decision of the participant, the physician, or both, to discontinue an intervention.16 The importance of withdrawals, however, is controversial,32 and trialists might have considered it irrelevant. When the decision to withdraw is made by a clinician, it is possible to recognise this from the participant’s case report form; but when the decision is taken by the participant it might be difficult to distinguish it from other causes of loss to follow-up. Additionally, in some trials there was no direct contact with the control group and their information was collected from registries. These participants could not withdraw as they were unaware that they were part of a trial. In these trials it would be inappropriate to require withdrawal data from the controls. In summary, there are arguments against the relevance of withdrawals as a surrogate of harm; however, even when we excluded this outcome, the general pattern of poor reporting of harm persisted.
Three aspects of the criteria used to appraise the harm outcomes could be discussed. The first is whether these outcomes can be assessed for all trials. It is possible to argue that overdiagnosis cannot be assessed in all trials as this ideally requires a persistent increase in incidence after a long follow-up.41 47 We chose to present overdiagnosis by including all 57 trials in the denominator as it had been specified in our review protocol. For completeness, we also restricted the analysis to trials with long term follow-up, an increased incidence of cancer in the screened group, and an unscreened control group. Although the proportion of trials reporting overdiagnosis is higher in this subset (17% compared with 7% in the main analysis), it is still unacceptably low and makes no change to our conclusion. We found no reason to exclude trials from the denominators of the other analyses.
The second point is whether it is relevant to collect data from the unscreened arms for all harm outcomes. Reporting harm outcomes for the intervention and control groups is a central recommendation in the guidelines for reporting randomised trials,48 49 and we therefore assessed whether each screening harm was reported in both groups of the trials. Symptoms and incidental findings in unscreened participants can lead to invasive procedures, adverse psychosocial consequences, complications of diagnosis and treatment, and mortality from other causes. Hence, these four outcomes should be reported for the control groups to make it possible to assess the level of any surplus harm in the screened group. In contrast, it can be argued that it is necessary to undergo a screening test to experience a false positive result. If we accept this premise, participants in unscreened control groups cannot experience false positive findings and it would be adequate to report false positives only for the screened group when the comparator is no screening. This would mean that 18 trials (32%) would have reported false positive findings adequately, instead of the two (4%) that report them for both arms. Even if we allow for a less strict criterion, the vast majority of trials do not provide data on false positive findings. It is also possible to argue that it is not meaningful to report overdiagnosis for the screened and the unscreened groups because it can be quantified only indirectly (that is, comparing the incidence between groups). We considered, however, that overdiagnosis was reported when incidence data had been provided for both arms and the term “overdiagnosis” or equivalent had been used.
Thirdly, we could have included data presented outside of the results section. Some trials presented data on harm in the discussion section but not in the results section: five trials presented numbers for overdiagnosis,50 51 52 53 54 three trials presented the number of false positive results in the screened group,55 56 57 and one presented the number of invasive procedures in the screened group.57 In the case of overdiagnosis, the trials presented the difference in incidence of cancer in the results section and interpreted this as overdiagnosis in the discussion section. Addition of these to the numerator would mean that of 57 trials, nine (16%, 95% confidence interval 7% to 28%) reported overdiagnosis. When we considered the smaller subset of 12 trials with long term follow-up, increased incidence of cancer in the screened group, and an unscreened control group, overdiagnosis was reported in four trials (33%, 10% to 65%). Thus, in the most optimistic scenario only a third of the trials report the most serious harm of screening.
Strengths and weaknesses
We identified trial articles from either Cochrane Reviews or electronic searches in three different databases. It is therefore unlikely that we missed any important article for the trials included in this review. Our searches also identified trials of screening interventions that we were not aware of before this study (for example, cervical cancer screening with visual inspection or human papillomavirus typing). We have not included these in this review, which is therefore not fully comprehensive.
We tried to avoid underestimation of quantification of harm in four ways. Firstly, we assessed only whether any quantification of harms was present, not whether the harms were adequately reported or correctly defined. Secondly, we considered that harms had been reported even when the quality of the data on harms was lower than that for benefits (that is, when harms data were reported for only a subset of the included participants). Thirdly, when data on harms were provided in a table or a figure, we included the entire table or figure area in the numerator of our estimate of space devoted to harms. Data on harms often accounted for a small number of lines in a table. Finally, and unlike previous surveys of harm reporting,16 17 18 19 20 21 22 23 24 25 27 we extracted data from multiple articles reporting on a single trial.
Most of the concept papers and editorials about harms of screening were published within the past decade.4 7 8 9 10 11 12 Also, the data monitoring committee of the PLCO trial has recently recommended early stopping of the prostate cancer screening part because of concerns about harms.58 This suggests more concern about screening harms in recent years. Because of the small number of trials, however, we could not assess whether harm reporting has improved in recent years, as originally planned in the review protocol.
Strengths and weaknesses in relation to other studies
Several authors have stated that adequate evidence on the harms of cancer screening is lacking.5 12 13 59 We identified only one previous study that attempted to quantify the problem. This was a review of research articles about mammography screening including study designs other than randomised trials. Thirty eight per cent of the included articles did not report any harms; 23% mentioned harms but presented them as unimportant; and 39% acknowledged the existence of harms.60 Our results extend this finding to other types of cancer screening.
Several literature reviews have assessed harms reported in randomised trials of other types of medical interventions.16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 In these reviews, the proportion of trials reporting absolute numbers for various harms ranged from 41% to 88%,19 23 24 27 29 withdrawals because of harms in 25-94%,19 21 23 24 26 27 30 and the median space devoted to harms in the results section ranged from 0% to 14%.19 20 23 30 Although cancer screening trials assess a preventive activity targeted at healthy individuals, our results show that reporting of harm is no better than what was found in reviews of therapeutic interventions.
Meaning of the study
The trials we reviewed included large numbers of participants, followed them for long periods of time, required many resources, and provided valuable information about the impact of screening on cancer specific mortality. However, we found that the harms were poorly reported. Healthcare decision makers, healthcare practitioners, and, ultimately, patients therefore cannot make informed choices about cancer screening. This is problematic as many cancer screening programmes have important associated harms.
While we acknowledge that collecting data on harms will complicate cancer screening trials, this is not a sound argument against the strong ethical obligation to collect such data. If trialists do not report certain outcomes because they consider that the harms will be either rare or irrelevant when compared with the potential decrease in mortality, such information will not be available for people who judge these outcomes differently. We think that future screening trials should collect and report the expected harms of screening (false positives, overdiagnosis and overtreatment, psychosocial consequences, somatic complications, and all cause mortality). Adequate reporting of harm requires data from the control group as these provide a reference level and help to interpret harms data from the screened group.
Implications for future research
The CONSORT statement,48 49 which aims to improve reporting of clinical trials, has an extension specifically devoted to reporting of harm.16 Although most of the examples in this extension come from pharmacological trials, the extension is also applicable to screening trials. There are some topics, however, where direct application of the CONSORT statement to cancer screening trials seems difficult. How can withdrawals because of harmful events be distinguished from other sources of loss to follow-up in screening trials? Are there specific harms in cancer screening where data from the intervention group is enough? Can scales be used to grade screening harms for severity? A discussion of these questions could help to standardise harm reporting in randomised trials of cancer screening and will hopefully lead to more complete evidence that allows informed decisions.
What is known about this topic
Cancer screening programmes require detailed consideration of harms as they target healthy people
Harms from screening include overdiagnosis and overtreatment, false positive findings, additional invasive procedures, negative psychosocial consequences, and somatic complications
It is unknown whether trials that assess cancer screening routinely quantify harms
What this study adds
The most important harms of screening, overdiagnosis and false positive findings were quantified in only 7% and 4% of 57 cancer screening trials, respectively
There was variation in the proportion of trials quantifying other outcomes (from 9% quantifying psychological consequences to 60% quantifying all cause mortality)
Notes
Cite this as: BMJ 2013;347:f5334
Footnotes
We thank Rajeswari Aghoram, James Dickinson, and Yuk Tsan Wun for providing details about the search strategy used in the Cochrane Systematic Review of liver cancer screening. We thank Michael Nixon for helpful comments on an earlier draft.
Contributors: BH drafted the protocol, and KJJ and JB provided comments. BH and DSR assessed references for eligibility. BH and MFT extracted data. BH analysed data and drafted the manuscript. MFT, DSR, KJJ, and JB contributed to revisions with important intellectual content. All authors had full access to all data (including statistical reports and tables) in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. BH is guarantor.
Funding: BH is supported by Fundação para a Ciência e Tecnologia (governmental agency) grant SFRH/BD/74640/2010. The funder had no role in study design or data collection, analysis, or interpretation.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not required.
Data sharing: Comma separated files (csv) with the extracted data and the R script with the analyses are available from the authors.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/.