Very large treatment effects in randomised trials as an empirical marker to indicate whether subsequent trials are necessary: meta-epidemiological assessmentBMJ 2016; 355 doi: https://doi.org/10.1136/bmj.i5432 (Published 27 October 2016) Cite this as: BMJ 2016;355:i5432
- Myura Nagendran, honorary clinical research fellow1,
- Tiago V Pereira, research scientist2,
- Grace Kiew, medical student3,
- Douglas G Altman, professor4,
- Mahiben Maruthappu, public health registrar5,
- John P A Ioannidis, professor6,
- Peter McCulloch, professor7
- 1Division of Anaesthetics, Pain Medicine and Intensive Care, Imperial College London, UK
- 2Health Technology Assessment Unit, Institute of Education and Sciences, Hospital Alemão Oswaldo Cruz, Sao Paulo, Brazil
- 3Gonville and Caius College, University of Cambridge, UK
- 4Centre for Statistics in Medicine, Oxford, UK
- 5Department of Epidemiology and Public Health, University College London, UK
- 6Departments of Medicine, of Health Research and Policy, and of Statistics, and Meta-Research Innovation Center at Stanford (METRICS), Stanford University, USA
- 7Nuffield Department of Surgical Science, John Radcliffe Hospital, University of Oxford, Oxford OX3 9DU, UK
- Correspondence to: P McCulloch
- Accepted 29 September 2016
Objective To examine whether a very large effect (VLE; defined as a relative risk of ≤0.2 or ≥5) in a randomised trial could be an empirical marker that subsequent trials are unnecessary.
Design Meta-epidemiological assessment of existing published data on randomised trials.
Data sources Cochrane Database of Systematic Reviews (2010, issue 7) with data on subsequent large trials updated to 2015, issue 12.
Eligibility criteria All binary outcome forest plots were selected, which contained an index randomised trial with a VLE that was nominally statistically significant (P<0.05), included a subsequent large randomised trial (≥200 events and ≥200 non-events) for validation of the effect, assessed a primary outcome of the review, and was not a subgroup or sensitivity analysis.
Results Of 3082 reviews yielding 85 002 forest plots, only 44 (0.05%) satisfied the inclusion criteria. Index trials were generally small, with a median sample of 99 (median 14 events). Few index trials were rated at low risk of bias (9 of 44; 20%). The relative risk was closer to the null in the subsequent large trials in 43 of 44 cases. Subsequent large trial data failed to find a statistically significant (P<0.05) effect in the same direction in 19 cases (43%, 95% confidence interval 29% to 58%). Even when the subsequent large trials did find a significant effect in the same direction, the additional primary outcomes in most of these trials would have to be considered before deciding in favour of using the intervention. Subsequent large trial data found a statistically significant effect in the same direction in 19 of 21 cases when the index trial also had a value of P<0.001.
Conclusions The frequency of VLEs followed by a large trial is vanishingly small, and where they occur they do not appear to be a reliable marker for a benefit that is reproducible and directly actionable. An empirical rule using a VLE in a randomised controlled trial as a marker that further trials are unnecessary would be neither practical nor useful. Caution should be taken when interpreting small studies with very large treatment effects.
Randomised controlled trials are perceived as the gold standard for settling interventional questions and maintain a dominant position in the hierarchy of medical evidence.1 Under ideal circumstances, their data can provide essential information on efficacy and harms to clinicians and act as a powerful guide for policy makers. However, the value of conducting trials can be limited by both logistical factors that inhibit recruitment and recognised deficiencies in reporting (bias, selective publication, and lack of transparency).2 A further crucial aspect of conducting such trials is the ethical requirement for clinical equipoise between treatments. Reaching a consensus agreement within the medical community on whether such equipoise exists in a given situation can often be difficult.3
Some clinicians might find equipoise more difficult than others,4 and where initial reports have generated enthusiasm in the clinical community, the argument that the superiority of the new treatment is “obvious” and that a further trial would therefore be “unethical” is frequently advanced. This has led to serious problems in areas such as surgery where it has proved difficult or impossible to conduct randomised controlled trials of new techniques and devices because of strong beliefs based on weak evidence of large benefits. Therefore, the question of when an effect is so obvious that it does not require further testing has real practical importance.
There are some situations in which treatment effects are so large that bias, while perhaps having some impact on the overall effect size, is unlikely to affect the large clinical and statistical significance of the result.5 Although most healthcare interventions tend to provide only modest benefits,6 there might be a subset where a very large effect (VLE) is seen.7 If a set of conditions could be defined where it could be demonstrated that VLE sizes made it highly unlikely that the superiority of the treatment would be refuted by further trials, such trials would be wasteful of resources as well as potentially unethical.8
Therefore, we set out to identify trials showing a VLE (relative risk of ≤0.2 or ≥5) that were followed by a further large trial (≥200 events and ≥200 non-events) within the Cochrane Database of Systematic Reviews, and to evaluate this relative risk threshold as an empirical marker indicating that further trials are unlikely to be useful or necessary.
Definition of a VLE, index trial, and large trial
For consistency, we focused only on binary outcomes in randomised trials. We based our definition for a VLE effect on that used in previous empirical work on assessing large treatment effects.7 Pereira and colleagues formed a definition based on the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) scale for relative risks in non-randomised data. Within this scale, relative risks of two to five are defined as large, and those greater than five as very large.9 The relative risk was preferred over the odds ratio because the odds ratio may be substantially larger when outcomes are very common. Accepting that point estimates of effect might not provide useful information when the confidence intervals are wide and the effect is not nominally significant, we included only trials with a relative risk of five or more (or ≤0.20) that had a nominally statistically significant effect based on a Fisher exact test (P<0.05).
The index trial was defined as a trial with a nominally statistically significant VLE that was then followed by at least one large trial. A large trial was defined a priori as one with at least 200 events and 200 non-events. This choice is arbitrary, but it selects for trials that have a very large power to detect not only a relative risk exceeding five, but also a much smaller relative risk. For example, with 200 events and 200 non-events and 1:1 ratio of participants in the compared arms, the power is 90% to detect a relative risk of 1.44 at an α of 0.05.
We used the Cochrane Database of Systematic Reviews (2010, issue 7) as in a previous study.7 However, for our final dataset of included forest plots, we also manually checked whether there were any newer versions of the Cochrane review published since 2010; where such newer versions contained newer trials, our database was updated with these extra trials using the Cochrane Database of Systematic Reviews up to 2015, issue 12.
Inclusion and exclusion criteria
Quantitative summary data on treatment comparisons and outcomes are presented in forest plots. Inclusion was assessed at two stages: an automated computerised algorithm and then a manual human scrutinisation process. We further assessed forest plots that satisfied the following initial inclusion criteria by the automatic algorithm: two or more studies, VLE in one trial, the VLE had a P<0.05 by Fisher’s exact test, and the trial with a VLE was followed by at least one large trial. If two or more trials were published in the same year and it was not feasible to identify which was published first, we randomly picked up one as the index trial. We included VLEs regardless of either the choice of intervention or treatment comparison.
Additional inclusion criteria during the manual scrutinisation process were:
The VLE was explicitly defined as a primary outcome of the review in which it appeared
The forest plot was not a sensitivity analysis
The forest plot was not a subgroup analysis
If two forest plots satisfying the first three criteria had overlapping trials, only the plot with the largest number of trials was included.
We excluded forest plots using outcomes measured on continuous scales and those not including the year of publication of each trial (because it would not be possible to determine if the trial was followed by a large trial). We also excluded reviews with issues preventing adequate data extraction in their structure (that is, information that could not be parsed or with inconsistent data hierarchy), methodological reviews, and protocols.
The primary data extraction of eligible forest plots was performed using an automated algorithm approach. Full details are described elsewhere.7 Briefly, raw data from each of the 3545 available reviews within the 2010 issue 7 of the Cochrane Database Systematic Reviews are stored under a hierarchical structure. Python computer scripts were applied to these data to parse and extract the required information from each review. This approach has previously been validated by hand using 200 randomly selected forest plots with 100% agreement.7 Updating of the eligible topics using 2015 issue 12 of the Cochrane Database Systematic Reviews was performed manually.
For each eligible forest plot, we automatically extracted the following characteristics: Cochrane Database Systematic Reviews identification, title, comparison, outcome, subgroup, total number of trials, year of publication of index trial, and relative risk of index trial. Two authors (MN and GK) then independently conducted the manual scrutinisation process of potentially eligible forest plots. In cases of disagreement, consensus was obtained by discussion with a third author.
There were some cases where the list of outcomes within the review was not explicitly split by the Cochrane review authors into primary and secondary. Where this occurred, the data extractors for this study made a judgment to include the forest plot if they thought that the outcome was highly likely to represent an outcome of critical or primary importance, given the stated objective of the review. Any case where this occurred was automatically referred to the third author for arbitration. Only cases with agreement from all three authors were accepted. Further characteristics extracted in this final subset of forest plots included relative risk sizes of all subsequent large trials, numbers of events and non-events, and time lag between index and large trials.
We used two approaches to assess whether an index trial VLE was upheld or refuted by subsequent large trials. Firstly, we deemed a VLE refuted if at least one subsequent large trial presented a statistically significant effect in the opposite direction or a non-significant result. Secondly, if more than one large trial followed the index trial, we performed a fixed effect meta-analysis of all large trials to assess whether this effect estimate refuted the VLE (that is, a statistically significant effect in the opposite direction or a non-significant result).
Risk of bias assessment by the Cochrane reviewers was manually extracted for all index trials.10 Specifically, number of bias domains rated at low risk and total number of bias domains assessed were extracted. An index trial was classified as being at low risk of bias if all domains were rated at low risk.
Descriptive statistics are expressed as medians with interquartile ranges or absolute counts and percentages. The magnitude of effect was captured by the relative risk metric, but the absolute risk difference is also presented for comparability. Because most of the index trials were small with few events and non-events, we calculated 95% confidence intervals for the relative risks via an exact approach.11 This method has been shown to provide confidence intervals with better coverage probability than asymptotic methods when samples sizes are small.11 For the absolute risk difference, we calculated 95% confidence intervals using the Woolf method.12 For larger trials, we computed P values and 95% confidence intervals using asymptotic approaches.
Comparisons between independent groups were performed with Fisher’s exact, Mann-Whitney U, and Kruskal-Wallis tests, as appropriate. Data analyses were performed using Stata (version 12.1, Stata Corp) and R 3.1.0 (R Core Team, 2014, www.R-project.org/). All P values were two tailed with nominal statistical significance claimed for P <0.05.
Patients were not involved in any aspect of the study design, conduct, or in the development of the research question or outcome measures. This study was a meta-epidemiological assessment of existing published research and therefore there was no active patient recruitment for data collection.
Selection of forest plots for analysis
Of 3545 reviews within the Cochrane Database Systematic Reviews up to issue 7 in 2010, 3082 reviews provided 85 002 forest plots for investigation. Of these forest plots, 294 (0.35%) satisfied the computerised selection algorithm for containing at least one trial with a nominally statistically significant VLE (index trial) followed by at least one further trial with at least 200 events and 200 non-events (a large trial). Figure 1⇓ summarises the flow of forest plots through the selection process.
From these 294 plots, in-depth scrutiny was performed (fig 1⇑) to exclude non-eligible ones. Before arbitration, the initial κ score between the two authors was 0.85 (95% confidence interval 0.77 to 0.92). After discussions between the two authors and arbitration with the third author, consensus was reached on inclusion of 44 plots for final inclusion (0.05% of the 85 002 plots assessed by the computerised algorithm).
Baseline characteristics of eligible forest plots
Table 1⇓ presents baseline characteristics of the index trial, forest plot, and subsequent large trials. The relative risks displayed in table 1⇓ have been consistently coined so that all are above one (that is, a relative risk of 0.2 becomes 5). The median relative risk was 7.95 (interquartile range 5.5-12.8; range 5.0-48.6). Obstetrics and gynaecology was the most well represented specialty with 10 topics. Index trials were generally small with a median of 14 events and 91 non-events. 21 topics had an updated version of the Cochrane Database Systematic Review after 2010. The updates contributed one new trial each to two topics, and four new trials each to two topics.
Few index trials were rated at low risk of bias (9/44; 20%) and very few forest plots assessed mortality (7/44; 16%). The median proportion of events contributed by an index trial to its forest plot was 1.4% (interquartile range 0.6-3.0). Most forest plots had an I2 statistic suggesting moderate heterogeneity (median 49% (interquartile range 20-66)). The median number of studies was 17 (interquartile range 9-22). There was a median of six years (interquartile range 3-12) between the index trial and the largest trial, and the largest trial had a median of 320 events (interquartile range 243-485). The median proportion of events contributed by the largest trials to the forest plot was 37% (interquartile range 19-54).
Comparison of index trials to subsequent large trials
At least one subsequent large trial refuted the index trial VLE in 19 of 44 cases (43%, 95% confidence interval 29% to 58%). The relative risk was closer to the null in the subsequent large trials in 43 of 44 cases. Of the 44 forest plots, 27 had only one subsequent large trial, nine plots had two subsequent large trials, and eight plots had three or more subsequent large trials. In the 17 plots with at least two subsequent large trials, the fixed effect meta-analysis of all subsequent large trials upheld the index result in six cases and refuted it in 11. In the 17 cases where both approaches (that is, at least one subsequent large trial refuting VLE versus fixed effect meta-analysis of all subsequent large trial data) could be directly compared, there was agreement in 16 cases (11 both refuted, five both upheld). One case was refuted by at least one large trial but upheld by the meta-analysis of all large trials. Index trials that were upheld by a large trial had a higher median number of events than those that were refuted (21 v 9, P<0.01). Of the 19 plots where an index trial VLE was refuted, there were two cases in which the large trial data presented a statistically significant effect in the opposite direction to the index trial.
Figure 2⇓ plots the index trial VLE size against the coined (that is, all >1) relative risk in subsequent large trial data. Even with a stricter cutoff value in relative risk of at least 10 (or ≤0.1), only six of 13 index trial VLEs were upheld. Figure 3⇓ plots the index trial P value against the size of coined (that is, all >1) relative risk of large trial data. Most refuted cases occurred when the index trials had a P value between 0.05 to 0.001. If the index trial had a P value of less than 0.001, the effect was upheld in 19 of the 21 cases by subsequent large trials. Table 2⇓ demonstrates the positive predictive value with our data for a range of different cutoff values of relative risks and P values. Confidence intervals for the estimates were extremely wide owing to the small number of cases.
Upheld index trial VLEs
Information on the 25 plots in which the index trial VLE was upheld by subsequent large trial data is displayed in table 3⇓ (for both relative risk and absolute risk difference). All but three of the 25 interventions were compared with an inactive control rather than another active treatment. The vast majority of forest plots also pertained to primary outcomes that are unlikely to be the only primary outcome of interest that might dictate whether the intervention is adopted (that is, specific adverse events or surrogate laboratory measures as opposed to hard clinical endpoints). There was large variability of the absolute risk differences across the index trials (range 0.01-0.89) and across the upholding subsequent large trials (range 0.00-0.94). Even among the 25 upheld topics, only eight had an absolute risk difference exceeding 10%.
In only four of the 25 cases was the confirmatory large trial effect also very large. There was only one subsequent large trial in each case. In three cases, a treatment of known effectiveness for the outcome measure was compared with placebo (hepatitis B antibody seroconversion with hepatitis B vaccine; rise in haemoglobin with iron in pregnant women; improvement in postoperative pain with rofecoxib).13 14 15 In the other case, the comparator intervention was practically the outcome, giving a control value of 100% (misoprostol v surgery in women with miscarriage; outcome: surgical evacuation of the fetus).16 Given the choice of controls and outcomes used, these results are unsurprising.
Of the index trial VLEs that were upheld, only one pertained to mortality. This plot assessed the effect of early nitrate anti-hypertensive treatment on all cause mortality up to day two in patients with an acute cardiovascular event. There was, however, no statistically significant benefit to early nitrate treatment at the co-primary outcome time points of days 3-10 and day 30 onwards. Hence, the forest plot providing evidence on mortality up to day two did not translate to any tangible, lasting clinical benefit.
Additional tables are available in the online supplementary appendix. Table S1 contains the references to Cochrane Reviews for table 3, table S2 contains the equivalent data of table 3 but for refuted studies, and table S3 details how many large trials followed each index trial VLE and the number of large trials that refuted the index trial VLE.
In this study, there were only rare instances where an initial very large treatment effect in a trial from a primary outcome forest plot was followed by a large trial (0.05% of more than 85 000 binary outcome forest plots within the Cochrane Database). Most VLEs occurred in small studies with very few events. Just over half of the VLEs were subsequently upheld as nominally statistically significant by a subsequent large trial, although typically the effect estimate was heavily attenuated. Even when the effect was upheld, the specific primary outcome was often one of many primary outcomes that would have to be considered before adopting the intervention. Furthermore, it would be important to also consider the absolute risk reduction in deciding whether a treatment was to be used.
Our main objective in this study was to evaluate the usefulness of a VLE in a randomised controlled trial as an empirical marker that further trials were unlikely to be necessary. Theoretically, this kind of empirical marker could highlight where resources might be wasted on unnecessary follow-up trials. The scale of waste within the research process has been well acknowledged.8 17 Unfortunately, our results show that a simple rule of thumb based on relative risk size in randomised controlled trials appears impractical, given the low frequency of VLEs and the positive predictive value of the rule. In nearly half of cases, the rule we chose would have given an incorrect reassurance, but the rarity of VLEs would strictly limit its usefulness in any case.
Comparison with other studies
The previous empirical evaluation of very large treatment effects in the literature, by Pereira and colleagues,7 demonstrated that most of these effects represented regression to the mean, with subsequent trials usually reporting smaller effects. We used the same data source and a similar initial extraction approach, with a focus on the feasibility of an empirical rule for the reliability of VLEs. So far, there has been no empirical evaluation of such a rule, although Glasziou and colleagues discussed the circumstances under which observational evidence might be accepted when the signal (effect size) to noise (bias) ratio is large.5 They suggested that relative risks beyond 10 are highly likely to reflect real treatment effects, even if confounding factors associated with the treatment may have contributed to the size of the observed associations. More stringent criteria, such as a relative risk cutoff value of at least 10 (or ≤0.1), led to an even poorer positive predictive value of only 50% (seven of 14 cases) in our data (table 2⇑). While another selection criterion—the presence of P<0.001 in the index trial—improved the positive predictive value substantially, there were still cases where the subsequent large trial did not uphold the index trial’s finding.
Conclusions and policy implications
Our findings show that even a relative risk of five is a rare event, and mostly occur in small trials with large confidence intervals. Because index trials with VLEs for primary outcomes are so rare, attempts to improve the positive predictive value by making the criteria more stringent would effectively rule out nearly all trials (eg, only four trials that we assessed had a relative risk of ≥20 and ≤0.05). Even when these criteria are satisfied, issues of heterogeneity in treatment effect could still mean that the results apply only to a narrow population and therefore need further trials in different patient groups or circumstances.18
Methodological problems in interpreting the results of small studies have been well documented.19 20 Reversals in the medical literature, even for randomised controlled trials, are common.21 22 Therefore, it might actually be dangerous to consider a case open and shut after a single trial with a VLE. A more important practical lesson from this study could be that the place of small randomised controlled trials needs re-evaluation. If even very large treatment effects in small trials are unreliable evidence of significant benefit, perhaps we should avoid conducting small trials (unless explicitly justified for any case specific reason—eg, rare diseases) and aim instead to conduct studies that are larger and properly powered to detect modest effects. This has serious implications for complex interventions such as surgery, where large randomised controlled trials are known to be more difficult to deliver.23
Strengths and limitations of study
Using the large number of forest plots available within the Cochrane Database as a source of data was a major strength of our work given the rarity of VLEs. Furthermore, our systematic approach to obtaining a set of independent VLEs and assessing them under a range of possible cutoff values for relative risks and P values also lends further credence to our conclusion that an empirical rule using a VLE would be neither practical nor useful.
However, our findings must be considered in light of several limitations. Firstly, our definition of a VLE, while based on previous empirical work,7 necessarily imposes an arbitrary cutoff value on a continuum. Our stringent rule left very few eligible topics compared with the vast number of topics handled by Cochrane. One might speculate whether a more lenient rule would change our inferences. However, if anything, smaller effects are likely to be even less commonly upheld than the VLE that we studied.
Secondly, we considered effects in the context of the primary outcome of the Cochrane review in which they appeared rather than the primary outcome of the trial itself, mainly on logistical grounds. Thirdly, clinical and statistical significance are not synonymous. There might be statistically significant upheld effects that are attenuated in size to a point where they lose clinical significance, and vice versa.
Fourthly, it was difficult to accurately ascertain whether an effect pertained to a subgroup or sensitivity analysis where such analyses were not explicitly defined in the Cochrane review. We attempted to ensure objectivity by using a review process involving two independent authors and discussion with a third author in cases of ambiguity. Fifthly, while the Cochrane Database Systematic Reviews represent a considerable body of trial meta-analyses, it nonetheless provides imperfect coverage of the entire body of randomised trial evidence. However, there is no obvious reason to believe that non-covered topics are likely to be substantially different about VLE prevalence and validation.
Finally, the decision to perform a subsequent large trial when a VLE has been seen in one trial is not a random process. Trials with VLEs might be less likely to have subsequent large trials done on the same question, if the early trials are considered to be well done and their findings are deemed conclusive. If so, our data underestimate the proportion of VLEs that are true. However, subsequent large trials might be less likely to be performed if the original trial results are thought to lack credibility or be unreliable. If so, our data overestimate the proportion of VLEs that are true. Given that the early trials showing VLEs are almost ubiquitously very small ones, it is more likely that our data overestimate the proportion of VLEs.
Our study suggests that the frequency of VLEs followed by a large trial is vanishingly small in the Cochrane Database of Systematic Reviews, and where they occur they do not appear to be a reliable marker for a reproducible and clinically actionable benefit. An empirical rule using a VLE as a marker that further trials are unnecessary would be neither practical nor useful. Caution should be taken when interpreting small studies with very large treatment effects.
What is already known on this topic
Most healthcare interventions provide modest benefits, but randomised trials occasionally report very large improvements over existing treatments or inactive controls; this often leads to speculation that further trials might be unnecessary
The use of very large treatment effects as an empirical marker could highlight where resources might be wasted on unnecessary follow-up trials
However, large effect estimates are usually downgraded in subsequent trials, and the profile of their appearance and shift suggests regression to the mean as the cause
What this study adds
There does not appear to be an effect size large enough to be confident that future large (reliable) trials will always show a significant effect rather than one that could be due to chance
Most very large effect estimates come from small trials with large confidence intervals that should be interpreted with caution
These findings are highly relevant to fields such as surgery, where the average size of trials is usually much smaller than for drug trials, for logistical reasons
Contributors: PM and JPAI conceived the study. MN, TVP, GK, and MM extracted and sorted data for the study. MN and TP performed the analysis. MN wrote the first draft of the manuscript. All authors contributed to critical revision of the manuscript for important intellectual content and approved the final version. MN and PM are the guarantors.
Funding: No specific funding was provided for this study.
Competing interests: All authors have completed the ICMJE uniform disclosure at www.icmje.org/coi_disclosure.pdf and declare no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: No ethical approval required as a meta-epidemiological study.
Data sharing: Raw data and analysis available on request from the authors.
The lead authors affirm that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/.