CCBYNC Open access

Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey

BMJ 2016; 352 doi: (Published 08 February 2016) Cite this as: BMJ 2016;352:i493


The arguments and analyses of Franklin et al. [1] are flawed and misleading. We don’t agree with their claim that our [2] and all the analyses included in a recent Cochrane review [3] comparing observational studies and RCTs with a ratio-of-odds ratio (ROR) approach would be incorrect because the “use of the ROR to quantify bias is just as flawed and dependent on the direction of comparison chosen for each study by investigators” [1].

It is trivial that the direction of comparisons is essential in meta-epidemiological research comparing analytic approaches. It is also essential that there must be a rule for consistent coining of the direction of comparisons. The fact that there are theoretically multiple ways to define such rules and apply the ratio-of-odds ratio method doesn’t invalidate the approach in any way. It is just important to transparently select a reasonable, replicable and useful rule on standardizing the direction of effects which reflects the perspective of the research question.

We took in our study the perspective of clinicians facing new evidence, having no randomized trials, and having to decide whether they use a new promising treatment. In this situation, a treatment would be seen as promising when there are indications for beneficial effects in the RCD-study, which we defined as having better survival than the comparator (that is a OR < 1 for mortality in the RCD-study). This is exactly the scenario for which Franklin et al. and others promote the use of such observational studies. The basic architecture of our design reflects this scenario because we started from RCD studies that were published first and then we tried to see what subsequent RCTs found. It makes no sense to start from what the subsequent studies find. The inversion rule we applied is absolutely not selective, it is systematic, transparent and replicable; Franklin et al. used our data and could replicate our results. Moreover, it is the only reasonable and useful selection rule in real life and captures exactly the key clinical dilemma: A clinician has some results from non-randomized RCD-analyses that indicate a beneficial treatment but no evidence from RCTs proving this. Can s/he trust the RCD-study estimate? How often would such potential benefits truly manifest if or when trials were to become available in the future? In many or even most RCD-studies (in our data and even more so in every day uses of RCD data) there is no clear experimental (e.g. drug seeking license) and control arm, but only two variations of usual care (e.g. two different durations of treatment, two treatment strategies, or two active drugs neither of which is clearly experimental). Therefore it is not possible to apply a selection rule based on what is experimental or control and to coin all directions of comparisons as “experimental treatment versus control”. The only rational choice is to reflect the clinical scenario and to systematically coin the comparisons in a way that the seemingly best treatment from the current perspective of the clinician who makes the decision is always listed first, i.e. all relative risks for mortality would be <1. If the clinician were to use this treatment as the best, how much would s/he be off? This is what matters.

Franklin et al., however, took the perspective of statisticians already knowing the answer to all clinical questions provided by clinical trials and then retrospectively explore if their modeling would get the same results. However, in clinical reality, such subsequent RCTs unfortunately will not even exist in most cases to provide such answers. Clearly, an inversion rule based on the RCTs can give totally misleading results. The theoretical simulation of Franklin et al. to make all relative risk estimates <1 in RCTs makes no sense in real life and is without any relevance for patient care or health-care decision making.

A simple example illustrates the clinical consequences: suppose for one topic where treatment A is compared with treatment B, the odds ratio (OR) for mortality in an RCD study is 0.5 and in a subsequent RCT the OR is 1. For another topic where treatment C is compared with treatment D, the OR for mortality in an RCD study is 1.2 and in a subsequent RCT the OR is 0.6. The clinician would choose treatments A and D, because they appear to be better than treatments B and C in the absence of RCT data. However, when the RCT data come out subsequently, the clinician will then realize that treatment B was not really any worse than A (and might even have been preferable, if it had other advantages, e.g. lower cost and better tolerance), and that treatment C was actually a much better choice than D. Our approach captures that the RCT results in both examples deviate 2-fold (overestimating the mortality benefit) from what the RCD originally suggested. Conversely, the Franklin et al. approach will estimate that these two major differences cancel out and the summary ROR is 1.0 suggesting perfect agreement between RCT and RCD (on average) even though RCD were highly misleading in both cases.

The approach used by Franklin et al. to calculate the observed and expected overlap in confidence intervals is interesting but suffers from major problems. First, it is notoriously underpowered to detect any difference between RCD and RCT. Second, it is totally open to manipulation by selecting what confidence interval should be considered (95%, 90% 75%, 50%, 20%, etc.). Third, it is clinically irrelevant. It is impossible to say whether a 10% difference in the expected (60%) versus observed (50%) overlap is clinically relevant. Mortality differences between compared interventions are almost ubiquitously very small or modest across all medical specialties. Extremely few treatments have very large treatment effects even when compared against placebo/no treatment [4] and the differences between different active treatments (the standard question addressed by RCD) are even smaller and typically tiny. A seemingly minor difference, e.g. OR 0.96 versus OR 1.04 may be a substantial difference clinically, even if the confidence intervals amply overlap in a single comparison of one RCD and one respective RCT. The method proposed by Franklin et al. will rarely be able to capture these statistically elusive but clinically major differences. This is why meta-analyses of ROR approaches are needed, because they are far more powerful to detect clinically important differences with proper statistical documentation.

The analysis on the agreement of the first RCT with the other RCTs within our sample is extremely misleading. It is trivial that effects of one experiment are larger than the mean of multiple replications of the experiment. This is just expected according to the regression to the mean principle. For clinical trials, this phenomenon has been described already two decades ago [5-10]. Interestingly, Franklin et al. included in their analysis a clinical question where both subsequent trials were published simultaneously making it impossible to clearly determine which one is the first (Gnerlich 2007). Franklin et al. selected the data which better fit to their claim. They also missed to provide a more balanced view by describing that the first randomized trial in our dataset typically indicated the better treatment correctly. Our trial data is in perfect agreement with a recent much larger analysis of 647 meta-analyses reporting that “When the first trial is statistically significant, 84.1% (95% CI: 79.4%, 88.8%) of the corresponding meta-analyses is both in the same direction and statistically significant” [7]. For our dataset, the effect estimates of the first trial was in the same direction as the corresponding meta-analysis in 73% of the clinical questions (8 of 11 clinical questions with multiple RCTs; using the trial publication dates and dropping the one case where all trials were published simultaneously not allowing us to determine the first one). Moreover, the vast majority of trials had the same direction as the final summary, for 5 clinical questions all trials had concordant effects and there was never more than one trial disagreeing in the direction of the other clinical questions. All three first trials (3 of 11 clinical questions, 27%) which disagreed with the final direction had only 1 vs 2 events, thus they would very unlikely lead to any misguided care decisions. However, we would refer to the large literature on the agreement of individual clinical trial results and subsequent trial evidence for further details (see for example References [5-10]).

Overall, this and the claim of Franklin et al. that it is misleading to combine ROR estimates on the same outcome only proves that they intentionally ignore or are not familiar with an extensive literature of two decades and hundreds of papers in meta-epidemiology.

Lars G Hemkens, Despina G Contopoulos-Ioannidis, John P A Ioannidis

1. Franklin JM, Dejen S, Huybrechts KF, et al. A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies. Epidemiologic Methods 2017.
2. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JP. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ 2016;352:i493 doi: 10.1136/bmj.i493.
3. Anglemyer A, Horvath HT, Bero L. Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials. Cochrane Database Syst Rev 2014(4):MR000034 doi: 10.1002/14651858.MR000034.pub2.
4. Pereira TV, Horwitz RI, Ioannidis JP. Empirical evaluation of very large treatment effects of medical interventions. JAMA 2012;308(16):1676-84 doi: 10.1001/jama.2012.13444.
5. Gartlehner G, Dobrescu A, Evans TS, et al. Average effect estimates remain similar as evidence evolves from single trials to high-quality bodies of evidence: a meta-epidemiologic study. J Clin Epidemiol 2016;69:16-22 doi: 10.1016/j.jclinepi.2015.02.013.
6. LeLorier J, Gregoire G, Benhaddad A, et al. Discrepancies between meta-analyses and subsequent large randomized, controlled trials. N Engl J Med 1997;337(8):536-42 doi: 10.1056/NEJM199708213370806.
7. Tam WW, Tang JL, Di MY, et al. How often does an individual trial agree with its corresponding meta-analysis? A meta-epidemiologic study. PLoS One 2014;9(12):e113994 doi: 10.1371/journal.pone.0113994.
8. Herbison P, Hay-Smith J, Gillespie WJ. Meta-analyses of small numbers of trials often agree with longer-term results. J Clin Epidemiol 2011;64(2):145-53 doi: 10.1016/j.jclinepi.2010.02.017.
9. Ioannidis J, Lau J. Evolution of treatment effects over time: empirical insight from recursive cumulative metaanalyses. Proc Natl Acad Sci U S A 2001;98(3):831-6 doi: 10.1073/pnas.021529998.
10. Lau J, Antman EM, Jimenez-Silva J, et al. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med 1992;327(4):248-54 doi: 10.1056/NEJM199207233270406.

Competing interests: All authors have completed the ICMJE uniform disclosure form at and declare: DCI and JPAI had no financial support for this project; LGH had support from the Commonwealth Fund for the submitted work; all authors declare no financial relationships with any organization that might have an interest in the submitted work in the previous three years and no other relationships or activities that could appear to have influenced the submitted work.

14 September 2017
Lars G Hemkens
senior researcher
Despina G Contopoulos-Ioannidis, John P A Ioannidis
Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA , and Basel Institute for Clinical Epidemiology and Biostatistics, University Hospital Basel, Basel, Switzerland
Stanford, USA