# Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey

BMJ 2016; 352 doi: https://doi.org/10.1136/bmj.i493 (Published 08 February 2016) Cite this as: BMJ 2016;352:i493## All rapid responses

The practice of 'flipping' ratios used in the article by Hemkens et al, has been used in several of the meta-epidemiological studies previous published; see for example (1-4). All these articles first estimate meta-analytic ratios (odds ratios -OR- or rate ratios) for two study subgroups. Then the ratios of odds (or rate) ratios (ROR) are pooled across meta-analyses. The term "coining" has been used in these articles to describe a variety of practices.

In Hemkens et al, the OR of one of the two study subgroups has been "coined" (inversed) so all ORs are less than one. Franklin et al showed how this produces biased results.

In another article by Evangelou et al. (which GS co-authored) both subgroup ORs have been "coined" so that the total (pooled) OR of the meta-analysis is larger than one. The dataset in the Appendix of the paper comprises 92 genetic associations and compared the OR from unrelated case-controlled studies (ORU) with the OR from family-based studies (ORF). Simulations, briefly summarised below, show that this type of "coining" employed by Evangelou et al. also produces biased results.

We simulated 10 000 pairs of two subgroups (representing case-control and family-based studies) under different scenarios for the underlying RORs and subgroup-specific heterogeneities. We set the standard errors of the (assumed observed) log(ORU) and log(ORF) by re-sampling from the observed 92 standard errors. The simualtions scenarios, the R code, the data and the results can be found in the GitHub repository https://github.com/esm-ispm-unibe-ch/Simulations_coining

The ROR estimate using coining is unbiased only in the special case where there is no association between exposure and outcome (trueORU= trueORF=ROR=1), there is no heterogeneity in ROR, and the average standard errors of the observed logORU and logORF are the same. As soon as there is heterogeneity in ROR, differences in the precision by which the subgroups estimate the effects or when ROR is different to 1, the coining method is biased. The two figures in the simulations document available in GitHub illustrate these biases.

Evangelou et al. did not identify any important differences between subgroups and ROR was close to one. This is because the subgroup-specific heterogeneity variances (the heterogeneities around logORU and logORF) were close (0.06 and 0.08) which means that ROR had little heterogeneity. Consequently, it is possible that the finding in this paper is correct. However, our Figure 2 in the simulation document showed that coining can produce an estimated ROR of 1 when the true ROR is as large as 1.5.

The findings in the study by Evangelou et al are possibly misleading. However, this empirical evaluation was published 12 years ago; the field has moved on and single gene association studies and their meta-analysis, like those evaluated empirically in the paper, have been overshadowed by powerful genome-wide association studies. Nevertheless, clearly identifying the erroneous practices and labelling them accordingly in the publishing system (via errata or justified retractions) is the only way to stop their propagation.

1. Trikalinos TA, Ntzani EE, Contopoulos-Ioannidis DG, Ioannidis JPA. Establishment of genetic associations for complex diseases is independent of early study findings. Eur J Hum Genet EJHG. 2004 Sep;12(9):762-9.

2. Kavvoura FK, Liberopoulos G, Ioannidis JPA. Selection in reported epidemiological risks: an empirical assessment. PLoS Med. 2007 Mar;4(3):e79.

3. Siontis KCM, Patsopoulos NA, Ioannidis JPA. Replication of past candidate loci for common diseases and phenotypes in 100 genome-wide association studies. Eur J Hum Genet EJHG. 2010 Jul;18(7):832-7.

4. Evangelou E, Trikalinos TA, Salanti G, Ioannidis JPA. Family-based versus unrelated case-control designs for genetic associations. PLoS Genet. 2006 Aug;2(8):e123.

**Competing interests: **
No competing interests

**02 February 2018**

I would like to add some mathematical considerations regarding the simulations provided in the recent response by Franklin et al.

The method by Hemkens et al. is based on an inversion rule, i.e. setting the sign of log(OR_RCD) to be negative. This means that log(OR_RCD) cannot be assumed to follow a normal distribution, as it is truncated to negative values. In fact, if the log(OR_RCD) before inversion follows a normal, after the inversion it follows a folded normal distribution. The distribution of log(OR_RCT) on the other hand remains unaffected.

In other words, if we start with a distribution of log(OR_RCD) centered at a specific value, after applying the inversion rule the mean will always shift to a smaller value, while the mean of log(OR_RCT) will not change. Moreover, the log(ROR), which is defined as the difference between the two, cannot be assumed to follow a normal distribution after implementing the inversion rule.

The bias that was shown in the simulations by Franklin et al. can be calculated analytically. E.g. for the first scenario, where the mean is 0 and standard deviation is sigma=0.5, the mean of the folded normal distribution (that logOR_RCD follows after the rule) is sigma*sqrt(2/Pi)=0.40. On the other hand, log(OR_RCT) still follows the normal, and the mean is still 0. The overall bias in the log(ROR) scale is the difference between the two, i.e. 0.40, which translates in a ROR=exp(0.40)=1.49, exactly as found in the simulations.

Finally, note that when treatment effects are very large and/or the standard deviations are sufficiently small, the normal and the corresponding folded normal distributions are almost identical. In such cases, the bias introduced in ROR by using this inversion rule will be negligible.

Orestis Efthimiou

**Competing interests: **
No competing interests

**30 November 2017**

We are grateful to Franklin et al. for highlighting the additional graphs. They are very helpful for trying to bring some closure to our discussion. Contrary to their assertion, what their simulations show is the classic, well-known principle of regression-to-the-mean: when measurements are sampled from an empirical distribution using some selection rule that selects only for one side or one tail of the distribution (or in any way that the mean of the selected sample is less than the true mean), then a repeat (e.g. re-measurement or subsequent replication study) will regress towards the mean. One of us has written repeatedly on how this principle contributes (along with several other factors) towards inflating the results of early studies in diverse fields of research. For a representative overview, please see (1).

Take the null true effect simulation, for example. The inversion makes all RCD estimates to have logOR<0, while the true logOR is 0. In the 9 non-inversed cases (where the selection rule is: logOR<0 in the RCD studies), the subsequent RCTs will regress towards the mean, to the truth of logOR=0 in this case (and of course some RCTs may “overshoot” even to values logOR>0 in the process, due to random error), i.e. the logOR of the RCTs is expected, on average, to be larger. Similarly, in the 7 inverted cases (where the selection rule is: before inversion logOR>0 in the RCD studies), again the subsequent RCTs will regress to the truth of logOR=0; again, due to the inversion the logOR of the inverted RCTs is expected, on average, to be larger than the logOR of the inverted RCD studies. Thus consistently an ROR>1 is generated.

Similarly, also on the other simulated cases, the subsequent RCTs regress towards the mean, towards the assumed true value (which is non-null in these cases). Thus, the ROR will similarly be above 1.

The pattern that Franklin et al. describe so nicely in these graphs is exactly one of the reasons (besides many other reasons, as we explained before) why in the clinical scenario where only RCD results are available and they show that one of two compared treatments is the best, this evidence should be seen with great caution. Clinicians may often get it way wrong if they chose the treatment that seems to be the best based on these RCD studies.

Interestingly, this pattern (regression-to-the-mean as we call it or bias as Franklin et al. non-specifically call it), is not that prominent in our data compared with what is shown in the simulated graphs by Franklin et al. Even without any inversion, we estimate a summary ROR of 1.25, not much different from 1.31. This may be because only 3 of the 16 comparisons had to be inverted so as to match the meaningful clinical question in our real data, while 7 to 13 of the 16 comparisons were inverted in the Franklin et al. graphed simulations. In situations where this pattern (regression-to-the-mean, or whatever one wants to call it) might be more prominent (e.g. as in the Franklin et al. graphs), caution about using the early RCD evidence would be even greater, not less.

Lars G. Hemkens

Despina G. Contopoulos-Ioannidis

John P.A. Ioannidis

References

1. Ioannidis JP. Why most discovered true associations are inflated. Epidemiology. 2008 Sep;19(5):640-8.

**Competing interests: **
No competing interests

**29 November 2017**

Hemkens et al. still do not acknowledge the bias of the inversion method that they used. This bias has nothing to do with regression to the mean. As a complement to the previously presented mathematical proof, we have posted a summary of some simulation results that may be easier to digest, indicating the magnitude and direction of the bias. It is available at http://www.drugepi.org/faculty-staff-trainees/faculty/jessica-franklin/.

**Competing interests: **
No competing interests

**28 November 2017**

We worry that the letter by Franklin et al. [1] ending with a heated plea takes the conversation to an inappropriate level. We will try again to respond to their main concern, trying to promote transparency and better understanding of the methods used and their meaning. We would also like to put the overall relevance of this issue more into perspective.

We clearly stated upfront in the Introduction our objective: “We systematically compared the findings from such studies [with routinely collected data] on various clinical questions (which have never been addressed in trials before), with the findings from subsequent randomized controlled trials.” [2]. This is the frame of our question and the methods that we use are appropriate to answer this question and also clinically relevant in decision-making. If a clinician has at her hands only a published study using routinely collected data (RCD) to inform her decision-making and that study shows that a treatment A has more favorable results (for example a benefit of 10% less mortality) compared with another treatment B, how different could the results of a subsequent RCT be? We were interested in the specific situation when no randomized trials exist and this is why we highlight even in the title the comparison with subsequent trials.

Conversely, the methods proposed by Franklin et al. are not relevant to address this clinical question, they have no clinical meaning, and they are also grossly underpowered and easy to manipulate, as we discussed in our previous response [3]. We did not directly evaluate the theoretical issue whether RCD-studies are “biased” [4], but aimed to quantify the clinical uncertainty for exactly that situation without available RCTs where RCD-studies might have their highest relevance and we concluded that this uncertainty is high.

We used multiple measures to try to answer our question, including: “how frequently the treatment effect estimates from RCD studies and randomized controlled trials were in the opposite direction, how often the confidence intervals did not overlap, and how often the RCD study’s confidence interval did not include the effect estimate demonstrated by later available trials”, as well as the ratio of odds ratios [2]. Franklin et al. claim that the latter method is biased in our study and in all other similar studies included in the related Cochrane review [5]. However, all other methods that we used, and which Franklin et al. do not question, agree with our conclusion: results can be different in RCD versus subsequent RCTs and one needs to use early RCD-studies with caution.

For the ratio-of-odds ratio approach we coined the RCD point estimate to be an odds ratio <1.0 for 3 of the 16 topics (see below). This needs to be done, if our clinically important question on claimed treatment benefits is to be answered (mortality benefits are by definition odds ratio <1.0). In contrast to many meta-epidemiological evaluations, we could not focus on effects of the “experimental” treatments here, because it was typically not possible to tell which treatment is “experimental” and which one is “control” in an RCD comparison. However, and this is a crucial point, by consistently focusing on the benefits we have preemptively addressed potential problems with “inconsistent directions” in the ROR approach that Franklin et al. criticize and call “bias”. To our knowledge, there is no other better way to handle this issue.

What Franklin et al. describe practically represents regression-to-the-mean. Subsequent studies are expected to deviate in such a way that the estimated difference between A and B will become, on average, smaller in the subsequent studies, compared with their original studies’ counterparts. If the original and subsequent studies were theoretically addressing the same (unknown) true effect, the ratios of these odds ratios would still not be 1.0, but >1.0. We agree with Franklin et al. up to this point.

With large-scale RCD evidence, regression-to-the mean is not an issue. With limited RCD evidence, as e.g. in the example of Hahn et al. 2010 (in our sample), regression-to-the-mean can be substantial. Thus in theory, some of the large difference between RCD study results and the subsequent RCT results in this specific example could reflect regression-to-the-mean. This does not invalidate our results and our conclusion: the subsequent RCT results may be substantially different and, as compared to these RCT results, the estimate seen in the early RCD study is exaggerated. The regression-to-the-mean is not different from what we describe, it is actually part of the explanation for the phenomenon of why we may see exaggerated estimates in early RCD studies. Nevertheless, it reinforces our message that one has to be cautious in making clinical decisions when only some RCD data are available. The subsequent evidence may be substantially different.

There are numerous mechanisms in play affecting the reported estimate from such a RCD study. Some are related to the RCD study itself (such as confounding and other biases which we discussed elsewhere [6] and others relate to the context (e.g. publication bias), or to regression-to-the mean effects. Clinicians who have to make a treatment decision based on just one RCD study at hand when there are no RCTs, need to know how much better the best treatment option is versus other options, and how uncertain such a decision would be. If the uncertainty is high (and that is what we aimed to explore), they need to take this into account in their decision making. The exact underlying reasons (be they confounding, publication biases, regression-to-the-mean, or other unknown reasons) are usually very difficult to decipher and can be hotly debated, but they are not of much interest to the clinician.

We also want to focus attention to the estimate of the summary difference that we observed. A 31% difference on an odds ratio scale is a huge difference, if we recall that all the data that we analyzed reflect mortality benefits. Very few treatments have a clear benefit on mortality and those that do typically have benefits that are <10% in an odds ratio scale [7]. When two active treatments are compared (the typical situation with RCD), the clinically meaningful difference sought that may drive decision making can be 3% or even less. Therefore, the noise here can be many-fold the size of the true treatment effect (if any exists). It is extremely implausible that all of this huge difference that we observed is due to a mechanism of regression-to-the-mean. However, regardless of the mechanism causing the difference, our results and conclusions are fully valid: one has to be cautious to make clinical decisions based on this evidence.

Regression-to-the-mean due to coining would not be an issue, if we choose RCD treatment effects the way they were reported in the RCD papers without any subsequent coining from our side. This approach is probably suboptimal from a clinical perspective (and this is why we did not use it and instead focused on the benefits), but helps remove this regression-to-the-mean influence. We used the coining approach, which Franklin et al. criticize, for 3 of the 16 comparisons analyzed. If we include these 3 comparisons without performing any such coining, the summary odds ratio does not change much (ROR 1.25; 95% CI 0.99 – 1.58; I2 0%). If we exclude these 3 comparisons, the summary odds ratio remains similar (ROR 1.35; 95% CI 1.04 – 1.75; I2 0%). This highlights that this matter is theoretically interesting from a methodological perspective, but in our study it is inconsequential. Moreover, all the results from all the other measures of agreement which we report are totally unaffected by this detail.

We have no reason, conflict or bias to try to espouse one type of result or some specific conclusion. We did our best to set a clinically relevant question, collect the best available data, use the best and most appropriate available methods to address this question, conducted more than 10 sensitivity analyses and then concluded that “caution is needed to prevent misguided clinical decision making”. We even acknowledged that “Randomized controlled trials are not necessarily a perfect gold standard. When their results differ against those of observational studies on the same question, it may not be certain that the trials are correct and the observational data are wrong”, included 9 full paragraphs discussing caveats and limitations in our Discussion and shared all our data. How much more scientifically careful, cautious and disinterested could we have been to not have a totally inappropriate threat of retraction thrown to our face by Franklin et al.?

Despite our disagreements, we fully respect Franklin et al. and we are grateful for their criticism (despite its tone) and for highlighting various interesting aspects. But we wonder about their emotional statements which let us question their unbiasedness. While we feel that the methods that they propose in their paper are clinically irrelevant and misleading and can be even more problematic and open to manipulation than what already exists, we certainly did not ask to have their paper retracted.

We hope that this discussion will offer fertile ground for future constructive and impartial methodological work.

Lars G. Hemkens

Despina G. Contopoulos-Ioannidis

John P.A. Ioannidis

References

1. Franklin JM, Dejene S, Huybrechts KF, Wang SV, Kulldorff M, Rothman KJ. Re: Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. http://www.bmj.com/content/352/bmj.i493/rr-11 (accessed 15 Nov 2017)

2. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ 2016;352:i493. doi:10.1136/bmj.i493

3. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Response: Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. http://www.bmj.com/content/352/bmj.i493/rr-9 (accessed 15 Nov 2017)

4. Franklin JM, Dejene S, Huybrechts KF, et al. A bias in the evaluation of bias comparing randomized trials with nonexperimental studies. Epidemiol Methods 2017.

5. Anglemyer A, Horvath HT, Bero L. Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials. Cochrane Database Syst Rev 2014(4):MR000034 doi: 10.1002/14651858.MR000034.pub2.

6. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. The authors respond. http://www.bmj.com/content/352/bmj.i493/rr-4 (accessed 15 Nov 2017)

7. Pereira TV, Horwitz RI, Ioannidis JP. Empirical evaluation of very large treatment effects of medical interventions. JAMA 2

**Competing interests: **
No competing interests

**15 November 2017**

We acknowledge receipt of this letter that includes a request for retraction of the paper. We take this request very seriously. Before we make a decision on this request, we -The BMJ's editors and statisticians - are reviewing all the available information. We hope to reach a decision that will maintain the integrity of the scientific literature, acknowledge legitimate differences of opinion about the methods used in the analysis of data, and is fair to all the participants in the debate. We will post a rapid response once we make a decision on this issue.

**Competing interests: **
No competing interests

**13 November 2017**

In a recent paper [1], we provided mathematical proof that the inversion rule used in the analysis of Hemkens et al. [2] results in positive bias of the pooled relative odds ratio (ROR). Hence, their conclusions regarding the comparison of findings from randomized controlled trials (RCTs) and routinely collected data (RCD) are invalid. When the treatments are switched so that all ORs in the RCD studies are on the same side of null, then a ROR summarizing the studies will be biased even if all the RCTs and all the RCD studies each have perfect unbiased estimates of the true effect size. In their response, Hemkens et al [3] do not address this core statistical problem with their analysis.

A side note: To illustrate the bias with their statistical method, our paper included a minor supplementary analysis that compared the first RCT within each clinical question with subsequent RCTs published on the same topic, obtaining a pooled ROR from this analysis of 1.46, 95% CI 0.97 to 2.18. For one clinical question, both available RCTs were published simultaneously, making it impossible to determine which came first. Hemkens et al. [3] state that when we assigned one of these as first, “Franklin et al. selected the data which better fit to their claim.” There was no cause for Hemkens et al. to accuse us of such a dishonest approach. The fact is that since the ORs from these particular two trials are in opposite directions (1.17 versus 0.70), the inversion method yields results that are mathematically identical regardless of which one is assigned as the “first” one. A re-analysis of the data readily confirms that. Hence, we did not select the data to fit our claim.

We applaud the transparency with which Hemkens et al reported their analyses, which allowed us to replicate their findings independently as well as to illustrate the inherent bias in their statistical method. Our paper was originally submitted to BMJ, as recently revealed by a journal editor [4], and it was reviewed there by two prominent biostatisticians and an epidemiologist. All three reviewers recognized that we had described a fundamental flaw in the statistical approach invented and used by Hemkens et al. We believe that everyone makes mistakes, and acknowledging an honest mistake is a badge of honor. Thus, based on our paper and those three reviews, we expected Hemkens et al. and the journal editors simply to acknowledge the problem and to retract the paper. Their reaction to date is disappointing.

1 Franklin JM, Dejene S, Huybrechts KF, et al. A bias in the evaluation of bias comparing randomized trials with nonexperimental studies. Epidemiol Methods 2017.

2 Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ 2016;352:i493. doi:10.1136/bmj.i493

3 Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis. Response: Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ Published Online First: 27 September 2017.http://www.bmj.com.ezp-prod1.hul.harvard.edu/content/352/bmj.i493/rr-9 (accessed 2 Oct 2017).

4 Merino JJ. Re: Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ Published Online First: 2 October 2017.http://www.bmj.com.ezp-prod1.hul.harvard.edu/content/352/bmj.i493/rr-8 (accessed 2 Oct 2017).

**Competing interests: **
No competing interests

**02 October 2017**

The rapid response of Hemkens, Contopoulos-Ioannidis, and Ioannidis overlooks the fact that a metric of comparison can be systematic, transparent, replicable, and also wrong. Franklin et. al. clearly explains and demonstrates that inverting the OR based on RCD study result (or on the RCT result) yields a misleading statistic. When estimates from the two types of studies are drawn from the same underlying distribution the true ROR = 1, while the ROR obtained using the inversion method will necessarily be > 1. Use of this flawed metric does not allow us to conclude that RCDs will necessarily estimate more extreme effect than RCTs. However, with minor amendments I think the bullet points under "What this Study Adds" provide good advice (italics mine):

• *The first published study will often* substantially overestimate mortality benefits of medical treatments compared with subsequent trials investigating the same question

• *The first published study* might not necessarily provide very reliable answers on how to best treat patients; caution is needed to prevent misguided clinical decision making.

• If no randomized trials exist, clinicians and funders of care should consider that treatment effects are probably more uncertain and substantially smaller than *the first published study* suggests; decisions for widespread adoption and reimbursement of expensive interventions might be best withheld until *evidence from several studies accumulates.*

**Competing interests: **
No competing interests

The arguments and analyses of Franklin et al. [1] are flawed and misleading. We don’t agree with their claim that our [2] and all the analyses included in a recent Cochrane review [3] comparing observational studies and RCTs with a ratio-of-odds ratio (ROR) approach would be incorrect because the “use of the ROR to quantify bias is just as flawed and dependent on the direction of comparison chosen for each study by investigators” [1].

It is trivial that the direction of comparisons is essential in meta-epidemiological research comparing analytic approaches. It is also essential that there must be a rule for consistent coining of the direction of comparisons. The fact that there are theoretically multiple ways to define such rules and apply the ratio-of-odds ratio method doesn’t invalidate the approach in any way. It is just important to transparently select a reasonable, replicable and useful rule on standardizing the direction of effects which reflects the perspective of the research question.

We took in our study the perspective of clinicians facing new evidence, having no randomized trials, and having to decide whether they use a new promising treatment. In this situation, a treatment would be seen as promising when there are indications for beneficial effects in the RCD-study, which we defined as having better survival than the comparator (that is a OR < 1 for mortality in the RCD-study). This is exactly the scenario for which Franklin et al. and others promote the use of such observational studies. The basic architecture of our design reflects this scenario because we started from RCD studies that were published first and then we tried to see what subsequent RCTs found. It makes no sense to start from what the subsequent studies find. The inversion rule we applied is absolutely not selective, it is systematic, transparent and replicable; Franklin et al. used our data and could replicate our results. Moreover, it is the only reasonable and useful selection rule in real life and captures exactly the key clinical dilemma: A clinician has some results from non-randomized RCD-analyses that indicate a beneficial treatment but no evidence from RCTs proving this. Can s/he trust the RCD-study estimate? How often would such potential benefits truly manifest if or when trials were to become available in the future? In many or even most RCD-studies (in our data and even more so in every day uses of RCD data) there is no clear experimental (e.g. drug seeking license) and control arm, but only two variations of usual care (e.g. two different durations of treatment, two treatment strategies, or two active drugs neither of which is clearly experimental). Therefore it is not possible to apply a selection rule based on what is experimental or control and to coin all directions of comparisons as “experimental treatment versus control”. The only rational choice is to reflect the clinical scenario and to systematically coin the comparisons in a way that the seemingly best treatment from the current perspective of the clinician who makes the decision is always listed first, i.e. all relative risks for mortality would be <1. If the clinician were to use this treatment as the best, how much would s/he be off? This is what matters.

Franklin et al., however, took the perspective of statisticians already knowing the answer to all clinical questions provided by clinical trials and then retrospectively explore if their modeling would get the same results. However, in clinical reality, such subsequent RCTs unfortunately will not even exist in most cases to provide such answers. Clearly, an inversion rule based on the RCTs can give totally misleading results. The theoretical simulation of Franklin et al. to make all relative risk estimates <1 in RCTs makes no sense in real life and is without any relevance for patient care or health-care decision making.

A simple example illustrates the clinical consequences: suppose for one topic where treatment A is compared with treatment B, the odds ratio (OR) for mortality in an RCD study is 0.5 and in a subsequent RCT the OR is 1. For another topic where treatment C is compared with treatment D, the OR for mortality in an RCD study is 1.2 and in a subsequent RCT the OR is 0.6. The clinician would choose treatments A and D, because they appear to be better than treatments B and C in the absence of RCT data. However, when the RCT data come out subsequently, the clinician will then realize that treatment B was not really any worse than A (and might even have been preferable, if it had other advantages, e.g. lower cost and better tolerance), and that treatment C was actually a much better choice than D. Our approach captures that the RCT results in both examples deviate 2-fold (overestimating the mortality benefit) from what the RCD originally suggested. Conversely, the Franklin et al. approach will estimate that these two major differences cancel out and the summary ROR is 1.0 suggesting perfect agreement between RCT and RCD (on average) even though RCD were highly misleading in both cases.

The approach used by Franklin et al. to calculate the observed and expected overlap in confidence intervals is interesting but suffers from major problems. First, it is notoriously underpowered to detect any difference between RCD and RCT. Second, it is totally open to manipulation by selecting what confidence interval should be considered (95%, 90% 75%, 50%, 20%, etc.). Third, it is clinically irrelevant. It is impossible to say whether a 10% difference in the expected (60%) versus observed (50%) overlap is clinically relevant. Mortality differences between compared interventions are almost ubiquitously very small or modest across all medical specialties. Extremely few treatments have very large treatment effects even when compared against placebo/no treatment [4] and the differences between different active treatments (the standard question addressed by RCD) are even smaller and typically tiny. A seemingly minor difference, e.g. OR 0.96 versus OR 1.04 may be a substantial difference clinically, even if the confidence intervals amply overlap in a single comparison of one RCD and one respective RCT. The method proposed by Franklin et al. will rarely be able to capture these statistically elusive but clinically major differences. This is why meta-analyses of ROR approaches are needed, because they are far more powerful to detect clinically important differences with proper statistical documentation.

The analysis on the agreement of the first RCT with the other RCTs within our sample is extremely misleading. It is trivial that effects of one experiment are larger than the mean of multiple replications of the experiment. This is just expected according to the regression to the mean principle. For clinical trials, this phenomenon has been described already two decades ago [5-10]. Interestingly, Franklin et al. included in their analysis a clinical question where both subsequent trials were published simultaneously making it impossible to clearly determine which one is the first (Gnerlich 2007). Franklin et al. selected the data which better fit to their claim. They also missed to provide a more balanced view by describing that the first randomized trial in our dataset typically indicated the better treatment correctly. Our trial data is in perfect agreement with a recent much larger analysis of 647 meta-analyses reporting that “When the first trial is statistically significant, 84.1% (95% CI: 79.4%, 88.8%) of the corresponding meta-analyses is both in the same direction and statistically significant” [7]. For our dataset, the effect estimates of the first trial was in the same direction as the corresponding meta-analysis in 73% of the clinical questions (8 of 11 clinical questions with multiple RCTs; using the trial publication dates and dropping the one case where all trials were published simultaneously not allowing us to determine the first one). Moreover, the vast majority of trials had the same direction as the final summary, for 5 clinical questions all trials had concordant effects and there was never more than one trial disagreeing in the direction of the other clinical questions. All three first trials (3 of 11 clinical questions, 27%) which disagreed with the final direction had only 1 vs 2 events, thus they would very unlikely lead to any misguided care decisions. However, we would refer to the large literature on the agreement of individual clinical trial results and subsequent trial evidence for further details (see for example References [5-10]).

Overall, this and the claim of Franklin et al. that it is misleading to combine ROR estimates on the same outcome only proves that they intentionally ignore or are not familiar with an extensive literature of two decades and hundreds of papers in meta-epidemiology.

Lars G Hemkens, Despina G Contopoulos-Ioannidis, John P A Ioannidis

References:

1. Franklin JM, Dejen S, Huybrechts KF, et al. A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies. Epidemiologic Methods 2017.

2. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JP. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ 2016;352:i493 doi: 10.1136/bmj.i493.

3. Anglemyer A, Horvath HT, Bero L. Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials. Cochrane Database Syst Rev 2014(4):MR000034 doi: 10.1002/14651858.MR000034.pub2.

4. Pereira TV, Horwitz RI, Ioannidis JP. Empirical evaluation of very large treatment effects of medical interventions. JAMA 2012;308(16):1676-84 doi: 10.1001/jama.2012.13444.

5. Gartlehner G, Dobrescu A, Evans TS, et al. Average effect estimates remain similar as evidence evolves from single trials to high-quality bodies of evidence: a meta-epidemiologic study. J Clin Epidemiol 2016;69:16-22 doi: 10.1016/j.jclinepi.2015.02.013.

6. LeLorier J, Gregoire G, Benhaddad A, et al. Discrepancies between meta-analyses and subsequent large randomized, controlled trials. N Engl J Med 1997;337(8):536-42 doi: 10.1056/NEJM199708213370806.

7. Tam WW, Tang JL, Di MY, et al. How often does an individual trial agree with its corresponding meta-analysis? A meta-epidemiologic study. PLoS One 2014;9(12):e113994 doi: 10.1371/journal.pone.0113994.

8. Herbison P, Hay-Smith J, Gillespie WJ. Meta-analyses of small numbers of trials often agree with longer-term results. J Clin Epidemiol 2011;64(2):145-53 doi: 10.1016/j.jclinepi.2010.02.017.

9. Ioannidis J, Lau J. Evolution of treatment effects over time: empirical insight from recursive cumulative metaanalyses. Proc Natl Acad Sci U S A 2001;98(3):831-6 doi: 10.1073/pnas.021529998.

10. Lau J, Antman EM, Jimenez-Silva J, et al. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med 1992;327(4):248-54 doi: 10.1056/NEJM199207233270406.

**Competing interests: **
All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: DCI and JPAI had no financial support for this project; LGH had support from the Commonwealth Fund for the submitted work; all authors declare no financial relationships with any organization that might have an interest in the submitted work in the previous three years and no other relationships or activities that could appear to have influenced the submitted work.

**14 September 2017**

## The authors reply

We thank Salanti and Efthimiou for contributing to this interesting discussion. However, their very general critique of the meta-epidemiological method at-large is unfounded; it bypasses the research questions of our study and ignores the context of the other 4 cited studies. Three (1-3) of the 4 papers they mention have not used any relative odds ratio (ROR) calculations to compare evidence from two types of studies. They use "coining" simply to describe absolute magnitudes of reported effect sizes ("to focus consistently on the extent of deviation from the 'null'" (2)). This is fully correct and appropriate.

Only the genetics meta-epidemiologic study that Salanti co-authored (4) used ROR. However, the coining was very different than in our BMJ paper (5). It involved the summary odds ratio of a meta-analysis combining all the studies of both designs to determine a consistent comparison, not the odds ratio of the first published study/design. While our BMJ paper (5) explored how the benefits of a promising treatment postulated by a certain type of studies would agree with future evidence, the perspective of the genetics study (4) was totally different. In genetics, it is typically impossible to know beforehand which allele increases risk. This may seem similar to having two compared active treatments without clear experimental and control arms. However, in contrast to the clinical question addressed in our BMJ paper (5), in the evaluation of family-based versus unrelated-control genetic epidemiology designs (4) there was no consistent first published and subsequent study design and this was not the research question. With equal footing for both designs, the summary of all data (i.e. the result of that meta-analysis across all studies available) represented what geneticists would consider the best evidence - it was thus the appropriate choice for determining the direction of the allele comparison. Twelve years later, the conclusions of the genetics meta-epidemiologic study (4) are widely considered correct and thousands of papers use unrelated case-control designs as interchangeable with family-based designs for detecting genetic signals. Therefore, all 4 papers mentioned by Salanti and Efthimiou have nothing to be corrected.

However, their comment offers an opportunity to clarify a broader issue: that it is important to carefully consider the specific research question of each meta-epidemiological study before making sweeping unfounded claims about "biased coining" and "erroneous practices". One may need to apply different types of "coining" depending on the specific, properly framed, relevant question that is asked. Such inversion rules and their validity are totally context-related which requires adequate subject-matter knowledge of the field (be it genetics, clinical medicine or health-care decision making). This is essential to address relevant questions and use coining properly, whenever needed. In particular, when one wants to incorporate the "bias" contributed by the fact that a study is done first (called "winner's curse" in genetics, "decline effect" or "regression-to-the-mean" in other fields), coining of the first study is appropriate. When the "winner's curse" is part of the problem for decision-making based on early evidence, and the research question is exactly about this problem, then of course one has to consistently evaluate the "winners", i.e. in our study (5) the "winning" treatments in the first study done with routine data. This "curse" is exactly the "bias" that has been precisely dissected in various previous responses. It is not an "erroneous practice" of the meta-epidemiological method, it is a phenomenon that consistently affects the clinical decision-making in exactly the situation addressed by our research question and which we therefore integrated in our analytical framework by coining 3 of the 16 effects. Regardless, we showed in a previous response, that this part of the problem was not so relevant in our study (5) as it may sound, because analyzing the data without coining did not change the results much. However, an analysis without coining addresses a different research question. Other research questions may need different coining in other circumstances. Statistical analyses do not occur in vacuum, but should be fit-for-purpose (as exemplified by (4) and (5)).

References:

1. Trikalinos TA, Ntzani EE, Contopoulos-Ioannidis DG, Ioannidis JPA. Establishment of genetic associations for complex diseases is independent of early study findings. Eur J Hum Genet EJHG. 2004 Sep;12(9):762-9.

2. Kavvoura FK, Liberopoulos G, Ioannidis JPA. Selection in reported epidemiological risks: an empirical assessment. PLoS Med. 2007 Mar;4(3):e79.

3. Siontis KCM, Patsopoulos NA, Ioannidis JPA. Replication of past candidate loci for common diseases and phenotypes in 100 genome-wide association studies. Eur J Hum Genet EJHG. 2010 Jul;18(7):832-7.

4. Evangelou E, Trikalinos TA, Salanti G, Ioannidis JPA. Family-based versus unrelated case-control designs for genetic associations. PLoS Genet. 2006 Aug;2(8):e123.

5. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ 2016;352:i493.

Competing interests:No competing interests05 February 2018