Intended for healthcare professionals

CCBYNC Open access

Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey

BMJ 2016; 352 doi: (Published 08 February 2016) Cite this as: BMJ 2016;352:i493
  1. Lars G Hemkens, senior researcher1 2,
  2. Despina G Contopoulos-Ioannidis, clinical associate professor3 4,
  3. John P A Ioannidis, professor1 4 5 6
  1. 1Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
  2. 2Basel Institute for Clinical Epidemiology and Biostatistics, University Hospital Basel, Basel, Switzerland
  3. 3Department of Pediatrics, Division of Infectious Diseases, Stanford University School of Medicine, Stanford, California, USA
  4. 4Meta-Research Innovation Center at Stanford (METRICS)
  5. 5Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA
  6. 6Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, USA
  1. Correspondence to: J P A Ioannidis jioannid{at}
  • Accepted 8 January 2016


Objective To assess differences in estimated treatment effects for mortality between observational studies with routinely collected health data (RCD; that are published before trials are available) and subsequent evidence from randomized controlled trials on the same clinical question.

Design Meta-epidemiological survey.

Data sources PubMed searched up to November 2014.

Methods Eligible RCD studies were published up to 2010 that used propensity scores to address confounding bias and reported comparative effects of interventions for mortality. The analysis included only RCD studies conducted before any trial was published on the same topic. The direction of treatment effects, confidence intervals, and effect sizes (odds ratios) were compared between RCD studies and randomized controlled trials. The relative odds ratio (that is, the summary odds ratio of trial(s) divided by the RCD study estimate) and the summary relative odds ratio were calculated across all pairs of RCD studies and trials. A summary relative odds ratio greater than one indicates that RCD studies gave more favorable mortality results.

Results The evaluation included 16 eligible RCD studies, and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths). Trials were published a median of three years after the corresponding RCD study. For five (31%) of the 16 clinical questions, the direction of treatment effects differed between RCD studies and trials. Confidence intervals in nine (56%) RCD studies did not include the RCT effect estimate. Overall, RCD studies showed significantly more favorable mortality estimates by 31% than subsequent trials (summary relative odds ratio 1.31 (95% confidence interval 1.03 to 1.65; I2=0%)).

Conclusions Studies of routinely collected health data could give different answers from subsequent randomized controlled trials on the same clinical questions, and may substantially overestimate treatment effects. Caution is needed to prevent misguided clinical decision making.


Routinely collected health data (RCD), such as electronic health records or patient registries, are proposed to assess comparative treatment effects of medical interventions. In theory, observational studies collecting this type of data could complement randomized controlled trials.1 The most important limitation of RCD studies is their inherent risk of bias due to confounding by indication. While only proper randomization can pre-emptively eliminate such bias, approaches such as propensity scores are frequently used to deal with bias in observational research. The propensity score reflects the probability that a patient will be selected for a treatment and is estimated by use of information on known factors affecting the treatment choice, for example, disease severity.2 3 Many other methods are increasingly used, but propensity scores are probably the most popular method used to inform healthcare decisions.3 4 5 Studies using data not collected for the purpose of a specific research project face many challenges and are prone to various specific biases related to the very nature of this data.1 A major challenge is the accuracy and reliability of the collected data, which is typically lower than many clinical trials with standardized and predefined outcome assessments. This might be less problematic for mortality, because it is an unambiguous outcome and less prone to data accuracy problems.

Although their limitations should not be underestimated,1 RCD studies could provide the best available evidence to inform healthcare decisions when randomized controlled trials are not available. However, it is unknown whether such studies offer highly reliable answers on vital clinical questions, for example, whether the estimated treatment effects from RCD studies agree with effects demonstrated in subsequent randomized controlled trials. Most RCD studies are published on questions where there is already available evidence from trials. For example, a 2010 survey showed that almost 70% of 337 RCD studies based on propensity scores already had randomized controlled trials published on the same question.6 It is likely that the authors of these RCD studies may be consciously or unconsciously influenced by the already available results of the respective trials. To directly assess whether RCD studies can predict the results of subsequent randomized controlled trials, one needs to focus on topics where no prior trial evidence is available to influence what might be considered as reasonable effects to report by the RCD studies.

We therefore aimed to obtain insights on the concordance between RCD studies and randomized controlled trials with a comprehensive meta-epidemiological study. The present study used RCD studies that analyzed a critical healthcare question, used propensity scores to deal with bias, and evaluated effects on mortality. We systematically compared the findings from such studies on various clinical questions (which have never been addressed in trials before), with the findings from subsequent randomized controlled trials.


Eligibility criteria and identification of routine data studies

Eligible RCD studies compared one treatment with another or no intervention, usual care, or standard treatment; were performed before any randomized controlled trial on the same clinical question; assessed mortality effects; and used propensity scores based analyses for mortality. We considered studies that used only data that were routinely collected. Any type of such data was considered eligible,7 8 including those from health insurance claims, electronic health or medical records, and registries (even if registries also comprised some actively collected data for the purpose of the registry rather than only passive, routine data collection).9 We considered studies evaluating drugs, biologics, dietary supplements, devices, diagnostic procedures, surgeries, or radiotherapies in any patient population with any condition, and mortality outcome (all cause or cause specific) that were published in English. We included studies published up to 2010 to ensure sufficient time for randomized evidence, if any, to appear.

We searched PubMed (last search November 2014) combining terms for RCD (such as “routine*”, “database*”, “claim*”, “health record*”, registr*”, and covering all terms used in the National Library of Medicine search strategy for electronic health or medical records10), with terms for mortality and propensity scores. For further details on inclusion criteria, definitions, and search strategies, see reference 6. One reviewer (LGH) screened titles and abstracts and obtained full texts of potentially relevant articles and determined eligibility.

Data extraction from RCD studies

For each eligible study, we extracted all clinical questions reported in the abstract following the PICO structure (patient, intervention, comparison, outcome).11 We formulated separate clinical questions for each combination of patients and compared interventions (experimental and comparator) for which any result was reported in the abstract. We considered clinically relevant variations of treatment characteristics (such as timing or dose) or patient conditions (eg, comorbidities) as separate PICO clinical questions. We also considered specific subquestions separately—such as when the main comparison looked at coronary stenting versus no stenting, and subanalyses compared drug eluting stents with bare metal stents separately. We did not consider separately specific age subgroups within adult populations and demographic subpopulations (sex, race, or ethnicity).

For each clinical question, we searched the complete article for a comparative effect between the compared interventions on mortality outcomes based on analyses that used propensity scores in any way (adjustment, selection of compared populations, both, or other). If we identified such an effect estimate, we screened the full text and references for randomized evidence on the same clinical question (not necessarily evaluating mortality outcomes). We excluded any clinical questions with existing prior trial evidence. We then extracted data on RCD study characteristics and the mortality effect estimate with 95% confidence intervals. If a study reported multiple estimates, we used the analysis with results first mentioned in the abstract (as a prespecified rule to avoid subjectivity in the selection of effects). One reviewer (LGH) extracted the data and screened the articles.

Eligibility criteria and identification of randomized controlled trials

For each eligible clinical question, we systematically searched PubMed (to November 2014) for randomized controlled trials or systematic reviews or meta-analyses of trials that also addressed this question and reported any mortality outcome. We created standardized search strategies for each topic by combining search terms for the intervention, comparator, and condition. We used the PubMed standard filters for study design, limited results to the English language, and added terms for mortality to increase specificity when we searched for trials and diagnostic topics (web appendix 1 and reference 6). For RCD studies published up to 2007, we also searched all relevant modules of the Cochrane Library, but found no pertinent randomized controlled trial that was not also identified via PubMed; thus for newer RCDs, we only searched PubMed.

We screened titles and abstracts, obtained full texts of potentially relevant articles and determined eligibility. The resulting randomized controlled trials derived from these searches were considered for further analyses. We tested the completeness of our search by using the related articles function in PubMed for each eligible trial (screening the first 20 related articles), and in no case we found an additional trial. These processes were all done by one reviewer (LGH) who marked studies if he was uncertain about eligibility. These studies were discussed with a second reviewer (DCI), who also confirmed the eligibility of all identified pertinent trials and spot checked all excluded full texts for verification. Discrepancies were discussed to reach consensus. We excluded from further analyses any clinical questions for which preceding trials (that is, published up until the year before the RCD study was published) were identified with the above searches.

Data extraction from randomized controlled trials

For each eligible trial, we extracted the number of randomized patients and deaths per treatment group (we preferred intention to treat data wherever possible). If a trial had multiple mortality endpoints, we preferred the same type of outcome definition as in the RCD study (all cause or cause specific mortality) and the most similar follow-up period (eg, inhospital and 30 day mortality). We extracted the proportions of patients not initiating the randomized treatment and patients switching to the non-allocated treatment during the study (treatment crossover). Data extraction was performed by one reviewer (LGH).

Risk of bias assessment

We assessed the risk of bias for RCD studies (DCI, JPAI) and randomized controlled trials (LGH, and an external researcher experienced in systematic reviewing), using the Cochrane risk of bias tools.12 13 Discrepancies were discussed to reach consensus.

Statistical analysis

For consistency, we inverted the RCD effect estimates where necessary so that each RCD study indicated an odds ratio less than 1 (that is, swapping the study groups so that the first study group has lower mortality risk than the second). We assumed that reported relative risks or hazard ratios were approximations to the odds ratio, a reasonable assumption because death was a relatively uncommon event (median across treatment comparisons 3% (interquartile range 2-9%)). For each clinical question, we also calculated the odds ratio for mortality using data from randomized controlled trials for the same clinical question. Multiple trials were meta-analytically combined with random effects models to obtain a summary odds ratio.14 We used Peto’s approach for event rates less than 1%.15

We recorded how frequently the treatment effect estimates from RCD studies and randomized controlled trials were in the opposite direction, how often the confidence intervals did not overlap, and how often the RCD study’s confidence interval did not include the effect estimate demonstrated by later available trials.

We also calculated for each clinical question the relative odds ratio (ratio of odds ratios) by dividing the summary odds ratio of all subsequent randomized controlled trials by the estimated odds ratio in the RCD study. Confidence intervals of relative odds ratios were calculated by use of the sum of the variances of the trial summary odds ratio and of the RCD study odds ratio estimate. We then combined the individual relative odds ratios across all questions to calculate the summary value. A summary relative odds ratio greater than 1 indicates that the RCD study found more favorable mortality outcomes than subsequent trials. Calculations were done after log-transformation.

We conducted several sensitivity analyses:

  • • Used fixed effect models instead of random effects models to combine effect sizes from randomized controlled trials14

  • • Excluded trials with a high risk of bias

  • • Excluded trials reporting high treatment crossover rates (>20% in any group) or asymmetric crossover (between group difference >10%)

  • • Included only trials clearly reporting low treatment crossover rates (<10% in all groups)

  • • Excluded trials with frequent non-initiation of randomized treatment (>10% in any group)

  • • Excluded trials in which the median age differed by more than two standard deviations from the median age in the RCD study

  • • Used the effect estimates from two mutually exclusive patient subgroups instead of the main effect from one RCD study16 and compared them with the summary odds ratio for the trials representing effects specifically for these subgroups

  • • Excluded one clinical question where all pertinent trials were already used for another treatment comparison17 18

  • • Used only trials identified by search strategies of existing systematic reviews

  • • Included only RCD studies with low risk of bias for all assessed domains (with the exception of “bias due to confounding,” which was deemed moderate for all RCD studies).

We used Stata 13.1 (Stata Corp) for all analyses and reported 95% confidence intervals. All P values were two tailed.

Patient involvement

No patients were involved in setting the research question or the outcome measures, nor were they involved in developing plans for design or implementation of the study. No patients were asked to advise on interpretation or writing up of results. There are no plans to disseminate the results of the research to study participants or the relevant patient community.


In the search for RCD studies, we identified 929 records and evaluated 420 in full text (fig 1). We found preceding randomized evidence on all evaluated clinical questions in 231 studies, did not find any subsequent randomized controlled trials in 90 studies, and excluded 83 studies for different reasons (fig 1). We eventually analyzed 16 RCD studies on clinical questions that did not have preceding trials and for which subsequent pertinent trials were identified (table 1). One study reported on three clinical questions with one primary result (which we included in our main analysis) and two subgroup effects (included alternatively in sensitivity analyses).16


Fig 1 Study flow diagram. RCT=randomized controlled trial

Table 1

Description of analyzed treatment comparisons in routinely collected data studies

View this table:

RCD studies were published between 2000 and 2010 and used diverse types of routine data including registries, hospital databases, and administrative data. Most studies were relevant to cardiology (12 (75%) of 16), and 11 (69%) compared two active interventions. All RCD studies assessed all cause mortality, and comparative effect estimates were based on a median of 2086 patients per analysis (interquartile range 734-8658; table 1). While we deemed the risk of bias due to confounding moderate for all studies, most had a low risk of bias for other types of bias. The overall risk of bias was therefore low to moderate for all RCD studies (web appendix 2).

We identified 36 subsequent randomized controlled trials32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 with 17 275 patients and 835 deaths overall, addressing the same clinical question as the RCD studies. All trials reported all cause mortality, and were published between 2003 and 2014, a median of three years after the RCD study. For each clinical question, we included a median of 985 randomized patients (interquartile range 287-1696; fig 2 and fig 3). We deemed the risk of bias high for 10 trials, mainly due to lack of blinding (web appendix 3).


Fig 2 Meta-analyses of comparative effects of medical interventions on mortality reported in randomized controlled trials published after the same clinical question was investigated in RCD studies (part one). For each clinical question investigated in a RCD study, the trials published subsequently are shown. Diamonds=result of meta-analyses combining these subsequent trials as summary odds ratios (using random effects models)


Fig 3 Meta-analyses of comparative effects of medical interventions on mortality reported in randomized controlled trials published after the same clinical question was investigated in RCD studies (part two). For each clinical question investigated in a RCD study, the trials published subsequently are shown. Diamonds=result of meta-analyses combining these subsequent trials as summary odds ratios (using random effects models)

Agreement of treatment effects

Across 16 clinical questions, eight RCD studies found significant treatment effects (fig 4). Confidence intervals were wide and overlapped between RCD studies and randomized controlled trials in all 16 treatment comparisons. However, in more than half of cases (nine of 16; 56%), the confidence intervals of the RCD based estimate did not include the mortality effect found in subsequent randomized trial evidence. For five (31%) of 16 clinical questions, treatment effects from randomized evidence were in the opposite direction to the RCD study estimate. None of these five trial estimates was significant, and one RCD study estimate was significant.


Fig 4 Treatment effects on mortality in RCD studies and randomized controlled trials. Left panel shows comparative effects of medical interventions on mortality reported in RCD studies and results of subsequently published trials on the same treatment comparisons. White circles=effect estimates reported in RCD studies; blue circles=pooled summary effects from subsequent trials (corresponding meta-analyses are shown in fig 2 and fig 3); lines=95% confidence intervals. Right panel shows for each clinical question the ratio of mortality effects reported in trial evidence versus RCD study effects (as relative odds ratios). Blue squares (lines)=relative odds ratio (95% confidence intervals); blue diamond=pooled summary relative odds ratio (meta-analysis of relative odds ratio) across all clinical questions. A relative odds ratio greater than 1 indicates more favorable mortality outcomes in RCD studies than in subsequent trials

When data were synthesized, RCD studies showed significantly inflated results compared with randomized controlled trials, with an average overestimation of mortality benefits by 31% (summary relative odds ratio 1.31 (95% confidence interval 1.03 to 1.65); table 2, fig 4). There was no heterogeneity between topics (I2=0% (0% to 45%)). The results were quite similar in all sensitivity analyses (table 2), with estimates of summary relative odds ratios ranging between 1.20 and 1.34 and their 95% confidence intervals excluding the null in six of the 10 sensitivity analyses. We found the smallest estimate of a difference between RCD studies and trials (summary relative odds ratio 1.20) when we considered only RCD studies with a low risk of bias on all dimensions (except for confounding bias, where a moderate risk is probably the best one can expect for this type of study design).

Table 2

Agreement of treatment effects reported in RCD studies and subsequent randomized trial evidence

View this table:


Principal findings

In our comprehensive analysis of various clinical questions on topics never evaluated in randomized controlled trials before, we found that studies using routinely collected health data frequently do not agree with subsequent randomized trials. We analyzed 16 clinical questions with 36 corresponding subsequent trials published a median of three years later. Although our results need to be interpreted cautiously given the relatively small numbers of studies, the emerging pattern was that RCD studies systematically and substantially overestimated the mortality benefits of medical treatments compared with subsequent trials investigating the same question.

The overall findings suggest that results from RCD studies in the absence of randomized controlled trials need to be seen with substantial caution. RCD studies might not necessarily provide reliable answers on how to best treat patients. As an example, the clinical consequences might be illustrated by the clinical question in our analysis with the largest body of randomized evidence—that is, on the duration of clopidogrel treatment after use of drug eluting stents.18 Here, the RCD based estimate suggested substantial and significant reductions in mortality (odds ratio 0.59 (95% confidence interval 0.35 to 0.99)), leaving the study authors to conclude that “longer (≥12 months) planned duration of clopidogrel results in reduced 12-month mortality . . . Randomized studies are urgently needed to address this issue.”18 However, later trial evidence showed no benefit of longer clopidogrel treatment and rather indicated harm, and the confidence intervals were not compatible with the early findings in the RCD study (odds ratio 1.11 (95% confidence interval 0.85 to 1.45)). This shows that RCD studies have a substantial risk of misguiding patient care.1

Comparison with other studies

A recent Cochrane review identified 14 previous meta-epidemiological studies comparing randomized and observational study results.68 Most focused on traditional observational epidemiology rather than on RCD studies, and only two meta-epidemiological analyses compared propensity score analyses with randomized controlled trials.69 70 A further empirical evaluation was excluded from the Cochrane review.71

In their analysis of mortality effects across 22 clinical questions in the field of surgery, Lonjon and colleagues found a point estimate of a summary relative odds ratio that was similar to our analysis (1.20, 95% confidence interval 0.96 to 1.54; original results inverted to allow comparison with this study).69 For subjective outcomes, they found a summary relative odds ratio close to 1 (0.93, 95% confidence interval 0.75 to 1.15). The authors interpreted the lack of statistical difference between study designs as evidence for equivalent effects. However, 20-30% relative changes in the odds of mortality are substantial, because most differences in mortality with treatments across medicine are of this magnitude or even smaller.72 Kuss and colleagues analyzed only one treatment comparison (off pump v on pump cardiac bypass surgery) and similarly interpreted lack of statistical difference as signaling equivalence.70 Dahabreh and colleagues analyzed mortality effects of treatments in the setting of acute coronary syndrome.71 They also found that propensity score analyses gave significantly larger effect sizes than RCTs.70

Strengths and limitations of study

All these previous empirical evaluations were restricted to specific topics and none evaluated clinical questions where all the data from randomized controlled trials were published subsequently to the RCD studies. However, many RCD studies are specifically undertaken to explore whether trials results can be replicated in the real world.6 In such cases, the trial evidence provides some prior knowledge that could inhibit the publication of findings that deviate greatly from the trial experience. Thus, our approach provides a more clean assessment of the ability of RCD results to predict the results of trials.

Some caveats should be considered in our study. Although we screened many RCD studies using propensity scores, only a fraction of the entire RCD literature was eligible for our analyses. This was largely due to the high number of clinical questions that were already addressed by some randomized trials, as we have previously discussed.1 6 However, we followed a systematic approach to derive a reproducible sample of RCD studies that covers a wide range of diverse healthcare questions. Although many relate to cardiovascular conditions, they represent various types of interventions, including surgery, devices, drug treatment, or treatment concepts. The generalizability to other conditions and diseases might also need to be assessed in the future.

The RCD studies included in our sample encompass a wide spectrum of data sources, from administrative hospital databases to committed registries. These data sources might differ with regard to their granularity, validation processes, and completeness. The sample was too small to allow a meaningful evaluation of differences across different subgroups of routine data sources. We have no detailed information on the accuracy of the key information of interest for our analyses (mortality and treatment allocation). Although we assume high data accuracy given the type of outcome (death is difficult to err on) and the clinical prominence of the assessed interventions, we cannot rule out that accuracy problems further reduce the reliability of such research.

Our PubMed search strategy for subsequent randomized controlled trials was relatively specific. It would be difficult to conduct thorough systematic reviews from scratch with highly sensitive search strategies for all the 106 RCD studies without preceding trials that we evaluated. Instead, we used a standardized search approach, systematically integrated existing systematic reviews and validated the search results with alternative identification algorithms—that is, the related article function in PubMed. Although the number of included clinical questions with pairs of RCD study and trials could have been higher with a more sensitive strategy for subsequent trials, we had similar results in sensitivity analyses restricted to trial results obtained from search strategies of existing systematic reviews.

We assessed only mortality effects. Other more subjective clinical outcomes would probably be collected less accurately in the routinely collected datasets. This might further reduce the validity of treatment effect estimates and further limit the reliability of RCD studies to guide clinical decision making. Conversely, some other types of outcomes might have much larger treatment effects than mortality, and thus it might be easier to separate from noise due to bias in RCD studies. However, treatment benefits for other outcomes (eg, hospital admission) might not necessarily translate to benefits for mortality or other hard benefits.73

We compared the RCD effects with early evidence from subsequent randomized trials that sometimes overestimates treatment effects.74 Thus, our results even might be conservative and we may have underestimated the inflated and optimistic effects from RCD studies.

Randomized controlled trials are not necessarily a perfect gold standard. When their results differ against those of observational studies on the same question,75 76 it may not be certain that the trials are correct and the observational data are wrong. We explored the potential effect of risk of bias in the randomized and non-randomized studies. None of the RCD studies and only a few trials were deemed to have high risk of bias. When we compared only the effects from studies without high bias potential, we found similar effects as in the main analysis.

We used intention to treat effects for our comparison with RCD studies, because this is the most robust approach against bias. Such effects could be conservative in trials without active controls, low adherence to the allocated treatment, or high dropout rates. However, most trials compared active treatments, most had only very few patients not starting the allocated treatment or switching to the other treatment during the study, and none had a high risk of bias due to missing outcome information (dropouts). In various sensitivity analyses, we found no indication that use of intention to treat effects affected our main findings.

For RCD studies, the assessment of the risk of bias is not straightforward. Use of propensity score methods helps to reduce confounding, but it is unlikely that confounding can be eliminated. It is difficult even to judge to what exact extent confounding has been reduced with different propensity adjustments or other approaches. For other dimensions of potential bias beyond confounding, our selected studies might have been at lower risk for bias than many other RCD studies that look at outcomes other than mortality. For non-mortality outcomes, missing information, measurement errors, and availability of diverse definitions and analyses could be more prominent than for death. Our results remained largely similar in different sensitivity analyses, although we did see the lowest estimate for a summary relative odds ratio (indicating closest convergence of results from randomized controlled trials and RCD studies) when we considered RCD studies with low risk of bias in all dimensions (other than confounding). We cannot exclude the possibility that RCD studies become better in predicting trial results when bias is minimized, although much more data are needed to make a conclusive statement about this.

Genuine differences in estimated effect sizes could still exist between the two methods. Nevertheless, we tried to make the PICO structure highly comparable in the juxtaposed RCD studies and randomized controlled trials that we evaluated. It is also unclear whether those questions where subsequent trials were performed are qualitatively different from those where subsequent trials are never performed once an effect has been described in the observational literature. When strong, conclusive effects are seen in RCD studies, there may be less likelihood to perform a subsequent trial.72 However, it is unlikely that such strong, conclusive effects are commonly seen.

Conclusions and policy implications

Despite the wide and increasing use of routinely collected health data in comparative effectiveness research, the reliability of this approach needs to be questioned, especially when effectiveness outcomes are concerned and randomized controlled trials might be feasible to conduct. Of course, for some outcomes (especially on safety or harms), it may be difficult to obtain definitive evidence from large trials, and RCD data could then offer the best possible guidance.

If no randomized trials exist, clinicians and funders of care can still act on the results from observational RCD and other evidence, but they should consider that treatment effects could be more uncertain and substantially smaller than what RCD studies suggest. Therefore, decisions for widespread adoption and reimbursement of expensive interventions with evidence based entirely on RCD may be best withheld until trial evidence becomes available. Large randomized trials might still be needed to address critically important clinical questions for patient relevant outcomes.1 77 78

What is already known on this topic

  • Observational studies using routinely collected data (RCD studies) are increasingly used to inform healthcare decisions when RCTs are not available

  • However, observational studies have an inherent risk of bias due to confounding by indication

  • Another difficulty is the accuracy and reliability of routinely collected data

What this study adds

  • RCD studies systematically and substantially overestimate mortality benefits of medical treatments compared with subsequent trials investigating the same question

  • Observational RCD studies might not necessarily provide very reliable answers on how to best treat patients; caution is needed to prevent misguided clinical decision making

  • If no randomized trials exist, clinicians and funders of care should consider that treatment effects are probably more uncertain and substantially smaller than RCD studies suggest; decisions for widespread adoption and reimbursement of expensive interventions might be best withheld until trial evidence becomes available


  • We thank Hannah Ewald, University of Basel, for support in the risk of bias assessment.

  • Contributors: LGH and JPAI conceived the study. All authors extracted and analyzed the data and interpreted the results. LGH wrote the first draft and all authors made revisions on the manuscript. All authors read and approved the final version of the paper. JPAI is the guarantor.

  • Funding: This study was supported by the Commonwealth Fund, a private independent foundation based in New York City. The views presented here are those of the authors and not necessarily those of the Commonwealth Fund, its directors, officers, or staff. The Basel Institute for Clinical Epidemiology and Biostatistics received support from Santésuisse, the umbrella association of Swiss social health insurers. The Meta-Research Innovation Center at Stanford is funded by a grant by the Laura and John Arnold Foundation. The work of JPAI is supported by an unrestricted gift from Sue and Bob O’Donnell. The funders had no role in design and conduct of the study; collection, management, analysis, and interpretation of the data; and preparation, review, or approval of the manuscript or its submission for publication.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at and declare: DCI and JPAI had no financial support for this project; LGH had support from the Commonwealth Fund for the submitted work; all authors declare no financial relationships with any organization that might have an interest in the submitted work in the previous three years and no other relationships or activities that could appear to have influenced the submitted work.

  • Ethical approval: Not required for this study.

  • Data sharing: No additional data available.

  • The corresponding author affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:


View Abstract