- Richard L Tannen, professor of medicine,
- Mark G Weiner, associate professor of medicine,
- Dawei Xie, assistant professor of biostatistics and epidemiology
- 1University of Pennsylvania School of Medicine, 295 John Morgan Building, 36th and Hamilton Walk, Philadelphia, PA 19104, USA
- Correspondence to: R L Tannen
- Accepted 19 October 2008
Objectives To determine whether observational studies that use an electronic medical record database can provide valid results of therapeutic effectiveness and to develop new methods to enhance validity.
Design Data from the UK general practice research database (GPRD) were used to replicate previously performed randomised controlled trials, to the extent that was feasible aside from randomisation.
Studies Six published randomised controlled trials.
Main outcome measure Cardiovascular outcomes analysed by hazard ratios calculated with standard biostatistical methods and a new analytical technique, prior event rate ratio (PERR) adjustment.
Results In nine of 17 outcome comparisons, there were no significant differences between results of randomised controlled trials and database studies analysed using standard biostatistical methods or PERR analysis. In eight comparisons, Cox adjusted hazard ratios in the database differed significantly from the results of the randomised controlled trials, suggesting unmeasured confounding. In seven of these eight, PERR adjusted hazard ratios differed significantly from Cox adjusted hazard ratios, whereas in five they didn’t differ significantly, and in three were more similar to the hazard ratio from the randomised controlled trial, yielding PERR results more similar to the randomised controlled trial than Cox (P<0.05).
Conclusions Although observational studies using databases are subject to unmeasured confounding, our new analytical technique (PERR), applied here to cardiovascular outcomes, worked well to identify and reduce the effects of such confounding. These results suggest that electronic medical record databases can be useful to investigate therapeutic effectiveness.
The future widespread implementation of electronic records in clinical practice will provide an enormous opportunity for research related to medical treatments, provided this information is compiled into robust, well designed databases and analysed with appropriate methods. By contrast, incorrect analyses could have important negative effects on medical treatment and health policy. Therefore, before implementation of this approach for assessing effectiveness of treatment, we need to assess the validity of the results from studies using such databases and of the study design and analytical strategies that are most likely to yield valid results. The need for further investigation into these strategies is widely supported.1 2 3 4 5 6 7
Two major potential problems could arise in the use of medical record databases to provide reliable information concerning treatment outcomes: the quality of the data contained within the database and the ability of analyses of observational—that is, non-experimental—data to provide valid results.
Considerable controversy exists over whether observational studies can provide reliable information on effectiveness of therapeutics.1 2 8 9 10 11 12 13 14 15 Because of their ability to balance measured and unmeasured confounders, randomised controlled trials remain the highest level of evidence, whereas the quality of evidence from observational studies is lower because of confounding by indication and other biases related to the effects of unmeasured covariates. Several comparative analyses suggest that observational studies often yield results reasonably consistent with those of randomised controlled trials. Nevertheless, there are several well documented examples where the results from observational studies were misleading.1 2 3 7 8 16 17 Hormone treatment to protect against coronary artery disease in postmenopausal women is one highly cited example.7 18 19 20 Some authorities believe that the results of observational studies should be an important component of evidence based medicine; some suggest their reliability is limited to conditions where confounding by indication is unlikely, as, for example, in studies of unanticipated adverse effects of drugs,21 whereas others are sceptical of their value.1 2 7 8 9 10 11 12 13 14 15
An important limitation applicable to previous comparative analyses is that most of the observational studies did not have rigorous inclusion and exclusion criteria, exposure definitions, and outcomes identical to the randomised controlled trials so that lack of randomisation was not the only important difference.1 2 15 22 23
To overcome these limitations in validating an observational study, we tested the value of a comprehensive longitudinal electronic clinical database, the UK General Practice Research Database (GPRD), using studies designed to replicate the design of previously performed randomised controlled trials to the extent that was feasible aside from randomisation.24 Validity of the method was measured by comparing the outcomes of the replicated GPRD study with those of the randomised controlled trial.25 26 27 28 29 The GPRD study results depended on both the quality of the information in the database and whether observational data can reproduce results from a randomised controlled trial.
We examined both the potential research value of the electronic medical record database and the validity of observational studies. We also used a new analytical method, prior event rate ratio (PERR) adjustment, to enhance the validity of the results.
The UK GPRD database contains information from the electronic medical records of primary care practices encompassing a representative sample of about 5.7% of the UK population during 1990-2000 and contains records of over eight million patients.24 25 It includes the complete primary care medical record, comprehensive information on essentially all medications prescribed, and information from outside consultants and admissions to hospital. The box details limitations and advantages of the database.
The GPRD database
Comprehensive national healthcare system
Representative sample of entire population
All care centralised in general practice record
All medications prescribed by general practitioner, generated by computer
Contains around eight million patients
Lacks direct link to laboratory data (laboratory data inadequate)
Missing data on smoking, systolic blood pressure, body mass index (about 30%)
Limited data on onset of menopause
Limited data on admission to hospital
Lacks direct link to death certificates (cause of death not reliable)
GPRD study protocol
Table 1 summarises database replications of six randomised controlled trials that have been performed and reported in detail elsewhere⇓.20 30 31 32 33 34 As far as possible the database studies used the same inclusion and exclusion criteria, a similar study time frame, and a similar treatment regimen as the randomised trials.25 26 27 28 29 Thus the major primary difference was the lack of randomisation in the database studies, albeit that other issues such as use of placebo, nature of healthcare delivery, and some characteristics of subjects entered into randomised trials compared with those existing in the general population can differ between a randomised trial and a database study.
Selection of the subjects for inclusion in the database studies followed the outline shown in figure 1⇓. First the exposed cohort was selected from all database subjects who met the inclusion criteria and received treatment with the study treatment during a predefined recruitment interval. The exposed cohort was finalised after elimination of patients with exclusion criteria. Their start time was the day of the first prescription of the study drug. The unexposed cohort was selected from all patients who met the inclusion criteria but did not receive the study drug during the recruitment interval. They were then age and sex matched to the exposed patients with a computerised random selection program, and their start time was considered identical to that of the matched exposed patients. Then, those who had exclusion criteria were eliminated.
The selection process differed for the database matched to the Syst-Eur study because study entry and start time for both the exposed and unexposed cohorts was determined by measured blood pressure that indicated systolic hypertension.25
All database studies ended on a predefined date or on outcome stop points defined in the randomised controlled trial. Patients were considered lost to follow-up if they left the practice or the practice was eliminated from the database before the end date. We analysed database studies using a simulated “intention to treat” paradigm where subsequent treatment of the exposed and unexposed patients did not modify study end time and also an “as treated” analysis in which the study ended for an exposed or unexposed patient who deviated from their treatment protocol.
We determined Cox unadjusted and adjusted hazard ratios for all outcomes. The adjusted hazard ratios used a predetermined set of potential confounders that included key demographics, medications at baseline, and identified medical conditions. We imputed missing values for systolic blood pressure, body mass index, and smoking35 and created five separate datasets. The final estimates combined the results from the five datasets, as described previously.26 35 36 37
We also analysed results with a propensity score approach, which used all demographics, drug use at baseline, and identified medical conditions as confounders.25 26 38 39 40 Propensity scores were estimated using logistic regression with the outcome being the indicator of treatment and the covariates being all confounders considered. For those with no missing data, all covariates were used; whereas for those with missing data (for body mass index, systolic blood pressure, or smoking), we used separate logistic regression models, which excluded the missing covariates, to estimate propensity scores. Analysis stratified by the propensity scores balances the treated and untreated groups with respect to the observed covariates used in estimating the propensity scores. We determined outcome hazard ratios for each fifth of the propensity score and combined the five hazard ratios to determine an overall hazard ratio using a Cox model treating the fifths as strata with different baseline hazards. The propensity score thereby accounts for missing confounders in a different fashion from the multiple imputation method used with the Cox analysis. The matched database study for Syst-Eur, our first study, was analysed only with propensity score analysis.25
We also used a prior event rate ratio (PERR) approach to adjust the Cox hazard ratio, as described recently.28 29 This analysis requires that neither the exposed nor unexposed patients are treated with the study drug before the start of the study. It assumes that the hazard ratio of the exposed to unexposed for a specific outcome before the start of the study reflects the combined effect of all confounders (both measured and unmeasured) independent of any influence of treatment.
To apply the PERR adjustment method, we divided the unadjusted hazard ratio of exposed versus unexposed groups during the study by the unadjusted hazard ratio of exposed versus unexposed “before” the study. Thus if p=prior events and s=study events, the calculation is: PERR adjusted HR=HRs/HRp. We obtained confidence intervals for the PERR adjusted hazard ratio using a bootstrap technique.28 Hazard ratios are reported because of variable observation times for patients both before and during the study; though incidence rate ratios produced similar results.
In all studies we carried out the PERR analysis using a subset of patients who did not take the study drug at any time before the start of the study. In no instance did Cox adjusted hazard ratios for this subset differ meaningfully from the results in the overall cohort.
The time interval used to assess previous events encompassed 1 January 1987 to the patient’s start time. If a patient had no medical or treatment record before that date, their time interval began on the earliest subsequent date with a record. If they had no records before the study start time, they were not used in this analysis. The average time of the previous period for all the outcomes assessed averaged 3.52 years (range 2.8-3.9 years). Analysis of the impact of the duration of the previous time period using the empirical data in these studies suggested that encompassing events from 3-4.5 years before study start time did not meaningfully influence the results of the PERR analysis.
We compared differences between the hazard ratio from the randomised trial and the database using a standard normal z test, where the z score was obtained from the difference between the logarithm of the hazard ratio divided by the standard error of that difference.25
Comparability between replicated database study and randomised controlled trial
The size of the unexposed group in the database study was always larger than the placebo group of the randomised controlled trials (table 1).⇑ The exposed group in the database study, however, was smaller than the treated cohort in half the randomised controlled trials. Furthermore, the database was inadequate to replicate several randomised controlled trials because of an insufficient number of exposed patients.
Although entry criteria were similar for the database studies and randomised trials, the database cohort typically differed from the respective trials in their baseline demographic characteristics, existing comorbidities, and use of cardiovascular drugs.25 26 27 28 29
The database treatment protocol precisely replicated the trial in only one study (WHI-hysterectomy (see table 1). The other database studies used the same class of drug, rather than specific drug used in the trial. Furthermore, identical dosing regimens could not be replicated. It is worth noting that the prescription database in GPRD can actually track data on medication prescribing better than many randomised controlled trials.
Finally in contrast with the randomised controlled trials, where randomisation resulted in similar baseline health profiles of the treated and placebo arms, all the database studies except Syst-Eur exhibited differences in the baseline characteristics of the exposed and unexposed groups.
Comparison of outcomes in the database studies and randomised controlled trials
We focused on randomised controlled trials with primary cardiovascular outcomes because they could be replicated reasonably without the need for laboratory data. We report on death, myocardial infarction, stroke, and coronary revascularisation (coronary artery bypass grafts or percutaneous transluminal coronary angioplasty (CABG/PTCA) or both). These cardiovascular outcomes should be the least susceptible to misclassification errors. Other outcome results are provided in the primary publications, and the results for breast cancer, colon cancer, and hip fracture were similar in both the “intact uterus” and “hysterectomy” WHI randomised controlled trials and their respective database studies.25 26 27 28 29
Table 2⇓ and figure 2⇓ show cardiovascular outcomes and statistical comparisons for the six database studies and trials. We have shown simulated “intention to treat” results, but results of the “as treated” analyses did not differ meaningfully. Cox adjusted and prior event rate ratio (PERR) adjusted hazard ratios (performed in five studies) are also shown.
Propensity score analyses (table 3)⇓ did not differ meaningfully from the analysis with Cox adjusted hazard ratios. A minor exception was the death outcomes in the HOPE and EUROPA studies, where the propensity score adjusted hazard ratios were slightly lower than the Cox adjusted hazard ratios and slightly more similar to the hazard ratios from the randomised controlled trial.
Results from the WHI randomised controlled trial for the entire cohort and also subdivided by age were reported.20 31 41 We compared the database studies to the overall WHI randomised controlled trial and also to the results restricted to women aged <70, an age profile more comparable with the study cohorts in the database.
In nine of 17 comparisons of cardiovascular outcomes there was no significant difference between the Cox adjusted hazard ratios from the database and the hazard ratios from the randomised controlled trials (see table 2, which compares the trial hazard ratio, the database Cox adjusted hazard ratio, and the database-PERR adjusted hazard ratios). In none of these nine comparisons did the PERR analysis differ significantly from either the trial hazard ratios or the Cox adjusted hazard ratios.
In eight of the 17 comparisons, however, the Cox adjusted hazard ratios differed significantly from the trial hazard ratios, suggesting the presence of unmeasured confounding. In seven of these eight instances the PERR adjusted hazard ratios differed significantly from the Cox adjusted hazard ratios, and either did not differ significantly (five outcomes) or were more similar (two outcomes) to the trial hazard ratio. In the other outcome the PERR hazard ratio was more similar to the trial but did not differ significantly from the Cox adjusted hazard ratio. A Wilcoxon signed rank test showed that when the Cox adjusted hazard ratio differed significantly from the trial hazard ratio (n=8), the PERR adjusted hazard ratio was significantly (P<0.05) more similar to the trial hazard ratio than the Cox adjusted hazard ratio.
As the 17 outcomes analysed came from six studies, it is reasonable to question the analysis of each outcome as an independent data point. As shown in the 4S study, however, the two outcomes (myocardial infarction and coronary revascularisation) clearly behaved independently of one another. Unmeasured confounding affected revascularisation but had no discernible effect on myocardial infarction. As the PERR analysis is outcome specific and derived entirely in that fashion, it seems reasonable to analyse the data assuming that each individual outcome is independent.
In the aggregate, when the outcome results from the database studies analysed by conventional statistical methods are confirmed or corrected by the PERR method, they are largely comparable with the results from the respective randomised controlled trials.
The large confidence intervals in the PERR analysis of all the WHI outcomes, which limits the interpretation of this data, were due to the small number of previous events. Surprisingly, despite this limitation, the PERR adjusted hazard ratio was significantly higher than the Cox adjusted hazard ratio and not different from the randomised controlled trial hazard ratio for both the myocardial infarction and stroke outcomes in the WHI-hysterectomy study, suggesting the presence of unmeasured confounding. The likelihood that unmeasured confounding influenced these two outcomes is consistent with the significant difference between the Cox adjusted hazard ratios and the randomised controlled trial hazard ratios.
We have shown only Cox adjusted hazard ratios for death because PERR adjustment cannot be done. The Cox and the propensity score adjusted hazard ratios for death resembled the randomised controlled trial results in three studies; however, they were higher than the trial in the Syst-Eur study and lower in both the WHI studies. The WHI results on death should be interpreted cautiously because in both studies a subset of the overall cohort that was not missing any data on baseline body mass index, systolic blood pressure, or smoking did not show a significant decrease in death, despite results comparable with the overall cohort for all other outcomes.
Despite its shortcomings, this careful, albeit not exhaustive, comparison between randomised controlled trials and observational studies using data from an electronic primary care medical record database reveals several important insights. From an overall perspective, our results suggest that observational studies using databases might produce valid results concerning the efficacy of cardiovascular drug treatments.
Rigour of database studies
Our studies comparing performance of the database and randomised controlled trials were performed in as rigorous fashion as possible.
In addition to using similar inclusion and exclusion criteria and relatively similar time frames, we analysed studies with both a simulated “intention to treat” and an “as treated” design. We analysed data with multiple imputation plus Cox adjusted hazard ratios, and also propensity score plus stratified Cox unadjusted hazard ratios. The propensity score is useful to identify heterogeneity and also incorporates missing data into the analysis in a fashion different from the multiple imputations used with the primary Cox method. We used a subset of the overall cohort without “missing data” on the key confounders (systolic blood pressure, body mass index, and smoking) as a secondary verification analysis to ensure that missing data did not influence the results in the overall cohort. We assessed use of non-study drugs to confirm that cointervention during the study did not account for the results. Computerised random matching and thereby start time delineation for the unexposed group obviated the potential for unanticipated bias related to start time in the unexposed group.
Overall study results
We analysed results of the outcomes for myocardial infarction, stroke, coronary revascularisation, and death for six comparative studies (table 2 and fig 2).⇑ ⇑ We examined the aggregate database study results with conventional biostatistical analyses (Cox adjusted hazard ratios or propensity score analyses, or both) and our newly described prior event rate ratio (PERR) adjustment technique.28 29
When analysed with conventional biostatistical analyses, the database outcome results (independent of death) did not differ significantly from those in the randomised controlled trial in nine of the 17 comparisons. In no instance did the PERR analysis differ significantly from the randomised controlled trial, when there was no difference between the conventional analyses and the trial.
As shown in table 2 and figure 2, when the database outcomes analysed with conventional biostatistical techniques differed significantly from the trial, the PERR analysis results were either not significantly different from or much more similar to the trial results.
The instances where the database results analysed by conventional biostatistical methods differed importantly from the results in the trial presumably reflect unmeasured confounding by indication in the database studies. Thus our findings support concerns that the validity of observational studies must always be viewed with circumspection. The studies reported herein, however, suggest that the PERR technique can identify (by differing from the results with standard statistical methods) and largely correct for the effects of unmeasured confounding, when it exists. The availability in the database of previous event rates, rather than only prevalence data, permitted performance of this analysis.
PERR analytical technique
The underlying hypothesis of the PERR analytical technique is that a comparison between the event rate for a specific outcome in a cohort’s exposed and unexposed patients before entry into the study should reflect the effect of all confounders on that specific outcome independent of the effect of treatment. This assumption holds only when neither the exposed nor unexposed patients have been treated with the study drug before the start of the study. If so, the ratio between the previous events in the exposed and unexposed patients should reflect the aggregate effect of all identified and unidentified confounders.
Therefore, when the unadjusted incidence rate ratio or hazard ratio of that outcome during the study is divided by the ratio for that outcome before the study, this adjustment should correct for the aggregate effects of all identified and unmeasured confounders.
When there are no unmeasured confounders, reflected by similar results of the database Cox adjusted hazard ratio and the randomised controlled trial hazard ratio, the PERR adjusted results should be similar to the Cox adjusted hazard ratio. Based on the empirical findings in these studies, the PERR adjustment seemed to function in this fashion.
When there are unmeasured confounders, presumably resulting from confounding by indication, the results of the PERR adjusted hazard ratio and the Cox adjusted hazard ratio should differ. Our empirical results show that in every instance where the comparison of the Cox adjusted hazard ratio in the database study differed from the results of the trial, suggesting the presence of “unidentified confounding,” the PERR adjustment yielded a result much more consistent with the findings in the trial. Of most importance in all but one instance where unmeasured confounding seemed to be present, the PERR adjusted value identified the presence of unmeasured confounding by differing significantly from the Cox adjusted hazard ratio.
Identification of the PERR method emerged from these studies because the direct comparison of the database observational study and the randomised controlled trial provided a presumed correct answer against which to validate the database results. Further investigation is necessary to fully validate the PERR technique. More extensive statistical simulation studies would determine its limitations and applications and the applicability of the method to additional outcomes. It is also important to appreciate that this technique is outcome specific; it cannot be extrapolated from one outcome to another. Finally, it is restricted to outcomes for which previous events can be ascertained. If an outcome was a study exclusion criterion, it cannot be analysed with this approach, nor can it be applied to death.
The PERR method differs and seems to be more widely applicable than other methods that have been developed in an attempt to address hidden bias.42 As confirmed in our studies, propensity score analysis does not overcome unmeasured confounding. When combined with sensitivity analyses, however, it might provide results that can be interpreted as unlikely to have been influenced by unmeasured covariates.43 44 45 Recently, propensity scores combined with regression calibration were used to address unobserved variables under certain conditions.46 47
Instrumental variable analysis, used commonly in economics, has also been used to address unmeasured confounding. An instrumental variable analysis requires identification of a factor that affects the assignment to treatment but has no direct effect on the outcome.48 49 50 Its applicability and validity for studies of therapeutic efficacy have not been widely examined.42 51 52 Some have suggested that this technique is most suited to address health policy issues rather than specific clinical issues of treatment effectiveness.48
Both the propensity score calibration and the instrumental variable analysis methods have important constraints. The propensity score calibration technique requires the presence of a validation study, whereas the instrumental variable analysis requires identification of an appropriate instrument. These requirements limit their applicability to a wide variety of studies.
Of interest, the DID (difference-in-differences) method used in economic studies, has some similarities to the PERR method in that it compares the differences between the difference in before and after behaviour in two groups.53 54 55 The key assumption behind the DID method, similar to PERR, is that the distribution of the unobserved confounding variables in the treated group and the comparison group and the effect of these unobserved confounding variables on the outcome remains the same before and during the study period. The DID method is also used commonly in psychology, where it is called the before and after design with an untreated comparison group.56 57
Death was significantly higher in one of our database studies (Syst-Eur) and it seemed to be significantly lower in both of the database comparisons with the WHI randomised controlled trial; however, for the reasons enumerated these latter results should be interpreted cautiously.
Future perspective and study limitations
Thus it seems from our studies that an electronic medical record database can be an important tool for ascertaining evidence based decisions with regard to treatment. To maximise the value of future databases they should be designed with all the advantages enumerated for GPRD and also should overcome its limitations (see box). Ideally future databases should be much larger than GPRD, which includes about eight million patients. On the basis of our work to date, we estimate that 40-50 million patients are needed for the breadth of future studies we can envisage.
Studies using such databases would not replace the need to do randomised controlled trials but could serve as an important tool to supplement the contributions of trials to evidence based medicine. One example among many is to generalise the results of randomised controlled trials. Although we have not comprehensively examined this issue, our studies have shown the feasibility of further generalising the results of the Syst-Eur and WHI randomised controlled trials.25 58 59
As well as the need for further validation of the PERR technique, several other limitations apply to this investigative effort. The PERR technique should be viewed currently as applicable only to analysis of a study using a design similar to ours, which includes similar inclusion and exclusion criteria for the exposed and unexposed and a defined study start, recruitment interval, and end time. Furthermore, the random matching technique might be critical to assure that bias does not exist in the start time for unexposed patients. Application of the PERR technique to other study designs will require its validation under those conditions.
Another potential shortcoming of our studies is the inability to exactly replicate all aspects of the randomised controlled trial independent of randomisation, such as exact dose of study drug, the role of placebos, the possibilities of differences in health care, and other differences between participants entered in randomised controlled trials and those in the general population. In addition, there is also the possibility of inaccuracy of information in the database (for instance, misclassification of outcome, ascertainment bias, etc). The reasonably similar results of the database studies and comparative randomised controlled trials, however, suggest these were not major problems.
Our current view is that the PERR analysis should not be performed in isolation. We would recommend its use along with conventional biostatistical analyses. When the conventional and PERR analyses are similar, “unmeasured confounding” would seem unlikely; whereas when they differ “unmeasured confounding” would seem likely. When unmeasured confounding seems to be present, the PERR analysis seems to yield a more valid result, but additional evaluation is required to ascertain the veracity of this suggestion.
What is already known on this topic
Two major potential problems could impede the capability of an electronic medical record database to provide reliable information concerning drug efficacy: the quality of the data contained within the database and the ability of analyses of observational—that is, non-experimental—data to provide valid results
The quality of evidence from observational studies is less than from randomised controlled trials because of confounding by indication and other biases related to the effects of unmeasured covariates
What this study adds
Although observational studies are subject to unmeasured confounding, a new analytical technique, prior event rate ratio (PERR) adjustment, can identify and reduce unmeasured confounding
Data from properly constructed electronic medical record databases, when analysed with standard statistical methods along with the PERR method, can reveal important insights into the efficacy of medical treatment
Cite this as: BMJ 2009;338:b81
We gratefully acknowledge the assistance of Xingmei Wang, who assisted with the biostatistical analyses, and James Lewis and Stephen Kimmel for their insightful review of this manuscript.
Contributors: RLT and MHW contributed to conception and design; analysis and interpretation of data; drafting and revision of article; and final approval of published version. CX contributed to design, analysis and interpretation of data, drafting and revision of article, and final approval of published version.
Funding: This work was supported by the National Institutes of Health research grant RO1-HL 073911.
Competing interests: None declared.
Ethical approval: Not required.
Provenance and peer review: Not commissioned; externally peer reviewed.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.