Intended for healthcare professionals

CCBY Open access

Emulating the GRADE trial using real world data: retrospective comparative effectiveness study

BMJ 2022; 379 doi: (Published 03 October 2022) Cite this as: BMJ 2022;379:e070717
  1. Yihong Deng, principal health services analyst12,
  2. Eric C Polley, associate professor3,
  3. Joshua D Wallach, assistant professor4,
  4. Sanket S Dhruva, assistant professor of medicine56,
  5. Jeph Herrin, assistant professor78,
  6. Kenneth Quinto, senior medical advisor for real world evidence analytics9,
  7. Charu Gandotra, medical officer10,
  8. William Crown, distinguished research scientist11,
  9. Peter Noseworthy, professor of medicine12,
  10. Xiaoxi Yao, associate professor of health services research and medicine112,
  11. Timothy D Lyon, assistant professor of urology13,
  12. Nilay D Shah, managing director14,
  13. Joseph S Ross, professor of medicine and public health1516,
  14. Rozalina G McCoy, associate professor of medicine117
  1. 1Robert D and Patricia E Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
  2. 2OptumLabs, Eden Prairie, MN, USA
  3. 3Department of Public Health Sciences, University of Chicago, Chicago, IL, USA
  4. 4Department of Environmental Health Sciences, Yale School of Public Health, New Haven, CT, USA
  5. 5Section of Cardiology, San Francisco Veterans Affairs Health Care System, San Francisco, CA, USA
  6. 6Department of Medicine, UCSF School of Medicine, San Francisco, CA, USA
  7. 7Section of Cardiovascular Medicine, Yale School of Medicine, New Haven, CT, USA
  8. 8Flying Buttress Associates, Charlottesville, VA, USA
  9. 9Office of Medical Policy, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Springs, MD, USA
  10. 10Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Springs, MD, USA
  11. 11Florence Heller Graduate School, Brandeis University, Waltham, MA, USA
  12. 12Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
  13. 13Department of Urology, Mayo Clinic, Jacksonville, FL, USA
  14. 14Delta Air Lines, Atlanta, GA, USA
  15. 15Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
  16. 16Department of Health Policy and Management, Yale School of Public Health, New Haven, CT, USA
  17. 17Division of Community Internal Medicine, Geriatrics, and Palliative Care, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA
  1. Correspondence to: R G McCoy mccoy.rozalina{at} (or @RozalinaMD on Twitter)
  • Accepted 30 August 2022


Objective To emulate the GRADE (Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness Study) trial using real world data before its publication. GRADE directly compared second line glucose lowering drugs for their ability to lower glycated hemoglobin A1c (HbA1c).

Design Observational study.

Setting OptumLabs® Data Warehouse (OLDW), a nationwide claims database in the US, 25 January 2010 to 30 June 2019.

Participants Adults with type 2 diabetes and HbA1c 6.8-8.5% while using metformin monotherapy, identified according to the GRADE trial specifications, who also used glimepiride, liraglutide, sitagliptin, or insulin glargine.

Main outcome measures The primary outcome was time to HbA1c ≥7.0%. Secondary outcomes were time to HbA1c >7.5%, incident microvascular complications, incident macrovascular complications, adverse events, all cause hospital admissions, and all cause mortality. Propensity scores were estimated using the gradient boosting machine method, and inverse propensity score weighting was used to emulate randomization of the treatment groups, which were then compared using Cox proportional hazards regression.

Results 8252 people were identified (19.7% of adults starting the study drugs in OLDW) who met eligibility criteria for the GRADE trial (glimepiride arm=4318, liraglutide arm=690, sitagliptin arm=2993, glargine arm=251). The glargine arm was excluded from analyses owing to small sample size. Median times to HbA1c ≥7.0% were 442 days (95% confidence interval 394 to 480 days) for glimepiride, 764 (741 to not calculable) days for liraglutide, and 427 (380 to 483) days for sitagliptin. Liraglutide was associated with lower risk of reaching HbA1c ≥7.0% compared with glimepiride (hazard ratio 0.57, 95% confidence interval 0.43 to 0.75) and sitagliptin (0.55, 0.41 to 0.73). Results were consistent for the secondary outcome of time to HbA1c >7.5%. No significant differences were observed among treatment groups for the remaining secondary outcomes.

Conclusions In this emulation of the GRADE trial, liraglutide was statistically significantly more effective at maintaining glycemic control than glimepiride or sitagliptin when added to metformin monotherapy. Generating timely evidence on medical treatments using real world data as a complement to prospective trials is of value.


Type 2 diabetes is a common serious chronic health condition, impacting 11.3% (37.3 million) of the US population1 and 9.3% (463 million) of people worldwide.2 Moderate glycemic control, defined by achieving glycated hemoglobin (HbA1c) between 7% and 8%, improves microvascular and macrovascular outcomes.34 Current clinical practice guidelines recommend targeting HbA1c <7% for most non-pregnant adults.5 Timely and appropriate treatment intensification is fundamental to maintaining glycemic control6 and preventing complications.78910 Metformin is the preferred glucose lowering drug owing to its efficacy, tolerability, and low cost.11121314 Type 2 diabetes is, however, a progressive disease, and most patients ultimately require intensification of treatment. Recent US population level estimates suggest that nearly one third of people with HbA1c ≥7% are treated with only one glucose lowering drug15 and as such would benefit from treatment intensification. Clinical practice guidelines advise that choice of second line treatment should be informed by clinical and situational considerations specific to each individual, recognizing the knowledge gaps stemming from the lack of direct comparisons of currently available second line drugs.11121314

The GRADE (Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness Study) trial is a recently completed, but still unpublished, pragmatic, randomized, parallel arm clinical trial that seeks to address this knowledge gap by comparing four second line glucose lowering drugs among adults with moderately uncontrolled type 2 diabetes who are in receipt of metformin monotherapy.1617 The drugs represent four classes: glimepiride (sulfonylurea), sitagliptin (dipeptidyl-peptidase 4 inhibitor), liraglutide (glucagon-like peptide-1 receptor agonist), and insulin glargine (basal analog insulin). The GRADE trial was designed (2008) and launched (July 2013) before US Food and Drug Administration approval of sodium-glucose cotransporter-2 inhibitors and several cardiovascular outcomes trials that showed reduction in atherosclerotic cardiovascular and kidney disease outcomes with use of glucagon-like peptide-1 receptor agonists, and in heart failure and kidney disease outcomes with use of sodium-glucose cotransporter-2 inhibitors. This highlights a key limitation of large prospective randomized controlled trials: such trials are time consuming to conduct, potentially hindering the ability to answer questions in a clinically meaningful time frame. Thus, it is of value to efficiently generate timely evidence on medical treatments using observational research methods applied to real world data as a complement to prospective trials.

Advances in the quantity, quality, and granularity of real world data, combined with improvements in statistical methods used to account for confounding, treatment allocation bias, and time related bias, have provided opportunities to use large scale real world data to inform the understanding of drug effectiveness and safety. Ideally, studies using real world data would be conducted before the publication of the results from randomized controlled trials, thereby minimizing potential biases that could be introduced by trying to replicate known results from such trials. As an illustrative test case of the opportunities and limitations of using observational research methods to emulate randomized controlled trials, and building on parallel analyses emulating the PRONOUNCE (A Trial Comparing Cardiovascular Safety of Degarelix Versus Leuprolide in Patients With Advanced Prostate Cancer and Cardiovascular Disease) trial,18 we used claims and laboratory results data from OptumLabs® Data Warehouse (OLDW), a deidentified national dataset of privately insured and Medicare Advantage beneficiaries, to emulate the GRADE trial. We used published1619 and publicly available17 information on the GRADE trial’s study design to emulate the methods and anticipated results as closely as possible, with the goal of directly comparing the effectiveness of glimepiride, sitagliptin, liraglutide, and insulin glargine in achieving and maintaining HbA1c <7.0% among adults with type 2 diabetes and HbA1c 6.8-8.5% while in receipt of metformin monotherapy. We also examined the secondary metabolic, microvascular, macrovascular, and safety endpoints planned in the GRADE trial as feasible using the data available within OLDW. Our study therefore had two complementary objectives. First, a clinical objective, to examine four second line glucose lowering drugs for lowering or maintaining, or both, HbA1c <7.0%, filling an important clinical knowledge gap in the comparative effectiveness of these commonly used and guideline recommended drug classes. Second, a methodologic objective, to ascertain whether routinely available claims data can be used to emulate a prospective randomized clinical trial ahead of its publication, filling important methodologic and regulatory policy needs in the use of real world data to predict clinical trial results.


Study design

We retrospectively analyzed medical and pharmacy claims data from OLDW, a deidentified claims dataset that includes healthcare utilization information for beneficiaries of private health plans (adults of working age and their dependents) and Medicare Advantage plans. The latter are Medicare approved plans offered by private companies to beneficiaries who are eligible for Medicare (eg, adults aged ≥65 years, individuals with disability, people with end stage kidney disease) as a private alternative to Original Medicare. Just as with private insurance, Medicare Advantage plans typically bundle medical and pharmacy coverage. OLDW contains longitudinal health information on enrollees in these health plans, representing a diverse mixture of ages, ethnicities, and geographic regions across the US.2021 The study is reported according to the Reporting of studies Conducted using Observational Routinely-collected Data (RECORD) reporting guideline.22

Study population

We first assembled a cohort of adults (≥18 years) who initially started glimepiride, sitagliptin, liraglutide, or insulin glargine between 25 January 2010 (date of liraglutide approval by the FDA; remaining study drugs were approved earlier) and 30 June 2019 (see supplemental figure S1). The index date was set to the date of the first claim for the study drug. People who started ≥2 study drugs on the index date were excluded. Individuals were required to be adherent to metformin for ≥8 weeks before that first study drug fill date. This was established by identifying all metformin fills before the index date, establishing continuous treatment episodes based on prescription fill dates and the days’ supply for each fill (allowing up to 30 day gap between fills), and requiring that the last metformin treatment episode before the index date be at least eight weeks. To ensure consistent and adequate capture of baseline comorbidities and treatment data, people were required to have six months of continuous enrollment with medical and pharmacy coverage before the index date. We excluded those with fills for any glucose lowering drugs other than metformin during the baseline period and those with type 1 diabetes, defined using ICD-9 and ICD-10 (international classification of diseases, ninth and 10th revisions, respectively) codes. Individuals were further required to have valid personal (age, sex, region) data and HbA1c results both within three months before the index date (baseline HbA1c) and during follow-up. Next, we adapted the eligibility criteria for the GRADE trial161719 and applied these to beneficiaries included in OLDW, as detailed in supplemental table S1. Supplemental tables S2 and S3 summarize the relevant diagnosis codes and drugs. All eligible individuals in OLDW were included in the cohort.


The primary outcome was time to primary metabolic failure, calculated as days to HbA1c ≥7.0% while treated with the assigned drug, with the period of eligibility starting at month 3 after the index date (analogous to the first quarterly HbA1c assessment in the GRADE trial). Unlike the GRADE trial protocol, we did not require a confirmatory HbA1c owing to variation in real world HbA1c testing intervals. To assess for potential bias in outcome ascertainment as the result of different frequencies of HbA1c testing and varying intervals between tests among the treatment groups, we compared the number, frequency, and timing of available HbA1c test results and found no difference between the groups (see supplemental table S4). Because testing frequency is guided by baseline HbA1c, we also examined intervals between sequential HbA1c tests stratified by baseline HbA1c and found no differences between the treatment groups (see supplemental table S5).

Secondary metabolic, cardiovascular, and microvascular outcomes were analyzed as specified in the GRADE trial’s statistical analysis plan17 if they were feasible to ascertain using claims data (see supplemental table S6). Individuals were followed until they experienced the outcome of interest, anticipated follow-up duration of the trial (seven years), end of the study period (31 July 2019), end of insurance coverage, or death. Individuals with outcomes observed while being treated with the assigned regimen, were followed until they discontinued the assigned drug (defined as not refilling a drug after 30 days of the end of last treatment episode), with the goal of emulating the definitions of these outcomes in the GRADE trial (ie, while being treated with the originally assigned drugs).16

Independent variables

Patient individual level age, sex, race or ethnicity, and annual household income were identified from OLDW enrollment files at the time of the index date. Detailed description of the source data for these variables is available in the supplemental methods. Comorbidities (ascertained from all claims during six months preceding the index date) included retinopathy, nephropathy, neuropathy, coronary artery disease, cerebrovascular disease, peripheral vascular disease, heart failure, and previous severe hypoglycemia and hyperglycemia, as detailed in supplemental table S2. Specialties of treating physicians were categorized as primary care, endocrinology, cardiology, nephrology, other, and unknown. Baseline drugs, included as surrogates for burden of complications, were identified from fills in the six months preceding the index date (see supplemental table S3).

Statistical analysis

Inverse probability of treatment weighting was used to balance the differences in baseline characteristics among the treatment groups. Propensity scores were used as probability of treatment; these propensity score weights were estimated using generalized boosted models including the baseline variables presented in table 1. Using generalized boosted models involves an iterative process with multiple regression trees to capture complex and non-linear relations between treatment assignments and the pretreatment covariates, with the propensity score model leading to the best balance among the treatment groups.23 The supplemental methods provide additional detail on the models. We calculated stabilized weights with multiple treatments by dividing the marginal probability of treatment by the propensity score of treatment received.24 Supplemental figure S2 shows the distribution of weights. Standardized mean differences were used to assess the balance of covariates after weighting; a standardized mean difference ≤0.1 was considered a good balance and ≤0.2 was considered acceptable.25 Before evaluation of the outcomes, we examined the weighted sample sizes and ability to account for baseline confounding to determine the feasibility of including each treatment group.

Table 1

Baseline characteristics in weighted cohort. Values are numbers (percentages) unless stated otherwise

View this table:

The cumulative incidence of the primary (time to first HbA1c ≥7.0) and secondary (time to first HbA1c >7.5%) metabolic failures within each treatment arm was estimated with the inverse probability of treatment weighting Kaplan-Meier method. We used the propensity score weighted Cox proportional hazards regression models adjusted by baseline HbA1c values to compare the outcomes between treatment groups. As the primary outcome can be only observed from the third month, we set the at risk time for the proportional hazards model as three months after the index date. Results are presented as median times to metabolic failure and expected proportions of people experiencing metabolic failure at one and two years. All pairwise comparisons between the treatment groups were estimated, and we applied the Holm method to adjust the P values for multiple testing. We tested the proportional hazards assumption using Schoenfeld residuals. Similar analyses were performed for other time-to-event outcomes. The at risk start time for modeling secondary metabolic, cardiovascular, and microvascular disease outcomes was set at the study index date. Repeated measures HbA1c trends by treatment group were estimated by using the inverse probability of treatment weighting mean HbA1c results by treatment group in three month time intervals. The follow-up time by treatment arm was estimated using the same propensity score weights as the primary analysis and the inverse probability of treatment weighting Kaplan-Meier method for the censoring distribution.26

All primary analyses were conducted using the per protocol censoring approach for the primary outcome and for the secondary outcomes of secondary metabolic failure and insulin initiation, censoring at the time of treatment drug discontinuation, disenrollment from the health plan, end of study period, or death, whichever came first (see supplemental figure S3). Time receiving treatment for each drug was determined by calculating continuous coverage episodes based on available fills—the same as for baseline metformin treatment. Remaining secondary outcomes were analyzed using the intention-to-treat censoring approach, censoring the participant at the time of health plan disenrollment, end of study, or death, which ever came first. P<0.05 was considered statistically significant for all two sided tests. All analyses were performed using SAS 9.4 (SAS Institute, Cary, NC) and R version 4.0.2.(R Foundation).

Subgroup analyses

A priori defined subgroup analyses were performed as a function of baseline HbA1c (<7.0% v ≥7.0%), age group (<65 years, ≥65 years), sex (men v women), and race or ethnicity (white, black, Hispanic, Asian).

Sensitivity analyses

First, to examine the comparative effectiveness of study drugs while treated only with them and not with any other drug, accounting for real world treatment practices, we repeated all analyses using the as treated censoring approach, censoring at the time a new drug class was added, the assigned drug was discontinued, health plan disenrollment, end of study, or death, which ever came first (see supplemental figure S3). Second, we assessed residual confounding by testing a falsification endpoint that was unlikely to be associated with the studied drugs: diagnosis of pneumonia (see supplemental table S2) during the follow-up period.

Patient and public involvement

Patients were not involved in the design, conduct, or dissemination of this study. However, this study was informed by the need to identify preferred glucose lowering treatment strategies in the absence of direct comparisons across the examined drugs; and to examine whether and how data collected in the process of routine patient care can be used to emulate prospective clinical trials. Because this study seeks to inform drug regulatory policy and procedures, investigators from the FDA contributed to the design of the study and interpretation of study findings; they are included as coauthors on this publication.


Study population

We identified 18 365 adults with type 2 diabetes who started glimepiride, 12 818 who started sitagliptin, 5021 who started liraglutide, and 5659 who started insulin glargine and had the required baseline enrollment and available HbA1c results (see supplemental figure S1). Eligibility criteria of the GRADE trial were met by 19.7% (8252 of 41 863) of these individuals, ranging from 4.4% (251 of 5659) using glargine to 23.5% (4318 of 18 365) using glimepiride. The most prevalent reasons for ineligibility (see supplemental table S7) were HbA1c outside the prespecified range (ranging from 51.7% (6631 of 12 818) of individuals using sitagliptin to 81.1% (4591 of 5659) using glargine) and not being treated with metformin monotherapy at the time of study drug initiation (ranging from 43.5% (7997 of 18 365) of individuals using glimepiride to 68.5% (3874 of 5659) using glargine). The final cohort comprised 4318 individuals in the glimepiride arm, 2993 in the sitagliptin arm, 690 in the liraglutide arm, and 251 in the glargine arm (see supplemental table S8 for all included drugs).

Supplemental table S9 shows baseline characteristics of the included individuals before weighting. Across the four treatment groups, there were significant differences (largest standardized mean difference >0.2) in age, race or ethnicity, annual household income, and prescribing physician specialty. Individuals in the liraglutide arm were more likely to be younger, white, on a higher income, and treated by an endocrinologist than those in the other treatment arms. Individuals in the glargine arm were most likely to be on a low income and they had the highest prevalence of all examined comorbidities.

The glargine arm was excluded from all analyses because of small sample size (n=251, weighted n=179) and inability to achieve good control of confounders after weighting. The propensity score model was estimated on the remaining three groups. After weighting, mean participant ages were 62.0 years (standard deviation (SD) 11.1 years) in the glimepiride arm, 62.0 (SD 11.0) years in the sitagliptin arm, and 60.5 (SD 10.4) years in the liraglutide arm (table 1). Women comprised 48.2% (2009 of 4168) of the glimepiride arm, 49.1% (1374 of 2800) of the sitagliptin arm, and 50.5% (289 or 572) of the liraglutide arms. White people comprised 64.7% (2695 of 4168), 64.2% (1798 of 2800), and 65.8% (376 of 572) of the treatment arms, respectively. Mean baseline HbA1c levels were 7.63% (SD 0.48%), 7.61% (SD 0.47%), and 7.60% (SD 0.48%), respectively. Supplemental table S10 presents the pairwise standardized mean differences for all baseline covariates; all values were <0.2.

Primary metabolic failure (HbA1c ≥7.0%)

Median follow-up until per protocol censoring was 238 days (95% confidence interval 226 to 255 days) in the glimepiride arm, 124 (100 to 150) days in the liraglutide arm, and 186 (179 to 201) days in the sitagliptin arm (see supplemental figure S4). Mean HbA1c decreased most in the liraglutide arm and least in the sitagliptin arm, with differences most pronounced between months 3 and 6 of treatment (fig 1). The median times to primary metabolic failure were 442 days (95% confidence interval 394 to 480 days) in the glimepiride arm, 764 (741 to not calculable) days in the liraglutide arm, and 427 (380 to 483) days in the sitagliptin arm (fig 2). Liraglutide was associated with lower risk of primary metabolic failure compared with glimepiride (hazard ratio 0.57, 95% confidence interval 0.43 to 0.75) and sitagliptin (0.55, 0.41 to 0.73); table 2. No significant difference was observed between sitagliptin and glimepiride (1.03, 0.94 to 1.13). By one year, the estimated cumulative incidence rates of primary metabolic failure were 0.28 (95% confidence interval 0.19 to 0.36) in the liraglutide arm, 0.44 (0.42 to 0.46) in the glimepiride arm, and 0.46 (0.43 to 0.48) in the sitagliptin arm (table 3). These trends in cumulative incidence rates of primary metabolic failure persisted at two years.

Fig 1
Fig 1

Mean hemoglobin A1c (HbA1c) levels over time. Results are based on observed receipt of treatment trajectories, with no imputation of missing HbA1c levels

Fig 2
Fig 2

Cumulative incidence rates of primary metabolic failure in propensity score weighted individuals included in the study

Table 2

Hazard ratios for primary and secondary metabolic outcomes

View this table:
Table 3

Cumulative incidence rates of primary and secondary metabolic failure by treatment arm

View this table:

Secondary metabolic failure (HbA1c >7.5%)

Time to secondary metabolic failure was longest in the liraglutide arm (see supplemental figure S5). Liraglutide was associated with lower risk of secondary metabolic failure compared with glimepiride (0.61, 0.43 to 0.87) and sitagliptin (0.59, 0.41 to 0.85); table 2. By one year, the estimated cumulative incidence rates of secondary metabolic failure were 0.11 (95% confidence interval 0.06 to 0.17) in the liraglutide arm, 0.20 (0.19 to 0.22) in the glimepiride arm, and 0.22 (0.19 to 0.24) in the sitagliptin arm (table 3). The difference in event rates persisted at two years.

Other secondary outcomes

Insulin was started by 84 of 4168 (2.0%) people in the glimepiride arm, 11 of 572 (1.9%) in the liraglutide arm, and 50 of 2800 (1.8%) in the sitagliptin arm, with no significant difference among the three groups (hazard ratios for pairwise comparisons are shown in table 2). Overall, 37 patients experienced emergency department visits or hospital admissions for hypoglycemia during the study period, including <11 in the liraglutide and sitagliptin arms, precluding statistical analyses.

Heart failure, end stage kidney disease, pancreatitis, pancreatic cancer, thyroid cancer, and all cause mortality could not be analyzed owing to too few (<11) events in all treatment groups (supplemental table S11 presents the event rates). No statistically significant differences were observed between groups for major adverse cardiovascular events, retinopathy, neuropathy, other cardiovascular events, cancer, and all cause admissions to hospital (see supplemental table S12).

Subgroup analyses

Liraglutide was associated with lower risk of primary metabolic failure compared with glimepiride (hazard ratio 0.59, 95% confidence interval 0.44 to 0.78) and sitagliptin (0.58, 0.43 to 0.79) among patients with baseline HbA1c ≥7.0%. No significant differences were observed among the treatment groups in individuals with baseline HbA1c <7.0% (see supplemental table S13). Liraglutide was associated with lower risk of primary metabolic failure compared with glimepiride (0.54, 0.42 to 0.71) and sitagliptin (0.58, 0.44 to 0.77) among those aged <65 years. No significant differences were observed among groups in people aged ≥65 years of age. Liraglutide was also associated with lower risks of primary metabolic failure than glimepiride and sitagliptin in women, but not in men, and in white and Hispanic individuals, but not in black or Asian individuals. Findings were similar for secondary metabolic failure (see supplemental table S14).

Sensitivity analyses

Another glucose lowering drug was added before discontinuation of the assigned treatment in 423 of 4168 (10%) people in the glimepiride arm, 237 of 572 (41%) in the liraglutide arm, and 419 of 2800 (15%) in the sitagliptin arm. Sensitivity analyses using the as treated censor approach were consistent with the primary analyses (see supplemental figure S6 and table S15). No significant differences were observed among the treatment groups for the pneumonia falsification endpoint (see supplemental table S16).


Principal findings

In our emulation of the GRADE trial using real world data from an administrative claims database we found that liraglutide was statistically significantly more effective at maintaining glycemic control, defined by time to HbA1c ≥7.0% (primary metabolic failure) and HbA1c >7.5% (secondary metabolic failure) than either glimepiride or sitagliptin. These differences are clinically meaningful, with over 40% more patients in control of their HbA1c when treated with liraglutide than when treated with glimepiride or sitagliptin. We were unable to include insulin glargine in the comparisons because of the small number of individuals treated with this drug who met the GRADE trial eligibility criteria. This was not surprising as treatment with basal insulin in the clinical context examined by the GRADE trial is outside the standard of care and mainstream practice. Additionally, the analytic framework implemented in this work shows that real world data may be an important complement to prospective trials, allowing for efficient and timely examination of pressing clinical questions and inquiries of comparative effectiveness and safety.

Our efforts to emulate all specifications of the GRADE trial were hindered because study conditions are not adequately represented in real world practice as they are not supported by clinical practice guidelines. Although all four study drugs were frequently used by the OLDW population, 80% of adults starting these drugs had to be excluded because they did not meet the prespecified eligibility criteria for the GRADE trial. Nevertheless, this proportion of included participants is still higher than the 9.1% generalizability estimated by the GRADE trial team compared with the overall US population with diabetes.19 Most of the people (58.6% overall) were excluded because they did not meet the baseline HbA1c level requirements, including 81.1% of people who started glargine, 71.8% who started liraglutide, 52.8% who started glimepiride, and 51.7% who started sitagliptin. According to current guidelines, the target HbA1c for most non-pregnant adults is 7.0%, such that treatment intensification would not be warranted for some people. Initiation of insulin, in particular, is advised when HbA1c is >9-10%,1427 so starting glargine as a second line drug at HbA1c levels <8.5% would not be consistent with the standard of care1427 or contemporary practice.282930 The fact that most people treated with the studied drugs in clinical practice are not represented in the study population raises concerns about the utility and generalizability of the GRADE trial’s findings and its impact on diabetes management, underscoring the important complementary insights that can be gleaned from analyses of real world data (which can be designed to use more pragmatic and generalizable eligibility criteria) as adjuncts to randomized controlled trials.

Comparison with other studies

We met our objective to conduct all analyses before publication of the GRADE trial findings, and it will be important to ultimately compare our findings with those of the GRADE trial. The greater effectiveness of liraglutide compared with both glimepiride and sitagliptin is consistent with previous studies.29313233 Additionally, subgroup analyses showing greater effectiveness of liraglutide among people with raised baseline HbA1c and in younger patients, generated important hypotheses about the optimal use of liraglutide (and potentially other glucagon-like peptide-1 receptor agonists) in clinical practice to be explored in future research. When the GRADE trial was conceived, drugs’ ability to lower HbA1c was at the forefront of clinical decision making when choosing glucose lowering treatment. Similarly, the sodium-glucose cotransporter-2 inhibitors class of glucose lowering drugs had not yet been incorporated into practice and therefore was excluded as a comparator treatment when the GRADE trial was conceived and designed.

Strengths and limitations of this study

Our study is strengthened by application of advanced analytic methods that account for measured differences between treatment arms that otherwise confound analyses and preclude causal inference. The generalized boosted based models for the propensity score are more flexible and less sensitive to model misspecification compared with logistic regression. The large and diverse population within OLDW made emulation efforts uniquely possible despite the narrow eligibility criteria specified by the GRADE trial.

Despite rigorous causal inference analytic methods, observational studies are inevitably subject to residual confounding. For the metabolic endpoints, there was evidence of non-proportional hazards, which makes the single summary hazard ratio calculated from the Cox proportional hazards an imperfect estimate for the time varying risk. However, with the goal of emulating the GRADE trial, where the statistical analysis plan was to estimate single summary hazard ratios, we report the same estimate in the emulation. We were also unable to operationalize every component of the GRADE trial’s eligibility criteria and endpoints. For example, we did not require confirmatory HbA1c results to meet the metabolic endpoints and were not able to maintain the same standard timeframe for HbA1c ascertainment as specified in the GRADE trial. Additionally, while the GRADE trial analyses were conducted using the intention-to-treat principle, we a priori chose to use per protocol analysis for the metabolic endpoints because in the absence of randomization, reasons for changing a treatment typically depend on post-initiation factors that could confound the association between the treatment group and the outcome. While advanced statistical methods can account for post-baseline differences between groups in key characteristics, these methods require accurate estimation of the reasons to stop or change treatment, and such estimation is not feasible in this setting using claims data. Duration of follow-up was also different among the treatment arms, which is unavoidable when studying real world practice patterns. In particular, a higher proportion of individuals initiating liraglutide filled only one cycle of treatment before either switching to a different treatment or not refilling their prescription, potentially because of poor tolerability, the need to be administered subcutaneously, or high cost.

Not all people with claims data in OLDW have available laboratory data, as laboratory results are available for a subset of patients based on data sharing agreements between OptumLabs and commercial laboratories. The availability of laboratory results, however, is independent of treatment regimen, and we do not expect it to bias our analyses. The schedule of HbA1c testing in real world practice is contingent on an individual’s current HbA1c level and ability to access care, and on the clinician’s anticipation of changing HbA1c levels. This may have confounded study results by delaying the time to HbA1c reassessment and reaching the study endpoint in people with low baseline HbA1c or with barriers to care. Our evaluation could not account for inclusion and exclusion criteria that could not be operationalized using claims data, including drugs obtained without insurance coverage (eg, obtained through a low cost generic programme,34 a patient assistance programme, or a sample), comorbidities that were not coded and billed in a clinical encounter, and information on family history. However, previous studies found the likely number of glucose lowering drugs missing from claims to be low.35 Finally, the study cohort comprised people with private and Medicare Advantage health plans, such that results may not fully generalize to people with public health plans or those without insurance coverage.

Policy implications

Contemporary clinical practice guidelines increasingly focus on the impacts of glucose lowering treatments on hard outcomes that are important to patients beyond HbA1c, such as macrovascular and microvascular complications and death,.36 Indeed, most recent clinical practice guidelines recommend consideration of glucagon-like peptide-1 receptor agonists and sodium-glucose cotransporter-2 inhibitors even as preferred treatments and independent of the HbA1c level among people at high risk for atherosclerotic cardiovascular disease, kidney disease, and heart failure.14 For these outcomes, robust evidence favors liraglutide (of the drug classes examined) in individuals at high risk for atherosclerotic cardiovascular disease,3738 further underscoring the advantage of this drug. It will be important, in future research, to compare the effectiveness of glycemic control achieved by glucagon-like peptide-1 receptor agonists with that of sodium-glucose cotransporter-2 inhibitors, as sodium-glucose cotransporter-2 inhibitors are similarly recommended for people at high risk for cardiovascular disease, kidney disease, and heart failure.14

Analytic methods such as those implemented in this study, and in the parallel emulation of PRONOUNCE,18 can be leveraged for more timely evaluations of drug effectiveness and safety as long as the treatments being considered are already used in clinical practice. Indeed, work is currently underway to examine the comparative effectiveness of sulfonylurea, glucagon-like peptide-1 receptor agonist, dipeptidyl-peptidase 4 inhibitor, and sodium-glucose cotransporter-2 inhibitor drugs for atherosclerotic cardiovascular disease and other hard outcomes among people at moderate risk for atherosclerotic cardiovascular disease using observational data from real world practice.39


Better understanding of the comparative effectiveness and safety of second line glucose lowering drugs is urgently needed to inform shared decision making in diabetes. Ultimately, the population included in this study and our findings should be compared with those of the GRADE trial, once published in peer reviewed literature, to assess the fidelity and generalizability of results and to improve our understanding of the use of real world data to emulate clinical trials.

What is already known on this topic

  • Real world data are an important source of information about clinical practice, comparative effectiveness and safety, and health outcomes

  • Such data also have the potential to generate timely, pragmatic evidence on medical treatments as a complement to prospective clinical trials

  • Multiple classes of second line glucose lowering drugs have been approved for the management of type 2 diabetes, with limited evidence about their comparative effectiveness for glycemic control

What this study adds

  • This study emulated the Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness Study (GRADE) randomized clinical trial using data from a US administrative claims database to identify the strengths and limitations of using real world data to emulate prospective comparative effectiveness trials, particularly when examining drugs in contexts that may not be the standard of care

  • Liraglutide was found to be more effective than glimepiride and sitagliptin at lowering glycated hemoglobin (HbA1c), supporting its preferential use when substantial glycemic control is needed

  • Advanced causal inference analytic methods applied to observational data can be used to emulate clinical trials efficiently and effectively

Ethics statements

Ethical approval

All study data are deidentified consistent with Health Insurance Portability and Accountability Act of 1996 (HIPAA) expert deidentification determination. The study was therefore exempt from review by the Mayo Clinic institutional review board.

Data availability statement

This study was conducted using deidentified claims data from OptumLabs Data Warehouse. Raw data are not publicly available. The study protocol, code sets, and statistical analysis plan are available online.40


  • Contributors: RGM had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. RGM is the guarantor. YD analyzed the data and co-drafted the manuscript. ECP co-designed the study, supervised the analyses, and reviewed and edited the manuscript. JDW, SSD, JH, KQ, CG, WC, PN, XY, and TDL provided feedback on study design and reviewed and edited the manuscript. JSR and NDS secured funding, supervised the study, provided feedback on study design, and reviewed and edited the manuscript. RGM co-drafted the manuscript, co-designed the study, and supervised the analyses. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: This publication is supported by the Food and Drug Administration of the US Department of Health and Human Services (HHS) as part of a financial assistance (Center of Excellence in Regulatory Science and Innovation grant to Yale University and Mayo Clinic U01FD005938) totaling $250 000 with 100% funded by FDA/HHS. RGM is also supported by the National Institutes of Health (NIH) National Institute of Diabetes and Digestive and Kidney Diseases (grant number K23DK114497). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication. The contents are those of the authors and do not necessarily represent the official views of, nor an endorsement, by FDA/HHS, NIH, or the US government.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at and declare: support from the US Food and Drug Administration (YD, JDW, NDS, JSR, JH, RGM) and the National Institute of Diabetes and Digestive and Kidney Diseases (RGM) for the submitted work. In the previous three years, authors report support from the FDA (SSD), Johnson & Johnson through the Yale Open Data Access project (JDW, JSR), the National Institute on Alcohol Abuse and Alcoholism of the National Institutes of Health (NIH; JDW), AHRQ (JSR), PCORI (RGM), Department of Veterans Affairs Health Services Research and Development (SSD), Department of Veterans Affairs Office of Rural Health (SSD), NIH (SSD), Arnold Ventures (JDW, SSD, JSR), National Evaluation System for Health Technology Coordinating Center (SSD), National Institute for Health Care Management (SSD), Institute for Clinical and Economic Review California Technology Assessment Form (SSD), American College of Cardiology (SSD), National Academy of Medicine (SSD), American Diabetes Association (RGM), Medical Devises Innovation Consortium (JSR), Bristol Myers Squibb (TDL). JSR has served as an expert witness. JDW serves as a consultant for Hagens Berman Sobol Shapiro LLP and Dugan Law Firm APLC. NDS is currently employed by Delta Air Lines; he was an employee of Mayo Clinic when this research was conducted. JSR is a co-founder of medRxiv and an associate research editor for The BMJ. Other declarations are: no other relationships or activities that could appear to have influenced the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.

  • The lead author (RGM) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

  • Dissemination to participants and related patient and public communities: Study results will be disseminated to patient and public communities through peer reviewed publication, reporting of study results to the Food and Drug Administration, and sharing of the results on social media.

  • Provenance and peer review: Not commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: