Practices and impact of primary outcome adjustment in randomized controlled trials: meta-epidemiologic studyBMJ 2013; 347 doi: https://doi.org/10.1136/bmj.f4313 (Published 12 July 2013) Cite this as: BMJ 2013;347:f4313
- Nazmus Saquib, postdoctoral research fellow,
- Juliann Saquib, postdoctoral research fellow,
- John P A Ioannidis, professor
- 1Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
- Correspondence to: J P A Ioannidis
- Accepted 27 June 2013
Objective To assess adjustment practices for primary outcomes of randomized controlled trials and their impact on the results.
Design Meta-epidemiologic study.
Data sources 25 biomedical journals with the highest impact factor according to Journal Citation Reports 2009.
Study selection Randomized controlled trials published in print in 2009 that reported primary outcomes. The search yielded 684 eligible papers of randomized controlled trials, of which 200 were randomly selected.
Data extraction Two researchers independently extracted data on study population, intervention, primary outcome, and the adjustment plan for primary outcomes. They also recorded the magnitude and statistical significance of the intervention effect with and without adjustments, and estimated whether adjustment made a difference in the level of nominal significance. They also compared the analysis plan for model adjustment in the published trial versus the trial protocol with information on the protocol collected from registries, design papers, and communication with all corresponding authors.
Results 54% of the trials used stratified randomization, 96% presented baseline characteristics in the compared arms, and 46% also evaluated differences in baseline factors with statistical testing. Half of the trials performed adjusted analyses for the main outcome, as the sole analysis (29%) or along with unadjusted analyses (21%). Adjustment for stratification variables and for baseline variables was performed in 39% (42/108) and 42% (84/199) of the trials, respectively. Among 40 comparisons with both adjusted and unadjusted analyses, 43% had statistically significant effects, 40% had non-significant effects, and 18% had significant effects with only one of the two analyses, but not with the other. Information on analysis plan regarding model adjustment was available in 6% (9/162) of trial registry entries, 78% (21/27) of design papers, and 74% (40/54) of protocols obtained from authors. The analysis plan disagreed between the published trial and the registry, protocol, or design paper in 47% (28/60) of the studies.
Conclusions There is large diversity on whether and how analyses of primary outcomes are adjusted in randomized controlled trials and these choices can sometimes change the nominal significance of the results. Registered protocols should explicitly specify adjustments plans for main outcomes and analysis should follow these plans.
The results of primary outcomes in randomized controlled trials are often influenced by factors other than the treatment. Examples include recruiting site characteristics in multicenter trials such as the type and number of sites1 as well as various participants’ characteristics such as age, sex, and body weight.2 These variables are often used as stratification factors during randomization.3 Some participants’ characteristics are asymmetrically distributed between the study arms despite randomization; and the likelihood of imbalance is higher in trials with a small sample and when randomization procedures are not followed properly. At present there is no consistent practice as to how to handle the imbalance between the study groups in analysis.2 4 5 Some studies adjust the outcome model for baseline differences among study groups. Other studies consider baseline differences to be chance findings and therefore not to be adjusted for, and many methodologists argue against even checking for them.6
Different choices for adjustment may lead to different estimates of the treatment effect and levels of statistical significance. The variability in treatment effects estimates due to multiple analytical choices has been described as “vibration of effects”: each analysis with or without various adjustments may give somewhat different results.7 Making adjustments should have little impact on the conclusions, when the effect size is large and its clinical importance is not contestable or even when the effect size is modest but the amount of evidence is large. Most randomized controlled trials are, however, not large8 and when the presence of an effect is tenuous (for example, when the results hover around the “attractive” P value of 0.05),9 decisions regarding adjustment may influence the interpretation of the study outcome. Vibration of effects may lead to biased results, if multiple possible adjustment schemes are performed retrospectively and the most favorable result is then selectively reported or highlighted.
Several measures can be taken to reduce the disparate practices in the analysis of data from randomized controlled trials to increase transparency and overall reproducibility of clinical research. The ideal practice would be that the trial investigators make the protocol, dataset, and analytical code publicly available. Alternately, investigators could provide as much raw data as possible using detailed tables and figures either in the trial publications or in the registry. Minimally, investigators should follow the standardized reporting guidelines that are already in place for randomized controlled trials. For example, the CONSORT statement and ICH E9 (statistical principles for clinical trials) provide explicit instructions on statistical analyses in randomized controlled trials, including model adjustment for baseline differences and stratified randomization factors.10 11
In this analysis we evaluated a sample (n=200) of randomized controlled trials published in high impact journals in 2009. We gathered all relevant information about planned statistical analysis of the primary outcome from trial registries, design papers, and protocols provided by the authors. We assessed differences in the statistical plan for model adjustment between the trial protocol and the trial publication for those trials with available protocols. For the sample of all 200 trials we also assessed the congruency of model adjustment for primary outcomes between the trial methods and results section. We examined which factors were used for adjustment and evaluated the extent to which adjusted and unadjusted treatment effects differ in level of nominal statistical significance.
We searched PubMed to locate the relevant trials. We categorized the search term according to study type (randomized controlled trial) and journal (BMJ, American Journal of Psychiatry, American Journal of Respiratory Critical Care Medicine, Annals of Internal Medicine, Annals of Neurology, Archives of General Psychiatry, Archives of Internal Medicine, Blood, Brain, Circulation, European Heart Journal, Gastroenterology, Gut, Hepatology, Journal of Allergy and Clinical Immunity, Journal of the American College of Cardiology, Journal of Clinical Oncology, Journal of the National Cancer Institute, JAMA, Lancet, Lancet Infectious Diseases, Lancet Neurology, Lancet Oncology, New England Journal of Medicine, and PLoS Medicine). The 25 biomedical journals were selected as having the highest impact factor (per Journal Citation Reports 2009 edition) among journals that may publish clinical trials. The “AND” Boolean operator was used to combine search terms between the categories and the “OR” was used within the category for journals. We limited the search to studies that involved human participants and were published in 2009.
We screened the abstract using the inclusion criteria of randomized controlled trial, published in print in 2009, and main trial paper with primary outcome. We randomly selected 200 from the eligible articles.
Protocol assessment of included randomized controlled trials
Registration entries—We searched the trial papers to determine whether the study was registered. If the registry number was available, we checked the appropriate trial registry online and extracted information on the use of model adjustment for the primary outcome from general registration information, study results, or full study protocols that might be available in the registry for each trial.
Original protocol through personal communication—We identified the corresponding author and his or her contact information for each trial. We emailed each author with a brief description of our study and requested the full original study protocol. We waited two weeks for a response and sent a reminder email to those who had not responded.
Design paper search—We searched the registry as well as the reference list of the trial publication to see whether a design paper had been published. We obtained full text of the design papers through PubMed and extracted information on the use of model adjustment for the primary outcome. Additionally, some authors sent us a design paper during our email communication with them.
Data extraction from publications of randomized controlled trials
From each trial we extracted the following information: study arms, primary outcomes, sample size, number of sites, whether randomization was stratified, and whether arms were compared for baseline characteristics. We considered baseline differences to be statistically significant when P values were given or the text included the word “significant” for the comparison.
We recorded whether any adjustments were planned for the primary outcome in the Methods section and whether these analyses were respectively reported in the Results section. We defined a model as adjusted if explicit statements regarding adjustment were made, covariates were listed, or a statistical test that by definition is multivariate (for example, multiple logistic regression, multivariate analysis of variance) was reported. Whenever adjustments were made, we recorded whether they used variables that had been used in the stratification of the randomization or the baseline characteristics, and which variables were involved. To avoid double counting, if a variable had been used for the stratified randomization then it was not counted as a baseline characteristic; otherwise, baseline variables include all those mentioned in the text or table as measured at baseline. Finally, we recorded the unadjusted estimate and 95% confidence interval of the treatment effect of the primary outcome and any adjusted estimates for the same treatment comparison. For trials with multiple primary outcomes, we selected the outcome on which the sample size calculation was based.
Two researchers (NS and JS) independently extracted data. The pilot phase involved data extraction from 30 articles; the model adjustment data were in full agreement in two thirds of the cases. At the end of the pilot phase, the two researchers met with the senior investigator (JPAI) to resolve the discrepancies, to finalize the definition, and to streamline the protocol. At the end of the full data extraction phase, the two researchers discussed discrepancies in model adjustment data and were able to reach a consensus in all but four cases, which were arbitrated by the senior investigator.
We compared the model adjustment plan provided in the main trial publication with the corresponding information extracted from the registry, protocol, or design paper, to see if these did or did not agree. We also calculated the proportion of trials that reported statistically significant effects in each of the categories of agreement. For trials where the protocol and the publication had different analysis plans, we also recorded what the results were for analyses that had not been specified in the protocol but had been added in the publication.
From the sample of all 200 trials we generated frequency tables for categorical variables and measures of central tendency (that is, mean, median) and range for continuous variables pertaining to general trial characteristics and adjustment procedures. We identified the most common variables adjusted for and calculated the percentage of trials that reported them.
To achieve consistency across trials, we presented effect sizes in such a manner that relative risk metrics <1.00 and risk difference and mean difference metrics <0 indicate that the experimental intervention is better than the control intervention. Whenever both adjusted and unadjusted treatment effects were reported, we examined the concordance of the level of nominal statistical significance, based on 95% confidence intervals being entirely on one side of the null, P values <0.05, or a statement in the text. Whenever nominal significance was reached with only one analysis but not the other, we also noted whether the authors of the trial focused on the significant or non-significant result primarily in interpreting their findings.
Statistical analyses were conducted in SAS version 9.2 (Cary, NC). All P values are two tailed.
The search resulted in 1123 articles. We excluded a total of 439 articles; 163 papers were not randomized controlled trials, 94 papers were published in print in 2010 (published electronically in 2009, but appeared in print in 2010), and 182 were secondary studies using the population of the original randomized controlled trial (that is, extended follow-up or subset analyses). Then, of the 200 randomly selected articles, two papers each reported the results from two separate trials and these were considered separately; conversely, we excluded three trials that had not analyzed primary outcomes between the study arms. Analysis included 199 trials (see supplementary file).
Comparison between trial protocols and main paper
Most of the 199 trials (81%, n=162) were registered; the most common registries included clinicaltrials.gov (n=109), International Standard Randomized Controlled Trial Number (ISRCTN) (n=41), and Australia New Zealand Clinical Trial Registry (ANZCTR) (n=6). General information on registration was available for the majority of the trials; study results were available for only 12% (20/162) of the trials. Statistical procedures such as model adjustment were found only in 6% (9/162) of the trials. Of the trials, 15% (30/199) had published design papers and 70% (21/30) of them provided information on model adjustment. We evaluated 54 original protocols, sent to us by the authors, and found that 74% (40/54) included an analysis section with a model adjustment plan. In total, 31% (61/199) of the trials had available information on adjustment from the registry, design paper, or protocol (figure⇓).
We compared the analysis plans from the main paper with the plan reported in the registry, protocol, or design paper for 60 trials (the plan was unclear for one published trial). The adjustment plan matched in 53% (32/60) of the trials; of them, 62% (20/32) reported a significant effect. Among the trials where the adjustment plan did not match, 53% (15/28) reported a significant effect. Of the 28 identified trials where the plan in the protocol did not match what was reported eventually in the publication, 75% (n=21) reported analyses that had not been specified in the protocol and, of them, 62% (13/21) were statistically significant; whereas 25% (n=7) did not report analyses that had been specified in the protocol and 29% (2/7) of the reported analyses were statistically significant.
Characteristics of selected trials
Three quarters (77%) of the 199 reviewed trials had only two study arms, and a similar percentage (76%) of the trials assessed a single primary outcome. Most trials (69%) enrolled participants from multiple study sites (median 20, range 2-862 sites). More than half (54%) used stratified randomization. Almost all trials (96%) presented baseline characteristics per arm. About half (46%) tested statistically for differences in baseline characteristics, and 22% claimed statistically significant differences in baseline characteristics (table 1⇓).
Among the 199 trials, the most common approach specified in the Methods section was to use an unadjusted analysis either alone (48%) or in conjunction with adjusted analysis (21%), whereas 30% of the trials planned adjusted analysis only. A similar picture was seen in the Results section. There were six trials where the plan in the Methods section was incongruent with what was shown in the Results section: three trials where adjusted analysis were promised in the Methods section but not shown in the Results section and three trials where adjusted analyses appeared in the Results section without being stated in the Methods section. The use or not of adjustment was unclear in both the Methods and the Results sections for two trials (table 2⇓).
Ninety nine trials reported adjusted analysis in the Results section (adjusted only n=57, in conjunction with unadjusted n=42). In 92% (91/99) of the trials that used adjustments, the authors presented clearly the exact adjusting variables, whereas in 8% (8/99) of the trials the list of adjusting variables remained unclear. The median number of variables used in the adjustment was 3 (range 1-13, n=91). Ninety two per cent (84/91) adjusted for baseline variables and 46% (42/91) adjusted for stratification variables. The most common variables used for adjustment were age (31%), study center/site (27%), sex (23%), socioeconomic status (9%), smoking (6%), race (5%), body mass index or body weight (5%), and self rated health (5%). Adjustment for baseline variables was substantially more common when significant baseline differences had been found (60% v 38%, P=0.02).
Comparison of nominal significance between unadjusted and adjusted analyses
Of the 42 trials that performed both unadjusted and adjusted analyses for the selected primary outcomes, the determination of nominal statistical significance was possible for 38 (range 12-49) trials. Twenty eight of them provided specific effect estimates for both types of analyses (table 3⇓); 10 trials provided nominal significance for both types of analyses but were missing one or both effect estimates (table 4⇓). Two trials19 36 had two comparisons each (two different active interventions versus control), making a total of 40 comparisons: 43% (17/40) had a statistically significant effect with both analyses, 40% (16/40) had non-significant effects with both analyses, and 18% (7/40) had significant effect with only one of the two analyses (three unadjusted only, four adjusted only) (table 3).
Among the seven comparisons (six trials)15 19 20 37 42 45 with discrepant levels of nominal significance in unadjusted versus adjusted analyses, the authors interpreted the trial focusing primarily on the statistically significant result in four trials (five comparisons)19 37 42 45 where the experimental treatment tended to be better. Conversely, they focused primarily on the non-statistically significant result in one trial15 where the experimental treatment was worse. Finally, the authors of one trial carefully balanced between the significant and non-significant results, admitting that according to the primary analysis plan the results were non-significant.20
This empirical evaluation shows that there is wide diversity in adjustment practices for the analysis of primary outcomes in randomized controlled trials. The use or not of adjustments can make a substantial difference in the statistical inferences for several trials. In 18% of the comparisons that provided both adjusted and unadjusted analysis results, nominal statistical significance was attained with only one of the two approaches; in six of these seven cases, authors focused primarily on the result that was more favorable or less unfavorable for the experimental treatment. The lack of standardization in adjustment practices offers an opportunity for subjectively steering the results of randomized controlled trials towards specific conclusions. Registries generally did not include information on this aspect of the analysis, few trials had published design papers, and protocols could be obtained from authors for the minority of the trials. Even then, analysis plans regarding model adjustment disagreed between the available information in the protocol and the published trial in about half of the cases.
In the current practice, it is not possible to evaluate whether the analysis plan changed between the original protocol and main trial paper using the trial registries. Although most trials that we evaluated were registered, the information regarding statistical analyses, specifically model adjustment, was rarely provided. Only 12% of registry records provided study results and in only half of those cases was information on adjustment given. Trial investigators should provide full protocols with detailed analysis plans to the registry prior to conducting the study. Analysis plans in the registry, protocol, or design paper disagreed with the plans in the published paper in approximately half of the studies. It is unlikely that the disagreement rate would be lower in the trials where the protocol was neither publicly available nor retrievable from the authors. Our findings are consistent with a growing literature on selective reporting of outcomes and analyses in diverse study designs, including randomized controlled trials. Other empirical studies50 51 52 have shown selective reporting based on comparisons of protocols and published results; many outcomes specified in the protocols were missing in the published reports, while new outcomes were introduced.
We observed large variability in terms of the practices of adjustment for stratification or baseline variables. Less than half of the trials that used a stratified randomization technique adjusted the model for the stratification variables, even though this is suggested by both the CONSORT and the ICH E9 guidelines. Although we do not advocate testing for differences in baseline variables, our results showed that more trials adjusted for baseline variables when significant differences had been noted between the compared arms in these variables; however, there were many exceptions to this pattern. Many studies with significant differences at baseline did not adjust for them and many studies without significant differences still adjusted for baseline factors. This inconsistency is not surprising since a consensus does not exist on testing and adjusting for multicenter trials. The studies that adjusted for center typically did not report the exact model that they used to adjust for site. Multiple approaches exist to account for site in the analysis, and results may differ among them.1 Another empirical study found that only 25% among a sample of trials published in 2000 and 2006 reported adjusted analyses and only 5% and 10% of trials in these two years reported both adjusted and unadjusted analyses. The respective proportions were higher in our survey, but they were still low, and the difference may reflect the more select nature of our sample of trials (trials published in major journals) rather than improvement in reporting of adjustments over time.53
The adjusted model occasionally lacked information regarding which covariates it included. Sometimes the text specified the list of covariates that were tested in univariate analyses against the outcome, but did not report which of them qualified for the final adjusted model54; whereas in other trials the text reported covariates that were significantly associated with the outcome without mentioning whether the adjusted model also contained any non-significant covariates.55 56 Flexibility in the use of different combinations of covariates provides an additional mechanism for “vibration of effects.” Different adjustment choices may yield different results and opens a window for potential selective reporting.
Limitations of this study
Our analysis has limitations. Inaccurate reporting of the methods and results cannot be excluded. However, the Methods and Results sections were almost always congruent in specifying the types of analyses. Moreover, some data were missing in the published reports. The estimates for the intervention effect and the associated significance levels were not mentioned in several trials that performed both unadjusted and adjusted analysis. Thus one would be unable to see the magnitude of the differences, if any, between the two analytical approaches. Finally, despite our intense efforts we could not retrieve the protocols for more than two thirds of the trials. It does not necessarily follow that these trials suffer from selective reporting, but lack of transparency creates opportunities for such bias.
We have shown the potential of selective reporting on the primary outcomes of trials published in the most high profile clinical journals, due to the alternative choices on how to adjust or not the results of these outcomes. Adjustments may affect the interpretation of the study findings and eventually their impact on medical practice and policy. Other authors have also identified that non-significant results in randomized controlled trials are subject to “spin” in their interpretation to make them seem more favorable.57 Overall, the plethora of analytical and interpretation options may infuse subjectivity in the evidence procured by randomized controlled trials. One possibility to help minimize these problems is to request that registered protocols should present in meticulous detail any adjustments plans for the main outcomes. Significant advances have been made in trial registration58 59 and there is impetus to improve also the quality of registered protocols. For example, the SPIRIT initiative recommends that protocols explicitly specify whether an adjusted analysis will be undertaken; and if so, to list the adjustment variables as well as techniques to handle them in the model.60 This is in line with efforts in other biomedical fields, where the movement of reproducible research has led to the request to deposit detailed protocols, analytical codes, and the raw data of published studies.61 62 63 64 65 Increased transparency regarding the choices of adjusted and unadjusted analyses may enhance the reliability of inferences from randomized trials.
What is already known on this topic
At present the imbalance between study groups in randomized controlled trials is inconsistently handled
Some studies adjust the outcome model for baseline differences among study groups, whereas others consider them chance findings or not even worth consideration
Different choices for adjustment may lead to different estimates of the treatment effect and levels of statistical significance
What this study adds
Covariates are handled in diverse ways in the analysis of primary outcomes of randomized controlled trials
It is common for analysis plans on adjustments to differ between protocols and published papers, but this can be discerned only when protocols are obtained from authors, since detailed information on the analysis plan is rarely available
Moreover, unadjusted versus adjusted results sometimes differ in the level of nominal significance, and investigators usually select to report the more favorable results
Cite this as: BMJ 2013;347:f4313
We thank the corresponding authors who responded to our email request and contributed to this work by providing us with the trial protocols or design papers.
Contributors: JPAI had the original idea and all three authors conceived and designed the study. NS and JS identified the eligible trials and extracted the relevant data. All three authors performed the statistical analyses, wrote the manuscript, and approved the final version of the manuscript. JPAI is the guarantor. All authors, external and internal, had full access to all of the data (including statistical reports and tables) in the study and can take responsibility for the integrity of the data and the accuracy of the data analysis.
Funding: This study received no funding.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organization for the submitted work; no financial relationships with any organization that might have an interest in the submitted work in the previous three years, no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not required.
Data sharing: Datasets are available from the corresponding author at.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/.