Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publicationsBMJ 2009; 339 doi: http://dx.doi.org/10.1136/bmj.b2981 (Published 07 August 2009) Cite this as: BMJ 2009;339:b2981
- Santiago G Moreno, research student1,
- Alex J Sutton, professor of medical statistics1,
- Erick H Turner, assistant professor2,
- Keith R Abrams, professor of medical statistics1,
- Nicola J Cooper, senior research fellow1,
- Tom M Palmer, research associate3,
- A E Ades, professor of public health science4
- 1Department of Health Sciences, University of Leicester, Leicester LE1 7RH
- 2Department of Psychiatry, Oregon Health and Science University, Portland Veterans Affairs Medical Center, Portland, Oregon, USA
- 3MRC Centre for Causal Analyses in Translational Epidemiology, Department of Social Medicine, University of Bristol
- 4Department of Community Based Medicine, University of Bristol
- Correspondence to: S G Moreno
- Accepted 10 May 2009
Objective To assess the performance of novel contour enhanced funnel plots and a regression based adjustment method to detect and adjust for publication biases.
Design Secondary analysis of a published systematic literature review.
Data sources Placebo controlled trials of antidepressants previously submitted to the US Food and Drug Administration (FDA) and matching journal publications.
Methods Publication biases were identified using novel contour enhanced funnel plots, a regression based adjustment method, Egger’s test, and the trim and fill method. Results were compared with a meta-analysis of the gold standard data submitted to the FDA.
Results Severe asymmetry was observed in the contour enhanced funnel plot that appeared to be heavily influenced by the statistical significance of results, suggesting publication biases as the cause of the asymmetry. Applying the regression based adjustment method to the journal data produced a similar pooled effect to that observed by a meta-analysis of the FDA data. Contrasting journal and FDA results suggested that, in addition to other deviations from study protocol, switching from an intention to treat analysis to a per protocol one would contribute to the observed discrepancies between the journal and FDA results.
Conclusion Novel contour enhanced funnel plots and a regression based adjustment method worked convincingly and might have an important part to play in combating publication biases.
In 2008 Turner et al published a study in the New England Journal of Medicine showing that the scientific journal literature on antidepressants was biased towards “favourable” results.1 The authors compared the results in journal based reports of trials with data on the corresponding trials submitted to the US Food and Drug Administration (FDA) when applying for licensing. The discrepancies observed in the journal based reports were due to publication biases. Although the term publication bias has been used historically to refer to the suppression of whole studies based on (the lack of) statistical significance or “interest level,” a range of mechanisms can distort the published literature. These include, in addition to the suppression of whole studies, selective reporting of outcomes or subgroups; data “massaging,” such as the selective exclusion of patients from the analysis; and biases regarding timelines.2 A good umbrella term for all these is dissemination biases3 4; in keeping with common usage we refer to them as publication biases. If such biases are present, any decision making based on the literature could be misleading,5 6 not least through obtaining inflated clinical effects from meta-analysis.7
The FDA dataset is assumed to be an unbiased (but not the complete) body of evidence in the specialty of antidepressants and so is regarded a gold standard data source owing to the legal requirements of submitting evidence in its entirety to the FDA and its careful monitoring for deviations from protocol.8 9 10 A gold standard dataset will not, however, be available in most contexts. In the absence of a gold standard, meta-analysts have had to rely on analytical methods to both detect and adjust for publication biases. This has been an active area of methodology development over the past decades, with much written on approaches to deal with publication biases in a meta-analysis context.2 These include graphical diagnostic approaches and formal statistical tests to detect the presence of publication bias, and statistical approaches to modify effect sizes to adjust a meta-analysis estimate when the presence of publication bias is suspected.2 While the performance of many of these methods has been evaluated using simulation studies, concerns remain as to whether the simulations reflect real life situations and therefore whether their perceived performance is representative of what would happen if they were used in practice. Understandably this has led to caution in the use of the methods, particularly for those that adjust effect sizes for publication biases6; but ultimately this is what is required for rational decision making if publication biases exist.
We consider what we believe are currently the best methods for identifying and adjusting for publication biases—both of which have been described only recently. Specifically, we consider a funnel plot (a scatter plot of effect size versus associated standard error) enhanced by contours separating areas of statistical significance from non-significance.11 These contours help distinguish publication biases from other factors that lead to asymmetry in the funnel plot. The method used to adjust a meta-analysis for publication bias is based on a regression line fitted to the funnel plot.12 The adjusted effect size is obtained by extrapolating the regression line to predict the effect size that would be seen in a hypothetical study of infinite size—that is, which has an effect size with zero associated standard error. For comparison and completeness we consider established methods to deal with publication bias. These are the regression based Egger’s test for funnel asymmetry,13 and the trim and fill method,14 which adjusts a meta-analysis for publication bias by imputing studies to rectify any asymmetry in the funnel plot.
The dataset from Turner et al provides a unique opportunity to evaluate the performance of these analytical methods against a gold standard. We present the results of applying the diagnostic and adjustment methods to the journal published results and compare the findings with those obtained through (gold standard) analysis of the data submitted to the FDA.
A full description of the dataset, how it was obtained, and the references to the trials associated with it have been published previously.1 Briefly, Turner et al identified the cohort of all phase II and phase III short term double blind placebo controlled trials used for the licensing of antidepressant drugs between 1987 and 2004 by the FDA. Seventy four trials registered with the FDA and involving 12 drugs and 12 564 patients were identified. To compare drug efficacy reported by the published literature with that of the FDA gold standard, Turner et al collected data on the primary outcome from both sources. Once the primary outcome data were extracted from the FDA trial registry, they searched the published scientific literature for publications matching the same trials. When a match was identified, they extracted data on the article’s apparent primary efficacy outcome. Because studies reported their outcomes on different scales, they expressed all effect sizes as standardised mean differences using Hedges’ g scores (accompanied by corresponding variances).15 Among the 74 studies registered with the FDA, 23 (31%), accounting for 3449 participants, were not published. Overall, larger effects were derived from the journal data than from the FDA data. Among the 38 studies with results viewed by the FDA as statistically significant, only one was unpublished. Conversely, inconclusive studies were, with three exceptions, either not published (22 studies) or published in conflict with the FDA findings (11 studies). Moreover, 94% of published studies reported a positive significant result for their primary outcome, compared with 51% according to the FDA. Data for the analysis were extracted from the previous paper (table C in the appendix),1 in which two studies were combined, making a total of 73 studies in our assessment.
We applied two novel methods to the journal dataset: the contour enhanced funnel plot11 16 to detect publication biases, and a regression based adjustment method12 to adjust for them. For completeness and comparison we also applied to the dataset the most established and commonly used methods to deal with publication biases—namely, Egger’s regression test13 for detecting bias, and the trim and fill adjustment method (fixed effects linear estimator).14 17 18 19 The trim and fill method is an iterative non-parametric technique that uses rank based data augmentation to adjust for publication bias by imputing studies estimated to be missing from the dataset. We use fixed effect models for the primary analysis in this paper; we also reanalysed the data using random effects models as a sensitivity analysis. Stata v.9.2 was used for all the analyses.
Contour enhanced funnel plots
In its simplest form a funnel plot is a scatter plot of study effect sizes (x axis) against their estimated standard errors (y axis).20 When no bias is present such a plot should be symmetrical, with increasing variability in effect sizes being observed in the less precise studies towards the bottom of the plot, producing a funnel shape. Asymmetry in this plot may indicate that publication biases are present through the lack of observed data points in a region of the plot.20 Asymmetry alone does not necessarily imply publication biases exist, however, since alternative explanations for the asymmetry may be present.21 For example, confounding factors (that is, any unmeasured variable associated with both study precision and effect size) may distort the appearance of the plot. It has been observed that certain aspects of trial quality may influence the estimates of effect size,22 23 24 25 and empirical evidence suggests that small studies are, on average, of lower quality and this could induce asymmetry on a funnel plot.26 Mechanisms such as this lead to what have been termed small study effects,21 26 27 28 and their presence will also make funnel plots asymmetrical.
With a view to disentangling genuine publication biases from other causes of funnel asymmetry, the funnel plot can be enhanced by including contours that partition it into areas of statistical significance and non-significance11 16 based on the standard Wald test, marking traditionally perceived milestones of significance—for example, the 1%, 5%, and 10% levels.29 In this way the level of statistical significance of every study’s effect estimate is identified. Since there is evidence that publication biases are related to these milestones,30 31 this can aid interpretation of the funnel plot—that is, if studies seem to be missing in areas of statistical non-significance, then this adds credence to the notion that the asymmetry is due to publication biases. In such cases an attempt should be made to adjust for such biases (in the absence of being able to obtain gold standard data unaffected by publication biases, such as data from regulatory authorities like the FDA). Conversely, if the parts of the funnel where studies are perceived to be missing are found in areas of higher statistical significance, the cause of asymmetry is more likely to be due to factors other than publication biases.
Regression based adjustment
The regression based adjustment method fits a regression line of best fit to the data presented on a funnel plot.32 An adjusted pooled estimate of effect is obtained by predicting, from the regression line, the pooled effect size for an ideal study of infinite size (hence with zero standard error), which would be located at the top of a funnel plot; since it is hypothesised that there would be no bias in studies of that size. This idea has been discussed in the literature33 34 35 (and additionally, such metaregressions are commonly used to test for the presence of publication bias),13 but only recently has the notion been formally evaluated.12 In that evaluation the performance of several different regression models was considered over an extensive range of meta-analytical and publication bias scenarios. The best models were shown to consistently outperform the established trim and fill method. One of these, the quadratic version of the original Egger’s regression test,13 is implemented here. This assumes a linear trend between the effect size and its variance (rather than its standard error, as assumed in the original Egger’s test). Other models considered in the simulation study were designed for binary outcomes exclusively and are not considered here.
Figure 1A⇓ displays a contour enhanced funnel plot of the studies submitted to the FDA, with the corresponding fixed effect meta-analysis pooled estimate providing a weighted average of effect sizes across trials (g score 0.31, 95% confidence interval 0.27 to 0.35). This funnel plot is reasonably symmetrical (Egger’s test P=0.10), which is consistent with the hypothesis that the FDA is an unbiased and appropriate gold standard data source.
The contour enhanced funnel plot for the journal data (fig 1B) is different and highly asymmetrical (Egger’s test P<0.001). A meta-analysis of these data results in a higher average effect size (g score 0.41, 0.37 to 0.45). Most of the study estimates now lie above (but many close to) the right contour line, indicating a statistically significant benefit at the 5% level, with few studies located below this 5% contour line—that is, not reaching significance at the 5% level. Crucially, the area where studies seem to be “missing” is contained within the area where non-significant studies would be located; inside the triangle defined by P=0.10 contour boundaries. This adds further credence to the hypothesis that the observed asymmetry is caused by publication biases. Hence, even without the availability of the corresponding funnel plot for the FDA data (fig 1A), a contour enhanced funnel plot has convincingly identified publication biases as a major problem for the journal data.
For the journal dataset, the trim and fill method imputed a total of 18 “missing” studies (all in the region of non-statistical significance indicated by squares in figure 1C). This agrees reasonably well with the truth, as 23 studies identified through the FDA registry were not identified in the journal literature. The application of the trim and fill method reduced the average effect size to 0.35 (95% confidence interval 0.31 to 0.39), which is about halfway between the FDA and journal estimates (all three estimates are presented in figure 1C).
The fitted line corresponding to the regression based adjustment method is plotted in figure 1D (orange dashed line). The adjusted estimate is obtained by extrapolating the line to where the standard error is 0 (at the top of figure 1D). This produces an adjusted average effect size of 0.29 (95% confidence interval 0.23 to 0.35), which is close to the estimate produced by the meta-analysis of the FDA data (0.31, 0.27 to 0.35).
The situation is complicated by the fact that among the FDA non-significant studies that were published in medical journals, most were published as if they were significant. This is investigated in figure 2A⇓ by linking the effect sizes from each study where estimates were available from both data sources (69% (n=50) of all the trials), using arrows indicating the magnitude and direction of change between FDA and published effect sizes. The effect size differed between FDA and journal analyses in 62% (n=31) of the 50 trials by at least a g score of 0.01. Of these, the journal published effects were larger in 77% (n=24) of the studies (arrow pointing to right). As expected, a meta-analysis of these data produces a higher average effect size for the journal data (g score=0.41, 95% confidence interval 0.37 to 0.45) compared with the matched FDA data (0.37, 0.33 to 0.41). About eight studies in figure 2 achieve statistical significance at the 5% level when published in medical journals, contradicting their non-significant FDA submission, whereas no journal publication revokes statistical significance previously reported to the FDA. This suggests that reporting biases within published studies are directed towards the realisation of statistical significance. Similarly, 96% (n=21) of the 22 unpublished studies (in journals) were non-significant when submitted to the FDA (fig 2B); which again supports the hypothesis of the presence of publication biases. The fixed effect meta-analysis estimate for these 22 unpublished studies (0.15, 95% confidence interval 0.08 to 0.22) was far lower than the one for published studies (0.41, 0.37 to 0.45; fig 2B), adding further support that serious publication biases are present in the journal data.
A reanalysis of the data using random effects models produced similar results to the fixed effect (proportion of total variability explained by heterogeneity (I2) was 16% for the FDA data and 0% for the journal data).36 Details are available on request from the first author.
The application of two novel approaches to identify and adjust for publication biases in a dataset derived from a journal publication, where a gold standard dataset exists, produced encouraging results. Firstly, detection of publication biases was convincing using a contour enhanced funnel plot. Secondly, the regression based method produced a corrected average effect size, which was close to that obtained from the FDA dataset (and closer than that obtained by the trim and fill method).
This assessment does, however, have limitations. Firstly, the findings relate to a single dataset and thus are not necessarily generalisable to other examples. Specifically, all the trials were sponsored by the pharmaceutical industry and we make the assumption that the FDA data are completely unbiased. Furthermore, the methods under evaluation were designed primarily for the assessment of efficacy outcomes and they might not be appropriate for safety outcomes—for example, there may be incentives to suppress statistically significant safety outcomes (rather than non-significant ones). This is an area that requires more research.
Debate is ongoing about the usefulness of funnel plots and related tests for the identification of publication biases. Although their use is widely advocated2 37 some question their validity,27 38 39 40 41 including in this journal.42 We think the analysis presented here provides strong evidence that they do have a useful role.
Recently there has been a lot of research into refining tests for funnel plot asymmetry,13 26 43 44 45 and while we support the formalisation of such an assessment, none of the tests (nor trim and fill or the regression adjustment method) considers the statistical significance of the available study estimates. For this reason we think the consideration of the contours on the funnel plot to be an essential component of distinguishing publication biases from other causes of funnel plot asymmetry. We make no claim that the contours can distinguish between the different mechanisms for publication bias—for example, whether it is missing whole studies, selectively reported outcomes, or “massaged” data that have led to the distorted funnel plot. (Because we have the FDA data, we do go on to disentangle this (fig 2) but generally this will not be possible.) But we do not think this is an important limitation because all these biases have the same effect in a meta-analysis—that is, they are all assumed to be related to statistical significance and they all result in an exaggeration of the pooled effect. There is empirical evidence to support this notion for the effect of reporting biases within published clinical trials in general46 47 48 and for trials on antidepressants in particular.1 49 50 Potential mechanisms that are known to induce this include: (a) selectivity in which outcomes are reported or labelled as primary in journal publications; (b) post hoc searches for statistical significance using numerous hypothesis tests—that is, data dredging or fishing; and (c) selectivity in the analysis methods applied to the data for journal publication. Regarding the last point, the FDA makes its recommendations based on the intention to treat principle,51 52 whereas only half the journal publications are analysed and reported using this approach.53 54 55 56 The usual alternative—the per protocol approach to analysis—excludes dropouts and non-adherents (or patients with protocol deviations in general) and aims to estimate drug efficacy, which will tend to inflate effect sizes compared with the intention to treat approach, which estimates effectiveness.57 58 59 60 An estimate from a per protocol analysis will generally have less precision than for the associated intention to treat analyses owing to the removal of patients with protocol deviations,61 62 which would result in a shift downwards along the y axis of a funnel plot. This is consistent with what is observed in figure 2A, where most arrows are in a downward (as well as right moving) direction. How much such a mechanism commonly contributes to funnel plot asymmetry would be worthy of further investigation.
Few methods for specifically addressing outcome63 64 and subgroup reporting biases65 exist, and further development of analytical methods to specifically tackle aspects of reporting biases within studies is encouraged. Nevertheless, it is reassuring that the methods used in this article to address publication and related biases generally seem to work well in the presence of multiple types of publication biases. We no longer advocate the use of the trim and fill method because of problems identified through simulation studies.12 40 66 The regression adjustment method, which is easy to carry out,67 consistently outperformed the trim and fill method in an extensive simulation study12 (as well as within this particular dataset).
We consider technical issues relating the influence of choice of outcome metric on the robustness of the results, and analyses methods used within the assessments. Firstly, the Hedges’ g score outcome metric was used throughout the analysis. This includes a correction for small sample size. An alternative metric, without the correction, is the Cohen’s d score, which could also have been used. However this would have negligible influence on the funnel plots presented here since the correction is still modest even for the smallest trials (n=25). An additional consideration is that the contours on the funnels are constructed assuming normality of the effect size since they are based on the Wald test. We acknowledge that this may not be exactly the statistical test used in the original analyses for some of the trials. For example, for trials with small sample sizes, a t test may have been used. However, as the Wald and t test statistics converge as the sample size increases, this is only going to affect the assessment of the most imprecise trials at the bottom of the funnel, and all our findings are clearly robust to this.
The 73 randomised controlled trials considered here correspond to 12 different antidepressants. Despite this, there was little statistical heterogeneity in both datasets and so we carried out fixed effect analyses for simplicity (and findings are consistent if random effects are used). There is an ever present tension in meta-analysis between “lumping and splitting” studies, and an argument could be made for allowing for specific differences in drug treatment by stratifying them and carrying out 12 separate analyses. Challenges would arise if attempting to detect and adjust for publication biases in each of the analyses independently owing to the difficulty of interpreting funnel plots with small numbers of studies and the limited power of statistical methods.26 We agree with the suggestions of Shang et al,68 in their assessment of biases in the homoeopathy trial literature (which has some commonalities with the analysis presented here), that it is advantageous to “borrow strength” from a large number of trials and provide empirical information to assist reviewers and readers in the interpretation of findings from small meta-analyses that focus on a specific intervention. Furthermore, investigations of extensions of the existing statistical methods that would formalise such ideas for borrowing strength to produce stratum specific tests and estimates of bias are under way.
Given the apparent biases in the journal based literature for these placebo controlled trials on antidepressants, we are concerned about the validity of the findings of a recent high profile network meta-analysis69 of non-placebo controlled trials on antidepressants as no assessment of potential publication biases seemed to be carried out.70
Undoubtedly the best solution to publication biases is to prevent them from occurring in the first place.2 Using a gold standard data source, such as the FDA trial registry database, is one way of achieving this. However, this is still a long way off from becoming a reality for many analyses. Hence we often have to rely on analytical methods to deal with the problem, and we believe that the contour enhanced funnel plot and the regression based adjustment method provide important developments in the toolkit to combat publication biases.
What is already known on this topic
Publication biases exaggerate clinical effects resulting in potentially erroneous clinical decision making
While most of the attention has focused on the non-publication of whole studies, the problem of reporting biases within published studies is receiving increased attention
What this study adds
Mechanisms including suppression of whole studies, selective outcome reporting, and data “massaging” (for example, selective exclusion of patients from the analysis) may act simultaneously, but may be motivated by underlying statistical significance
Contour enhanced funnel plots and a regression based adjustment method to identify and adjust for multiple publication biases using real data where a gold standard exists showed promising results
Cite this as: BMJ 2009;339:b2981
Contributors: AJS conceived the project and led the research together with SGM. AJS and SGM carried out the statistical analyses and interpretation of the data. EHT, AEA, KRA, and NJC participated in data analysis and interpretation. TMP made a substantial contribution by designing and developing the plots. SGM and AJS drafted the paper, which was revised by all coauthors through substantial contributions to the contents of the paper. All authors approved the final version of the paper for publication. SGM is the guarantor.
Funding: SGM was supported by a Medical Research Council Health Services Research Collaboration studentship in the UK. AEA was funded by the Medical Research Council Health Services Research Collaboration. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, and writing and publishing the report. The corresponding author as well as the other authors had access to all the data and take responsibility for the integrity of the data and the accuracy of the data analysis.
Competing interests: None declared.
Ethical approval: Not required.
Data sharing: Data are available on request from the first author.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.