The case of the misleading funnel plotBMJ 2006; 333 doi: https://doi.org/10.1136/bmj.333.7568.597 (Published 14 September 2006) Cite this as: BMJ 2006;333:597
- Joseph Lau (), professor1,
- John P A Ioannidis, professor2,
- Norma Terrin, associate professor1,
- Christopher H Schmid, professor of medicine1,
- Ingram Olkin, professor3
- 1 Institute for Clinical Research and Health Policy Studies, Tufts-New England Medical Center, Boston, MA 02111, USA,
- 2Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece,
- 3Statistics Department, Stanford University, Palo Alto, USA
- Correspondence to: J Lau
- Accepted 16 June 2006
The advent of evidence based medicine has generated considerable interest in developing and applying methods that can improve the appraisal and synthesis of data from diverse studies. Some methods have become an integral part of systematic reviews and meta-analyses, with reviewers, editors, instructional handbooks, and guidelines encouraging their routine inclusion. However, the evidence for using these methods is sometimes lacking, as the reliance on funnel plots shows.
What is a funnel plot?
The funnel plot is a scatter plot of the component studies in a meta-analysis, with the treatment effect on the horizontal axis and some measure of weight, such as the inverse variance, the standard error, or the sample size, on the vertical axis. Light and Pillemer proposed in 1984: “If all studies come from a single underlying population, this graph should look like a funnel, with the effect sizes homing in on the true underlying value as n increases. [If there is publication bias] there should be a bite out of the funnel.”1 Many meta-analyses show funnel plots or perform various tests that examine whether there is asymmetry in the funnel plot and directly interpret the results as showing evidence for or against the presence of publication bias.
The plot's wide popularity followed an article published in the BMJ in 1997.2 That pivotal article has already received over 800 citations (as of December 2005) in the Web of Science. With two exceptions, this is more citations than for any other paper published by the BMJ in the past decade. The authors were careful to state many reasons why funnel plot asymmetry may not necessarily reflect publication bias. However, apparently many readers did not go beyond the title of “Bias in meta-analysis detected by a simple, graphical test.”
The influential Cochrane Handbook adopts a relatively conservative view and acknowledges that there are problems with the concept.3 Yet it devotes more than four pages to this subject, far more than for any other test of bias and heterogeneity in meta-analysis. Whereas the widely accepted quality of reporting of meta-analysis (QUOROM) statement simply requires in its proposed checklist a description of “any assessment for publication bias,”4 its equally accepted counterpart for meta-analyses of observational studies in epidemiology (MOOSE) states that “methods should be used to aid in the detection of publication bias, eg, fail safe methods or funnel plots.”5 In an article on quantitative synthesis in systematic reviews commissioned by the American College of Physicians, even we advocated funnel plots and devoted a figure and considerable text to them.6
Use and abuse
Hand searching of the BMJ from July 2003 to June 2005 shows that funnel plots were mentioned in 20 of the 47 systematic reviews that included some quantitative data synthesis (see bmj.com for full details). In all 20 cases, the plots were mentioned specifically as tests for evaluating publication bias. Four of the 20 systematic reviews eventually did not perform these tests because they felt too few studies were available (maximum 3 to 10 per meta-analysis), one made no further mention besides the methods, and only one performed the tests and acknowledged that “the funnel plot may not detect publication bias when the number of studies is small.” The other 14 systematic reviews did not question the inferences from these tests and typically made categorical statements about conclusively finding or excluding publication bias with these methods.
A total of 34 meta-analyses had been evaluated with these methods: 14 of them had nine or fewer studies and 18 of them had significant between-study heterogeneity; only five of the 34 meta-analyses had 10 or more studies and no significant between-study heterogeneity. Although 10 studies is not an adequate number for the funnel plot,7 we chose it as a cut-off to show that systematic reviewers did not meet even this liberal criterion.
Inconsistent interpretations were notable between different tests in the same meta-analysis. For example, in a meta-analysis of breastfeeding and blood pressure in later life,w3 the results said: “evidence of such [publication bias] was provided by a funnel plot. The Egger test was significant (P = 0.033), but not the Begg test (P = 0.186)” and a figure shows “Begg's funnel plot (pseudo 95% confidence limits).” Inconsistent interpretations were notable even for the same test between results and discussion. For example, in a meta-analysis of metformin for polycystic ovary syndromew4 the results stated that, “the funnel plot implies publication bias” whereas the discussion concluded that, “these data seem robust with no evidence of major publication bias.”
Accuracy of test
The evaluation of a methodological test is directly analogous to the evaluation of a clinical diagnostic test. Fryback and Thornbury have proposed a six level model for evaluating a diagnostic test.8 This provides a good discussion framework. The six expectations of a clinical diagnostic test are technical feasibility, diagnostic accuracy, diagnostic effect, treatment effect, effect on patient outcome, and societal effect. If the conclusions of evidence based medicine are based on poor tests, the negative effect eventually may be considerable. So we must examine closely at least the technical feasibility and diagnostic accuracy of these methods.
An evaluation of the technical feasibility of the funnel plot shows many problems that are difficult to solve. Strong empirical evidence exists that the appearance of the plot may be affected by the choice of the coding of the outcome (binary versus continuous),9 the choice of the metric (risk ratio, odds ratio, or logarithms thereof), and the choice of the weight on the vertical axis (inverse variance, inverse standard error, sample size, etc).1011 Figure 1 gives an example of how these choices can make a difference.
Even in the unlikely event that agreement is reached on what metric and what expression of weight to use on the axes, enormous uncertainty and subjectivity remains in the visual interpretation of the same plot by different researchers. Our team recently designed a survey to examine this question using simulated plots with or without publication bias.12 The ability of researchers to identify publication bias using a funnel plot was practically identical to chance (53% accuracy).
Formal statistical tests may eliminate the subjectivity in visual inspection of asymmetry. Investigators commonly use the rank correlation test13 or one of many tests based on regression.27101114 The validity of these tests depends on assumptions often unmet in practice, however, and the choice of test introduces further subjectivity into the procedure. The methods theoretically require a considerable number of available studies, generally at least 30 for sufficient power. But the number needed depends on the size of the studies and on the true treatment effect—for example, for an odds ratio of 0.67, even 60 studies are not adequate.7 Most meta-analyses of clinical trials, however, have far fewer studies. For instance, the average Cochrane meta-analysis has fewer than 10.15 Thus the tests typically have low power16 and may be inappropriate.
Even ignoring statistical concerns of power and choice of metric and weights, it is still unclear if funnel plots really diagnose publication bias. Strictly speaking, funnel plots probe whether studies with little precision (small studies) give different results from studies with greater precision (larger studies). Asymmetry in the funnel plot may therefore result not from a systematic under-reporting of negative trials but from an essential difference between smaller and larger studies that arises from inherent between-study heterogeneity. For example, small studies may focus on high risk patients, for whom the treatment is more effective because such patients have more events that could potentially be prevented17; or studies with small weight may generally have shorter follow-up and differ because the treatment effect decreases with time.18 Early studies may target different populations (with different effect sizes) than subsequent studies,19 and subsequent studies may be much larger, trying to test the concept on less selected patients. Variation in quality can affect the shape of the funnel plot, with smaller, lower quality studies showing greater benefit of treatment.20
Methods used by evidence based medicine should be evaluated with rigorous standards
The funnel plot is widely used in systematic reviews and meta-analyses as a test for publication bias
Asymmetry of the funnel plot, either visually interpreted or statistically tested, does not accurately predict publication bias
Inappropriate or misleading use of funnel plot tests may do more harm than good
Heterogeneity may sometimes be both statistically and clinically obvious—that is, studies may be examining different questions.21 Yet the authors of a meta-analysis, such as the one investigating the relation between garlic consumption and cancer,21 may still pool all studies together when it comes to the funnel plot, even though they have analysed them separately for the main analysis. In other cases, it may not be possible to identify a source for the existing heterogeneity.22 Simulation studies of funnel plots have found that bias may be incorrectly inferred if studies are heterogeneous.2123
For example, figure 2 shows the funnel plot for a meta-analysis of inhaled disodium cromoglicate as maintenance therapy in children with asthma.24 The authors found both statistical and clinical heterogeneity, yet they published a funnel plot (fig 2, top), stating: “Studies with low precision and negative outcome are under-represented, indicating publication bias.” Grouping the studies according to age of participants (middle) and study design (bottom) creates a different impression.
Finally, we have no gold standard against which to compare the results of funnel plot tests. A true standard measure of publication bias would require prospective registries of trials with detailed knowledge of which studies have been published and which are unpublished. It would then be feasible to test whether tests of publication bias capture accurately the presence of unpublished studies and whether one variant performs better than others. Given that efforts for study registration have only recently started,25 this evaluation is currently difficult. Although a large number of alternative tests for publication bias exist,26 none has been validated against a standard.
Prevention of bias
In conclusion, evidence based methods, including the funnel plot, should be evidence based. If treatment decisions are made on the basis of misleading methodological tests, the costs to patients and society could be high. Decisions guided by the easy assurance of a symmetrical funnel plot may overlook serious bias. Equally, it may be misleading to discredit and abandon valid evidence simply because of an asymmetrical funnel plot. The prevention of publication bias is much more desirable than any diagnostic or corrective analysis.
Details of BMJ systematic reviews mentioning funnel plots are on bmj.com
Contributors and sources The authors have worked for a long time on methodological research in systematic reviews and meta-analyses. The idea was generated by JL and expanded by the other authors. The manuscript was written by JPAI and JL and commented on critically by the other authors. All authors approved the final version. JL is the guarantor.
Funding Supported in part by AHRQ grant R01 HS10254. Competing interests: None declared.