Reasons or excuses for avoiding meta-analysis in forest plotsBMJ 2008; 336 doi: https://doi.org/10.1136/bmj.a117 (Published 19 June 2008) Cite this as: BMJ 2008;336:1413
- John P A Ioannidis, professor1,
- Nikolaos A Patsopoulos, research fellow1,
- Hannah R Rothstein, professor2
- 1Clinical Trials and Evidence Based Medicine Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine and Biomedical Research Institute, Foundation for Research and Technology-Hellas, Ioannina 45110, Greece
- 2Department of Management, Zicklin School of Business, City University of New York, New York, NY 10010, USA
- Correspondence to: J P A Ioannidis
- Accepted 30 March 2008
Some systematic reviews simply assemble the eligible studies without performing meta-analysis. This may be a legitimate choice. However, an interesting situation arises when reviews present forest plots (quantitative effects and uncertainty per study) but do not calculate a summary estimate (the diamond at the bottom). These reviews imply that it is important to visualise the quantitative data but final synthesis is inappropriate. For example, a review of sexual abstinence programmes for HIV prevention claimed that owing to “data unavailability, lack of intention-to-treat analyses, and heterogeneity in programme and trial designs… a statistical meta-analysis would be inappropriate.”1 As we discuss, options almost always exist for quantitative synthesis and sometimes they may offer useful insights. Reviewers and clinicians should be aware of these options, reflect carefully on their use, and understand their limitations.
Why meta-analysis is avoided
Of the 1739 systematic reviews that included at least one forest plot with at least two studies in issue 4 of the Cochrane Database of Systematic Reviews (2005), 135 reviews (8%) had 559 forest plots with no summary estimate.
The reasons provided for avoiding quantitative synthesis typically revolved around heterogeneity (table 1⇓). The included studies were thought to be too different, either statistically or in clinical (including methodological) terms. Differences in interventions, metrics, outcomes, designs, participants, and settings were implied.
How large is too large heterogeneity?
This question of lumping versus splitting is difficult to answer objectively for clinical heterogeneity. Logic models based on the PICO (population-intervention–comparator-outcomes) framework may help to deal with the challenges of deciding what to include and what not. Still, different reviewers, readers, and clinicians may disagree on the (dis)similarity of interventions, outcomes, designs, participant characteristics, and settings.
No widely accepted quantitative measure exists to grade clinical heterogeneity. Nevertheless, it may be better to examine clinical differences in a meta-analysis rather than use them as a reason for not conducting one. For example, a review identified 40 trials of diverse interventions to prevent falls in elderly people.2 Despite large diversity in the trials, the authors did a meta-analysis and also examined the effectiveness of different interventions. The analysis suggested that evidence was stronger for multifactorial risk assessment and management programmes and exercise and more inconclusive for environmental modifications and education.
Statistical heterogeneity can be measured—for example, by calculating I2 and its uncertainty.3 4 5 I2, the proportion of variation between studies not due to chance, takes values from 0 to 100%. In the 22 forest plots including four or more studies that avoided synthesis because of heterogeneity, I2 ranged between 35% and 98% with a median of 71% (figure 1⇓). Yet, 86 of the 1011 forest plots where reviewers had no hesitation in performing meta-analysis had I2 exceeding 71%.5 The lower 95% confidence limit of I2 was <25% in 11 of the 22 non-summarised forest plots—that is, for half of them we cannot exclude that statistical heterogeneity is limited. Therefore, even for statistical heterogeneity, there is substantial variability in what different reviewers consider too much. Statistical heterogeneity alone is a weak and inconsistently used argument for avoiding quantitative synthesis.
Potential methods for use in heterogeneity
Table 2⇓ provides methodological approaches to quantitative synthesis of data that some researchers may deem unsuitable for meta-analysis. It is unknown whether researchers preparing systematic reviews were aware of these methods but thought that they were inapplicable; were aware of their existence but lacked the necessary experience and software; or were unaware of their existence. Detailed discussion of methods is beyond our scope here, but we present the principal options and caveats and provide references for interested readers. Some methods are experimental and extra caution is needed.
Models that can accommodate statistical heterogeneity between studies include traditional random effects (models that assume that different studies have different true treatment effects),6 meta-regressions (regressions that examine whether the treatment effect is related to one or more characteristics of the studies or patients),7 and bayesian methods (methods that combine various prior assumptions with the observed data).8 Random effects do not explain the heterogeneity: they distort estimates when large versus smaller studies differ in results and smaller studies are more biased, and they can be unstable with limited evidence;9 meta-regressions may suffer from post hoc selection of variables, the ecological fallacy, and poor performance with few studies;10 and bayesian results may depend on prior specifications.8 Meta-analysis of data at the individual level may permit fuller exploration of heterogeneity, but these data are usually unavailable.11
The availability of multiple interventions for the same condition and indication is increasingly common. Different regimens may be merged in common groups, but differences in treatment effects of merged regimens may remain unrecognised. Multiple treatments meta-analysis could be used to examine all the different treatments used for a given condition. For example, 242 chemotherapy trials are available covering 137 different regimens for advanced colorectal cancer.12 The number of possible comparisons is prohibitive. A meta-analysis grouped these regimens into 12 treatment types and then performed a network analysis that evaluated their relative effectiveness. Instead of taking one comparison at a time, the network considered concomitantly all the data from all relevant comparisons. Networks integrate information from both direct and indirect comparisons of different treatments.13 14 15 Main caveats include possible inconsistency in results between direct and indirect comparisons and the still limited experience on networks.13 14 15 16
Clinical trials on the same topic also commonly use many different outcomes. Meta-analysis of one outcome at a time offers a fragmented picture. Some outcomes simply differ in their measures—for example, global clinical improvement measured on a continuous scale or as a binary end point (yes/no). Continuous scales can be converted into binary ones and standardised metrics (popular in the social sciences)17 can accommodate different outcomes that measure the same construct (such as various psychometric scales). However, for medical applications, many clinicians think that anything other than plain absolute risk is insufficiently intelligible to inform practice and policy.18 19 Finally, some outcomes may represent truly different end points with partial correlation among themselves (for example, serum creatinine, creatinine clearance, progression to end stage renal disease, initiation of renal replacement therapy) and multivariate meta-analysis models can cater for two or more correlated outcomes.20 21 22 Such models borrow strength from all the available outcomes across trials. The main caveats are specification of correlations and sparse data.
The combination of data from randomised and non-randomised studies is possible using traditional meta-analysis models. The main caveats are the spurious precision,23 confounding, and potentially stronger selective reporting biases in observational studies.24 However, the generalised synthesis of both randomised and non-randomised studies on the same topic may offer complementary information.25 26 27 Other designs that require special care in meta-analysis include cluster28 and crossover trials.29
The authors of several systematic reviews state only that “data synthesis is inappropriate” or allude vaguely to “clinical heterogeneity.” Specifying the reasons would improve transparency of the implicit judgments. Finally, some reviews argue that data are too limited. However, meta-analysis is feasible even with two studies. For most medical questions, only few studies exist. Limited data typically yield uncertain estimates, but the quantitative accuracy of meta-analysis may actually be a reason to avoid narrative interpretation without synthesis. Limited data may also result from asking questions that are too narrow, trying to make data too similar before inclusion in the same forest plot. Forced similarity may fragment information; it is almost unavoidable that trials will differ in at least minor ways.
To synthesise or not?
If the limitations of these methods are properly acknowledged, the use of quantitative synthesis may be preferable to qualitative interpretation of the results, or hidden quasi-quantitative analysis—for example, judging studies based on P values of single studies being above or below 0.05. Such an approach can actually lead to the wrong conclusion, especially when statistical power is low.31 For example, if an intervention is effective but two studies are done with 40% power each, the chance of both of them getting a significant result is only 16%.
More complex “home made” qualitative rules may further compound the methodological problems. This applies not only to reviews that avoid the final synthesis but also to entirely narrative reviews without any forest plots. For example, the reviewers of interventions to promote physical activity in children and adolescents “used scores to indicate effectiveness—that is, whether there was no difference in effect between control and intervention group (0 score), a positive or negative trend (+ or −), or a significant difference (P<0.05) in favour of the intervention or control group (++ or −−, respectively) . . . If at least two thirds (66.6%) of the relevant studies were reported to have significant results in the same direction then we considered the overall results to be consistent.”32 Such rules have poor performance validity.
Meta-analysis is often understood solely as a means of combining information to produce a single overall estimate of effect. However, one of its advantages is to assess, examine, and model the consistency of effects and improve understanding of moderator variables, boundary conditions, and generalisability.8 33 Different patients and different studies are unavoidably heterogeneous. This diversity and the uncertainty associated with it should be explored whenever possible. Obtaining estimates of treatment effect (rather than simple narrative evaluations) may allow more rational decisions about the use of interventions in specific patients or settings. More sophisticated methods may also capture and model uncertainties more fully and thus may actually reach more conservative conclusions than more naive approaches. However, it is then essential that their assumptions and limitations are clearly stated and inferences drawn cautiously. Any meta-analysis method, simple or advanced, may be misleading, if we don’t understand how it works.
Some reviews extract numerical data and generate forest plots but avoid meta-analysis
The typical reason for not doing meta-analysis is high heterogeneity across studies
Appropriate quantitative methods exist to handle heterogeneity and may be considered if their assumptions and limitations are acknowledged
Narrative summaries may sometimes be misleading
Contributors and sources: The authors have a longstanding interest in meta-analysis and sources of heterogeneity in clinical research. JPAI had the original idea for the survey. NAP and JPAI extracted the data for the survey and NAP also did the statistical heterogeneity analyses. HRR rekindled the interest in pursuing the project further and the discussion evolved with interactions between JPAI, NPA, and HRR. We thank Iain Chalmers and Alex Sutton for comments on the manuscript. JPAI wrote the manuscript and the coauthors commented on it and approved the final draft.
Competing interests: None declared.