Measuring inconsistency in meta-analyses
BMJ 2003; 327 doi: https://doi.org/10.1136/bmj.327.7414.557 (Published 04 September 2003) Cite this as: BMJ 2003;327:557
All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
I2 Is Subject to the Same Statistical Power Problems as Cochran’s Q
Tania Huedo-Medina, post-doctoral fellow1,2, Blair T Johnson, professor
of psychology1
1Center for Health, Intervention, and Prevention
(CHIP), University of Connecticut, 2006 Hillside Road, Unit 1248, Storrs, CT
06269-1248 USA. 2Correspondence: tania.huedo-medina{at}uconn.edu.
In their popular article, Higgins and
colleagues (2003) provided a valuable explanation about the importance of
assessing heterogeneity in overall meta-analytic findings, and how their new
index helps scholars to attain this goal. As these authors review, there are
three general ways to assess heterogeneity in meta-analysis, but each has a
liability for interpretation. First, one can assess the between-studies variance,
τ2, but its values depend on the particular effect size metric
used, along with other factors. The second is Cochran’s Q, which follows a chi-square distribution to make inferences about
the null hypothesis of homogeneity. (It is actually not a test of heterogeneity, as Higgins and colleagues
assert, but of the hypothesis of homogeneity.) The problem with Q is that it has a poor power to detect
the true heterogeneity when the number of studies is small. Because neither of
these first two methods has a standardized scale, they are poorly equipped to
make comparisons of the degree of homogeneity across meta-analyses.
The third and final way to assess the
heterogeneity is calculating a scale-free index of variability. The Birge ratio, originated in 1932, has been the most commonly
used scale-free index to quantify the consistency of study findings; it is
defined as the ratio of a chi-square to its degrees of freedom. Because the
degrees of freedom are the expected value of each chi-square, when the
chi-square shows only random variation, the Birge
ratio is close to 1.00. Thus, to the extent that the Birge
ratio exceeds 1.00, results of a set of studies lack homogeneity. That is, they
are more varied than one can expect based merely on sampling error.
Higgins and Thompson (2002; Higgins et al.,
2003) extended the Birge ratio to the I2 index in an effort to
overcome the shortcomings of Q and
τ2. Like the Birge ratio, the I2 index is a scale-free
index of variability in defining the ratio of Q in relation to its degrees of freedom. The advantage of this new
index is its easier interpretation because it defines variability along a
scale-free range as a percentage from 0 to 100%. Although Higgins et al. claimed
that an advantage of the I2
index is that it “does not inherently depend on the number of studies in the
meta-analysis” (p. 559), they provided no evidence to support this claim.
Direct comparisons of I2 to Q are
difficult because only the second index has a known sampling distribution
theory that can be used to estimate the probability of a particular value’s
appearance. To counter this problem with I2,
Higgins and Thompson (2002) developed approximate confidence intervals for I2 based on the Birge ratio (which they termed the H index). Huedo-Medina et al. (2006) used these confidence
intervals in order to compare the performance of I2to Q in a Monte-Carlo simulation across a
wide variety of potential meta-analytic conditions. Huedo-Medina and
colleagues’ results demonstrated that like Q,
I2 suffers from the same
problem of low statistical power with small numbers of studies. Specifically, the
confidence intervals around I2
behave very similarly to tests of Q
in terms of Type I error and statistical power. Readers can examine this
conclusion for themselves: In each of the 14 examples that Higgins et al.
(2003) provided, the inference about consistency reached from the I2 index is identical to that
reached by the Q.
We concur with Higgins and colleagues that (1) in
reporting Q (with its associated p value) and I2 (with its confidence intervals), it is easier to
interpret the degree of consistency in a set of study outcomes; (2) using I2 greatly facilitates comparisons
across meta-analyses; and (3) the values of I2
themselves do not depend on the number of studies. Nonetheless, inferences from
both Q and I2 can be misleading when the number of studies is small.
Under such circumstances, analysts should still interpret results with caution.
References
Birge, R. T. (1932). The
calculation of errors by the method of least squares.Reviews of Modern Physics, 40, 207-227.
Higgins, J. P. T.,
& Thompson, S. G. (2002).Quantifying heterogeneity in a meta-analysis.Statistics in Medicine, 21, 1539-1558.
Higgins, J. P. T.,
Thompson, S. G., Deeks, J. J., & Altman, D. G.
(2003).Measuring inconsistency in meta-analyses.British Medical Journal, 327, 557–560.
Huedo-Medina, T. B., Sánchez-Meca, F., Marín-Martínez,
F., & Botella, J. (2006). Assessing heterogeneity in
meta-analysis: I2 or Q statistic? Psychological Methods, 11, 193-206.
Competing interests:
None declared
Competing interests: No competing interests
A better method of dealing with inconsistency in meta-analyses
First, assessing heterogeneity does not solve the problem of
heterogeneity in meta-analyses. This was why the random effects meta-
analysis was proposed to address this. However, in the presence of a
heterogeneous set of studies, a random effects meta-analysis will award
relatively more weight to smaller studies than such studies would receive
in a fixed effect meta-analysis but if for some reason the results of
smaller studies are systematically different from results of larger ones,
which can happen as a result of publication bias or low study quality bias
[1, 2] then a random effects meta-analysis will exacerbate the effects of
the bias.
Second, if the quality of the primary material is inadequate, this
may falsify the conclusions of the review, regardless of the random-
effects model. Such inadequacy may occur accidentally or deliberately, in
various ways: in the randomization process, in the masking to the
allocated treatment, in the random generation of number sequences, in the
analysis, or even when the double-blind type of masking is not
implemented. The need for analysis of the quality of these studies has
therefore become obvious and the solution involves more than just
inserting a random term based on heterogeneity [3] as is done with the
random effects model.
To solve this problem a replacement of the random-effects meta-
analysis with the quality effects meta-analysis has been proposed [4].
This approach incorporates the heterogeneity of effects in the analysis of
the overall interventional efficacy. However, unlike the random effects
model based on observed between-trial heterogeneity, adjustment based on
measured methodological heterogeneity between studies is introduced. A
simple noniterative procedure for computing the combined effect size under
this model has been published [4] and this could represent a more
convincing alternative to the random effects model.
References
1. Poole C, Greenland S. Random-effects meta-analyses are not always
conservative. Am J Epidemiol 1999; 150:469-75.
2. Kjaergard LL, Villumsen J, Gluud C. Reported methodologic quality
and discrepancies between large and small randomized trials in meta-
analyses. Ann Intern Med 2001; 135:982-9.
3. Senn S. Trying to be precise about vagueness. Stat Med 2007;
26:1417-30.
4. Doi SA, Thalib L. A quality effects model for meta-analysis.
Epidemiology. 19(1):94-100, 2008.
Competing interests:
None declared
Competing interests: No competing interests