Intended for healthcare professionals

Rapid response to:

Research Methods & Reporting

Interpretation of random effects meta-analyses

BMJ 2011; 342 doi: https://doi.org/10.1136/bmj.d549 (Published 10 February 2011) Cite this as: BMJ 2011;342:d549

Rapid Response:

Now: Prediction intervals - was I2 really an advantage?

We followed with interest the discussion about random effects meta-
analyses by Riley et al. (1). Basing treatment recommendations on
carefully conducted meta-analyses of independent studies is of paramount
importance particularly because licensing of a new drug requires
submission of two successful clinical trials as a standard (2,3). As both
studies need to be significant, the key question in the evaluation of a
meta-analysis is not the significance of the treatment effect, but in how
far included studies support each other (an assessment of consistency, or
heterogeneity).

In 2002 and 2003 Higgins and Thompson (4,5) criticized the
traditional heterogeneity test with Cochran's Q (6) for not being helpful
to decide about the presence or clinical importance of heterogeneity in a
meta-analysis. They further highlighted that the heterogeneity test by
Cochran depends on the number of studies and has low power (4,5). Higgins
et al. proposed the new quantity I2 as alternative to describe
heterogeneity in meta-analysis and argued that I2 does not inherently
depend on the number of included studies. Until now the British Medical
Journal publication by these authors (5) is cited more than 400 times and
in this recent publication again I2 was recommended instead of Q.

Effectively I2 is (Q-(k-1)) divided by Q where k denotes the number
of studies and therefore I2 directly depends on Q and vice versa. For this
reason we do not see the claimed advantages and, even further, find the
assessment of heterogeneity with I2 misleading.

In Table 1 for low, moderate and high I2 values the Q statistics and
corresponding p-values for typical number of studies are displayed so that
both strategies will arrive at the same conclusions. For example,
regarding heterogeneity as important if I2 is larger than 0.5 would
translate for 2 studies into taking p-values lower than 0.157 indicating
substantial heterogeneity.

Several authors (e.g. 7,8) suggest to use critical values for p-
values of the Cochran's Q test statistic in the order of 0.1 to 0.2 for
the assessment of heterogeneity. It is obvious that proposed thresholds
for I2 are more generous than those proposed for Cochran's Q and the
larger the number of studies the more heterogeneity is acceptable to I2.
In some instances "more than significant heterogeneity" is still
acceptable for an assessment by means of I2. We find it counter-logic to
criticise Q to have low power on one hand and to define a measure (and an
assessment rule) that would require the heterogeneity test to be even more
significant. For statisticians this should be even less plausible, because
findings of a statistical test are ignored that is already corrected for
the number of studies under investigation by means of its degree of
freedom. Even further, a test for rejecting the null-hypothesis of
homogeneity is used in this situation to provide evidence of homogeneity.
Thus large p-values should be taken for indicating homogeneity.

In conclusion, the direct comparison of I2 and Cochran's Q in several
realistic scenarios revealed, that for the assessment of heterogeneity in
meta-analyses I2 is misleading, because it allows in general more
heterogeneity than a statistical test. Thus it is questionable why a less
critical measurement should be issued instead of Cochran's Q.

As this development has taken its start in the BMJ and now "new"
measures have been proposed, we feel that it is important to dress the
balance and ask the question to what purpose we wish to assess
heterogeneity and which rules should be applied. We see a general tendency
to downgrade signals for heterogeneity.

Random effects models and prediction intervals are means to
circumvent an in depth discussion of what potential sources for
heterogeneity could be.

Anika Grosshennig, Theodor Framke, Armin Koch

correspondence to: grosshennig.anika@mh-hannover.de

References:

(1) Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-
analyses. BMJ2011;342:d549.

(2) Committee for Proprietary Medicinal Products (CPMP). Points to
consider on application with 1. meta-analyses; 2. one pivotal trial:
CPMP/EWP/2330/99, http://www.eudra.org, 2000.

(3) Food and Drug Administration U.S. Department of Health and Human
Services. Guidance for industry: Providing clinical evidence of
effectiveness for human drug and biological products.
http://www.fda.gov/cder/guidance/index.htm 1998.

(4) Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis.
Stat Med 2002;21(11):1539-58.

(5) Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency
in meta-analyses. BMJ 2003;327(7414):557-60.

(6) Cochran WG. The combination of estimates from different experiments.
Biometrics 1954;10:101-29.

(7) Jackson D. The power of the standard test for the presence of
heterogeneity in meta-analysis. Stat Med 2006;25(15):2688-99.

(8) Koch A, Roehmel J. Why are some meta-analyses more credible than
others? Drug Inf J 2001;35:1019-30.

Table 1: Critical boundaries for assessing heterogeneity being of relevance


Competing interests: No competing interests

17 May 2011
Anika Grosshennig
research associate
Theodor Framke, Armin Koch
Medical School Hannover, Department of Biostatistics