Interpretation of random effects meta-analysesBMJ 2011; 342 doi: https://doi.org/10.1136/bmj.d549 (Published 10 February 2011) Cite this as: BMJ 2011;342:d549
All rapid responses
We followed with interest the discussion about random effects meta-
analyses by Riley et al. (1). Basing treatment recommendations on
carefully conducted meta-analyses of independent studies is of paramount
importance particularly because licensing of a new drug requires
submission of two successful clinical trials as a standard (2,3). As both
studies need to be significant, the key question in the evaluation of a
meta-analysis is not the significance of the treatment effect, but in how
far included studies support each other (an assessment of consistency, or
In 2002 and 2003 Higgins and Thompson (4,5) criticized the
traditional heterogeneity test with Cochran's Q (6) for not being helpful
to decide about the presence or clinical importance of heterogeneity in a
meta-analysis. They further highlighted that the heterogeneity test by
Cochran depends on the number of studies and has low power (4,5). Higgins
et al. proposed the new quantity I2 as alternative to describe
heterogeneity in meta-analysis and argued that I2 does not inherently
depend on the number of included studies. Until now the British Medical
Journal publication by these authors (5) is cited more than 400 times and
in this recent publication again I2 was recommended instead of Q.
Effectively I2 is (Q-(k-1)) divided by Q where k denotes the number
of studies and therefore I2 directly depends on Q and vice versa. For this
reason we do not see the claimed advantages and, even further, find the
assessment of heterogeneity with I2 misleading.
In Table 1 for low, moderate and high I2 values the Q statistics and
corresponding p-values for typical number of studies are displayed so that
both strategies will arrive at the same conclusions. For example,
regarding heterogeneity as important if I2 is larger than 0.5 would
translate for 2 studies into taking p-values lower than 0.157 indicating
Several authors (e.g. 7,8) suggest to use critical values for p-
values of the Cochran's Q test statistic in the order of 0.1 to 0.2 for
the assessment of heterogeneity. It is obvious that proposed thresholds
for I2 are more generous than those proposed for Cochran's Q and the
larger the number of studies the more heterogeneity is acceptable to I2.
In some instances "more than significant heterogeneity" is still
acceptable for an assessment by means of I2. We find it counter-logic to
criticise Q to have low power on one hand and to define a measure (and an
assessment rule) that would require the heterogeneity test to be even more
significant. For statisticians this should be even less plausible, because
findings of a statistical test are ignored that is already corrected for
the number of studies under investigation by means of its degree of
freedom. Even further, a test for rejecting the null-hypothesis of
homogeneity is used in this situation to provide evidence of homogeneity.
Thus large p-values should be taken for indicating homogeneity.
In conclusion, the direct comparison of I2 and Cochran's Q in several
realistic scenarios revealed, that for the assessment of heterogeneity in
meta-analyses I2 is misleading, because it allows in general more
heterogeneity than a statistical test. Thus it is questionable why a less
critical measurement should be issued instead of Cochran's Q.
As this development has taken its start in the BMJ and now "new"
measures have been proposed, we feel that it is important to dress the
balance and ask the question to what purpose we wish to assess
heterogeneity and which rules should be applied. We see a general tendency
to downgrade signals for heterogeneity.
Random effects models and prediction intervals are means to
circumvent an in depth discussion of what potential sources for
heterogeneity could be.
Anika Grosshennig, Theodor Framke, Armin Koch
correspondence to: firstname.lastname@example.org
(1) Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-
(2) Committee for Proprietary Medicinal Products (CPMP). Points to
consider on application with 1. meta-analyses; 2. one pivotal trial:
CPMP/EWP/2330/99, http://www.eudra.org, 2000.
(3) Food and Drug Administration U.S. Department of Health and Human
Services. Guidance for industry: Providing clinical evidence of
effectiveness for human drug and biological products.
(4) Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis.
Stat Med 2002;21(11):1539-58.
(5) Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency
in meta-analyses. BMJ 2003;327(7414):557-60.
(6) Cochran WG. The combination of estimates from different experiments.
(7) Jackson D. The power of the standard test for the presence of
heterogeneity in meta-analysis. Stat Med 2006;25(15):2688-99.
(8) Koch A, Roehmel J. Why are some meta-analyses more credible than
others? Drug Inf J 2001;35:1019-30.
Table 1: Critical boundaries for assessing heterogeneity being of relevance
Competing interests: No competing interests
I read with interest the interpretation of random effects meta-
analyses by Riley et al. While the statistical interpretation is all very
well, several issues make their interpretation untenable. Let us begin by
breaking through the statistical undertones that overshadow the
interpretation of random effect meta-analyses and understand what it all
means in a logical and intuitive fashion.
A random effect meta-analysis is simply the weighted average of the
effect sizes of a group of studies. The weight that is applied in this
process of weighted averaging in a random effects meta-analysis is
achieved in two steps:
Step 1: inverse variance weighting
Step 2: Un-weighting of this inverse variance weighting by applying a
random effects variance component (REVC) that is simply derived from the
extent of variability of the effect sizes of the underlying studies. This
means that the greater this variability in effect sizes (otherwise known
as heterogeneity), the greater the un-weighting and this can reach a point
when the random effects meta-analysis result becomes simply the un-
weighted average effect size across the studies. At the other extreme,
when all effect sizes are similar (or variability does not exceed sampling
error), no REVC is applied and the random effects meta-analysis defaults
to simply a fixed effect meta-analysis (only inverse variance weighting).
The extent of this reversal is solely dependent on two factors:
1. Heterogeneity of precision
2. Heterogeneity of effect size
Since there is absolutely no reason to automatically assume that a
larger variability in study sizes or effect sizes automatically indicates
a faulty larger study or more reliable smaller studies, the re-
distribution of weights under this model bears no relationship to what
these studies have to offer. Indeed, there is no reason why the results of
a meta-analysis should be associated with this method of reversal of the
inverse variance weighting process of the included studies. As such, the
changes in weight introduced by this model (to each study) results in a
pooled estimate that can have no possible interpretation and, thus, bears
no relationship with what the studies actually have to offer.
To compound the problem further, Riley et al are proposing that we
take an estimate that has no meaning and compute a prediction interval
around it. This is akin to taking a random guess at the effectiveness of a
therapy and under the false belief that it is meaningful try to expand on
its interpretation. Unfortunately, there is no statistical manipulation
that can replace commonsense. While heterogeneity might be due to
underlying true differences in study effects, it is more than likely that
such differences are brought about by systematic error. The best we can do
in terms of addressing heterogeneity is to look up the list of studies and
attempt to un-weight (from inverse variance) based on differences in
evidence of bias rather than effect size or precision that are
consequences of these failures. We have thus devised a model that replaces
these untenable interpretations that abound in the literature and anyone
interested in exploring meaningful meta-analysis when heterogeneity is
present is welcome to download our meta-analysis software freely from
1. Senn S. Trying to be precise about vagueness. Stat Med 2007;
2. Al Khalaf MM, Thalib L, Doi SA. Combining heterogenous studies
using the random-effects model is a mistake and leads to inconclusive meta
-analyses. J Clin Epidemiol 2011; 64:119-23.
Competing interests: No competing interests
There continues to be confusion regarding the nature of meta-analysis
inference. Certain early meta-analyses (1) used what now is termed the
fixed effects model, an approach that assumes the included trials all
estimate the same underlying treatment effect. The study question is
designed to ensure that constituent trials show little important clinical
heterogeneity, and the meta-analysis estimates the overall treatment
effect. Advocates of the fixed effects model describe inference as
conditional on the trials used (2). A corollary of this model is that
considerations of clinical heterogeneity are paramount, and the
calculation of statistical heterogeneity is secondary.
Meanwhile a second approach, the random effects model, appeared (3).
It assumes trial treatment effects do not all estimate the same underlying
effect but rather that trial treatment effects are drawn from a
distribution. The calculation of the overall estimate in this approach
includes an additional term, the between-trial variance, and this approach
tends to yield larger confidence intervals than the fixed effects model.
It is asserted that the formulation of this approach allows a broader
inference - the inference applies to any future trial drawn from the
distribution. This expansive inference target was viewed with skepticism
by early commentators (4), as was the underlying conceptual construct of
the random effects model, the "universe of trials" (2). Additionally, the
random effects model suffers from an inability to reflect the large
uncertainty seen in estimating the between-trial variance term when the
number of constituent trials is small.
Riley et al (BMJ 342:d549) make a valuable addition to this
discussion. To better inform the strengths and limitations of inference,
they suggest including the 95% prediction interval which depends centrally
on the between-trial variance estimate. With this addition one has a more
valid sense of prediction for a future trial. However, on a deeper level,
this addition may only be palliative. The underlying "universe of trials"
concept remains problematic and casts a question over random effects
inferences generally. This presents a difficulty, of course, for those
relying on a broad inference from a random effects model. Clearly, the
debate between the random and fixed effects models will continue, and it
may need to be broadened to developing new models of analysis that can
quantify treatment effects across trials in the presence of substantial
clinical heterogeneity more reliably than the random effects meta-analysis
Kent Johnson, MD
1. Yusuf s, Peto R, Lewis J, Collins R, Sleight P. Beta-blockage
during and after myocardial infarction: an overview of the randomized
trials. Prog Cardiovasc Dis 1985;27:335-371.
2. Peto R. Why do we need systematic overviews of randomized Trials? With
Statist Med 1987; 6:233-244
3. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled
Clinical Trials 1986;7:177-188.
4. Thompson S, Pocock S. Can meta-analysis be trusted? Lancet 1991;338:127
Competing interests: No competing interests