Measuring inconsistency in metaanalyses
BMJ 2003; 327 doi: http://dx.doi.org/10.1136/bmj.327.7414.557 (Published 04 September 2003) Cite this as: BMJ 2003;327:557 Julian P T Higgins, statistician (julian.higgins{at}mrcbsu.cam.ac.uk)1,
 Simon G Thompson, director1,
 Jonathan J Deeks, senior medical statistician2,
 Douglas G Altman, professor of statistics in medicine2
 ^{1}MRC Biostatistics Unit, Institute of Public Health, Cambridge CB2 2SR,
 ^{2}Cancer Research UK/NHS Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
 Correspondence to: J P T Higgins
Cochrane Reviews have recently started including the quantity I^{2} to help readers assess the consistency of the results of studies in metaanalyses. What does this new quantity mean, and why is assessment of heterogeneity so important to clinical practice?
Systematic reviews and metaanalyses can provide convincing and reliable evidence relevant to many aspects of medicine and health care.1 Their value is especially clear when the results of the studies they include show clinically important effects of similar magnitude. However, the conclusions are less clear when the included studies have differing results. In an attempt to establish whether studies are consistent, reports of metaanalyses commonly present a statistical test of heterogeneity. The test seeks to determine whether there are genuine differences underlying the results of the studies (heterogeneity), or whether the variation in findings is compatible with chance alone (homogeneity). However, the test is susceptible to the number of trials included in the metaanalysis. We have developed a new quantity, I^{2}, which we believe gives a better measure of the consistency between trials in a metaanalysis.
Need for consistency
Assessment of the consistency of effects across studies is an essential part of metaanalysis. Unless we know how consistent the results of studies are, we cannot determine the generalisability of the findings of the metaanalysis. Indeed, several hierarchical systems for grading evidence state that the results of studies must be consistent or homogeneous to obtain the highest grading.2^{–}4
Tests for heterogeneity are commonly used to decide on methods for combining studies and for concluding consistency or inconsistency of findings.5 6 But what does the test achieve in practice, and how should the resulting P values be interpreted?
Testing for heterogeneity
A test for heterogeneity examines the null hypothesis that all studies are evaluating the same effect. The usual test statistic (Cochran's Q) is computed by summing the squared deviations of each study's estimate from the overall metaanalytic estimate, weighting each study's contribution in the same manner as in the metaanalysis.7 P values are obtained by comparing the statistic with a χ^{2} distribution with k1 degrees of freedom (where k is the number of studies).
The test is known to be poor at detecting true heterogeneity among studies as significant. Metaanalyses often include small numbers of studies,6 8 and the power of the test in such circumstances is low.9 10 For example, consider the metaanalysis of randomised controlled trials of amantadine for preventing influenza (fig 1). 11 The treatment effects in the eight trials seem inconsistent: the reduction in odds vary from 16% to 93%, with some of the confidence intervals not overlapping. But the test of heterogeneity yields a P value of 0.09, conventionally interpreted as being nonsignificant. Because the test is poor at detecting true heterogeneity, a nonsignificant result cannot be taken as evidence of homogeneity. Using a cutoff of 10% for significance12 ameliorates this problem but increases the risk of drawing a false positive conclusion (type I error).10
Conversely, the test arguably has excessive power when there are many studies, especially when those studies are large. One of the largest metaanalyses in the Cochrane Database of Systematic Reviews is of clinical trials of tricyclic antidepressants and selective serotonin reuptake inhibitors for treatment of depression.13 Over 15 000 participants from 135 trials are included in the assessment of comparative dropout rates, and the test for heterogeneity is significant (P = 0.005). However, this P value does not reasonably describe the extent of heterogeneity in the results of the trials. As we show later, a little inconsistency exists among these trials but it does not affect the conclusion of the review (that serotonin reuptake inhibitors have lower discontinuation rates than tricyclic antidepressants).
Since systematic reviews bring together studies that are diverse both clinically and methodologically, heterogeneity in their results is to be expected.6 For example, heterogeneity is likely to arise through diversity in doses, lengths of follow up, study quality, and inclusion criteria for participants. So there seems little point in simply testing for heterogeneity when what matters is the extent to which it affects the conclusions of the metaanalysis.
Quantifying heterogeneity: a better approach
We developed an alternative approach that quantifies the effect of heterogeneity, providing a measure of the degree of inconsistency in the studies' results.14 The quantity, which we call I^{2}, describes the percentage of total variation across studies that is due to heterogeneity rather than chance. I^{2} can be readily calculated from basic results obtained from a typical metaanalysis as I^{2} = 100% x(Q  df)/Q, where Q is Cochran's heterogeneity statistic and df the degrees of freedom. Negative values of I^{2} are put equal to zero so that I^{2} lies between 0% and 100%. A value of 0% indicates no observed heterogeneity, and larger values show increasing heterogeneity.
Examples of values of I^{2}
The principal advantage of I^{2} is that it can be calculated and compared across metaanalyses of different sizes, of different types of study, and using different types of outcome data. Table 1 gives I^{2} values for six published metaanalyses along with 95% uncertainty intervals. The upper limits of these intervals show that conclusions of homogeneity in metaanalyses of small numbers of studies are often unjustified.11 13 15^{–}19
The tamoxifen and streptokinase metaanalyses, in which all the included studies found similar effects,16 17 have I^{2} values of 3% and 19% respectively. These indicate little variability between studies that cannot be explained by chance. For the review comparing dropouts on selective serotonin reuptake inhibitors with tricyclic antidepressants, I^{2} is 26%, indicating that although the heterogeneity is highly significant, it is a small effect.
The reviews of trials of magnesium after myocardial infarction (I^{2} = 63%) and casecontrol studies investigating the effects of electromagnetic radiation on leukaemia (69%) both included studies with diverse results. The high I^{2} values show that most of the variability across studies is due to heterogeneity rather than chance. Although no significant heterogeneity was detected in the review of amantadine,11 the inconsistency was moderately large (I^{2} = 44%).
Figure 2 shows the observed values of I^{2} from 509 metaanalyses in the Cochrane Database of Systematic Reviews. Almost half of these metaanalyses (250) had no inconsistency (I^{2} = 0%). Among metaanalyses with some heterogeneity, the distribution of I^{2} is roughly flat.
Further applications of I^{2}
I^{2} can also be helpful in investigating the causes and type of heterogeneity, as in the three examples below.
Methodological subgroups
Figure 3 shows the six casecontrol studies of magnetic fields and leukaemia broken down into two subgroups based on assessment of their quality.19 If heterogeneity is identified in a metaanalysis a common option is to subgroup the studies. Because of loss of power, nonsignificant heterogeneity within a subgroup may be due not to homogeneity but to the smaller number of studies. Here, the P values for the heterogeneity test are higher for the two subgroups (P = 0.3 and P = 0.009) than for the complete data (P = 0.007), which suggests greater consistency within the subgroups. However, the values of I^{2} show that the three low quality studies are more inconsistent (I^{2} = 79%) than all six (I^{2} = 69%) (table 2). Substantially less inconsistency exists among the high quality studies (I^{2} = 15%), although uncertainty intervals for all of the I^{2} values are wide.
Heterogeneity related to choice of effect measure
A systematic review of clinical trials of human albumin administration in critically ill patients concluded that albumin may increase mortality.20 These studies had no inconsistency in risk ratio estimates (I^{2} = 0%) and a narrow uncertainty interval. Table 2 shows the heterogeneity statistics for risk differences as well as for risk ratios. Six trials with no deaths in either treatment group do not contribute information on risk ratios, but they all provide estimates of risk differences. Using P values to decide which scale is more consistent with the data21 is inappropriate because of the differing numbers of studies. I^{2} values may validly be compared and show that the risk differences are less homogeneous, as is often the case.22
Clinically important subgroups
I^{2} can also be used to describe heterogeneity among subgroups. Table 2 includes results for the outcome of recurrence in the metaanalysis of trials of tamoxifen for women with early breast cancer. There was highly significant (P = 0.00002) and important heterogeneity (I^{2} = 50%) among the trials.16 However, a potentially important source of heterogeneity is the duration of treatment. The authors divided the trials into three duration categories and presented an overall heterogeneity test, a test comparing the three subgroups, and a test for heterogeneity within the subgroups. I^{2} values corresponding to each test show that 96% of the variability observed among the three subgroups cannot be explained by chance. This is not clear from the P values alone. The extreme inconsistency among all 55 trials in the odds ratios for recurrence (I^{2} = 50%) is substantially reduced (I^{2} = 13%) once differences in treatment duration are accounted for.
How much is too much heterogeneity?
A naive categorisation of values for I^{2} would not be appropriate for all circumstances, although we would tentatively assign adjectives of low, moderate, and high to I^{2} values of 25%, 50%, and 75%. Figure 2 shows that about a quarter of metaanalyses have I^{2} values over 50%. Quantification of heterogeneity is only one component of a wider investigation of variability across studies, the most important being diversity in clinical and methodological aspects. Metaanalysts must also consider the clinical implications of the observed degree of inconsistency across studies. For example, interpretation of a given degree of heterogeneity across several studies will differ according to whether the estimates show the same direction of effect.
Advantages ofI^{2}
Focuses attention on the effect of any heterogeneity on the metaanalysis
Interpretation is intuitive—the percentage of total variation across studies due to heterogeneity
Can be accompanied by an uncertainty interval
Simple to calculate and can usually be derived from published metaanalyses
Does not inherently depend on the number of studies in the metaanalysis
May be interpreted similarly irrespective of the type of outcome data (eg dichotomous, quantitative, or time to event) and choice of effect measure (eg odds ratio or hazard ratio)
Wide range of applications
Summary points
Inconsistency of studies' results in a metaanalysis reduces the confidence of recommendations about treatment
Inconsistency is usually assessed with a test for heterogeneity, but problems of power can give misleading results
A new quantity I^{2}, ranging from 0100%, is described that measures the degree of inconsistency across studies in a metaanalysis
I^{2} can be directly compared between metaanalyses with different numbers of studies and different types of outcome data
I^{2} is preferable to a test for heterogeneity in judging consistency of evidence
An alternative quantification of heterogeneity in a metaanalysis is the amongstudy variance (often called τ^{2}), calculated as part of a random effects metaanalysis. This is more useful for comparisons of heterogeneity among subgroups, but values depend on the treatment effect scale. We believe, I^{2} offers advantages over existing approaches to the assessment of heterogeneity (box). Focusing on the effect of heterogeneity also avoids the temptation to perform so called two stage analyses, in which the metaanalysis strategy (fixed or random effects method) is determined by the result of a statistical test. Such strategies have been found to be problematic.23 24 We therefore believe that I^{2} is preferable to the test of heterogeneity when assessing inconsistency across studies.
Acknowledgments
We thank Keith O'Rourke and Ian White for useful comments.
Footnotes

Contributors The authors all work as statisticians and have extensive experience in methodological, empirical and applied research in metaanalysis. JH, JD, and DA are coconvenors of the Cochrane Statistical Methods Group. The views expressed in the paper are those of the authors. All authors contributed to the development of the methods described. JH and ST worked more closely on the development of I^{2}. JH is guarantor.

Funding This work was funded in part by MRC

Competing interests None declared