Research Methods & Reporting

A GRADE Working Group approach for rating the quality of treatment effect estimates from network meta-analysis

BMJ 2014; 349 doi: https://doi.org/10.1136/bmj.g5630 (Published 24 September 2014) Cite this as: BMJ 2014;349:g5630

This article has a correction. Please see:

  1. Milo A Puhan1,
  2. Holger J Schünemann2,
  3. Mohammad Hassan Murad3,
  4. Tianjing Li4,
  5. Romina Brignardello-Petersen5,
  6. Jasvinder A Singh6,
  7. Alfons G Kessels7,
  8. Gordon H Guyatt2
  9. for the GRADE Working Group
  1. 1Epidemiology, Biostatistics and Prevention Institute–Epidemiology, Hirschengraben 84, Zurich 8001, Switzerland
  2. 2Department of Clinical Epidemiology and Biostatistics, McMaster University, Health Sciences Centre, Hamilton, Ontario L8N 3Z5, Canada
  3. 3Mayo Clinic–Preventive Medicine, Rochester MN, Minnesota 55905, USA
  4. 4Johns Hopkins Bloomberg School of Public Health–Epidemiology, Baltimore, Maryland, USA
  5. 5University of Toronto–Clinical Epidemiology and Health Care Research, Toronto, Ontario, Canada
  6. 6University of Alabama–Clinical Immunology and Rheumatology, Birmingham, Alabama, USA
  7. 7University of Maastricht, Maastricht, Netherlands
  1. Correspondence to: M A Puhan miloalan.puhan{at}uzh.ch
  • Accepted 22 August 2014

Network meta-analysis (NMA), combining direct and indirect comparisons, is increasingly being used to examine the comparative effectiveness of medical interventions. Minimal guidance exists on how to rate the quality of evidence supporting treatment effect estimates obtained from NMA. We present a four-step approach to rate the quality of evidence in each of the direct, indirect, and NMA estimates based on methods developed by the GRADE working group. Using an example of a published NMA, we show that the quality of evidence supporting NMA estimates varies from high to very low across comparisons, and that quality ratings given to a whole network are uninformative and likely to mislead.

Network meta-analysis (NMA) that simultaneously addresses the comparative effectiveness and/or safety of multiple interventions through combining direct and indirect estimates of effect is rapidly gaining popularity and influence.1 2 3 4 5 6 Although NMA approaches appear attractive,6 7 8 application of their results requires understanding the quality of the evidence. By quality of evidence, we mean the degree of confidence or certainty one can place in estimates of treatment effects.

NMA is sufficiently new that terminology differs between authors and continues to evolve. Box 1 presents a glossary of terms used in this article.

Box 1: Glossary of terms (in order they appear in the text)

  • Ranking—Ordering of treatments according to their relative effectiveness. The first ranked treatment is most likely to be the most effective treatment with respect to a particular outcome compared with the other treatments in the network

  • Direct estimates—Estimate of effect provided by a head-to-head comparison (such as trials of A versus B when A v B is the comparison of interest)

  • Indirect estimates—Estimate of effect provided by two or more head-to-head comparisons that share a common comparator (such as trials of A v C and trials of B v C when A v B is the comparison of interest)

  • Network—A collection of trials of alternative interventions for a clinical condition that allow, through direct and indirect comparisons, calculation of the relative effects of all treatment versus placebo or standard care, and versus one another, on a particular outcome (for example, fig 1)

  • Loops—Two or more head-to-head comparisons that contribute to an indirect estimate. First order loops are those loops that involve only a single additional intervention. For example, if we are interested in A versus B, the direct estimates of A versus C and B versus C constitute a first order loop (see red solid line in fig 2). A second order loop would involve two other interventions (such as A v C, C v D, and D v B; see green and blue dashed lines in fig 2). Higher order loops involve additional interventions

  • Intransitivity—Differences in study characteristics that may modify treatment effect in the direct comparisons (such as A v C and B v C) that form the basis for the indirect estimate of effect of the comparison of interest (A v B), and thus bias the indirect assessment of A versus B. Factors that may modify treatment effects include differing patient characteristics; differing co-interventions; differing extent to which interventions of interest are optimally administered; differing comparators; and differences in measurement of outcome

  • Heterogeneity—Differences in estimates of effect across studies that assessed the same comparison

  • Incoherence—Differences between direct and indirect estimates of effect

Rationale for an approach to rate the quality of evidence from NMA

Recently, several articles have provided guidance regarding identification of the evidence for a NMA,9 statistical aspects of conducting NMA,10 11 12 13 14 15 16 17 and critical appraisal and interpretation of published NMA.18 19 Few of these, however, provide explicit guidance on how to rate the quality of the evidence.4 20 21

Reports of NMAs often describe the risk of bias of trials included in a NMA (such as method of randomisation, concealment of random allocation, masking, etc).22 23 24 For example, a recent NMA compared the effects of coronary artery bypass grafting, various stents, and medical treatment on mortality, myocardial infarction, and the need for revascularisation among patients with stable coronary artery disease. The authors stated that appropriate methods of concealment of random allocation were reported for 71 trials (71%).25 Fifty six trials (56%) reported blind adjudication of clinical outcomes, and for 69 trials (69%) data from intention to treat analyses were available. Although such an assessment of risk of bias describes the entire body of evidence (that is, all trials contributing evidence to the NMA), it does not acknowledge that the risk of bias is likely to differ across the comparisons of the network.1 For example, the risk of bias of studies comparing sirolimus eluting stents versus medical treatment may be considerably less than the risk of bias of studies comparing coronary artery bypass grafting with medical treatment. In addition, risk of bias is only one determinant of quality of evidence. Our confidence in effect estimates will, for instance, also decrease if there are large differences in results from study to study (for example, some studies suggest benefit, but others suggest harm) or if results are imprecise (that is, small numbers of patients and resulting wide confidence intervals, see box 2). Furthermore, the popular approach of treatment rankings (for example, probability that coronary artery bypass grafting is the most effective treatment to lower the risk of mortality) will result in misleading inferences when most evidence is low or very low quality, or when evidence supporting higher ranked treatments (such as coronary artery bypass grafting) is much lower quality than evidence supporting lower ranked treatments (such as drug eluting stents). Patients and clinicians may choose a lower ranked treatment with supporting evidence they can trust over a higher ranked treatment with supporting evidence they cannot trust.

The GRADE Working Group and its approach to rate the quality of evidence

The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group began in the year 2000 as an informal collaboration of people with an interest in addressing the shortcomings of present grading systems in health care. The GRADE Working Group has developed a sensible and transparent approach to grading quality of evidence (box 2).20 26 27 28 29 30 31 The goal of this approach is to provide ratings for the confidence in the estimates of effect for a specific comparison (such as sirolimus eluting stents v medical treatment) for all outcomes of importance to patients (such as all-cause mortality, recurrent angina). If all trials are at low risk of bias; if all included populations, interventions, and outcomes are applicable to practice; if trials show similar estimates of treatment effects; if the effect estimates from meta-analysis are precise (for example, narrow 95% confidence interval); and if suspicion of publication bias is low, we will judge the quality of evidence as high (that is, we can be confident that the true effect lies close to that of the estimate of the effect). If, however, trials are at high risk of bias; show inconsistent estimates of effects across trials; included highly selected patients or used surrogate outcomes; if the estimates of treatment effect are imprecise; or if we have a high suspicion of publication bias, we will judge the evidence as lower quality (that is, the confidence in estimates of treatment effect is only moderate, low or very low, box 2).

Box 2: GRADE approach for rating the quality of estimates of treatment effect

Goal of
  • Provides a rating for the quality of the estimates of effect for a specific comparison and a specific outcome

Ratings
  • High quality (⊕⊕⊕⊕)—We are very confident that the true effect lies close to that of the estimate of the effect

  • Moderate quality (⊕⊕⊕O)—We are moderately confident in the effect estimate: the true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different

  • Low quality (⊕⊕OO)—Our confidence in the effect estimate is limited: the true effect may be substantially different from the estimate of the effect

  • Very low quality (⊕OOO)—We have very little confidence in the effect estimate: the true effect is likely to be substantially different from the estimate of effect

Starting point
  • If randomised trials form the evidence base the quality rating starts with high. If observational studies form the evidence base the quality rating starts low. Lack of randomisation typically leads to a initial rating of low

Down rating
  • The quality rating may be rated down by −1 (serious concern) or −2 (very serious concern) for the following reasons

    • Risk of bias (such as failure to conceal random allocation or blind participants28 in randomised controlled trials or failure to adequately control for confounding in observational studies)

    • Inconsistency (such as heterogeneity of estimates of effects across trials29)

    • Indirectness (such as surrogate outcomes, study populations or interventions that differ from those of interest,20 or intransitivity;19 for further explanation see box 3)

    • Imprecision (for example, 95% confidence intervals are wide and include or are close to null effect30)

    • Publication bias31

Up rating
  • Rating up is typically applied only to observational studies; the most common reason is for a large or very large effect seen over a short period of time and altering a clear downward trajectory

Explanations
  • For reasons of transparency and for better understanding of the ratings, reasons for rating are documented, typically in a summary table’s footnotes

Box 3: Clarification of GRADE rating down for indirectness

Indirectness, a term established by the GRADE Working Group,20 refers to two different concepts.

One concept relates to differences between the question of interest and the body of evidence that is identified and used to inform the question. We may rate down the quality for this type of indirectness when patients of interest (as defined by the question) overlap only partly with patients enrolled in trials (for example, the population of interest is the very elderly, few of who participated in the trials); interventions of interest differ from those regimens tested in trials (for example, the intensity of anticoagulation control differed in trials compared with the community setting); or outcomes of interest differ from those measured in trials (for example, trials in diabetes measured blood glucose, a surrogate endpoint, rather than cardiovascular events).

The second concept on indirectness, particularly relevant for NMA, relates to biased evidence from indirect comparisons. In accordance with the NMA literature, in this article we refer to this second concept as intransitivity (see box 1 for definition of intransitivity).15 16

In this paper we describe the GRADE Working Group’s approach to rating the quality of evidence for specific comparisons included in a NMA. Discussion of an approach to rate the confidence in estimates of effect from NMA began at an international meeting on NMA at Johns Hopkins University (Baltimore MD, USA)1 and with a face-to-face GRADE Working Group meeting in 2010 (Keystone CO, USA). An iterative process including more face-to-face meetings, electronic conferences, email discussions, and multiple iterations of a draft manuscript followed. The final meeting took place at a GRADE Working Group meeting in 2014 (Barcelona, Spain).

The GRADE four-step approach for rating the quality of treatment effect estimates from NMA

Rating the quality of treatment effect estimates from NMA requires best estimates from direct, indirect, and NMA (combined direct + indirect) evidence, as well as quality ratings for the direct and indirect comparisons. We propose the following four steps to assess the quality of treatment effect estimates from NMA (fig 1):

  • 1. Present direct and indirect treatment estimates for each comparison of the evidence network. The direct estimate of effect is provided by a head-to-head comparison (trials of A v B), and the indirect estimate is provided by two or more head-to-head comparisons that share a common comparator (for example, we infer the effects of A v B from trials of A v C and trials of B v C).

  • 2. Rate the quality of each direct and indirect effect estimate.

  • 3. Present the NMA estimate for each comparison of the evidence network.

  • 4. Rate the quality of each NMA effect estimate.

Figure1

Fig 1 Approach for rating the quality of network meta-analysis (NMA) estimates

Example used for illustration

We use a recent NMA to illustrate the application of the GRADE approach. This article will not present details of the underlying systematic reviews and statistical aspects of the NMA; these are reported elsewhere.8 In brief, the NMA included randomised trials that compared drug treatments to prevent fragility fractures in individuals with or at risk of osteoporosis. The target population was postmenopausal women at risk of developing fragility fractures, but a small number of eligible trials enrolled men or women irrespective of risk. The drug treatments included bisphosphonates (alendronate, risedronate, zoledronate, and ibandronate), teriparatide, selective oestrogen receptor modulators (raloxifene), denosumab, and calcium and/or vitamin D.

Here, we present the hip fracture outcome data from 40 trials that included 139 647 participants, of whom 2567 (1.8%) had a hip fracture. The results presented in this paper are identical to those of the primary report,8 and we did not perform any new NMA for this paper. We did, however, apply our new GRADE approach to rating the quality of evidence of each comparison (this was not done in the original article). Figure 2 shows the evidence network for the available direct comparisons.

Figure2

Fig 2 Evidence network of randomised trials comparing the effects of drugs to prevent osteoporotic hip fractures. The size of the circle is proportional to the number of participants randomised to that treatment. Width of the lines is proportional to the number of trials for that comparison. Coloured dashed lines refer to loops for indirect evidence (see text).

Step 1: Presenting direct and indirect effect estimates and 95% CI

Making valid inferences on the basis of a NMA requires understanding of both the direct and indirect evidence that contributes to the NMA effect estimates. Several approaches exist for calculating indirect estimates.12 32 33 For the example presented here we use a method referred to as node splitting, which separates evidence on a particular comparison (a “node”) into direct and indirect estimates of treatment effect.12 For example, direct evidence for the comparison of alendronate versus raloxifene in our fracture prevention example shows an odds ratio of 0.49.8 Because the trial directly comparing the two agents is small, the 95% confidence interval is wide (0.04 to 5.45, fig 1). The indirect evidence (odds ratio 0.53, 95% confidence interval 0.30 to 0.90) includes a first order loop (first order loops are those loops that involve only a single additional intervention, such as vitamin D plus calcium, see red solid line in fig 2) and second order loops (loops that involve two other interventions, such as calcium, vitamin D, and placebo, see green and blue dashed lines in fig 2).

Step 2: Rating of quality of direct and indirect effect estimates

Investigators rate the quality of evidence separately for direct and indirect evidence. The confidence estimates for the direct comparisons involve an application of the GRADE principles (box 2) to each comparison for which head-to-head trials are available. For the network of drugs to prevent osteoporotic fractures, we found seven direct comparisons to warrant high or moderate confidence and nine direct comparisons to warrant low or very low confidence (table 1).

Table 1

 Estimates of effects and quality ratings for comparison of drugs to prevent osteoporotic hip fractures

View this table:

Depending on the size and structure of the evidence network, one, few, or many loops can contribute indirect evidence to the comparisons of interest. To keep the quality rating of the indirect evidence manageable, we suggest a focus on first order loops, which usually contribute most information to the indirect estimate. To identify the relevant loops a network graph such as figure 2 is needed (red solid line represents the first order loop for the indirect comparison of alendronate v raloxifene).

The rating of the quality of the indirect estimate is then based on the ratings of the two pairwise estimates (such as A v C and B v C) that contribute to the indirect estimate of the comparison of interest (A v B); these ratings can follow established GRADE guidance.26 For example, when comparing alendronate versus raloxifene, the comparisons of alendronate versus vitamin D plus calcium and raloxifene versus vitamin D plus calcium (fig 2, red solid line) create the first order loop. The lower confidence rating of the two direct comparisons constitutes the confidence rating of the indirect comparison. In this case, for both comparisons, the confidence rating is moderate: therefore, the initial rating of the indirect evidence warrants moderate confidence.

There is, however an additional issue that may further reduce confidence in estimates from the indirect comparison: intransitivity (see box 1). If the trials forming the basis for the indirect estimate (such as trials of A v C and of B v C) differ in important ways the likelihood of intransitivity may be high. As a consequence the indirect estimate of the comparison of interest (A v B) may be biased. In the presence of intransitivity we would rate down further from the lower of the confidence ratings of the contributing direct comparisons.

Consider, for example, the indirect comparison for risedronate versus vitamin D plus calcium (fig 2). The trials with placebo as common comparator provide most of the indirect evidence. Risedronate was tested in 20 trials for the prevention of fragility fractures. In half of these trials, patients were using glucocorticoid treatment or had a chronic disease that might modify bone metabolism (such as inflammatory bowel disease).8 This contrasts with the trials of vitamin D plus calcium versus placebo, in which participants were included only if they did not take drugs and did not have diseases that modify bone metabolism.34 As a consequence of these differences between the trials of risedronate and vitamin D plus calcium versus the common comparator placebo, we decided to down rate the indirect comparison of risedronate versus vitamin D plus calcium for intransitivity.

It is conceivable that a substantial proportion of indirect comparisons of any NMA warrant down rating for indirectness because of these two reasons. Although we suggest a low threshold for down rating for indirectness, authors should be explicit and report reasons for down rating in the footnotes of the table that presents the direct, indirect, and network estimates of effect. For the network of drugs to prevent osteoporotic fractures, we found 10 indirect comparisons to be of high or moderate quality, respectively, and 41 indirect comparisons to be of low or very low quality, respectively (table 1).

Steps 3 and 4: Presenting and rating of quality of NMA effect estimates

If only direct or indirect evidence is available for a given comparison, the network quality rating will be based on that estimate. When, for a particular comparison, both direct and indirect evidence are available, we suggest using the higher of the two quality ratings as the quality rating for the NMA estimate (for example, moderate quality if quality of the direct estimate is moderate and quality of the indirect estimate is low).

There are two reasons we have chosen this approach. First, if direct and indirect estimates are similar (coherent, see box 1), the lower quality estimate can only bolster the higher (it would make no sense to add evidence that would lower the quality of estimates). Second, in general, we expect the higher rated estimate to be the more precise (and thus dominating) body of evidence.

In the rarer instances in which the less precise estimate warrants higher confidence it likely means that there are no other reasons for down rating that estimate. On the other hand, if the more precise estimate warrants lower confidence than the less precise estimate, there must be serious problems (risk of bias, inconsistency, publication bias, indirectness). If direct and indirect are coherent, the serious problems with risk of bias in the lower confidence are unlikely to have biased the results. If there is serious incoherence then we default to the following guidance regarding what to do in the presence of incoherence.

The assessment of coherence (others use different terminology such as inconsistency) addresses the assumption that direct and indirect evidence are similar enough to be pooled. A commonly used approach to investigate coherence is to test the statistical significance of the difference between direct and indirect estimates.11 12 29 In addition, the magnitude of differences between the direct and indirect estimates should bear on addressing incoherence.

Consider table 2, which presents results from a NMA of the impact of alternative surgical approaches to open tibial fractures on reoperation (from Foote CJ, Guyatt GH, Vignesh KN, et al “Systematic review of prospective investigation of surgical treatment of open tibial shaft fractures (SPRINT review): a network meta-analysis” submitted for publication). In the comparison of unreamed versus reamed nailing, the direct estimate suggests unreamed is superior. The indirect evidence also suggests unreamed is superior, but the effect is much larger, the confidence intervals of the two estimates are virtually non-overlapping, and the statistical test of interaction generates a P value of 0.02. This suggests major incoherence between direct and indirect estimates. On the other hand, for the comparison of unreamed nailing versus external fixation (table 2) we would conclude the results are coherent.

Table 2

 Illustration of coherence and incoherence from a network meta-analysis of alternative surgical approaches to open tibial fractures

View this table:

In the face of large incoherence in a particular comparison we do not advocate discarding or modifying the NMA (for instance, by excluding the incoherent data) without a strong rationale. NMA authors can guide users of the NMA in one of two ways. The first is to focus attention on the direct or indirect estimate warranting greater confidence, rather than the NMA estimate, as the best estimate of effect. This is the approach authors used in the NMA of open tibial fractures. An alternative is to focus on the NMA estimate but rate down the quality of that estimate for incoherence (in this example, also for imprecision, thus leading to a judgment of low quality).

The optimal strategy is likely to depend on the circumstances. If the difference in quality between the two estimates is large, and one of the two is of higher quality, the former approach may be desirable. If the difference in quality between the two estimates is smaller, and neither is of high quality, using the NMA estimate and rating down for incoherence may be preferable. When there is only indirect evidence it is not possible to assess incoherence.33 In such situations issues regarding intransitivity may warrant particular attention, and the threshold for rating down for intransitivity may be lower.

In the example of preventing osteoporotic fractures, table 1 presents the NMA estimates and the final quality ratings. For most of the comparisons, there is only indirect evidence (such as alendronate v zoledronate), and the quality rating of the indirect comparison also represents the quality of the NMA estimate. For the comparison of vitamin D plus calcium versus risedronate, direct evidence had very low confidence rating and contributed substantially more to the NMA estimate than indirect evidence; therefore, the quality rating for the NMA estimate was also very low. Across the network, we found three comparisons (5% of all comparisons) of high quality, 13 (24%) of moderate quality, 19 (35%) of low quality, and 20 (36%) of very low quality.

Finally, one further criterion warrants consideration. While estimates from both direct and indirect may cross a threshold that warrants rating down for imprecision, the pooled network estimate, because it is more precise, may not. For example, we rated down both the direct and indirect comparisons of calcium versus calcium plus vitamin D for imprecision (table 1). Since the pooled estimate was more precise, therefore increasing our confidence that calcium plus vitamin D is more effective than calcium alone, we did not rate down the NMA estimate for imprecision.

Variability of quality of treatment effect estimates and ranking

Quality of estimates can vary greatly across comparisons within the network. Indeed, in our illustrative example, quality varied from high to very low (table 1). In making inferences regarding choice of intervention, recognising the quality of each comparison is far more valuable than the single risk of bias assessment across an evidence network typically reported in most NMA articles.22 23 24 25

An example of the necessity for rating the quality of individual paired comparisons arises from the initial report of the fracture NMA we present here. Using the standard ranking approach in NMA, the authors concluded teriparatide had the largest fracture reduction of the 10 treatments studied (odds ratio 0.42 against no treatment, table 1) and the highest probability of being ranked first across the treatments. Our quality ratings of teriparatide against placebo and other comparators are, however, low or very low (table 1). Other agents (zoledronate or denosumab) had high or moderate confidence ratings of superiority over placebo and over vitamin D plus calcium. The quality ratings suggest that clinicians and patients seeking a drug that prevents hip fractures will be better off choosing zoledronate or denosumab than teriparatide.

What the GRADE guidance adds to existing NMA guidance documents

A wealth of literature addressing NMA has accumulated over recent years. For example, Cipriani and colleagues provided excellent guidance on key statistical aspects of NMA.10 A task force of the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) published three documents addressing the conduct and interpretation of NMA as well as a checklist for critical appraisal of NMA.18 19 35 A recent overview of reporting practices paves the way for an extension of the PRISMA statement addressing NMA.36

Some of these documents addressed risk of bias for the entire network or specific comparisons of a NMA.14 18 19 35 Only one, a more statistically oriented paper, provides guidance on how to rate the quality of indirect and NMA estimates considering not only risk of bias but also other criteria that affect confidence in estimates of effect (box 2 and criteria specific to NMA).21 The four-step approach of the GRADE Working Group fills a gap by providing guidance to determine quality ratings for each estimate of effect in a NMA.

Research needs

There are a number of studies that would be useful to refine the four-step approach presented here. A previous study showed that inter-rater agreement is high if the raters are familiar with GRADE and if calibration exercises are done.37 It is important to conduct inter-rater agreement studies for NMA in order to identify those aspects of the rating process that require additional guidance and calibration exercises. Meta-epidemiological studies addressing the effects of specific criteria (such as intransitivity) on estimates of effect provided by NMA would be useful to inform how readily one should down rate the quality.

Finally, we do not currently support the use of weights (reflecting the amount of information) to decide if the quality rating of the direct or indirect estimate should determine the quality of the NMA estimate. Statistical approaches to determine weights are already incorporated in standard statistical packages. There is, however, little experience in the interpretation and use of such weights in different NMA (that may differ by, for example, their geometry). Studies that inform the optimal use of weights may lead to a revised approach for generating quality ratings for NMA estimates.

Conclusion

The GRADE Working Group approach following four steps highlights the necessity for authors of NMA to present direct, indirect, and NMA estimates as well as quality ratings for all direct comparisons. If authors do not present these estimates, scepticism regarding any inferences from the NMA is warranted.

Notes

Cite this as: BMJ 2014;349:g5630

Footnotes

  • Contributors: MP had the study idea, developed the grading approach, drafted the article and is guarantor of the article; HJS had the study idea and contributed to development of the grading approach and critical revision of the article; MHM and TL contributed to development of the grading approach, data analyses and critical revision of the article; RBP, JAS and AGK contributed to development of the grading approach and critical revision of the article; GG had the study idea, developed the grading approach and drafted the article.

  • Funding: No funding support.

  • Competing interests: We have read and understood the BMJ Group policy on declaration of interests and have no relevant interests to declare.

References

View Abstract