GRADE approach to drawing conclusions from a network meta-analysis using a partially contextualised frameworkBMJ 2020; 371 doi: https://doi.org/10.1136/bmj.m3907 (Published 10 November 2020) Cite this as: BMJ 2020;371:m3907
- Romina Brignardello-Petersen, methodologist1,
- Ariel Izcovich, clinician2,
- Bram Rochwerg, intensive care physician1,
- Ivan D Florez, clinician and methodologist1 3,
- Glen Hazlewood, rheumatologist4,
- Waleed Alhazanni, intensive care physician1,
- Juan Yepes-Nuñez, allergist5,
- Nancy Santesso, methodologist1,
- Gordon H Guyatt, internist1,
- Holger J Schünemann, internist1 6
- on behalf of the GRADE working group
- 1Department of Health Research Methods, Evidence and Impact, McMaster University, 1280 Main Street West, Hamilton, ON L8S 4L8, Canada
- 2Internal Medicine Service, German Hospital, Buenos Aires, Argentina
- 3Department of Pediatrics, School of Medicine, University of Antioquia, Medellín, Colombia
- 4Department of Medicine, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- 5School of Medicine, Universidad de los Andes, Bogota, Colombia
- 6GRADE Centre and Department of Medicine, McMaster University, Hamilton, ON, Canada
- Correspondence to: R Brignardello-Petersen @rominabrigpet on Twitter) (or
- Accepted 2 October 2020
Systematic review authors should draw conclusions on how interventions compare to others with regards to specific health outcomes considering the estimates of effects comparing those interventions, and the certainty of the evidence (confidence in evidence, quality of evidence).1 When review authors conduct network meta-analysis (NMA), they might also have information on how likely each intervention is the most beneficial or harmful for the outcome (rankings). The large amount of information that emerges from an NMA—that is, a relative estimate and its certainty for each comparison, in addition to the rankings—raises challenges in reaching appropriate conclusions that consider all key information.
The GRADE (grading of recommendations assessment, development, and evaluation) working group has presented guidance for evaluating the certainty of the evidence in NMA,23 how to avoid spurious judgments when addressing imprecision,4 and how to assess incoherence.5 In addition, we have provided suggestions for how to present the findings from an NMA in a summary of findings table.6 No guidance so far, however, exists on how to draw conclusions from the comparisons in the NMA.
Based on our experience (and that of the experts who provided feedback), it is unlikely that one intervention is definitely superior to all other interventions for a particular outcome—which is especially the case in large networks, for several reasons. Firstly, although treatments can be ranked statistically from the best to the worse, the effects of interventions that rank higher might not be importantly different than those of interventions that rank lower. In other words, differences in an effect might be trivial, small but important, moderate, or large, and the implications vary importantly across these categories.7 Moreover, certainty of the evidence usually varies from high to very low across the often many comparisons in an NMA. Interventions that rank high might have low or very low certainty evidence, while other interventions might rank low and have higher associated certainty.89 Therefore, there is seldom one intervention with high or moderate certainty evidence indicating that it is clearly superior compared with all other interventions.
This article describes how to interpret findings of an NMA for each outcome. Depending on the context, interpretation can be done with a minimally contextualised framework, in which value judgments are made regarding the importance of the magnitude of the effects are minimised; or a partially contextualised framework, in which review authors consider the importance of the magnitude of the effects on an outcome without regard of other outcomes. This article focuses on the partially contextualised framework.
The description of this framework assumes familiarity with the basic concepts of NMA, the implications of GRADE’s certainty of the evidence, and the degress of contextualisation. This article constitutes official guidance from the GRADE working group. This framework was developed, tested, and refined by the named authors with feedback from the entire GRADE working group that ultimately approved the paper as GRADE guidance.
Network meta-analysis (NMA) rarely establishes that, for a single outcome, one intervention is better than all others
Classification of interventions into those with a trivial, small, moderate, or large effect can better reflect the results and is consistent with guidance by the GRADE (grading of recommendations assessment, development, and evaluation) working group on how to communicate findings
This classification and the resulting conclusions should consider the estimates of effect, certainty of the evidence, and treatment rankings
This article describes GRADE guidance on how to draw conclusions from NMA for one outcome using a partially contextualised approach, which categorises interventions according to the magnitude of effect and further considers the certainty of the evidence and rankings
NMA users and reviewers that apply GRADE should use the new approach to ensure appropriate, informative, user friendly conclusions
This project was conducted under the auspices of the GRADE NMA project group. First, we conducted a systematic survey of the literature showing that no methods have been proposed to make conclusions from an NMA for one outcome that simultaneously considers the results from an NMA and the certainty in the evidence. A core team of experts in systematic review methodology and NMAs then developed an initial framework using a minimally contextualised framework.10
Reviewing the potential benefits of contextualisation, another team of experts (HJS, NS, RB-P) proposed an alternative framework in which the magnitude of the effect and its healthcare interpretation has a central role. This contextualised framework is built on GRADE’s Evidence to Decision frameworks111213 and GRADE’s guidance on how to interpret findings from pairwise comparisons.14
We obtained feedback about this initial framework from other experts in systematic reviews methodology, biostatisticians, and systematic review authors, both with and without experience in NMA, and who were and were not members of the GRADE working group. We also tested this framework in several examples (some examples included in the appendix). Finally, we presented the final framework to the GRADE working group at the meeting in Hamilton, Canada (June 2019) and Adelaide, Australia (November 2019), to obtain approval to publish this framework as GRADE guidance.
The partially contextualised framework to make conclusions from NMA has two guiding principles and four steps, which we describe below. The principles are similar as those for the minimally contextualised framework, but the conceptual underpinnings, some steps, and judgments required differ substantially.
The framework to draw conclusions from NMA, for one outcome, is based on two principles. Firstly, categories of interventions should be considered (eg, those with a trivial effect, small effect, moderate effect, or large effect). The effect can be either a benefit or a harm, depending on the context. In addition, depending on the results of each NMA, there might not be interventions in all the categories that describe the magnitude of the effect. Secondly, the judgments that place interventions in categories will rely on the estimates of effect, and the intervention rankings; and the conclusions will then consider the certainty of the evidence. None of the pieces of information can be used alone to determine whether an intervention is better than others.
Use of partially contextualised framework to draw conclusions from network meta-analyses
The process for drawing conclusions from NMAs has four steps. Review authors must conduct this process after they have finalised ratings of the certainty of the evidence for each comparison in the NMA. We illustrate each of the steps using an example NMA of pharmacological and nutritional interventions for treating acute diarrhoea and gastroenteritis in children.15 The primary outcome of this systematic review was diarrhoea duration, and the treatment effects were measured as difference in hours. The NMA included 138 randomised controlled trials in which researchers recruited 20 256 participants and assessed the effects of 27 interventions. The network has a complex geometry (fig 1), with 62 direct comparisons and 289 indirect comparisons. We present more examples in the appendix,161718 which include dichotomous outcomes.
Step 1: Choose reference intervention and thresholds for effects
Review authors should choose the intervention most connected to the other interventions in the network and use that intervention as a reference. Network estimates that are calculated with direct evidence are more likely to be judged as higher certainty evidence than those calculated with indirect evidence only, which results in classifying the treatments using the highest certainty evidence. In addition, this increases the likelihood of better differentiating between the interventions and achieving a more informative classification than if the classification was based on lower certainty evidence.
The reference intervention must be used for the process of drawing conclusions, but it does not necessarily have to be used as the reference for the purpose of presenting results if other treatments less connected to the network are more clinically meaningful as a reference. In the NMA of interventions for acute diarrhoea in children,15 the reference intervention was standard treatment that included arms characterised as “no active treatment,” “placebo,” or “only oral rehydration solution.”
Similar to GRADE guidance for communicating the results from systematic reviews,14 reviewers assessing the evidence must make judgments for what constitutes a trivial to no effect, small but important effect, moderate effect, and large effect. These judgments will serve as the basis for the classification of the interventions into groups, and should be established by informed review teams that possess the required health knowledge, ideally based on input from key stakeholders. The process for making these choices should be explicit and transparent; they might not be the same, even within the same NMA in different contexts. In addition, and consistent with GRADE guidance, these choices should be made based on absolute estimates rather than relative estimates of effect.
Absolute values (as in our example here) will be the natural report for continuous outcomes. The same, however, is not true for binary outcomes in which the NMA will yield estimates of relative effect that then need translation into absolute effect. Translation to absolute effects is necessary because judgments of importance (or judgments of magnitude of effect as small, moderate, or large) cannot be made on the basis of relative effects. For example, a 50% relative reduction with a baseline risk of 2% represents a 1% absolute risk reduction that might be considered unimportant, and if important as a small effect. That same 50% relative risk reduction, in the setting of a baseline risk of 40%, represents a 20% absolute risk reduction that could be judged as very important and large. For the purpose of this illustration, the authors of the review of interventions for acute diarrhoea15 determined that a small but important effect was a reduction or increase in diarrhoea duration from 3 to 12 hours, a moderate effect was a reduction or increase from 12 to 24 hours, and a large effect was a reduction or increase of 24 hours or more (fig 2).
Step 2: Classification based on comparison with reference
In this step, review authors should use the point estimate comparing each of the interventions against the reference. This point estimate, which represents the best estimate of effect, should be assessed against the thresholds for small, moderate, and large effects established in the previous step. Depending on the point estimate, each intervention should be classified as being in the range of trivial, small but important, moderate, or lage effects. Depending on its direction, the effect can either be a benefit or a harm when compared with the reference. Figure 2 illustrates this classification in the NMA of interventions for acute diarrhoea.15
The number of groups that result from this classification will depend on the specific NMA. The NMA of interventions for acute diarrhoea in children15 had five groups of interventions: small harm, trivial to no effect, small benefit, moderate benefit, or large benefit (table 1).
Step 3: Identification according to certainty of evidence
In this third step, review authors should use the certainty of the evidence for every treatment, when compared with the reference, in order to make the level of certainty explicit for each comparison with the reference. Review authors can choose to group interventions with high or moderate certainty evidence together, and those with low or very low certainty evidence. This classification might be reasonable in a network with several interventions and with many comparisons across all levels of evidence; however, interventions with low or very low certainty evidence should not be grouped together if most of the interventions have either low or very low certainty when compared with the reference, because review authors would lose the opportunity to differentiate according to evidence certainty. Table 2 shows the classification of interventions for acute diarrhoea and gastroenteritis in children,15 sorted by groups according to the magnitude of the effect and specifying the certainty of the evidence.
Review authors should draw conclusions about how likely each intervention has the magnitude of effect specified according to the certainty of the evidence. For instance, authors can state that “LGG [Lactobacillous rhamnosus GG] probably has moderate benefits when compared to standard therapy,” and that “Micronutrients may have trivial to no effect compared to standard therapy.”14
Step 4: Checking consistency with pairwise comparisons and rankings
In the fourth step, review authors should make sure that the classification is consistent with the pairwise comparisons not considered in the process (that is, the comparisons between pairs of interventions that are not the reference) and their certainty. The classification can be reviewed and adjusted if the pairwise comparisons suggest a different conclusion with high or moderate certainty evidence.
In this step, reviewers should consider the possibility that an intervention appears superior to another in relation to the reference intervention but not in a direct comparison between the two. For instance, consider a situation in which intervention A achieves a large benefit relative to placebo (the reference) and intervention B achieves only a moderate benefit relative to placebo. Intervention A will then be ranked higher than intervention B, but this ranking would be problematic if the interventions are directly compared with each other and B does better than A in achieving benefit. Although unlikely to happen, reviewers should be alert to these situations.
When looking at the indirect comparisons between non-reference interventions in the NMA of interventions for acute diarrhoea in children,15 we saw no indications that the classification was not appropriate. For example, when looking at the comparison between Saccharomyces boulardii + zinc (classified as moderate certainty of a large beneficial effect) and yoghurt ( classified as very low certainty of a moderate beneficial effect), the estimate comparing them was a mean difference of −22.96 hours of diarrhoea duration (95% confidence interval −42.15 to −4.44, very low quality evidence). This difference suggests that S boulardii + zinc could have a larger benefit than yoghurt. Similarly, when comparing interventions smectite + zinc (classified as moderate certainty of a large benefit) with vitamin A (classified as very low certainty of a small benefit), the estimate (mean difference −29.54 hours (−56.09 to −2.84), moderate quality evidence) suggests that smectite + zinc (M) could have a larger benefit than vitamin A.
Review authors can use also the rankings, rank probabilities, SUCRA (surface under the cumulative ranking curve) values, or P scores, if available, to check whether the classification in the groups is sensible, and can adjust the classification if necessary. For example, consider again an intervention with a large effect ranked higher than an intervention with a moderate effect; if the first intervention has a considerably lower SUCRA value than the second intervention, it suggests a problem. In the NMA of interventions for acute diarrhoea in children,15 the SUCRA values decreased from the intervention group with a large benefit to the intervention group with a large harm (table 2), indicating no need to revise the classification.
If the assumptions of NMA are met, the likelihood that step 4 changes the classification is low. Review authors should consider the amount of information provided by the pairwise comparisons not considered in previous steps, and safeguard against any possible mistake. After finishing these four steps, review authors can describe this classification to make their conclusions. According to GRADE guidance on how to communicate findings, the conclusions in this example15 are:
When considering all the interventions, symbiotics have a large beneficial effect on diarrhoea duration
When considering all the interventions, S boulardii + zinc and smectite + zinc probably have a large beneficial effect on diarrhoea duration
When considering all the interventions, zinc + probiotics might have a large beneficial effect on diarrhoea duration
When considering all the interventions, zinc + lactose-free formula, zinc, loperamide, and zinc + micronutrients probably have a moderate beneficial effect on diarrhoea duration
When considering all the interventions, all probiotics, racecadotril, S boulardii, and S boulardii + zinc + lactose-free formula classified as low certainty of a moderate beneficial effect might have a moderate beneficial effect on diarrhoea duration
When considering all the interventions, micronutrients might have a trivial effect on diarrhoea duration
For the rest of the interventions, the effect is uncertain because the certainty on the evidence was very low.
This article describes the GRADE working group guidance for drawing conclusions from an NMA using a partially contextualised framework. This framework allows review authors to classify interventions in different groups considering the magnitude of effect, certainty of the evidence, and rankings, if available; and to draw appropriate conclusions. The number of resulting categories depends on the evidence available, how many interventions are included in the NMA, how the interventions compare with one another, and the thresholds of magnitude of effect. This framework follows similar guiding principles to the minimally contextualised framework10 with one important difference.
The main difference between the minimally contextualised framework and the partially contextualised framework is that the categorisation in the partially contextualised framework does not emphasise imprecision over other GRADE domains to determine whether an effect is present (imprecision together with all other GRADE domains is considered when rating the certainty of the network estimate). In contrast, in the minimally contextualised framework we present elsewhere,10 the initial classification relative to the reference standard focuses (as does the subsequent classification considering differences between non-reference interventions) on whether the confidence interval excludes an established threshold.
Using the minimally contextualised framework, for instance, in comparison to the reference standard and using a no-effect decision threshold, categorisation would differ for an absolute risk reduction of 20% (95% confidence interval 1% to 39%) versus the same point estimate with a 95% confidence interval of −1% to 41%. Using the partially contextualised approach, the initial classification would be made on the basis of the point estimate (and with the same point estimate relative to the reference would be placed in the same category), and whether the confidence interval crosses the null would be irrelevant. This partially contextualised approach acknowledges, for example, that an intervention effect with a confidence interval of 1% to 39% that is rated down for risk of bias should not be more trustworthy than an effect with a confidence interval of −1% to 41% that is rated down for imprecision.
However, both frameworks, and indeed almost any system of classification, are vulnerable to the arbitrariness of thresholds. In the minimally contextualised framework, one threshold of focus is no difference between interventions. In the partially contextualised framework, the threshold of focus is the boundaries between ranges: in the current NMA example,15 a difference of 3.01 hours would be classified differently from a difference of 2.99 hours.
The partially contextualised framework might be particularly appealing in contexts where the specific magnitude of the potential benefit or harm (and whether it represents a trivial, small, moderate, or large effect) are key in helping review authors draw conclusions. This categorisation has an important role in the development of healthcare guidelines when panels judge the balance between health benefits and harms. In such contexts, this framework allows contextualisation through the thresholds of small, moderate, and large effects and other Evidence to Decision criteria.
This framework is described as partially contextualised because it requires the reviewer of the evidence to make explicit and transparent value judgments regarding magnitudes of effect that represent small, moderate, or large benefits or harms. Review authors make value judgments, regardless of the degree of contextualisation, by identifying critical and important outcomes for inclusion in their systematic review. Currently, many systematic reviews are done for specific purposes, for example, to inform a guideline or a health technology assessment. Thus, the guideline panel will require value judgments to make recommendations.
We have not established rules of thumb for these judgments, and they might vary across different settings. In the context of systematic reviews that are designed to inform guidelines, these judgments should be made by the panel of experts and should be informed by evidence regarding patients’ values regarding each outcome.1920 Ideally, the judgments are made by establishing close collaboration between the review team and members of the decision making group (eg, the guideline panel) early on in the process of developing recommendations.21 Authors of guidelines using existing systematic reviews can establish their own thresholds and reclassify the interventions according to their needs that, if made transparent, are then reviewed and modified by descision makers.22 In the context of systematic reviews that are not specifically designed to inform guidelines, these judgments can be made by the clinical experts involved in the systematic review team, considering the relative importance of each outcome.
We developed this framework after we recognised that contextualising the classification by considering the importance of the magnitude of the effect could be desirable in many instances. In NMAs where the evidence for most of the comparisons is indirect, and that are more likely to have wide and imprecise estimates, use of this partially contextualised framework also maximises the chances to differentiate among interventions. In this partially contextualised framework, the width of the confidence intervals is accounted for when assessing imprecision and not used again for drawing conclusions.
The main limitation of this framework is that the conclusions depend substantially on the thresholds established, but from our experience in working with many guideline panels we are confident that calibration takes place easily. Thus, while this limitation might be considered a problem, it is no different from what happens when systematic reviewer authors draw conclusions regarding the magnitude of an effect in the context of any meta-analysis. When using this approach, however, review authors must be explicit about the thresholds and should establish them using absolute estimates of effect. Thus, the step of establishing the thresholds is likely to make review authors more aware of the implication of such thresholds and to make them put more thought into the thresholds than usual.
Secondly, although use of only one intervention as the reference might mean that review authors can ignore a large amount of information, the fourth step of the process requires review authors to confirm that the pairwise comparisons between non-reference interventions and the rankings are consistent with the classification. Therefore, review authors have the chance to adjust the classification using the information not considered initially (although this adjustment is probably not needed in an NMA that was designed appropriately and meets the basic assumptions of an NMA). Finally, despite some concern about use of the point estimates alone for making conclusions, the point estimate has been argued to be the best estimate of effect and information regarding any uncertainty reflected in confidence intervals is captured in the rating of the certainty of the evidence.
In summary, this partially contextualised framework guides review authors to make conclusions from NMA, considering all the crucial pieces of information. This framework is likely to be the most appropriate in scenarios where most of the evidence is indirect and when the systematic reviews with NMAs are conducted to inform decisions such as in guidelines or coverage decisions following an health technology assessment.
We thank all members of the GRADE NMA project group and GRADE working group for their input on this manuscript, in particular to Monica Hultcranz, Reem Mustafa, Derek Chu, and Ilse Verst.
Contributors: RB-P, JY-N, and HJS developed the principles and initial version of the framework. BR, NS, and GHG provided input that resulted important modifications. RB-P and AI tested the framework in several examples. IDF, BR, GH, and WA provided data from the examples included in this article. RB-P, AI, and HJS drafted and edited the manuscript, based on feedback from all the authors and members of the GRADE working group. All authors approved the final version of the manuscript. HJS, who had a major role at all stages of this project, is the guarantor of this article. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding: This project was did not receive funding.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work
Ethical approval: Not applicable. All the work was developed using published data.
The lead author affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.
Provenance and peer review: Not commissioned; externally peer reviewed.
Patient and public involvement: Due to the nature of this work, we did not include patients and public.