Analysis Rating quality of evidence and strength of recommendations

What is “quality of evidence” and why is it important to clinicians?

BMJ 2008; 336 doi: (Published 01 May 2008) Cite this as: BMJ 2008;336:995
  1. Gordon H Guyatt, professor1,
  2. Andrew D Oxman, researcher2,
  3. Regina Kunz, associate professor3,
  4. Gunn E Vist, researcher2,
  5. Yngve Falck-Ytter, assistant professor4,
  6. Holger J Schünemann, associate professor5
  7. for the GRADE Working Group
  1. 1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada L8N 3Z5
  2. 2Norwegian Knowledge Centre for the Health Services, PO Box 7004, St Olavs plass, 0130 Oslo, Norway
  3. 3Basel Institute of Clinical Epidemiology, University Hospital Basel, Hebelstrasse 10, 4031 Basel, Switzerland
  4. 4Division of Gastroenterology, Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA
  5. 5Department of Epidemiology, CLARITY Research Group, Italian National Cancer Institute Regina Elena, Rome, Italy
  1. Correspondence to: G H Guyatt, CLARITY Research Group, Department of Clinical Epidemiology & Biostatistics, Room 2C12, 1200 Main Street West Hamilton, ON, Canada L8N 3Z5 guyatt{at}

Guideline developers use a bewildering variety of systems to rate the quality of the evidence underlying their recommendations. Some are facile, some confused, and others sophisticated but complex

In 2004 the Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group presented its initial proposal for patient management.1 In this second of a series of five articles focusing on the GRADE approach to developing and presenting recommendations we show how GRADE has built on previous systems to create a highly structured, transparent, and informative system for rating quality of evidence.

Summary points

  • A guideline’s formulation should include a clear question with specification of all outcomes of importance to patients

  • GRADE offers four levels of evidence quality: high, moderate, low, and very low

  • Randomised trials begin as high quality evidence and observational studies as low quality evidence

  • Quality may be downgraded as a result of limitations in study design or implementation, imprecision of estimates (wide confidence intervals), variability in results, indirectness of evidence, or publication bias

  • Quality may be upgraded because of a very large magnitude of effect, a dose-response gradient, and if all plausible biases would reduce an apparent treatment effect

  • Critical outcomes determine the overall quality of evidence

  • Evidence profiles provide simple, transparent summaries

A guideline’s formulation should include a clear question

Any question addressing clinical management has four components: patients, an intervention, a comparison, and the outcomes of interest.2 For example, consider the following: in patients with pancreatic carcinoma undergoing surgery what is the impact of a modified resection that preserves the pylorus compared with a standard wide tumour resection—variations of the Whipple procedure—on short term and long term mortality, blood transfusions, bile leaks, hospital stay, and problems with gastric emptying?

Perhaps the most common error in formulating the question is a failure to include all the outcomes that are of importance to patients.3 Critics have, for example, documented the inadequate measurement of side effects and toxicity in randomised trials,4 5 6 7 a limitation that carries over to summaries on evidence. Guideline developers may give excessive credence to surrogate outcomes such as exercise capacity rather than quality of life, or bone density rather than fracture rate. In the Whipple procedure example, a focus on blood loss or operative time rather than blood transfusion and duration of hospital stay would represent such a limitation.

Failure to fully consider all relevant alternatives constitutes another potential problem in treatment recommendations. This may be particularly problematic when guidelines target a global audience; full consideration of less costly alternatives becomes particularly important in such situations.

Guideline developers should address the importance of their outcomes

Ultimately those making recommendations must trade-off benefits and downsides of alternative management strategies. GRADE not only challenges guideline developers to specify all outcomes of importance to patients as they begin the process of guideline development but to differentiate outcomes that are critical for decision making from those that are important but not critical and those that are not important. Because experts, clinicians, and patients may have different values and preferences,8 input from those affected by the decision—patients or members of the public—may strengthen this process, as long as selection of public representatives avoids conflicts of interest.9 Although exploration of optimal strategies for making decisions about relative importance remains limited, the desirability of making the process transparent is beyond doubt.

Figure 1 presents a hierarchy of patient important outcomes regarding the impact of phosphate lowering drugs in patients with renal failure. GRADE suggests a nine point scale to judge importance. The upper end of the scale, 7 to 9, identifies outcomes of critical importance. Ratings of 4 to 6 represent outcomes that are important but not critical to decision making. Ratings of 1 to 3 are items of limited importance to decision making. Guideline panels should strive for the sort of explicit approach that this example represents.


Fig 1 Hierarchy of outcomes according to importance to patients to assess effect of phosphate lowering drugs in patients with renal failure and hyperphosphataemia

Judging the quality of evidence requires consideration of the context

Clinicians have an intuitive sense of the importance of designating evidence as higher or lower quality. Inferences are clearly stronger for higher quality than for lower quality evidence. GRADE uses four levels for quality of evidence: high, moderate, low, and very low. These levels imply a gradient of confidence in estimates of treatment effect, and thus a gradient in the consequent strength of inference.

GRADE provides a specific definition for the quality of evidence in the context of making recommendations. The quality of evidence reflects the extent to which confidence in an estimate of the effect is adequate to support a particular recommendation. This definition has two important implications. Firstly, guideline panels must make judgments about the quality of evidence relative to the specific context in which they are using the evidence. Secondly, because systematic reviews do not—or at least should not—make recommendations, they require a different definition. In this case the quality of evidence reflects the extent of confidence that an estimate of effect is correct.

The following example illustrates how guideline developers must make judgments about quality in the context of their particular recommendations. Bear in mind that because quality has to do with our confidence in estimates of benefits and risks, lack of precision (wide confidence intervals) is one factor that decreases the quality of evidence.

Let us say that a systematic review of randomised trials of a therapy to prevent major strokes yields a pooled estimate of absolute reduction in strokes of 1.3%, with a 95% confidence interval of 0.6% to 2.0%, over one year’s treatment (fig 2). This implies that 77 patients must be treated for one year to prevent one major stroke. The 95% confidence interval around the number need to treat (NNT)—50 to 167—means that it remains plausible that although 77 is the best estimate, as few as 50 people or as many as 167 may need to be treated for one year to prevent one major stroke.


Fig 2 Downgrading for imprecision: thresholds are key (threshold number needed to treat (NNT) of 200 does not require downgrading whereas the same result with a threshold of 100 requires downgrading)

Let it be assumed that the intervention is a drug with no serious adverse effects, minimal inconvenience, and modest cost. Under such circumstances we may be willing to enthusiastically recommend the intervention were it to reduce strokes by as little as 0.5% (blue line in fig 2)—this implies that the NNT=200. The confidence interval around the treatment effect excludes a benefit this small. We can therefore conclude that the precision (and thus the quality of the evidence) is sufficient to support a strong recommendation for the intervention.

What if, however, treatment is associated with serious toxicity and higher cost? Under these circumstances we may be reluctant to recommend treatment unless the absolute reduction in strokes is at least 1% (NNT=100; red dashed line in fig 2). The results fail to exclude an absolute benefit appreciably less than 1%. Under these circumstances the precision (and thus the quality of the evidence) is insufficient to support a strong recommendation for treatment. The thresholds chosen in this example are consistent with empirical explorations of patient values and preferences.10

In summary, greater levels of precision may be required to support a recommendation when advantages and disadvantages are closely balanced. Thus when this fine balance exists it is more likely that guideline developers will need to downgrade the evidence for imprecision.

This example illustrates that although judgments are not arbitrary when evidence is of sufficiently high quality, they rely heavily on underlying values and preferences.11 Guideline developers must therefore be transparent both in making such decisions and in providing a justification. In doing so they will find it useful to specifically consider the domains of quality assessment that GRADE has identified (see box).12

Factors in deciding on quality of evidence

Factors that might decrease quality of evidence
  • Study limitations

  • Inconsistency of results

  • Indirectness of evidence

  • Imprecision

  • Publication bias

Factors that might increase quality of evidence
  • Large magnitude of effect

  • Plausible confounding, which would reduce a demonstrated effect

  • Dose-response gradient

Study design is important in determining the quality of evidence

Early systems of grading the quality of evidence focused almost exclusively on study design.13 Study design remains critical to judgments about the quality of evidence. For recommendations addressing alternative management strategies—as opposed to issues of establishing prognosis or the accuracy of diagnostic tests—randomised trials provide, in general, stronger evidence than do observational studies. Rigorous observational studies provide stronger evidence than uncontrolled case series. In the GRADE approach to quality of evidence, randomised trials without important limitations constitute high quality evidence. Observational studies without special strengths or important limitations constitute low quality evidence. Limitations or special strengths can, however, modify the quality of the evidence.

Five limitations can reduce the quality of the evidence

The GRADE approach involves making separate ratings for quality of evidence for each patient important outcome and identifies five factors that can lower the quality of the evidence (see box). These limitations can downgrade the quality of observational studies as well as randomised controlled trials.

Study limitations

Confidence in recommendations decreases if studies have major limitations that may bias their estimates of the treatment effect.14 These limitations include lack of allocation concealment; lack of blinding, particularly if outcomes are subjective and their assessment highly susceptible to bias; a large loss to follow-up; failure to adhere to an intention to treat analysis; stopping early for benefit15; or selective reporting of outcomes (typically failing to report those for which no effect was observed). For example, a randomised trial suggests that danaparoid sodium is of benefit in treating heparin induced thrombocytopenia complicated by thrombosis.16 That trial, however, was unblinded and the key outcome was the clinicians’ assessment of when the thromboembolism had resolved, a subjective judgment.

Most of the randomised trials examining the relative impact of a standard compared with modified Whipple procedure were limited by lack of optimal concealment, lack of possible blinding of patients and of adjudicators of outcome, and substantial losses to follow-up. Thus the quality of evidence for each of the important outcomes is no higher than moderate (table 1).

Table 1

 GRADE evidence profile for impact of surgical alternatives for pancreatic cancer from systematic review and meta-analysis of randomised controlled trials in inpatient hospitals of pylorus preserving versus standard Whipple pancreaticoduodenectomy for pancreatic or periampullary cancer by Karanicolas et al19

View this table:

Inconsistent results

Widely differing estimates of the treatment effect (heterogeneity or variability in results) across studies suggest true differences in the underlying treatment effect. Variability may arise from differences in populations (for example, drugs may have larger relative effects in sicker populations), interventions (for example, larger effects with higher drug doses), or outcomes (for example, diminishing treatment effect with time). When heterogeneity exists but investigators fail to identify a plausible explanation then the quality of evidence decreases.

For example, the randomised trials of alternative approaches to the Whipple procedure yielded widely differing estimates of effects on gastric emptying, thus further decreasing the quality of the evidence (fig 3).


Fig 3 Effect on delayed gastric emptying of pylorus preserving pancreaticoduodenectomy compared with standard Whipple procedure for pancreatic adenocarcinoma

Indirectness of evidence

Guideline developers face two types of indirectness of evidence. The first occurs when, for example, considering use of one of two active drugs. Although randomised comparisons of the drugs may be unavailable, randomised trials may have compared one of the drugs with placebo and the other with placebo. Such trials allow indirect comparisons of the magnitude of effect of both drugs. Such evidence is of lower quality than would be provided by head to head comparisons of the two drugs.

Increasingly, recommendations must simultaneously tackle multiple interventions. For example, possible approaches to thrombolysis in myocardial infarction include streptokinase, alteplase, reteplase, and tenecteplase. Attempts to deal with multiple interventions inevitably involve indirect comparisons. A variety of recently developed statistical methods may help in generating estimates of the relative effectiveness of multiple interventions.17 Their confident application requires, in addition to evidence from indirect comparisons, substantial evidence from direct comparisons—evidence that is often unavailable.17

The second type of indirectness includes differences between the population, intervention, comparator to that intervention, and outcome of interest, and those included in the relevant studies. Table 2 presents examples of each.

Table 2

 Quality of evidence is weaker if comparisons in trials are indirect

View this table:


When studies include relatively few patients and few events and thus have wide confidence intervals a guideline panel judges the quality of the evidence to be lower because of resulting uncertainty in the results. For example, most of the outcomes for alternatives to the Whipple procedure include both important effects and no effects at all, and some include important differences in both directions.

Publication bias

The quality of evidence will be reduced if investigators fail to report studies they have undertaken. Unfortunately guideline panels must often guess about the likelihood of publication bias. A prototypical situation that should elicit suspicion of publication bias occurs when published evidence is limited to a small number of trials, all of which are funded by industry. For example, 14 trials of flavonoids in patients with haemorrhoids have shown apparent large benefits, but enrolled a total of only 1432 patients.18 The heavy involvement of sponsors in most of these trials raises questions of whether unpublished trials suggesting no benefit exist.

A particular body of evidence can have more than one of these limitations, and the greater the limitations the lower the quality of the evidence. For example, despite the availability of five randomised trials only very low quality evidence exists for the effect of alternative surgical procedures in patients with pancreatic carcinoma on the incidence of gastric emptying problems (table 1).19

Three factors can increase the quality of evidence

Although well done observational studies generally yield low quality evidence, in unusual circumstances they may produce moderate or even high quality evidence (see box).20

Firstly, when methodologically strong observational studies yield large or very large and consistent estimates of the magnitude of a treatment effect, we may be confident about the results. In those situations, although the observational studies are likely to have provided an overestimate of the true effect, the weak study design is unlikely to explain all of the apparent benefit.

The larger the magnitude of effect, the stronger becomes the evidence. For example, a meta-analysis of observational studies showed that bicycle helmets reduce the risk of head injuries in cyclists involved in a crash by a large margin (odds ratio 0.31, 95% confidence interval 0.26 to 0.37).21 This large effect suggests a rating of moderate quality evidence. A meta-analysis of observational studies evaluating the impact of warfarin prophylaxis in cardiac valve replacement found that the relative risk for thromboembolism with warfarin was 0.17 (95% confidence interval 0.13 to 0.24).22 This very large effect suggests a rating of high quality evidence.

Secondly, on occasion all plausible biases from observational studies may be working to underestimate the true treatment effect. For example, if sicker patients only receive an experimental intervention or exposure yet the patients receiving the experimental treatment still fare better, it is likely that the actual intervention or exposure effect is larger than the data suggest. For example, a rigorous systematic review of observational studies that included a total of 38 million patients found higher death rates in private for profit hospitals compared with private not for profit hospitals. Biases related to different disease severity in patients in the two hospital types, and the spillover effect from well insured patients would both lead to estimates in favour of for profit hospitals.23 Therefore the evidence from these observational studies might be considered as of moderate quality rather than low quality—that is, the effect is likely to be at least as large as was observed and may be larger.

Thirdly, the presence of a dose-response gradient may increase confidence in the findings of observational studies and thereby increase the assigned quality of evidence. For example, the observation that, in patients receiving anticoagulation with warfarin, there is a dose-response gradient between higher levels of the international normalised ratio and an increased risk of bleeding increases confidence that supratherapeutic anticoagulation levels increase the risk of bleeding.24

Critical outcomes determine the rating of evidence quality across outcomes

Recommendations depend on evidence for several patient important outcomes and the quality of evidence for each of those outcomes. This presents two challenges. Firstly, how should guideline developers decide which outcomes are important enough to consider, and which are critical? We suggest that guideline developers should explicitly consider these problems, taking into account the views of those affected.

Secondly, how should the quality of evidence be rated across outcomes if quality differs? This occurred in the Whipple procedure example, in which the evidence varied from moderate to very low quality (table 1).

In cases such as the Whipple procedure example, guideline developers should consider whether undesirable consequences of therapy are important but not critical to the decision on the optimal management strategy, or whether they are critical. If an outcome for which evidence is of lower quality is critical for decision making then the rating of quality of the evidence across outcomes must reflect this lower quality evidence. If the outcome for which evidence is lower quality is important but not critical, the GRADE approach suggests a rating across outcomes that reflects the higher quality evidence from the critical outcomes. Thus for the Whipple procedure example, if those making recommendations thought that gastric emptying problems were critical, the rating of evidence quality across outcomes would be very low. If gastric emptying was important but not critical, the quality rating across outcomes would be low (on the basis of results from the clearly critical perioperative mortality) despite the presence of moderate quality evidence on survival at five years (table 1).

Evidence profiles provide simple, transparent summaries

Busy clinicians—and busy patients and policy makers—require succinct, transparent, easily digested summaries of evidence. The GRADE process facilitates the creation of such summaries. Table 1, which presents the relative effect of the standard Whipple procedure compared with more limited resection (pylorus preservation) for patients with pancreatic carcinoma, informs us that more limited resection may decrease blood loss and perioperative mortality without increasing long term adverse outcomes, but that the evidence remains limited.


GRADE provides a clearly articulated, comprehensive, and transparent methodology for rating and summarising the quality of evidence supporting management recommendations. Although judgments will always be required for each step, the systematic and transparent GRADE approach facilitates scrutiny of and debate about those judgments.


  • This is a series of five articles that explain the GRADE system for rating the quality of evidence and strength of recommendations

  • Contributors: All authors, including the members of the GRADE Working Group, contributed to the development of the ideas in the manuscript and read and approved the manuscript. GG wrote the first draft and collated comments from authors and reviewers for subsequent iterations. He is guarantor for this manuscript. All authors listed in the byline contributed ideas about structure and content, provided examples, and reviewed successive drafts of the manuscript and provided feedback.

  • The members of the GRADE Working Group are Phil Alderson, Pablo Alonso-Coello, Jeff Andrews, David Atkins, Hilda Bastian, Hans de Beer, Jan Brozek, Francoise Cluzeau, Jonathan Craig, Ben Djulbegovic, Yngve Falck-Ytter, Beatrice Fervers, Signe Flottorp, Paul Glasziou, Gordon H Guyatt, Margaret Haugh, Robin Harbour, Mark Helfand, Sue Hill, Roman Jaeschke, Katharine Jones, Ilkka Kunnamo, Regina Kunz, Alessandro Liberati, Merce Marzo, James Mason, Jacek Mrukowics, Susan Norris, Andrew D Oxman, Vivian Robinson, Holger J Schünemann, Tessa Tan Torres, David Tovey, Peter Tugwell, Mariska Tuut, Helena Varonen, Gunn E Vist, Craig Wittington, John Williams, and James Woodcock.

  • Funding: No specific funding.

  • Competing interests: All authors are involved in the dissemination of GRADE, and GRADE’s success has a positive influence on their academic career. Authors listed in the byline have received travel reimbursement and honorariums for presentations that included a review of GRADE’s approach to rating quality of evidence and grading recommendations. GHG acts as a consultant to UpToDate; his work includes helping UpToDate in their use of GRADE. HJS is documents editor and methodologist for the American Thoracic Society; one of his roles in these positions is helping implement the use of GRADE. He is supported by “The human factor, mobility and Marie Curie actions scientist reintegration European Commission grant: IGR 42192—GRADE.”

  • Provenance and peer review: Not commissioned; externally peer reviewed.


View Abstract

Log in

Log in through your institution


* For online subscription