BMJ  2008;336:1049-1051 (10 May), doi:10.1136/bmj.39493.646875.AE

Analysis

Rating quality of evidence and strength of recommendations

Going from evidence to recommendations

Gordon H Guyatt, professor1, Andrew D Oxman, researcher2, Regina Kunz, associate professor3, Yngve Falck-Ytter, assistant professor4, Gunn E Vist, researcher2, Alessandro Liberati, associate professor5, Holger J Schünemann, associate professor6, for the GRADE Working Group

1 CLARITY Research Group, Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada L8N 3Z5, 2 Norwegian Knowledge Centre for the Health Services, Oslo, Norway, 3 Basel Institute of Clinical Epidemiology, University Hospital Basel, Basel, Switzerland, 4 Division of Gastroenterology, Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA, 5 University of Modena and Reggio Emilia and Agenzia Sanitaria Regionale, Bologna, Italy, 6 Department of Epidemiology, CLARITY Research Group, Italian National Cancer Institute Regina Elena, Rome, Italy

Correspondence to: G H Guyatt  guyatt{at}mcmaster.ca

Analysis, doi: 10.1136/bmj.39490.551019.BEAnalysis, doi: 10.1136/bmj.39489.470347.AD

The GRADE system classifies recommendations made in guidelines as either strong or weak. This article explores the meaning of these descriptions and their implications for patients, clinicians, and policy makers


Summary points

The strength of a recommendation reflects the extent to which we can be confident that desirable effects of an intervention outweigh undesirable effects
GRADE classifies recommendations as strong or weak
Strong recommendations mean that most informed patients would choose the recommended management and that clinicians can structure their interactions with patients accordingly
Weak recommendations mean that patients’ choices will vary according to their values and preferences, and clinicians must ensure that patients’ care is in keeping with their values and preferences
Strength of recommendation is determined by the balance between desirable and undesirable consequences of alternative management strategies, quality of evidence, variability in values and preferences, and resource use


This is the third of a series of five articles describing the GRADE approach to developing and presenting recommendations for management of patients. In it, we deal with how GRADE suggests clinicians should interpret the strength of a recommendation.

What do we mean by strength of recommendation?

The strength of a recommendation reflects the extent to which we can, across the range of patients for whom the recommendations are intended, be confident that the desirable effects of an intervention outweigh the undesirable effects. Alternatively, in considering two or more possible management strategies, a recommendation’s strength represents our confidence that the net benefit clearly favours one alternative or another.

Desirable effects of an intervention include reduction in morbidity and mortality, improvement in quality of life, reduction in the burden of treatment (such as having to take drugs or the inconvenience of having blood tests or going to the doctor’s office for monitoring), and reduced resource expenditures. Undesirable consequences include adverse effects that may have a deleterious impact on morbidity, mortality, or quality of life or increase use of resources.

Previous grading systems have sometimes used complex systems of recommendations with up to nine categories of strength of recommendations.1 GRADE has taken a very different approach, with only two categories. Although in this article we will characterise these as strong and weak, guideline panels may choose different words to characterise the two categories of strength. When using GRADE, panels make strong recommendations when they are confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects. When they make a weak recommendation, the panel has concluded that the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but it is not confident. Another way to state the significance of a weak recommendation is to say that most patients will be better off if clinicians follow the recommendation, but many other patients will not.

Strong and weak recommendations provide specific guidance

The great merit of GRADE’s binary classification of strength of recommendations is that it provides clear direction to patients, clinicians, and policy makers. The implications of a strong recommendation are:

  • For patients—most people in your situation would want the recommended course of action and only a small proportion would not; request discussion if the intervention is not offered
  • For clinicians—most patients should receive the recommended course of action
  • For policy makers—the recommendation can be adopted as a policy in most situations.

The implications of a weak recommendation are:

  • For patients—most people in your situation would want the recommended course of action, but many would not
  • For clinicians—you should recognise that different choices will be appropriate for different patients and that you must help each patient to arrive at a management decision consistent with her or his values and preferences
  • For policy makers—policy making will require substantial debate and involvement of many stakeholders.

As clinicians are becoming more aware of variability in patients’ values and preferences, and how differences in values and preferences influence management decisions, they are increasingly turning to structured decision aids to facilitate the decision making process.2 A strong recommendation indicates that use of a decision aid is unnecessary and may be an inefficient use of time and energy—almost all informed patients will make the same choice. A weak recommendation indicates that a decision aid, if available, could be useful.

Managers of healthcare systems worldwide are becoming more interested in taking measures to ensure the quality of care. The United States is the site of the most aggressive approach: the physician voluntary reporting program seems to presage "pay for performance" initiatives. Practices based on high quality evidence in which desirable consequences far exceed undesirable consequences constitute appropriate candidates for quality of care criteria. When evidence is lower quality, or desirable and undesirable consequences are more closely balanced, variable management is reasonable and management practices should be considered discretionary and not candidates for quality assessment. Guidelines should provide guidance to differentiate these two situations. GRADE provides the clearest possible statement on these matters: the management options associated with strong, but not with weak, recommendations are candidates for quality criteria. When a recommendation is weak, discussing with patients and families the relative merits of the alternative management strategies may become a quality criterion.

Four key factors determine the strength of a recommendation

Balance between desirable and undesirable effects
The first key determinant of the strength of a recommendation is the balance between the desirable and undesirable consequences of the alternative management strategies, on the basis of the best estimates of those consequences (tableGo). Consider, for instance, the use of antenatal steroids in women destined to deliver an infant prematurely. High quality evidence shows that administration of steroids to mothers decreases the risk of infant respiratory distress syndrome with minimal side effects, inconvenience, and costs. The advantages of administration of steroids hugely outweigh the disadvantages, indicating the appropriateness of a strong recommendation.


View this table:
[in this window]
[in a new window]

 
Determinants of strength of recommendation

 
When advantages and disadvantages are closely balanced, a weak recommendation becomes appropriate. Consider, for instance, patients with atrial fibrillation at low risk of stroke. Warfarin can reduce that low risk even further but adds inconvenience and an increased risk of bleeding. The right choice under such circumstances is not self evident and is likely to differ between patients.

As with all other aspects of a grading system, a tension exists between the important goal of simplicity and the danger of oversimplification. We have presented the trade-off between advantages and disadvantages as a dichotomy: clear difference versus a close call. Of course, the reality is a continuum between these extremes. Nevertheless, the forced dichotomisation allows simplification of a process that many people already find complex and may enhance the transparency of decision making.

Quality of evidence
The second factor that determines the strength of a recommendation is the quality of the evidence. If we are uncertain of the magnitude of the benefits and harms of an intervention, making a strong recommendation for or against that intervention becomes problematic. Thus, even when an apparent large gradient exists in the balance of advantages and disadvantages, guideline developers will be appropriately reluctant to offer a strong recommendation if the quality of the evidence is low.

For instance, graduated compression stockings have an apparent large effect in reducing deep venous thrombosis in people making long plane journeys. The randomised trials from which the estimate of effect comes were, however, seriously flawed—randomisation was unconcealed, the techniques for measuring deep venous thrombosis were not reproducible, and the studies were not blinded. Despite the apparent large benefit, and the only major disadvantage being inconvenience, use of stockings warrants only a weak recommendation.3

Values and preferences
The third determinant of the strength of recommendation is uncertainty about, or variability in, values and preferences. Given that alternative management strategies will always have advantages and disadvantages, and thus a trade-off will occur, how a guideline panel values benefits, risks, and inconvenience is critical to any recommendation and the strength of the recommendation. One could argue that, given the very limited study the subject has received, large uncertainty always exists about values and preferences. On the other hand, some systematic study of values and preferences has been completed, and clinicians’ experience with patients provides additional insight.

Consider, for instance, prevention of stroke in patients with atrial fibrillation. Warfarin, relative to no antithrombotic treatment, reduces the risk of stroke—in relative terms—by approximately 65%, but at an appreciable increased risk of severe gastrointestinal bleeding. Devereaux and colleagues asked 63 physicians and 61 patients how many serious gastrointestinal bleeds they would tolerate in 100 patients and still be willing to prescribe or take warfarin to prevent eight strokes (four minor and four major) in 100 patients.4 Figure 1Go shows the results. Whereas physicians gave a wide diversity of responses, most patients placed a high value on avoiding a stroke and were ready to accept a bleeding risk of 22% to reduce their chances of having a stroke by 8%. Even among patients, however, diversity in values and preferences was apparent; a few patients were ready to accept only a small risk of bleeding to reduce their stroke risk by 8%. These data suggest that only in patients at high risk of stroke would a strong recommendation for warfarin be warranted.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[PowerPoint Slide for Teaching]
 
Fig 1 Varying thresholds of major gastrointestinal bleeding found acceptable by patients and physicians for prevention of eight strokes in 100 patients

 
Contrast this with the decision faced by pregnant women with deep venous thrombosis. Warfarin therapy between the sixth and 12th week of pregnancy puts women’s unborn infants at risk of relatively minor developmental abnormalities. The alternative, heparin, eliminates the risk to the child. This benefit, however, comes with disadvantages of pain (heparin injections), inconvenience, and cost. Nevertheless, clinicians’ experience is that women overwhelmingly place a very high value on preventing fetal complications. As a result, a strong recommendation for substitution of heparin is warranted.

Given the paucity of empirical examinations of patients’ values and preferences, well resourced guideline panels will usually have to rely on consultation with individual patients and patients’ groups to gain insight into patients’ values. Less well resourced panels must rely on their intuitive impressions of these values. In either case, when a recommendation is particularly dependent on values and preferences, panels must state the values underlying their decision. For instance, the following assumption provided the basis for recommendations in a guideline for antithrombotic treatment in pregnant women: "While we are unaware of any research specifically addressing women’s preferences regarding antithrombotic therapy in pregnancy, anecdotal evidence suggests that many, though not all women, give higher priority to the impact of any treatment on the health of their unborn baby than to effects on themselves."5

Costs
The final determinant of the strength of a recommendation is cost. One could consider cost as one of the outcomes when weighing up the advantages and disadvantages of competing management strategies. Cost, however, is much more variable over time, geographical areas, and implications than are other outcomes. Drug costs tend to plummet when patents expire, and charges for the same drug differ widely across jurisdictions. In addition, the resource implications vary widely. For instance, a year’s prescription of the same expensive drug may pay for a single nurse’s salary in the United States, six nurses’ salaries in Poland, and 30 nurses’ salaries in China.

Thus, although higher costs will reduce the likelihood of a strong recommendation in favour of a particular intervention, the context of the recommendation will be critical. In considering resource allocation, guideline panels must thus be very specific about the setting to which a recommendation applies and the perspective that is used—that is, which costs were considered. Furthermore, recommendations that are heavily influenced by costs are likely to change over time as resource implications evolve.

Strong recommendations may not be important from all perspectives

If the consequences of the choice are relatively unimportant, some patients may not bother with even strong recommendations. This is particularly likely if they are faced with many new drugs or many suggestions to change their lifestyle.

Because governments and public health officials must consider several factors beyond the strength of a recommendation, they may consider that some strong recommendations that are important for individual patients have low priority. These factors—generally of little relevance to recommendations directed at clinicians—include the prevalence of the health problem (higher priority for more prevalent conditions), considerations of equity (higher priority for interventions that tackle health equities by targeting disadvantaged populations), total cost to society (lower priority for interventions with very high total costs), and the potential for improvement in quality of care (higher priority for underused interventions). Thus, if guideline panels are addressing funders or health system managers, they must make transparent the manner in which factors related to prevalence, equity, cost, and improving quality of care influence their recommendations.

Recommendations to use interventions in research context may be appropriate

Guideline panels may face decisions about promising interventions associated with appreciable harms or costs and with insufficient evidence of benefit to support their use. They may be reluctant to close the door on such an intervention or inappropriately provide a weak recommendation for its use—a course of action that may lead to wide dissemination and resulting harm. Consider, for instance, the impact of recommendations in favour of hormone replacement therapy to prevent cardiovascular disease in postmenopausal women. When interventions are expensive, an additional problem with premature recommendations in favour of an intervention is the risk of irretrievable allocation of resources that would be better spent elsewhere. Nevertheless, a guideline panel’s fears will be realised if the appropriate strong, or even weak, recommendation against use of the intervention in clinical practice has the effect of stifling further investigation.

Recommendations for use of an intervention only in the context of research may ameliorate these problems. Furthermore, such a recommendation may provide an important boost to efforts to answer important research questions, thus resolving uncertainty about optimal patient management.

Recommendations for use of interventions exclusively in the context of research will be appropriate when two conditions are met. Firstly, insufficient evidence must exist for a panel to suggest using or not using an intervention. Secondly, further research must have a large potential for reducing uncertainty about the effects of the intervention and for doing so at a reasonable cost. Guideline panels that do not have the skills and knowledge to set or apply criteria for research priorities should refrain from making recommendations about the use of interventions exclusively in the context of research. Organisations such as the National Institute for Health and Clinical Excellence may be well equipped to make such judgments: Of its first 95 technology appraisals, eight led to recommendations for use in the context of research.

Various presentations of quality of evidence and strength of recommendations may be appropriate

Most guideline panels have used letters and numbers to summarise their recommendations. Because of highly variable use of numbers and letters—for instance, some organisations have chosen letters for quality of evidence and numbers for strength of recommendations, and some the reverse—the situation is potentially very confusing.5

Symbolic representations of quality of evidence and strength of recommendations are appealing in that they are free of this history. On the other hand, organisations may have good reasons for choosing letters and numbers. Clinicians seem to be very comfortable with numbers and letters, and these are particularly suitable for verbal communication.

GRADE has decided to offer preferred symbolic representations and, for organisations that wish to use numbers and letters, a preferred number/letter representation for quality of evidence and grades of recommendation (fig 2Go).6


Figure 2
View larger version (27K):
[in this window]
[in a new window]
[PowerPoint Slide for Teaching]
 
Fig 2 Representations of quality of evidence and strength of recommendations

 

Analysis, doi: 10.1136/bmj.39490.551019.BEAnalysis, doi: 10.1136/bmj.39489.470347.AD


This is the third in a series of five articles that explain the GRADE system for rating the quality of evidence and strength of recommendations

The members of the GRADE Working Group are Phil Alderson, Pablo Alonso-Coello, Jeff Andrews, David Atkins, Hilda Bastian, Hans de Beer, Jan Brozek, Francoise Cluzeau, Jonathan Craig, Ben Djulbegovic, Yngve Falck-Ytter, Beatrice Fervers, Signe Flottorp, Paul Glasziou, Gordon H Guyatt, Robin Harbour, Margaret Haugh, Mark Helfand, Sue Hill, Roman Jaeschke, Katharine Jones, Ilkka Kunnamo, Regina Kunz, Alessandro Liberati, Merce Marzo, James Mason, Jacek Mrukowics, Andrew D Oxman, Susan Norris, Vivian Robinson, Holger J Schünemann, Tessa Tan Torres, David Tovey, Peter Tugwell, Mariska Tuut, Helena Varonen, Gunn E Vist, Craig Wittington, John Williams, and James Woodcock.

Contributors: All listed authors, and other members of the GRADE working group, contributed to the development of the ideas in the manuscript and read and approved the manuscript. GHG wrote the first draft and collated comments from authors and reviewers for subsequent iterations. All other listed authors contributed ideas about structure and content, provided examples, and reviewed successive drafts of the manuscript and provided feedback. GHG is the guarantor.

Funding: None.

Competing interests: All authors are involved in the dissemination of GRADE, and GRADE’s success has a positive influence on their academic careers. Authors listed on the byline have received travel reimbursement and honoraria for presentations that included a review of GRADE’s approach to rating quality of evidence and grading recommendations. GHG acts as a consultant to UpToDate; his work includes helping UpToDate in their use of GRADE. HJS is documents editor and methodologist for the American Thoracic Society; one of his roles in these positions is helping implement the use of GRADE. HJS is supported by "The human factor, mobility and Marie Curie Actions Scientist Reintegration European Commission Grant: IGR 42192—GRADE." AL is helping the use of GRADE by different institutions in the Italian health service, and in this role he has implemented GRADE to produce clinical recommendations in oncology through Grant No 249 (2005-7), Bando Ricerca Finalizzata, Ministero della Salute, Roma, Italy.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

  1. Fleisher LA, Bass EB, McKeown P. Methodological approach: American College of Chest Physicians guidelines for the prevention and management of postoperative atrial fibrillation after cardiac surgery. Chest 2005;128:17-23S.[CrossRef]
  2. O’Connor AM, Stacey D, Entwistle V, Llewellyn-Thomas H, Rovner D, Holmes-Rovner M, et al. Decision aids for people facing health treatment or screening decisions. Cochrane Database Syst Rev 2003;(1):CD001431.
  3. Geerts W, Ray JG, Colwell CW, Bergqvist D, Pineo GF, Lassen MR, et al. Prevention of venous thromboembolism. Chest 2005;128:3775-6.[CrossRef][Web of Science][Medline]
  4. Devereaux PJ, Anderson DR, Gardner MJ, Putnam W, Flowerdew GJ, Brownell BF, et al. Differences between perspectives of physicians and patients on anticoagulation in patients with atrial fibrillation: observational study. BMJ 2001;323:1218-22.[Abstract/Free Full Text]
  5. Bates SM, Greer IA, Pabinger I, Sofaer S, Hirsh J. Venous thromboembolism, thrombophilia, antithrombotic therapy and pregnancy: ACCP evidence-based clinical practice guidelines (eighth edition). Chest (in press).
  6. Schunemann HJ, Best D, Vist G, Oxman AD. Letters, numbers, symbols and words: how to communicate grades of evidence and recommendations. CMAJ 2003;169:677-80.[Abstract/Free Full Text]

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to StumbleUpon StumbleUpon   Add to Technorati Technorati    What's this?

Relevant Articles

What should clinicians do when faced with conflicting recommendations?
Andrew D Oxman, Paul Glasziou, and John W Williams, Jr
BMJ 2008 337: a2530. [Extract] [Full Text]

Use of GRADE grid to reach decisions on clinical practice guidelines when consensus is elusive
Roman Jaeschke, Gordon H Guyatt, Phil Dellinger, Holger Schünemann, Mitchell M Levy, Regina Kunz, Susan Norris, Julian Bion for the GRADE working group
BMJ 2008 337: a744. [Extract] [Full Text]

Grading quality of evidence and strength of recommendations for diagnostic tests and strategies
Holger J Schünemann, Andrew D Oxman, Jan Brozek, Paul Glasziou, Roman Jaeschke, Gunn E Vist, John W Williams, Jr, Regina Kunz, Jonathan Craig, Victor M Montori, Patrick Bossuyt, Gordon H Guyatt for the GRADE Working Group
BMJ 2008 336: 1106-1110. [Extract] [Full Text] [PDF]

What is "quality of evidence" and why is it important to clinicians?
Gordon H Guyatt, Andrew D Oxman, Regina Kunz, Gunn E Vist, Yngve Falck-Ytter, Holger J Schünemann for the GRADE Working Group
BMJ 2008 336: 995-998. [Extract] [Full Text] [PDF]

GRADE: an emerging consensus on rating quality of evidence and strength of recommendations
Gordon H Guyatt, Andrew D Oxman, Gunn E Vist, Regina Kunz, Yngve Falck-Ytter, Pablo Alonso-Coello, Holger J Schünemann for the GRADE Working Group
BMJ 2008 336: 924-926. [Extract] [Full Text] [PDF]

Differences between perspectives of physicians and patients on anticoagulation in patients with atrial fibrillation: observational study Commentary: Varied preferences reflect the reality of clinical practice
P J Devereaux, David R Anderson, Martin J Gardner, Wayne Putnam, Gordon J Flowerdew, Brenda F Brownell, Seema Nagpal, Jafna L Cox, and Tom Fahey
BMJ 2001 323: 1218. [Abstract] [Full Text] [PDF]

This article has been cited by other articles:

  • Hossain, M. (2009). 45-year-old male from an RTA with isolated sternal fracture: immediate discharge or hospital admission?. Evid. Based Med. 14: 134-135 [Full text]  
  • Skotko, B. G., Capone, G. T., Kishnani, P. S., for the Down Syndrome Diagnosis Study Group, (2009). Postnatal Diagnosis of Down Syndrome: Synthesis of the Evidence on How Best to Deliver the News. Pediatrics 124: e751-e758 [Abstract] [Full text]  
  • Guyatt, G. H., Helfand, M., Kunz, R. (2009). Comparing the USPSTF and GRADE Approaches to Recommendations. ANN INTERN MED 151: 363-363 [Full text]  
  • Petitti, D. B., Teutsch, S. M., Barton, M. B., Sawaya, G. F., Ockene, J. K., DeWitt, T. (2009). Comparing the USPSTF and GRADE Approaches to Recommendations. ANN INTERN MED 151: 363-364 [Full text]  
  • Hahn, D. L. (2009). Importance of Evidence Grading for Guideline Implementation: The Example of Asthma. Ann Fam Med 7: 364-369 [Abstract] [Full text]  
  • Tadros, T. M., Klein, M. D., Shapira, O. M. (2009). Ascending Aortic Dilatation Associated With Bicuspid Aortic Valve: Pathophysiology, Molecular Biology, and Clinical Implications. Circulation 119: 880-890 [Full text]  
  • Oxman, A. D, Glasziou, P., Williams, J. W Jr (2008). What should clinicians do when faced with conflicting recommendations?. BMJ 337: a2530-a2530 [Full text]  
  • Jaeschke, R., Guyatt, G. H, Dellinger, P., Schunemann, H., Levy, M. M, Kunz, R., Norris, S., Bion, J., for the GRADE working group, (2008). Use of GRADE grid to reach decisions on clinical practice guidelines when consensus is elusive. BMJ 337: a744-a744 [Full text]  
  • Guyatt, G. H, Oxman, A. D, Kunz, R., Jaeschke, R., Helfand, M., Liberati, A., Vist, G. E, Schunemann, H. J, for the GRADE working group, (2008). Incorporating considerations of resources use into grading recommendations. BMJ 336: 1170-1173 [Full text]  

Rapid Responses:

Read all Rapid Responses

Grading is possible only for simple (non-compound) and unambiguous recommendations
Michael Power
bmj.com, 13 May 2008 [Full text]



Access jobs at BMJ Careers
Whats new online at Student 

BMJ