Analysis Rating quality of evidence and strength of recommendations

Going from evidence to recommendations

BMJ 2008; 336 doi: (Published 08 May 2008) Cite this as: BMJ 2008;336:1049

This article has a correction. Please see:

  1. Gordon H Guyatt, professor1,
  2. Andrew D Oxman, researcher2,
  3. Regina Kunz, associate professor3,
  4. Yngve Falck-Ytter, assistant professor4,
  5. Gunn E Vist, researcher2,
  6. Alessandro Liberati, associate professor5,
  7. Holger J Schünemann, associate professor6
  8. for the GRADE Working Group
  1. 1CLARITY Research Group, Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada L8N 3Z5
  2. 2Norwegian Knowledge Centre for the Health Services, Oslo, Norway
  3. 3Basel Institute of Clinical Epidemiology, University Hospital Basel, Basel, Switzerland
  4. 4Division of Gastroenterology, Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA
  5. 5University of Modena and Reggio Emilia and Agenzia Sanitaria Regionale, Bologna, Italy
  6. 6Department of Epidemiology, CLARITY Research Group, Italian National Cancer Institute Regina Elena, Rome, Italy
  1. Correspondence to: G H Guyatt guyatt{at}

    The GRADE system classifies recommendations made in guidelines as either strong or weak. This article explores the meaning of these descriptions and their implications for patients, clinicians, and policy makers

    Summary points

    • The strength of a recommendation reflects the extent to which we can be confident that desirable effects of an intervention outweigh undesirable effects

    • GRADE classifies recommendations as strong or weak

    • Strong recommendations mean that most informed patients would choose the recommended management and that clinicians can structure their interactions with patients accordingly

    • Weak recommendations mean that patients’ choices will vary according to their values and preferences, and clinicians must ensure that patients’ care is in keeping with their values and preferences

    • Strength of recommendation is determined by the balance between desirable and undesirable consequences of alternative management strategies, quality of evidence, variability in values and preferences, and resource use

    This is the third of a series of five articles describing the GRADE approach to developing and presenting recommendations for management of patients. In it, we deal with how GRADE suggests clinicians should interpret the strength of a recommendation.

    What do we mean by strength of recommendation?

    The strength of a recommendation reflects the extent to which we can, across the range of patients for whom the recommendations are intended, be confident that the desirable effects of an intervention outweigh the undesirable effects. Alternatively, in considering two or more possible management strategies, a recommendation’s strength represents our confidence that the net benefit clearly favours one alternative or another.

    Desirable effects of an intervention include reduction in morbidity and mortality, improvement in quality of life, reduction in the burden of treatment (such as having to take drugs or the inconvenience of having blood tests or going to the doctor’s office for monitoring), and reduced resource expenditures. Undesirable consequences include adverse effects that may have a deleterious impact on morbidity, mortality, or quality of life or increase use of resources.

    Previous grading systems have sometimes used complex systems of recommendations with up to nine categories of strength of recommendations.1 GRADE has taken a very different approach, with only two categories. Although in this article we will characterise these as strong and weak, guideline panels may choose different words to characterise the two categories of strength. When using GRADE, panels make strong recommendations when they are confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects. When they make a weak recommendation, the panel has concluded that the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but it is not confident. Another way to state the significance of a weak recommendation is to say that most patients will be better off if clinicians follow the recommendation, but many other patients will not.

    Strong and weak recommendations provide specific guidance

    The great merit of GRADE’s binary classification of strength of recommendations is that it provides clear direction to patients, clinicians, and policy makers. The implications of a strong recommendation are:

    • For patients—most people in your situation would want the recommended course of action and only a small proportion would not; request discussion if the intervention is not offered

    • For clinicians—most patients should receive the recommended course of action

    • For policy makers—the recommendation can be adopted as a policy in most situations.

    The implications of a weak recommendation are:

    • For patients—most people in your situation would want the recommended course of action, but many would not

    • For clinicians—you should recognise that different choices will be appropriate for different patients and that you must help each patient to arrive at a management decision consistent with her or his values and preferences

    • For policy makers—policy making will require substantial debate and involvement of many stakeholders.

    As clinicians are becoming more aware of variability in patients’ values and preferences, and how differences in values and preferences influence management decisions, they are increasingly turning to structured decision aids to facilitate the decision making process.2 A strong recommendation indicates that use of a decision aid is unnecessary and may be an inefficient use of time and energy—almost all informed patients will make the same choice. A weak recommendation indicates that a decision aid, if available, could be useful.

    Managers of healthcare systems worldwide are becoming more interested in taking measures to ensure the quality of care. The United States is the site of the most aggressive approach: the physician voluntary reporting program seems to presage “pay for performance” initiatives. Practices based on high quality evidence in which desirable consequences far exceed undesirable consequences constitute appropriate candidates for quality of care criteria. When evidence is lower quality, or desirable and undesirable consequences are more closely balanced, variable management is reasonable and management practices should be considered discretionary and not candidates for quality assessment. Guidelines should provide guidance to differentiate these two situations. GRADE provides the clearest possible statement on these matters: the management options associated with strong, but not with weak, recommendations are candidates for quality criteria. When a recommendation is weak, discussing with patients and families the relative merits of the alternative management strategies may become a quality criterion.

    Four key factors determine the strength of a recommendation

    Balance between desirable and undesirable effects

    The first key determinant of the strength of a recommendation is the balance between the desirable and undesirable consequences of the alternative management strategies, on the basis of the best estimates of those consequences (table). Consider, for instance, the use of antenatal steroids in women destined to deliver an infant prematurely. High quality evidence shows that administration of steroids to mothers decreases the risk of infant respiratory distress syndrome with minimal side effects, inconvenience, and costs. The advantages of administration of steroids hugely outweigh the disadvantages, indicating the appropriateness of a strong recommendation.

    Determinants of strength of recommendation

    View this table:

    When advantages and disadvantages are closely balanced, a weak recommendation becomes appropriate. Consider, for instance, patients with atrial fibrillation at low risk of stroke. Warfarin can reduce that low risk even further but adds inconvenience and an increased risk of bleeding. The right choice under such circumstances is not self evident and is likely to differ between patients.

    As with all other aspects of a grading system, a tension exists between the important goal of simplicity and the danger of oversimplification. We have presented the trade-off between advantages and disadvantages as a dichotomy: clear difference versus a close call. Of course, the reality is a continuum between these extremes. Nevertheless, the forced dichotomisation allows simplification of a process that many people already find complex and may enhance the transparency of decision making.

    Quality of evidence

    The second factor that determines the strength of a recommendation is the quality of the evidence. If we are uncertain of the magnitude of the benefits and harms of an intervention, making a strong recommendation for or against that intervention becomes problematic. Thus, even when an apparent large gradient exists in the balance of advantages and disadvantages, guideline developers will be appropriately reluctant to offer a strong recommendation if the quality of the evidence is low.

    For instance, graduated compression stockings have an apparent large effect in reducing deep venous thrombosis in people making long plane journeys. The randomised trials from which the estimate of effect comes were, however, seriously flawed—randomisation was unconcealed, the techniques for measuring deep venous thrombosis were not reproducible, and the studies were not blinded. Despite the apparent large benefit, and the only major disadvantage being inconvenience, use of stockings warrants only a weak recommendation.3

    Values and preferences

    The third determinant of the strength of recommendation is uncertainty about, or variability in, values and preferences. Given that alternative management strategies will always have advantages and disadvantages, and thus a trade-off will occur, how a guideline panel values benefits, risks, and inconvenience is critical to any recommendation and the strength of the recommendation. One could argue that, given the very limited study the subject has received, large uncertainty always exists about values and preferences. On the other hand, some systematic study of values and preferences has been completed, and clinicians’ experience with patients provides additional insight.

    Consider, for instance, prevention of stroke in patients with atrial fibrillation. Warfarin, relative to no antithrombotic treatment, reduces the risk of stroke—in relative terms—by approximately 65%, but at an appreciable increased risk of severe gastrointestinal bleeding. Devereaux and colleagues asked 63 physicians and 61 patients how many serious gastrointestinal bleeds they would tolerate in 100 patients and still be willing to prescribe or take warfarin to prevent eight strokes (four minor and four major) in 100 patients.4 Figure 1 shows the results. Whereas physicians gave a wide diversity of responses, most patients placed a high value on avoiding a stroke and were ready to accept a bleeding risk of 22% to reduce their chances of having a stroke by 8%. Even among patients, however, diversity in values and preferences was apparent; a few patients were ready to accept only a small risk of bleeding to reduce their stroke risk by 8%. These data suggest that only in patients at high risk of stroke would a strong recommendation for warfarin be warranted.


    Fig 1 Varying thresholds of major gastrointestinal bleeding found acceptable by patients and physicians for prevention of eight strokes in 100 patients

    Contrast this with the decision faced by pregnant women with deep venous thrombosis. Warfarin therapy between the sixth and 12th week of pregnancy puts women’s unborn infants at risk of relatively minor developmental abnormalities. The alternative, heparin, eliminates the risk to the child. This benefit, however, comes with disadvantages of pain (heparin injections), inconvenience, and cost. Nevertheless, clinicians’ experience is that women overwhelmingly place a very high value on preventing fetal complications. As a result, a strong recommendation for substitution of heparin is warranted.

    Given the paucity of empirical examinations of patients’ values and preferences, well resourced guideline panels will usually have to rely on consultation with individual patients and patients’ groups to gain insight into patients’ values. Less well resourced panels must rely on their intuitive impressions of these values. In either case, when a recommendation is particularly dependent on values and preferences, panels must state the values underlying their decision. For instance, the following assumption provided the basis for recommendations in a guideline for antithrombotic treatment in pregnant women: “While we are unaware of any research specifically addressing women’s preferences regarding antithrombotic therapy in pregnancy, anecdotal evidence suggests that many, though not all women, give higher priority to the impact of any treatment on the health of their unborn baby than to effects on themselves.”5


    The final determinant of the strength of a recommendation is cost. One could consider cost as one of the outcomes when weighing up the advantages and disadvantages of competing management strategies. Cost, however, is much more variable over time, geographical areas, and implications than are other outcomes. Drug costs tend to plummet when patents expire, and charges for the same drug differ widely across jurisdictions. In addition, the resource implications vary widely. For instance, a year’s prescription of the same expensive drug may pay for a single nurse’s salary in the United States, six nurses’ salaries in Poland, and 30 nurses’ salaries in China.

    Thus, although higher costs will reduce the likelihood of a strong recommendation in favour of a particular intervention, the context of the recommendation will be critical. In considering resource allocation, guideline panels must thus be very specific about the setting to which a recommendation applies and the perspective that is used—that is, which costs were considered. Furthermore, recommendations that are heavily influenced by costs are likely to change over time as resource implications evolve.

    Strong recommendations may not be important from all perspectives

    If the consequences of the choice are relatively unimportant, some patients may not bother with even strong recommendations. This is particularly likely if they are faced with many new drugs or many suggestions to change their lifestyle.

    Because governments and public health officials must consider several factors beyond the strength of a recommendation, they may consider that some strong recommendations that are important for individual patients have low priority. These factors—generally of little relevance to recommendations directed at clinicians—include the prevalence of the health problem (higher priority for more prevalent conditions), considerations of equity (higher priority for interventions that tackle health equities by targeting disadvantaged populations), total cost to society (lower priority for interventions with very high total costs), and the potential for improvement in quality of care (higher priority for underused interventions). Thus, if guideline panels are addressing funders or health system managers, they must make transparent the manner in which factors related to prevalence, equity, cost, and improving quality of care influence their recommendations.

    Recommendations to use interventions in research context may be appropriate

    Guideline panels may face decisions about promising interventions associated with appreciable harms or costs and with insufficient evidence of benefit to support their use. They may be reluctant to close the door on such an intervention or inappropriately provide a weak recommendation for its use—a course of action that may lead to wide dissemination and resulting harm. Consider, for instance, the impact of recommendations in favour of hormone replacement therapy to prevent cardiovascular disease in postmenopausal women. When interventions are expensive, an additional problem with premature recommendations in favour of an intervention is the risk of irretrievable allocation of resources that would be better spent elsewhere. Nevertheless, a guideline panel’s fears will be realised if the appropriate strong, or even weak, recommendation against use of the intervention in clinical practice has the effect of stifling further investigation.

    Recommendations for use of an intervention only in the context of research may ameliorate these problems. Furthermore, such a recommendation may provide an important boost to efforts to answer important research questions, thus resolving uncertainty about optimal patient management.

    Recommendations for use of interventions exclusively in the context of research will be appropriate when two conditions are met. Firstly, insufficient evidence must exist for a panel to suggest using or not using an intervention. Secondly, further research must have a large potential for reducing uncertainty about the effects of the intervention and for doing so at a reasonable cost. Guideline panels that do not have the skills and knowledge to set or apply criteria for research priorities should refrain from making recommendations about the use of interventions exclusively in the context of research. Organisations such as the National Institute for Health and Clinical Excellence may be well equipped to make such judgments: Of its first 95 technology appraisals, eight led to recommendations for use in the context of research.

    Various presentations of quality of evidence and strength of recommendations may be appropriate

    Most guideline panels have used letters and numbers to summarise their recommendations. Because of highly variable use of numbers and letters—for instance, some organisations have chosen letters for quality of evidence and numbers for strength of recommendations, and some the reverse—the situation is potentially very confusing.5

    Symbolic representations of quality of evidence and strength of recommendations are appealing in that they are free of this history. On the other hand, organisations may have good reasons for choosing letters and numbers. Clinicians seem to be very comfortable with numbers and letters, and these are particularly suitable for verbal communication.

    GRADE has decided to offer preferred symbolic representations and, for organisations that wish to use numbers and letters, a preferred number/letter representation for quality of evidence and grades of recommendation (fig 2).6


    Fig 2 Representations of quality of evidence and strength of recommendations


    • Analysis, doi: 10.1136/bmj.39490.551019.BE
    • Analysis, doi: 10.1136/bmj.39489.470347.AD
    • This is the third in a series of five articles that explain the GRADE system for rating the quality of evidence and strength of recommendations

    • The members of the GRADE Working Group are Phil Alderson, Pablo Alonso-Coello, Jeff Andrews, David Atkins, Hilda Bastian, Hans de Beer, Jan Brozek, Francoise Cluzeau, Jonathan Craig, Ben Djulbegovic, Yngve Falck-Ytter, Beatrice Fervers, Signe Flottorp, Paul Glasziou, Gordon H Guyatt, Robin Harbour, Margaret Haugh, Mark Helfand, Sue Hill, Roman Jaeschke, Katharine Jones, Ilkka Kunnamo, Regina Kunz, Alessandro Liberati, Merce Marzo, James Mason, Jacek Mrukowics, Andrew D Oxman, Susan Norris, Vivian Robinson, Holger J Schünemann, Tessa Tan Torres, David Tovey, Peter Tugwell, Mariska Tuut, Helena Varonen, Gunn E Vist, Craig Wittington, John Williams, and James Woodcock.

    • Contributors: All listed authors, and other members of the GRADE working group, contributed to the development of the ideas in the manuscript and read and approved the manuscript. GHG wrote the first draft and collated comments from authors and reviewers for subsequent iterations. All other listed authors contributed ideas about structure and content, provided examples, and reviewed successive drafts of the manuscript and provided feedback. GHG is the guarantor.

    • Funding: None.

    • Competing interests: All authors are involved in the dissemination of GRADE, and GRADE’s success has a positive influence on their academic careers. Authors listed on the byline have received travel reimbursement and honoraria for presentations that included a review of GRADE’s approach to rating quality of evidence and grading recommendations. GHG acts as a consultant to UpToDate; his work includes helping UpToDate in their use of GRADE. HJS is documents editor and methodologist for the American Thoracic Society; one of his roles in these positions is helping implement the use of GRADE. HJS is supported by “The human factor, mobility and Marie Curie Actions Scientist Reintegration European Commission Grant: IGR 42192—GRADE.” AL is helping the use of GRADE by different institutions in the Italian health service, and in this role he has implemented GRADE to produce clinical recommendations in oncology through Grant No 249 (2005-7), Bando Ricerca Finalizzata, Ministero della Salute, Roma, Italy.

    • Provenance and peer review: Not commissioned; externally peer reviewed.