GRADE: an emerging consensus on rating quality of evidence and strength of recommendationsBMJ 2008; 336 doi: http://dx.doi.org/10.1136/bmj.39489.470347.AD (Published 24 April 2008) Cite this as: BMJ 2008;336:924
- Gordon H Guyatt, professor1,
- Andrew D Oxman, researcher2,
- Gunn E Vist, researcher2,
- Regina Kunz, associate professor3,
- Yngve Falck-Ytter, assistant professor4,
- Pablo Alonso-Coello, researcher5,
- Holger J Schünemann, professor6
- for the GRADE Working Group
- 1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada L8N 3Z5
- 2Norwegian Knowledge Centre for the Health Services, PO Box 7004, St Olavs Plass, 0130 Oslo, Norway
- 3Basel Institute of Clinical Epidemiology, University Hospital Basel, Hebelstrasse 10, 4031 Basel, Switzerland
- 4Division of Gastroenterology, Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA
- 5Iberoamerican Cochrane Center, Servicio de Epidemiología Clínica y Salud Pública (Universidad Autónoma de Barcelona), Hospital de Sant Pau, Barcelona 08041, Spain
- 6Department of Epidemiology, Italian National Cancer Institute Regina Elena, Rome, Italy
- Correspondence to: G H Guyatt, CLARITY Research Group, Department of Clinical Epidemiology and Biostatistics, Room 2C12, 1200 Main Street, West Hamilton, ON, Canada L8N 3Z5
Failure to consider the quality of evidence can lead to misguided recommendations; hormone replacement therapy for post-menopausal women provides an instructive example
High quality evidence that an intervention’s desirable effects are clearly greater than its undesirable effects, or are clearly not, warrants a strong recommendation
Uncertainty about the trade-offs (because of low quality evidence or because the desirable and undesirable effects are closely balanced) warrants a weak recommendation
Guidelines should inform clinicians what the quality of the underlying evidence is and whether recommendations are strong or weak
The Grading of Recommendations Assessment, Development and Evaluation (GRADE ) approach provides a system for rating quality of evidence and strength of recommendations that is explicit, comprehensive, transparent, and pragmatic and is increasingly being adopted by organisations worldwide
Guideline developers around the world are inconsistent in how they rate quality of evidence and grade strength of recommendations. As a result, guideline users face challenges in understanding the messages that grading systems try to communicate. Since 2006 the BMJ has requested in its “Instructions to Authors” on bmj.com that authors should preferably use the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for grading evidence when submitting a clinical guidelines article. What was behind this decision?
In this first in a series of five articles we will explain why many organisations use formal systems to grade evidence and recommendations and why this is important for clinicians; we will focus on the GRADE approach to recommendations. In the next two articles we will examine how the GRADE system categorises quality of evidence and strength of recommendations. The final two articles will focus on recommendations for diagnostic tests and GRADE’s framework for tackling the impact of interventions on use of resources.
GRADE has advantages over previous rating systems (box 1). Other systems share some of these advantages, but none, other than GRADE, combines them all.1
Box 1 Advantages of GRADE over other systems
Developed by a widely representative group of international guideline developers
Clear separation between quality of evidence and strength of recommendations
Explicit evaluation of the importance of outcomes of alternative management strategies
Explicit, comprehensive criteria for downgrading and upgrading quality of evidence ratings
Transparent process of moving from evidence to recommendations
Explicit acknowledgment of values and preferences
Clear, pragmatic interpretation of strong versus weak recommendations for clinicians, patients, and policy makers
Useful for systematic reviews and health technology assessments, as well as guidelines
What is “quality of evidence” and why is it important?
In making healthcare management decisions, patients and clinicians must weigh up the benefits and downsides of alternative strategies. Decision makers will be influenced not only by the best estimates of the expected advantages and disadvantages but also by their confidence in these estimates. The cartoon depicting the weather forecaster’s uncertainty captures the difference between an assessment of the likelihood of an outcome and the confidence in that assessment (figure⇓). The usefulness of an estimate of the magnitude of intervention effects depends on our confidence in that estimate.
Expert clinicians and organisations offering recommendations to the clinical community have often erred as a result of not taking sufficient account of the quality of evidence.2 For a decade, organisations recommended that clinicians encourage postmenopausal women to use hormone replacement therapy.3 Many primary care physicians dutifully applied this advice in their practices.
A belief that such therapy substantially decreased women’s cardiovascular risk drove this recommendation. Had a rigorous system of rating the quality of evidence been applied at the time, it would have shown that because the data came from observational studies with inconsistent results, the evidence for a reduction in cardiovascular risk was of very low quality.4 Recognition of the limitations of the evidence would have tempered the recommendations. Ultimately, randomised controlled trials have shown that hormone replacement therapy fails to reduce cardiovascular risk and may even increase it.5 6
The US Food and Drug Administration licensed the antiarrhythmic agents encainide and flecainide for use in patients on the basis of the drugs’ ability to reduce asymptomatic ventricular arrhythmias associated with sudden death. This decision failed to acknowledge that because arrhythmia reduction reflected only indirectly on the outcome of sudden death the quality of the evidence for the drugs’ benefit was of low quality. Subsequently, a randomised controlled trial showed that the two drugs increase the risk of sudden death.7 Appropriate attention to the low quality of the evidence would have saved thousands of lives.
Failure to recognise high quality evidence can cause similar problems. For instance, expert recommendations lagged a decade behind the evidence from well conducted randomised controlled trials that thrombolytic therapy achieved a reduction in mortality in myocardial infarction.8
Insufficient attention to quality of evidence risks inappropriate guidelines and recommendations that may lead clinicians to act to the detriment of their patients. Recognising the quality of evidence will help to prevent these errors.
How should guideline developers alert clinicians to quality of evidence?
A formal system that categorises quality of evidence— for example, from high to very low—represents an obvious strategy for conveying quality of evidence to clinicians. Some limitations, however, do exist. Quality of evidence is a continuum; any discrete categorisation involves some degree of arbitrariness. Nevertheless, advantages of simplicity, transparency, and vividness outweigh these limitations.
What is “strength of recommendation” and why is it important?
A recommendation to offer patients a particular treatment may arise from large, rigorous randomised controlled trials that show consistent impressive benefits with few side effects and minimal inconvenience and cost. Such is the case with using a short course of oral steroids in patients with exacerbations of asthma. Clinicians can offer such treatments to almost all their patients with little or no hesitation.
Alternatively, treatment recommendations may arise from observational studies and may involve appreciable harms, burdens, or costs. Deciding whether to use antithrombotic therapy in pregnant women with prosthetic heart valves involves weighing the magnitude of reduction in valve thrombosis against inconvenience, cost, and risk of teratogenesis. Clinicians offering such treatments must help patients to weigh up the desirable and undesirable effects carefully according to their values and preferences.
Guidelines and recommendations must therefore indicate whether (a) the evidence is high quality and the desirable effects clearly outweigh the undesirable effects, or (b) there is a close or uncertain balance. A simple, transparent grading of the recommendation can effectively convey this key information.
There are limitations to formal grading of recommendations. Like the quality of evidence, the balance between desirable and undesirable effects reflects a continuum. Some arbitrariness will therefore be associated with placing particular recommendations in categories such as “strong” and “weak.” Most organisations producing guidelines have decided that the merits of an explicit grade of recommendation outweigh the disadvantages.
What makes a good grading system?
Not all grading systems separate decisions regarding the quality of evidence from strength of recommendations. Those that fail to do so create confusion. High quality evidence doesn’t necessarily imply strong recommendations, and strong recommendations can arise from low quality evidence.
For example, patients who experience a first deep venous thrombosis with no obvious provoking factor must, after the first months of anticoagulation, decide whether to continue taking warfarin long term. High quality randomised controlled trials show that continuing warfarin will decrease the risk of recurrent thrombosis but at the cost of increased risk of bleeding and inconvenience. Because patients with varying values and preferences will make different choices, guideline panels addressing whether patients should continue or terminate warfarin should—despite the high quality evidence—offer a weak recommendation.
Consider the decision to administer aspirin or paracetamol (acetaminophen) to children with chicken pox. Observational studies have observed an association between aspirin administration and Reye’s syndrome.9 Because aspirin and paracetamol are similar in their analgesic and antipyretic effects, the low quality evidence regarding the association between aspirin and Reye’s syndrome does not preclude a strong recommendation for paracetamol.
Systems that classify “expert opinion” as a category of evidence also create confusion. Judgment is necessary for interpretation of all evidence, whether that evidence is high or low quality. Expert reports of their clinical experience should be explicitly labelled as very low quality evidence, along with case reports and other uncontrolled clinical observations.
Grading systems that are simple with respect to judgments both about the quality of the evidence and the strength of recommendations facilitate use by patients, clinicians, and policy makers.1 Detailed and explicit criteria for ratings of quality and grading of strength will make judgments more transparent to those using guidelines and recommendations.
Although many grading systems to some extent meet these criteria,1 a plethora of systems makes their use difficult for frontline clinicians. Understanding a variety of systems is neither an efficient nor a realistic use of clinicians’ time. The GRADE system is used widely: the World Health Organization, the American College of Physicians, the American Thoracic Society, UpToDate (an electronic resource widely used in North America, www.uptodate.com), and the Cochrane Collaboration are among the more than 25 organisations that have adopted GRADE. This widespread adoption of GRADE reflects GRADE’s success as a methodologically rigorous, user friendly grading system.
How does the GRADE system classify quality of evidence?
To achieve transparency and simplicity, the GRADE system classifies the quality of evidence in one of four levels—high, moderate, low, and very low (box 2). Some of the organisations using the GRADE system have chosen to combine the low and very low categories. Evidence based on randomised controlled trials begins as high quality evidence, but our confidence in the evidence may be decreased for several reasons, including:
Inconsistency of results
Indirectness of evidence
Although observational studies (for example, cohort and case-control studies) start with a “low quality” rating, grading upwards may be warranted if the magnitude of the treatment effect is very large (such as severe hip osteoarthritis and hip replacement), if there is evidence of a dose-response relation or if all plausible biases would decrease the magnitude of an apparent treatment effect.
Box 2 Quality of evidence and definitions
High quality— Further research is very unlikely to change our confidence in the estimate of effect
Moderate quality— Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate
Low quality— Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate
Very low quality— Any estimate of effect is very uncertain
How does the GRADE system consider strength of recommendation?
The GRADE system offers two grades of recommendations: “strong” and “weak” (though guidelines panels may prefer terms such as “conditional” or “discretionary” instead of weak). When the desirable effects of an intervention clearly outweigh the undesirable effects, or clearly do not, guideline panels offer strong recommendations. On the other hand, when the trade-offs are less certain—either because of low quality evidence or because evidence suggests that desirable and undesirable effects are closely balanced—weak recommendations become mandatory.
In addition to the quality of the evidence, several other factors affect whether recommendations are strong or weak (table 1⇓).
This is the first in a series of five articles that explain the GRADE system for rating the quality of evidence and strength of recommendations.
The Iberoamerican Cochrane Center is part of the Spanish public health network CIBER de Epidemiología y Salud Pública.
The members of the GRADE Working Group are Phil Alderson, Pablo Alonso-Coello, Jeff Andrews, David Atkins, Hilda Bastian, Hans de Beer, Jan Brozek, Francoise Cluzeau, Jonathan Craig, Ben Djulbegovic, Yngve Falck-Ytter, Beatrice Fervers, Signe Flottorp, Paul Glasziou, Gordon Guyatt, Robin Harbour, Margaret Haugh, Mark Helfand, Sue Hill, Roman Jaeschke, Kathatrine Jones, Ilkka Kunnamo, Regina Kunz, Alessandro Liberati, Merce Marzo, James Mason, Jacek Mrukovics, Susan Norris, Andrew Oxman, Vivian Robinson, Holger Schünemann, Tessa Tan Torres, David Tovey, Peter Tugwell, Mariska Tuut, Helena Varonen, Gunn Vist, Craig Wittington, John Williams, and James Woodcock.
Contributors: All members of the GRADE working group contributed to the development of the ideas in the manuscript, and read and approved the manuscript. GHG wrote the first draft and collated comments from authors and reviewers for subsequent iterations. ADO, GEV, RK, YF-Y, PA-C, and HJS contributed ideas about structure and content, provided examples, reviewed successive drafts of the manuscript, and provided feedback. GHG is the guarantor.
Funding: No specific funding.
Competing interests: All authors are involved in the dissemination of GRADE, and GRADE’s success has a positive influence on their academic career. Authors listed in the byline have received travel reimbursement and honorariums for presentations that included a review of GRADE’s approach to rating quality of evidence and grading recommendations. GHG acts as a consultant to UpToDate; his work includes helping UpToDate in their use of GRADE. HJS is documents editor and methodologist for the American Thoracic Society; one of his roles in these positions is helping implement the use of GRADE. He is supported by “The human factor, mobility and Marie Curie actions scientist reintegration European Commission grant: IGR 42192—GRADE.”
Provenance and peer review: Not commissioned; externally peer reviewed.