Intended for healthcare professionals

Analysis Rating quality of evidence and strength of recommendations

Grading quality of evidence and strength of recommendations for diagnostic tests and strategies

BMJ 2008; 336 doi: https://doi.org/10.1136/bmj.39500.677199.AE (Published 15 May 2008) Cite this as: BMJ 2008;336:1106
  1. Holger J Schünemann, professor12,
  2. Andrew D Oxman, researcher3,
  3. Jan Brozek, research fellow1,
  4. Paul Glasziou, professor4,
  5. Roman Jaeschke, clinical professor5,
  6. Gunn E Vist, researcher3,
  7. John W Williams Jr, professor6,
  8. Regina Kunz, associate professor7,
  9. Jonathan Craig, associate professor8,
  10. Victor M Montori, associate professor9,
  11. Patrick Bossuyt, professor10,
  12. Gordon H Guyatt, professor2
  13. for the GRADE Working Group
  1. 1Department of Epidemiology, Italian National Cancer Institute Regina Elena, 00144 Rome, Italy
  2. 2CLARITY Research Group, Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada L8N 3Z5
  3. 3Norwegian Knowledge Centre for the Health Services, PO Box 7004, 0130 Oslo, Norway
  4. 4Centre for Evidence-Based Medicine, Department of Primary Health Care, University of Oxford, Oxford OX3 7LF
  5. 5Department of Medicine, McMaster University, 1200 Main Street West, Hamilton, Ontario, Canada L8N 3Z5
  6. 6Department of Medicine, Duke University and Durham VA Medical Center, Durham, NC 27705, USA
  7. 7Basel Institute of Clinical Epidemiology, University Hospital Basel, Hebelstrasse 10, 4031 Basel, Switzerland
  8. 8Screening and Test Evaluation Program, School of Public Health, University of Sydney, Department of Nephrology, Children’s Hospital at Westmead, Sydney, Australia
  9. 9Knowledge and Encounter Research Unit, Department of Medicine, Mayo Clinic College of Medicine, Rochester, MN 55905, USA
  10. 10Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Centre, University of Amsterdam, Amsterdam 1100 DE, Netherlands
  1. Correspondence to: H J Schünemann schuneh{at}mcmaster.ca

The GRADE system can be used to grade the quality of evidence and strength of recommendations for diagnostic tests or strategies. This article explains how patient-important outcomes are taken into account in this process

Summary points

  • As for other interventions, the GRADE approach to grading the quality of evidence and strength of recommendations for diagnostic tests or strategies provides a comprehensive and transparent approach for developing recommendations

  • Cross sectional or cohort studies can provide high quality evidence of test accuracy

  • However, test accuracy is a surrogate for patient-important outcomes, so such studies often provide low quality evidence for recommendations about diagnostic tests, even when the studies do not have serious limitations

  • Inferring from data on accuracy that a diagnostic test or strategy improves patient-important outcomes will require the availability of effective treatment, reduction of test related adverse effects or anxiety, or improvement of patients’ wellbeing from prognostic information

  • Judgments are thus needed to assess the directness of test results in relation to consequences of diagnostic recommendations that are important to patients

In this fourth article of the five part series, we describe how guideline developers are using GRADE to rate the quality of evidence and move from evidence to a recommendation for diagnostic tests and strategies. Although recommendations on diagnostic testing share the fundamental logic of recommendations on treatment, they present unique challenges. We will describe why guideline panels should be cautious when they use evidence of the accuracy of tests (“test accuracy”) as the basis for recommendations and why evidence of test accuracy often provides low quality evidence for making recommendations.

Testing makes a variety of contributions to patient care

Clinicians use tests that are usually referred to as “diagnostic”—including signs and symptoms, imaging, biochemistry, pathology, and psychological testing—for various purposes.1 These purposes include identifying physiological derangements, establishing prognosis, monitoring illness and response to treatment, and diagnosis. This article focuses on diagnosis: the use of tests to establish the presence or absence of a disease (such as tuberculosis), target condition (such as iron deficiency), or syndrome (such as Cushing’s syndrome).

Whereas some tests naturally report positive and negative results (for example, pregnancy), other tests report their results as a categorical (for example, imaging) or continuous variable (for example, metabolic measures), with the likelihood of disease increasing as the test results become more extreme. For simplicity, in this discussion we assume a diagnostic approach that ultimately categorises test results as positive or negative.

Guideline panels considering a diagnostic test should begin by clarifying its purpose. The purpose of a test under consideration may be for triage (to minimise use of an invasive or expensive test), replacement (of tests with greater burden, invasiveness, or cost), or add-on (to enhance diagnosis beyond existing tests).2 The panel should identify the limitations for which alternative tests offer a putative remedy; for example, eliminating a high proportion of false positive or false negative results, enhancing availability, decreasing invasiveness, or decreasing cost. This process will lead to identification of sensible clinical questions that, as with other management problems, have four components: patients, diagnostic intervention (strategy), comparison, and outcomes of interest.3 4 The box shows an example of a question for a replacement test.

Example question for replacement test

In patients in whom coronary artery disease is suspected, does multislice spiral computed tomography of coronary arteries as a replacement for conventional invasive coronary angiography reduce complications with acceptable rates of false negatives associated with coronary events and false positives leading to unnecessary treatment and complications?5 6

Clinicians often use diagnostic tests as a package or strategy. For example, in managing patients with apparently operable lung cancer on computed tomography, clinicians may proceed directly to thoracotomy or may apply a strategy of imaging the brain, bone, liver, and adrenal glands, with subsequent management depending on the results. Furthermore, a testing sequence may use an initial sensitive but non-specific test, which, if positive, is followed by a more specific test (for example, faecal occult blood followed by colonoscopy). Thus, one can often think of evaluating or recommending not a single test but a diagnostic strategy.

Test accuracy is a surrogate for patient-important outcomes

The main contribution of this article is that it presents a framework for thinking about diagnostic tests in terms of their impact on outcomes important to patients (“patient-important outcomes”). Usually, when clinicians think about diagnostic tests they focus on test accuracy (that is, how well the test classifies patients correctly as having or not having a disease). The underlying assumption is, however, that obtaining a better idea of whether a target condition is present or absent will result in superior management of patients and improved outcome. In the example of imaging for metastatic disease in patients presenting with apparently operable lung cancer, the presumption is that positive additional tests will spare patients the morbidity and early mortality associated with futile thoracotomy.

The example of computed tomography for coronary artery disease described in the box illustrates another common rationale for a new test: replacement of another test (coronary computed tomography instead of conventional angiography) to avoid complications associated with a more invasive and expensive alternative.2 In this paradigmatic situation, the new test would need only to replicate the sensitivity and specificity (accuracy) of the reference standard to show superiority.

However, if a test fails to improve important outcomes, no reason exists for its use, whatever its accuracy. Thus, the best way to assess a diagnostic strategy is a controlled trial in which investigators randomise patients to experimental or control diagnostic approaches and measure mortality, morbidity, symptoms, and quality of life.7 Figure 1 illustrates two generic study structures that investigators can use to evaluate the impact of testing.

Figure1

Fig 1 Two generic ways in which a test or diagnostic strategy can be evaluated. On the left, patients are randomised to a new test or strategy or to an old test or strategy. Those with a positive test result (cases detected) are randomised (or were previously randomised) to receive the best available management (second step of randomisation for management not shown). Investigators evaluate and compare patient-important outcomes in all patients in both groups.2 On the right, patients receive both a new test and a reference test (old or comparator test or strategy). Investigators can then calculate the accuracy of the test compared with the reference test (first step). To make judgments about importance to patients of this information, patients with a positive test (or strategy) in either group are (or have been in previous studies) submitted to treatment or no treatment; investigators then evaluate and compare patient-important outcomes in all patients in both groups (second step)

When diagnostic intervention studies—ideally randomised controlled trials but also observational studies—comparing alternative diagnostic strategies with assessment of direct patient-important outcomes are available (fig 1, left), guideline panels can use the GRADE approach described for other interventions in previous articles in this series.12 13 If studies measuring the impact of testing on patient-important outcomes are not available, guideline panels must focus on studies of test accuracy and make inferences about the likely impact on patient-important outcomes (fig 1, right).14 In the second situation, diagnostic accuracy is a surrogate outcome for benefits and harms to patients.1

The key questions are whether the number of false negatives (cases missed) or false positives will be reduced, with corresponding increases in true positives and true negatives; how accurately similar or different patients are classified by the alternative testing strategies; and what outcomes occur in both patients labelled as cases and those labelled as not having disease. Table 1 presents examples that illustrate these questions. We discuss these questions in subsequent sections of this article, all of which will focus on using studies of diagnostic accuracy to develop recommendations.

Table 1

 Examples and implications of different testing scenarios focusing on accuracy

View this table:

Using indirect evidence to make inferences about impact on patient-important outcomes

A recommendation associated with a diagnostic question depends on the balance between the desirable and undesirable consequences of the diagnostic test or strategy and should be based on a systematic review that focuses on the clinical question. We will use a simplified approach that classifies test results into those yielding true positives (patients correctly classified above the treatment threshold—table 1 and fig 2), false positives (patients incorrectly classified above the treatment threshold), true negatives (patients correctly classified below the testing threshold), and false negatives (patients incorrectly classified below the testing threshold).

Figure2

Fig 2 Test and treatment thresholds. What clinicians expect of a good test is that results change the probability sufficiently to confirm or exclude a diagnosis. Tests, however, are altering only the probability of a disease of interest being present. If a test result moves the probability of the condition of interest to below the test threshold, this indicates that the condition is very unlikely, the downsides associated with any further testing and treatment for this condition outweigh any anticipated benefit, and no further testing or treatment for that condition should follow. If the test result increases the probability of disease to above the treatment threshold, this indicates that the condition is very likely, confirmatory testing that raises the probability of the condition further is unnecessary, and the anticipated benefits of treatment outweigh potential harms. If the pre-test probability is above the treatment threshold, further confirmatory testing that raises the probability further would not be helpful. If the pre-test probability is below the test threshold, further exclusionary testing would not be useful. When the probability is between the test and treatment thresholds, testing will be useful. Test results are of greatest value when they shift the probability across either threshold

However, inferring from data on accuracy that a diagnostic test or strategy improves patient-important outcomes requires the availability of effective treatment.1 Alternatively, even without an effective treatment, an accurate test may be beneficial if it reduces test related adverse effects or reduces anxiety by excluding an ominous diagnosis, or if confirming a diagnosis improves patients’ wellbeing through the prognostic information it imparts.

For instance, the results of genetic testing for Huntington’s chorea, an untreatable condition, may provide either welcome reassurance that a patient will not have the condition or the ability to plan for his or her future knowing that he or she will develop the condition. In this case, the ability to plan is analogous to an effective treatment and the benefits of planning need to be balanced against the downsides of receiving an early diagnosis.15 16 17 We will now describe the factors that influence judgments about the balance of desirable and undesirable effects, focusing on the quality of evidence.

Judgments about quality of underlying evidence

Study design

GRADE’s four categories of quality of evidence imply a gradient of confidence in estimates of the effect of a diagnostic test strategy on patient-important outcomes.13 High quality evidence comes from randomised controlled trials directly comparing the impact of alternative diagnostic strategies on patient-important outcomes (for example, trials of B type natriuretic peptide for heart failure as described in figure 1) without limitations in study design and conduct, imprecision (that is, powered to detect differences in patient-important outcomes), inconsistency, indirectness, and reporting bias.13 18 20

Although valid studies of accuracy also start as high quality in the diagnostic framework, such studies are vulnerable to limitations and often provide low quality evidence for recommendations, particularly as a result of the indirect evidence they usually offer on the impact of subsequent management on patient-important outcomes. Table 2 describes how GRADE deals with the particular challenges of judging the quality of evidence on desirable and undesirable consequences of alternative diagnostic strategies.

Table 2

 Factors that decrease quality of evidence for studies of diagnostic accuracy and how they differ from evidence for other interventions

View this table:

Study limitations (risk of bias)

Valid studies of diagnostic test accuracy include representative and consecutive patients in whom legitimate diagnostic uncertainty exists—that is, the sort of patients to whom clinicians would apply the test in the course of regular clinical practice. If studies fail this criterion—and, for example, enrol severe cases and healthy controls—the apparent accuracy of a test is likely to be misleadingly high.21 22

Valid studies of diagnostic tests involve a comparison between the test or tests under consideration and an appropriate reference (sometimes called “gold”) standard. Investigators’ failure to make such a comparison in all patients increases the risk of bias. The risk of bias is further increased if the people who carry out or interpret the test are aware of the results of the reference or gold standard test or vice versa. Guideline panels can use existing instruments to assess the risk of bias in studies evaluating accuracy of diagnostic tests, and the results may lead to downgrading of the quality of evidence if serious limitations exist.23 24 25

Directness

We described considerations about directness for other interventions in a previous article.13 Judging directness presents additional, perhaps greater, challenges for guideline panels making recommendations about diagnostic tests. If a new test reduces false positives and false negatives, to what extent will that reduction lead to improvement in patient-important outcomes? Alternatively, a new test may be simpler to do, with lower risk and cost, but may produce false positives and false negatives. Consider the consequences of replacing invasive angiography with coronary computed tomography scanning for the diagnosis of coronary artery disease (tables 3 and 4). True positive results will lead to the administration of treatments of known effectiveness (drugs, angioplasty and stents, bypass surgery); true negative results will spare patients the possible adverse effects of the reference standard test; false positive results will result in adverse effects (unnecessary drugs and interventions, including the possibility of follow-up angioplasty) without apparent benefit; and false negatives will result in patients not receiving the benefits of available interventions that help to reduce the subsequent risk of coronary events.

Table 3

 Key findings of diagnostic accuracy studies—should multislice spiral computed tomography rather than conventional coronary angiography* be used to diagnose coronary artery disease in a population with a low (20%) pre-test probability?6

View this table:
Table 4

 Consequences of key findings of diagnostic accuracy studies—should multislice spiral computed tomography rather than conventional coronary angiography* be used to diagnose coronary artery disease in a population with a low (20%) pre-test probability?6

View this table:

Thus, inferences that minimising false positives and false negatives will benefit patients, and that increasing them will have a negative impact on patient-important outcomes, are relatively strong. As for outcomes in other intervention studies, the degree of importance of these consequences for patients varies and should be considered by guideline panels when balancing desirable and undesirable consequences; for example, patients will place a greater value on preventing myocardial infarctions than a mild episode of angina. The impact of inconclusive test results is less clear; they are clearly undesirable, however, in that they are likely to induce anxiety and may lead to unnecessary interventions, induce further testing, or delay the application of effective treatment. The complications of invasive angiography—infarction and death—although rare, are important.

Because our knowledge of the consequences of the rates of false positives, false negatives, inconclusive results, and complications with the alternative diagnostic strategies are fairly secure, and those outcomes are important, we can make strong inferences about the relative impact of computed tomography scanning and conventional angiography on patient-important outcomes. In this example with a relatively low probability for coronary artery disease, computed tomography scanning results in a large number of false positives leading to unnecessary anxiety and further testing, including coronary angiography, after time and resources have been spent on computed tomography scanning (table 4). It also leads to about 1% (false negatives) of patients who have coronary artery disease being missed.

Uncertainty about the consequences of the false positive and false negative results will weaken inferences about the balance between desirable and undesirable consequences. Consider the consequences of false positive and false negative results of diagnostic imaging for patients in whom acute sinusitis is suspected. As the primary benefit of treatment is shortening of the duration of illness and symptoms, the balance of the consequences important to patients is less clear between patients with false negatives results who are deprived of antibiotics and will have a longer duration of symptoms and an increased risk of complications from the infection, but have no side effects from use of antibiotics and those with false positive results who receive antibiotics when they should not but may feel relieved that they have received care and treatment. Furthermore, guideline panels will have to consider the societal consequences (such as antibiotic resistance) of administering antibiotics to false positive cases.3

Consider once again the use of B type natriuretic peptide for heart failure (fig 1). The test may be accurate, but if clinicians are already making the diagnosis with near perfect accuracy, and instituting appropriate treatment, the test will not improve outcomes for patients. Even if clinicians are initially inaccurate but correct their mistakes as the clinical picture unfolds (for example, by withdrawing initial unnecessary diuretic treatment or subsequently recognising the need for diuretic treatment), patients’ outcomes may be unaffected. The link between test result and outcome here is sufficiently weak that, other considerations aside, the diagnostic accuracy information alone would provide only low quality evidence. In this case, however, two randomised controlled trials showed that (at least in their setting) B type natriuretic peptide reduced admissions hospital and length of stay in hospital without apparent adverse consequences.

Guideline panels considering questions of diagnosis also face the same sort of challenges regarding indirectness as do panels making recommendations for other interventions. Test accuracy may vary across populations of patients: panels therefore need to consider how well the populations included in studies of diagnosis correspond to the populations that are the subject of the recommendations. Similarly, panels need to consider how comparable new tests and reference tests are to the tests used in the settings for which the recommendations are made. Finally, when evaluating two or more alternative new tests or strategies, panels need to consider whether these diagnostic strategies were compared directly (in one study) or indirectly (in separate studies) with a common (reference) standard.26 27 28

Arriving at a bottom line for study quality

Table 5 shows the evidence profile and the quality assessment for all critical outcomes of computed tomography angiography in comparison with invasive angiography. The original accuracy studies were well planned and executed, the results are precise, and we do not suspect important publication bias. Little or no uncertainty exists about the directness of the evidence (for test results) for patient-important outcomes for true positives, false positives, and true negatives (tables 1 and 5). However, some uncertainty about the extent to which limitations in test accuracy will have deleterious consequences on patient-important outcomes for false negatives led to downgrading of the quality of evidence from high to moderate (that is, we believe the evidence is indirect for false negatives because we are uncertain that delayed diagnoses of coronary artery disease leads to worse outcomes).

Table 5

 Quality assessment of diagnostic accuracy studies—example: should multislice spiral computed tomography be used instead of conventional coronary angiography for diagnosis of coronary artery disease?*

View this table:

Problems with inconsistency also exist. Reviewers considering the relative merits of computed tomography and invasive angiography for diagnosis of coronary disease found important heterogeneity in the results for the proportion of angiography negative patients with a positive computed tomography test result (specificity) and in the results for the proportion of angiography positive patients with a negative computed tomography test result (sensitivity) that they could not explain (fig 3). This heterogeneity was also present for other measures of diagnostic tests (that is, positive and negative likelihood ratios and diagnostic odds ratios). Unexplained heterogeneity in the results across studies further reduced the quality of evidence for all outcomes. Major uncertainty about the impact of false negative tests on patient-important outcomes would have led to downgrading of the quality of evidence from high to low for the other examples in table 1.

Figure3

Fig 3 Example of heterogeneity in diagnostic test results. Sensitivity and specificity of multislice coronary computed tomography compared with coronary angiogram (from Hamon et al4). This heterogeneity also existed for likelihood ratios and diagnostic odds ratios

Arriving at a recommendation

The balance of presumed patient-important outcomes as the result of true and false positives and negatives with test complications determines whether a guideline panel makes a recommendation for or against applying a test.12 Other factors influencing the strength of a recommendation include the quality of the evidence, the uncertainty about values and preferences associated with the tests and presumed patient-important outcomes, and cost.

Coronary computed tomography scanning avoids the adverse consequences of invasive angiography, which can include myocardial infarction and death. These consequences are, however, very rare. As a result, a guideline panel evaluating multislice spiral computed tomography as a replacement test for coronary angiography could, despite its lower cost, make a weak recommendation against its use in place of invasive coronary angiography. This recommendation follows from the large number of false positives and the risk of missing patients with the disease who could be treated effectively (false negatives). It also follows from the evidence for the new test being only low quality and the consideration of values and preferences. Despite the general preference for less invasive tests with lower risks of complications, most patients would probably favour the more invasive approach (angiography). This reasoning follows from the assumption that patients would place a higher value on reassurance about the absence or presence of coronary disease, and instituting risk reducing strategies, than on avoiding complications of angiography. Guideline panels considering the use of coronary computed tomography compared with no direct imaging of the coronary arteries (for instance, in settings with inadequate access to angiography where computed tomography is not a replacement for angiography but a triage tool) may find the evidence of higher quality and make a strong recommendation for its use in identifying patients who could be referred for angiography and further treatment.

An alternative way to conceptualise the formulation of strong and weak recommendation relates to figure 2. Test strategies that result in patients moving below the test threshold or above the treatment threshold (given that effective treatment exists) will often lead to strong recommendations.

In addition, users of recommendations on diagnostic tests should check whether the pre-test probability range is applicable. The likelihood of the disease (prevalence or pre-test probability) in the patient in front of them will often influence the probability of a true positive or false positive test result in that patient. Recommendations for populations with different baseline risks or likelihood of disease may therefore be appropriate. In particular, recommendations for screening (low risk populations) will almost always differ from recommendations for using a test for diagnosis in populations of patients in whom a disease is suspected.

Finally, individual clinicians together with their patients will be establishing treatment and test thresholds on the basis of the individual patient’s values and preferences. For example, a patient averse to the risks of coronary angiography might choose computed tomography imaging over angiograms, whereas most patients who are averse to the risk of false positives and negatives, place a high value on reassurance and knowledge about coronary disease, and are willing to accept the risk of an angiogram will choose an angiogram instead of computed tomography. As for other recommendations, the exploration and integration of patients’ values and preferences are critical to the development and implementation of recommendations for diagnostic tests.

Conclusion

The GRADE approach to grading the quality of evidence and strength of recommendations for diagnostic tests provides a comprehensive and transparent approach for developing these recommendations. We have presented an overview of the approach, based on the recognition that test results are surrogate markers for benefits to patients. The application of the approach requires a shift in clinicians’ thinking to clearly recognise that, whatever their accuracy, diagnostic tests are of value only if they result in improved outcomes for patients.

Footnotes

  • Analysis, doi: 10.1136/bmj.39490.551019.BE
  • Analysis, doi: 10.1136/bmj.39489.470347.AD
  • Analysis, doi: 10.1136/bmj.39493.646875.AE
  • This is the fourth in a series of five articles that explain the GRADE system for rating grading the quality of evidence and strength of recommendations

  • We thank the many people and organisations that have contributed to the progress of the GRADE approach through funding of meetings and feedback on the work described in this article.

  • The members of the GRADE Working Group who participated in this work were Phil Alderson, Pablo Alonso-Coello, Jeff Andrews, David Atkins, Hilda Bastian, Hans de Beer, Jan Brozek, Francoise Cluzeau, Jonathan Craig, Ben Djulbegovic, Yngve Falck-Ytter, Beatrice Fervers, Signe Flottorp, Paul Glasziou, Gordon H Guyatt, Robin Harbour, Margaret Haugh, Mark Helfand, Sue Hill, Roman Jaeschke, Katharine Jones, Ilkka Kunnamo, Regina Kunz, Alessandro Liberati, Merce Marzo, James Mason, Jacek Mrukowics, Andrew D Oxman, Susan Norris, Vivian Robinson, Holger J Schünemann, Tessa Tan Torres, David Tovey, Peter Tugwell, Mariska Tuut, Helena Varonen, Gunn E Vist, Craig Wittington, John Williams, and James Woodcock.

  • Contributors: All listed authors, and other members of the GRADE working group, contributed to the development of the ideas in the manuscript, and read and approved the manuscript. HJS wrote the first draft and collated comments from authors and reviewers for subsequent iterations. All other listed authors contributed ideas about structure and content and provided feedback. HJS is the guarantor.

  • Funding: This work was partially funded by a “The human factor, mobility and Marie Curie Actions Scientist Reintegration” European Commission Grant: IGR 42192-“GRADE” to HJS.

  • Competing interests: The authors are members of the GRADE Working Group. The work with this group probably advanced the careers of some or all of the authors and group members. Authors listed in the byline have received travel reimbursement and honorariums for presentations that included a review of GRADE’s approach to grading the quality of evidence and strength of recommendations. GHG acts as a consultant to UpToDate; his work includes helping UpToDate in their use of GRADE. HJS is documents editor and methodologist for the American Thoracic Society; he supports the implementation of GRADE by this and other organisations worldwide. VMM supports the implementation of GRADE in several North American not for profit professional organisations.

  • Provenance and peer review: Not commissioned; externally peer reviewed.

References

View Abstract