- Frederick A Spencer, professor of medicine1,
- Alfonso Iorio, associate professor of clinical epidemiology and biostatistics12,
- John You, assistant professor of medicine12,
- M Hassad Murad, associate professor of medicine3,
- Holger J Schünemann, professor and chair of clinical epidemiology and biostatistics12,
- Per O Vandvik, associate professor of medicine45,
- Mark A Crowther, professor of medicine and molecular medicine16,
- Kevin Pottie, associate professor of family medicine and epidemiology and community medicine7,
- Eddy S Lang, senior researcher8,
- Joerg J Meerpohl, deputy director of German Cochrane Centre9,
- Yngve Falck-Ytter, assistant professor of medicine10,
- Pablo Alonso-Coello, senior researcher11,
- Gordon H Guyatt, professor of medicine and clinical epidemiology and biostatistics2
- 1Department of Medicine, McMaster University, Hamilton ON L8N 4A6, Canada
- 2Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton
- 3Division of Preventive Medicine, Mayo Clinic, Rochester, Minnesota, USA
- 4Department of Medicine, Inlandet Hospital Trust, GjØvik, Norway
- 5Norwegian Knowledge Centre for Health Services, Oslo, Norway
- 6Department of Molecular Medicine, McMaster University, Hamilton
- 7Departments of Family Medicine and Epidemiology and Community Medicine, University of Ottawa, Ottawa, Canada
- 8Division of Emergency Medicine, University of Calgary, Calgary, Canada
- 9German Cochrane Center, Institute of Medical Biometry and Medical Informatics, University Medical Center, Freiburg, Germany
- 10Department of Medicine, Case Western Reserve University, Cleveland, USA
- 11Iberoamerican Cochrane Centre, CIBERESP-IIB Sant Pau, Barcelona, Spain
- Correspondence to: F A Spencer
- Accepted 29 October 2012
The GRADE system provides a framework for assessing confidence in estimates of the effect (“quality of evidence”) of alternative management strategies on outcomes that are important to patients.1 2 3 4 5 6 The GRADE system includes consideration of risk of bias, publication bias, imprecision, inconsistency, and indirectness and their impact on confidence in estimates of benefits and harms. The evaluation of each of these issues has, thus far, focused almost exclusively on their potential impact on estimates of relative effect. Because, in most instances, estimates of relative effect of a therapy are similar across different baseline risks, one can apply these relative estimates to the best estimates of overall baseline risk or, if available, estimates from subgroups that differ in baseline risk.
Using the GRADE approach, guideline panellists multiply the best estimate of relative effect by the best available estimate of baseline risk to obtain an estimate of absolute effect (see box). Limitations of the evidence with respect to risk of bias, publication bias, imprecision, inconsistency, or indirectness may reduce confidence in estimates of the relative risk reduction and affect the strength of guideline recommendations.
Estimates of absolute effect
When patients and clinicians are trading off desirable and undesirable consequences of an intervention they require estimates of absolute effect. For instance, patients with atrial fibrillation need to trade off risk of strokes versus risk of major bleeding, and they need to know how many strokes anticoagulation will prevent, and how many strokes it will cause. This is best done by applying estimates of relative effect to estimates of baseline risk, such as by means of the CHADS2 scoring system:
Patients with a CHADS2 score of 1 have a yearly risk of stroke of about 22 per 1000
The relative risk of stroke in patients receiving warfarin is 0.34
Therefore the risk of stroke in treated patients is 22×0.34 per 1000 = 7 per 1000
Thus, the absolute reduction in risk is 22−7 = 15 per 1000
Patients whose CHADS2 score is 2 have a yearly risk of stroke of about 45 per 1000
The relative risk of stroke in patients receiving warfarin is also 0.34 in this group
Therefore the risk of stroke in treated patients is 45×0.34 per 1000 = 15 per 1000
Thus, the absolute reduction in risk is 45−15 = 30 per 1000
As with estimates of relative effect, the quality of evidence supporting estimates of baseline risk can vary. At present, GRADE—and all other systems that address confidence in estimates of treatment effect—fails to fully explore issues of confidence in estimates of baseline risk. Nor do these systems incorporate the 95% confidence interval of a baseline risk estimate when deriving their absolute risk estimates. Thus, evaluating uncertainty in baseline risk, and its impact on confidence in absolute estimates of treatment effect, remains an important outstanding issue.
We suggest that the domains currently used in GRADE (risk of bias, publication bias, imprecision, inconsistency, and indirectness) can also help to understand issues of confidence in baseline risk estimates. In this article we use examples from the Antithrombotic Therapy and Prevention of Thrombosis, 9th edition (AT9) to examine how these issues may influence estimates of baseline risk and the subsequent impact on derived estimates of absolute effect.
Risk of bias
In addressing treatment effects, evidence from observational studies generally warrants lower confidence than evidence from randomised controlled trials. However, community based or population based observational studies can provide better estimates of the baseline risk associated with a given clinical condition than randomised controlled trials, which often enrol highly selected populations. This will be true, however, only if the relevant observational studies are at low risk of bias in ascertaining event rates.
In the AT9 guidelines addressing atrial fibrillation,7 the panellists derived baseline risk estimates of non-fatal stroke for patients with atrial fibrillation from pooled event rates in the control arms of six randomised controlled trials conducted in the early 1990s.8 The panellists acknowledged limitations in these estimates, including the fact that the trials enrolled less than 10% of patients screened. In addition, the authors noted that more recent data from a large administrative database including a broader spectrum of patients suggested lower rates of non-fatal thromboembolism in untreated patients (4.2 v 2.1 per 100 patient years).9 These lower rates may be more reflective of event rates in the current era and would make an important difference in the estimated absolute risk reduction (that is, a more modest effect) associated with anticoagulation in this class of patients.
The panel chose, however, to rely on the trial data because of concern that the lower estimate of stroke derived from the large administrative database reflected under-ascertainment of stroke (that is, a high risk of bias).
Relative risk estimates for the impact of a therapeutic strategy in relation to a comparator on a target outcome are ideally drawn from a systematic review of relevant studies. These estimates are biased if the included studies are unrepresentative because of preferential publication of studies favouring a stronger or weaker effect.10 11 In GRADE, systematic review and guideline authors may rate down their confidence in effect estimates if they believe publication bias is likely.12
Publication bias may similarly affect estimates of baseline risk. Ideally, systematic reviews of large observational studies including a representative sample of the target population will inform estimates of baseline risk. However, observational studies reporting higher undesirable event rates may be less likely to be published than studies reporting lower event rates. This may be particularly true for surgical series, in which surgeons experiencing a higher rate of adverse events than their colleagues may be reluctant to display their less enviable record to the surgical world.
Examination of 95% confidence intervals for estimates of absolute effects provides the optimal approach to determine precision of the estimate.13 For practice guidelines, rating down the confidence in absolute estimates of effect is warranted if clinical action would differ if the upper versus the lower boundary of the confidence interval represented the truth.
Imprecision in estimates of baseline risk will affect the derived absolute effect of a given therapy. The AT9 guidelines suggest venous thromboprophylaxis with low dose, low molecular weight heparin (LMWH) for women undergoing assisted reproduction who develop severe ovarian hyperstimulation syndrome.14 The authors estimate that use of low dose LMWH will prevent 26 venous thromboembolic events (95% confidence interval 13 to 42) per 1000 patients treated. Their estimate comes from applying indirect evidence of the relative risk reduction associated with low dose LMWH from existing surgical literature (relative risk 0.36 (95% confidence interval 0.20 to 0.67)) to a baseline venous thromboembolic event rate of 4.1%. The quality of evidence for the resulting recommendation was rated down for indirectness (relative risk estimate derived from a general surgical population).
This baseline risk of 4.1% was, however, derived from a sample of just 49 patients with severe ovarian hyperstimulation syndrome from a cohort of 2748 cycles of assisted reproduction therapy.15 The 95% confidence interval around the 4.1% point estimate is 1.1% to 13.7%. Therefore, depending on selection of baseline risk (and multiplying by the 95% confidence interval of the relative risk reduction), use of low dose LMWH in such patients may result in as few as four events prevented to as many as 110 events prevented per 1000 treated. The lower estimate of four events per 1000 treated would make any recommendation for thromboprophylaxis in this population far less attractive than the latter. Such imprecision is likely to arise in rare conditions.
In GRADE, confidence in estimates of effect from a body of evidence may be rated down if the magnitude of treatment effect varies substantially across relevant studies.16 Inconsistency may also undermine estimates of baseline risk. Guideline developers often derive baseline risk estimates by pooling event rates from observational studies using similar populations. Event rates among individual studies may vary greatly from the pooled estimate, thus decreasing confidence in this estimate.
In the chapter of the AT9 guidelines addressing prophylaxis for venous thromboembolism in surgical patients, the authors suggest an average risk of 2.1% for venous thromboembolism in patients undergoing craniotomy and suggest use of lower extremity external compression devices as prophylaxis.17 This risk estimate was derived from a pooled estimate of event rates observed in eight studies providing event rates in neurosurgical patients using external compression devices.18 Based on this estimate, and multiplying by a relative risk estimate of 0.56, the authors calculated that use of LMWH instead of external compression devices would prevent nine non-fatal symptomatic venous thromboembolic events per 1000 patients treated. Using a similar approach, they calculated that LMWH will cause 11 more non-fatal intracranial bleeds. Based on these estimates of absolute benefit and harm, they provided a weak recommendation for mechanical prophylaxis over LMWH for venous thromboembolism.
The venous thromboembolic event rates in the included studies varied from 0% to 10%. This inconsistency decreases our confidence in the baseline risk estimates and consequently in the recommendation. If true venous thromboembolic event rates are closer to 10% despite use of external compression devices, LMWH would prevent 44 non-fatal venous thromboembolic events per 1000 treated. Based on this estimate of absolute effect, it is less clear which prophylactic strategy should be recommended.
Direct evidence in the GRADE framework includes studies that have enrolled the populations of interest, delivered the intervention in the manner of interest, and measured the outcomes important to patients over the time frame of interest.19 A guideline panel will have concerns about indirectness when the population, intervention, or outcome differs from those in which they are interested—what one might otherwise call limitations of applicability.
The evidence supporting a baseline risk estimate can also be indirect. This occurs when baseline risk estimates are derived from a population that differs significantly from the population to whom the resulting guidelines are directed. Given the lack of high quality evidence documenting outcome event rates for specific disease states in community settings, estimates of baseline risks for outcome events are often derived from event rates in the control arms of randomised controlled trials. In general, patients enrolled in such trials are younger, have less comorbidity, and have better outcomes than patients encountered in clinical practice. Therefore, application of relative risk estimates for a given intervention to a baseline risk rate derived from a randomised controlled trial may underestimate both the absolute benefits and harms associated with that intervention in the community setting.
Indirectness may also lead to overestimates of absolute effects. As discussed above, baseline risk estimates of non-fatal stroke for patients with atrial fibrillation in the AT9 guidelines were derived from the pooled event rates in the control arms of six randomised controlled trials comparing warfarin with aspirin in the early 1990s.2 For CHADS2 (stroke risk) scores of 0, 1, 2, and 3–6, respectively, baseline event rates of 0.8%, 2.2%, 4.5%, and 9.6% per year were used to generate estimates of absolute benefit with warfarin. Rates of non-fatal thromboembolism in untreated patients were significantly lower in a more current and representative population than seen in the older trials (for CHADS2 scores of 0, 1, 2, 3, and 4–6, respectively, absolute event rates of 0.4%, 1.2%, 2.5%, 3.9%, and 6.3% were reported).9
Use of the estimates from the more current observational database would have resulted in a substantial decrease in the calculated absolute benefit of warfarin over one year. For example, using the baseline risk estimates from the older trials, warfarin use is predicted to prevent 30 non-fatal strokes per 1000 (95% confidence interval 23 to 35 strokes prevented) in patients with a CHADS2 score of 2. With the lower baseline risk estimates, however, the absolute benefit of warfarin decreases—resulting in prevention of only 16 (13 to 19) non-fatal thromboembolic events per 1000 treated. Similarly, absolute benefit for patients with a CHADS2 score of 1 would have declined from 15 (11 to 17) fewer events to eight (6 to 9) fewer events without a change in estimated harm due to bleeding. These revised absolute benefits would potentially alter recommendations—possibly changing the direction of the recommendation for warfarin in patients with a CHADS2 score of 1 and reducing the strength of the recommendation from strong to weak for warfarin over aspirin in patients with a CHADS2 score of 2.
Adopted by over 60 groups worldwide, the GRADE approach represents an important innovation in interpreting evidence from systematic reviews, health technology assessments, and clinical practice guidelines. At present, the approach focuses on evaluating confidence in estimates in the relative effect of one treatment strategy over another, and then—in most cases—assuming that this confidence also applies to estimates of absolute effects. Estimates of baseline risks, however, directly affect estimates of absolute risks and benefits of a treatment. We suggest that the confidence in estimates of baseline risks is subject to the same issues as evidence for relative effects of a treatment strategy.
To date guidelines have rarely considered issues of baseline risk. As our examples illustrate, GRADE’s structure can be usefully adapted to better understand issues regarding confidence in baseline risk.
This discussion has only illustrated the problem. We are not yet ready to offer specific guidance on how to rate down confidence in estimates of baseline risk. As with other methodological problems previously encountered, a great deal of work studying specific examples needs to be done before we can offer concrete solutions. This article represents a first step in this process.
Uncertainty in baseline risk estimates and its impact on confidence in absolute estimates of treatment effect are not adequately evaluated in systems of judging confidence in estimates of treatment effect—including GRADE
Risk of bias, publication bias, imprecision, inconsistency, and indirectness can affect confidence in estimates of baseline risk and subsequently confidence in derived estimates of absolute effect of diagnostic and treatment modalities
GRADE’s structure can be easily and effectively adapted to better understand issues regarding confidence in baseline risk. Concerns can be categorised into one or more of the same domains used by GRADE to evaluate evidence supporting a relative risk estimate
Cite this as: BMJ 2012;345:e7401
Contributors: FAS and GHG conceived the study and take responsibility for the integrity of the data and the accuracy of the data analysis. FAS is the guarantor. FAS, AI, and GHG designed the study. FAS, AI, JY, MHM, POV, and GHG analysed and interpreted the data. FAS, AI, MHM, and GHG drafted the manuscript. All authors critically revised the manuscript.
Funding: This study was not externally funded.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; all authors are members of the GRADE working group and were contributors to Antithrombotic Therapy and Prevention of Thrombosis, 9th edition, but this manuscript is not submitted on behalf of either group.
Ethical approval: Not required.
Provenance: All authors were contributors to the Antithrombotic Therapy and Prevention of Thrombosis, 9th edition, which used the GRADE system to assess quality of evidence underlying subsequent recommendations. During the development of these guidelines, authors (in particular FAS, AI, POV, and GHG) struggled with issues of confidence in estimates of baseline risk and how to evaluate and categorise uncertainty in baseline risk estimates. Further discussion of these issues among panel members prompted the development of this manuscript.