What is new?
Key findingWe created and validated an item bank, entitled the “RTI item bank,” to evaluate risk of bias and precision for observational studies of interventions or exposures included in systematic literature reviews. It accommodates a variety of observational study design types, including studies with controls (cohort and case–control) and without controls that rely on changes or differences in exposure (cross-sectional and case series).
What this adds to what was known?No gold standard exists for evaluating the risk of bias of observational studies. Existing tools require modification or may not be applicable for specific designs such as cross-sectional or case series. In practice, review groups often develop their own critical appraisal tool. These ad hoc tools may lack validated questions and adequate instructions for reviewers, leading to inconsistent evaluations within and across reviews.
We created a practical and validated item bank for evaluating the conduct of observational studies of interventions or exposures that (1) is comprehensive, capturing all of the risk of bias and precision domains critical for evaluating this type of research; (2) can be easily adapted to different topic areas and study types (e.g., cohort, case–control, cross-sectional, and case series studies); and (3) provides instruction to assist reviewers in creating and applying the best tool for varied topics.
What is the implication, what should change now?Systematic reviewers should adopt validated tools that enable greater transparency and consistency in evaluating risk of bias and precision of observational studies. The RTI item bank is one such tool.
In the past decade, the number of publications included in PubMed has increased at an average annual rate of nearly 6% from 467,364 citations in 1998 to 816,597 in 2008. This steady expansion in the volume of published studies increases the complexity and variability of information that policy makers, clinicians, and patients need to evaluate to make informed health care choices. Systematic reviews that compare interventions play a key role in synthesizing the evidence [1]. The assessment of the design and conduct of individual studies is central to this synthesis and is routinely used for interpreting results and grading the strength of the body of evidence. Systematic reviewers may also use these assessments to select studies for the review, meta-analysis, and interpreting heterogeneous findings [2].
Although well-designed and well-implemented randomized controlled trials (RCTs) have long been considered the gold standard for evidence, they frequently cannot answer all relevant clinical questions. RCTs may be unethical [3], limited in their ability to address harms because of limited size or length of follow-up [4], or lack of applicability to vulnerable subpopulations [5]. Observational studies (lacking randomization, allocation concealment, blinding of participants and interventionists, and in some instances, control groups) may fill these gaps, but the trade-off is a wider range of sources of bias, including potential biases in selection, performance, detection of effects, and attrition; these biases have the potential to alter effect sizes unpredictably [6], [7].
The inclusion of non-RCT studies in systematic reviews requires validated tools to assess the likelihood of bias. Approaches to critical appraisal of study methodology and related terminology have varied and are evolving. Overlapping terms include quality, internal validity, risk of bias, or study limitations, but a central goal is an assessment of the believability of the findings. We use the phrase “assessment of risk of bias and precision” as the most representative of the goal of evaluating the degree to which the effects reported by the study represent the “true” causal relationship between exposure and outcome, that is, the accuracy of the estimation. The accuracy of an estimate depends on its validity (the absence of bias or systematic error in selection, performance, detection, measurement, attrition, and reporting and adequacy in addressing potential confounders) and precision (the absence of random error through adequate study size and study efficiency) [8]. Thorough assessment of these threats to the validity and precision of an estimate is critical to understanding the believability of a study.
Table 1 presents a taxonomy and description of threats to validity and precision, drawing on two well-cited sources: the Cochrane Handbook for Systematic Reviews of Interventions [7] and Modern Epidemiology [8].
Several reviews of critical appraisal tools, including Deeks et al. [9] and West et al. [10], identified key quality domains but found no gold standard in evaluating quality [9], [10], [11], [12]. Deeks et al. reviewed quality appraisal tools for nonrandomized studies. Of 213 identified tools, only six [13], [14], [15], [16], [17], [18] met their criteria of evaluating six core elements of internal validity (creation of groups, comparability of groups at the analysis stage, allocation to intervention, similarity of groups for key prognostic characteristics by design, identification of prognostic factors, and the use of case-mix adjustment) and were specifically designed for use in systematic reviews [9]. These tools vary in the criteria covered [9] and their overall approach. Tools focus on either a description or reporting of methods (questions regarding whether authors reported a particular element of the study in a manuscript) or a judgment of risk of bias (questions regarding whether the conduct of the study altered the believability of results).
Existing tools also have other constraints. Some tools such as the Newcastle Ottawa Scale [14] are scales that rely mostly or entirely on uniform weights for all questions. The use of uniform weights may be difficult to justify in all contexts [7]; for example, if, for a particular topic, a single flaw substantially increases risk of bias. Tools may require modification or may not be applicable for specific designs such as cross-sectional or case series. In practice, the idiosyncrasies of topics require and often result in each review developing its own critical appraisal tool. These ad hoc tools may lack validated questions and adequate instruction for reviewers, leading to inconsistent evaluations within and across reviews.
Our objective was to create a practical and validated item bank for evaluating the conduct of observational studies of interventions or exposures that (1) is comprehensive, capturing all of the risk of bias and precision domains critical for evaluating this type of research; (2) can be easily adapted to different topic areas and study types (e.g., cohort, case–control, cross-sectional, and case series studies); and (3) provides instruction to assist reviewers in creating and applying the best tool for varied topics.
Our resulting risk of bias and precision item bank provides a means to assess threats to the accuracy of an estimate provided in a study and is applicable to evaluating
- •
studies of interventions or exposures that lack random allocation to an intervention and rely on associations between changes or differences in exposure or interventions and changes or differences in an outcome of interest [19]. It is not designed to evaluate diagnostic studies.
- •
a variety of observational study design types, including studies with controls (cohort and case–control) and without controls that rely on changes or differences in exposure (cross-sectional and case series) [20].
- •
internal validity only and not external validity (applicability).
Although we did not test the reliability of our item bank for other study designs, we believe that it can be used for evaluating these studies as well, with some modifications. For instance, evaluations of quasi-experimental studies will need to add, in addition to questions from our item bank, questions from a validated RCT appraisal tools on allocation concealment and blinding of patients and interventionists. We anticipate that systematic review study directors (referred to as principal investigators [PIs]) will select specific items based on the needs of the review topic and the most likely potential sources of bias and threats to precision in the included studies.
As noted above, Deeks et al. [9, p23] identified two approaches to evaluating the quality of observational studies, focusing on either a description of methods (the evaluation of the “objective characteristics of each study's methods as they are described by the primary researchers”) or an evaluation of the risk of bias and threats to precision. Study appraisal based on risk-of-bias lists potential sources of bias (Table 1), relies heavily on judgment, and is supported by transparency in recording reasons for the judgment. One constraint of this approach is that threats to validity and precision can occur at various points in the study. Assessing these threats without explicit reference to methods used at each stage of research would require a relatively abstract evaluation and could result in poor interrater reliability. The alternative approach of “methods description” is easier to implement because methods for each stage of research tend to correspond well with how manuscripts are written. This approach relies less on reviewer judgment [9] but may fall short of evaluating believability. One solution, which we have adopted, uses both approaches, using the methods description for each stage of research as the primary framework to facilitate ease of review, but evaluating how the design and conduct of the study at that stage addresses threats to validity and precision. This approach requires the reviewer to judge risk of bias in the context of adequate reporting and description of methods. In developing our item bank, we identified questions relevant to each of the 12 “methods” domains identified by Deeks et al. [9]: (1) background/context, (2) sample definition and selection, (3) interventions/exposure, (4) outcomes, (5) creation of treatment groups, (6) blinding, (7) soundness of information, (8) follow-up, (9) analysis comparability, (10) analysis outcome, (11) interpretation, and (12) presentation and reporting. The item bank provides a tool for abstractors to review a manuscript to identify the risk of bias and threats to precision for these domains.