Concerns about composite reference standards in diagnostic researchBMJ 2018; 360 doi: https://doi.org/10.1136/bmj.j5779 (Published 18 January 2018) Cite this as: BMJ 2018;360:j5779
- Nandini Dendukuri, associate professor1,
- Ian Schiller, research assistant1,
- Joris de Groot, assistant professor2,
- Michael Libman, professor3,
- Karel Moons, professor2,
- Johannes Reitsma, associate professor2,
- Maarten van Smeden, assistant professor2
- 1Division of Clinical Epidemiology, McGill University Health Centre—Research Institute, Canada
- 2Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Netherlands
- 3Division of Infectious Diseases, McGill University Health Centre, Canada
- Corresponding author:
- Accepted 20 November 2017
Composite reference standards define a fixed, transparent rule to classify subjects into disease positive and disease negative groups based on existing imperfect tests
They are widely regarded as appropriate for determining sensitivity and specificity of a new test in the absence of a perfect reference test
Though a composite reference standard is attractive for its simple and transparent construction, it can result in biased estimates as it makes suboptimal use of data
Bias due to a composite reference standard can worsen as more information is gathered and the new test’s accuracy can be overestimated if the errors made by the composite reference standard and the new test are correlated
Composite reference standards cannot aid standardisation across settings when disease prevalence varies
Appropriately constructed latent class models should be used to make complete use of the information gathered from multiple imperfect tests
For many diseases, a perfect diagnostic test may not exist or cannot be applied owing to costs or ethical concerns. Researchers evaluating a new test for the disease have no perfect reference against which to compare it. For example, GeneXpert (Xpert) is a nucleic acid amplification test for paediatric pulmonary tuberculosis (TB). Culture is an inadequate reference standard owing to its poor sensitivity,1 so the accuracy of Xpert is often evaluated by comparing it to a composite reference standard based on multiple imperfect tests.23
Table 1 shows a composite reference standard defined using data gathered in a South African cohort study of symptomatic children.6 It classifies a child who is positive on culture, smear microscopy, chest radiography, or the tuberculin skin test as having TB. The apparent advantage of this composite reference standard is that it would identify more TB cases than culture alone. Studies using a standard such as this typically treat it as an error-free reference test to estimate the new test’s sensitivity (proportion of all patients with the disease that are correctly detected by the new test) and specificity (proportion of all patients without the disease who are correctly detected by the new test).37 Ensuring the new test is not used to define the composite reference standard is thought to protect against overestimating the new test’s accuracy.8
Composite reference standards are used for diverse conditions including Chlamydia trachomatis infection, apathy, and prostate cancer (see web table 1). Some are required by regulatory authorities for approval of new tests.9 Despite their widespread use, the number of tests necessary to define an adequate standard is unknown. Composite reference standards based on two or three tests are most common,710 but some are based on upwards of eight or nine tests (web table 1).211
Consequences of both the composite reference standard and the new test making the same errors are also not well understood. We discuss previously ignored concerns related to composite reference standards using numerical examples and data from a study of childhood TB and draw attention to better methods. We focus on composite reference standards based on the OR rule, which classifies patients with at least one positive test as disease positive and those with all negative tests as disease negative.3 Other possible composite rules include the AND rule, which classifies a patient as disease positive only if all tests are positive, or K positive rules, which classify a patient as disease positive only if at least K tests are positive.12
In these examples, the sensitivities of the component tests used to define the composite reference standard are moderate to high and their specificities are near perfect. An OR rule composite reference standard is, therefore, anticipated to have higher sensitivity than a single imperfect reference test.
We generated data for a sample of 1000 people assuming the composite reference standard was made up of component tests with sensitivity of 70% and specificities of 98-100% (detailed explanation in the web appendix). Disease prevalence was assumed to be 10%—that is, 100 patients were disease positive and 900 were disease negative. The new test under evaluation was set to have a sensitivity of 90% and a specificity of 98%.
Increasing the number of component tests
Misclassification of disease status
Depending on their accuracy, adding more component tests to the composite reference standard might cause more misclassification rather than less. We start with the ideal situation where each component test has perfect specificity of 100% and where the different component tests are conditionally independent (meaning that they are not prone to making the same false positive or false negative errors). Table 2 lists the frequency of results on the component tests and the composite reference standards based on them. As we move from a single reference test to a composite reference standard based on two or three component tests, the number of patients correctly classified as having the disease increases from 70 (34+15+15+6) to 91 (34+15+15+6+15+6) to 97 (34+15+15+6+15+6+6). The gain at the second step is less than at the first. After about five component tests (data not shown), additional tests increase costs but don’t result in any gain, as all 1000 patients are correctly identified.
Table 2 shows how misclassification changes when the specificity of the component tests decreases to 98%. The classification of disease positive patients remains the same, but the number of misclassified disease negative patients increases from 17 to 51 as we move from a single reference test to a composite reference standard with three component tests. For this example, the composite reference standard with three component tests resulted in more misclassified patients overall (three false positives and 51 false negatives) than the single reference test (30 false positives and 17 false negatives). The overall number of misclassified patients will continue to rise as more component tests are added to the composite reference standard.12
Sensitivity and specificity of new tests
Using the data in table 2 we can show that when all component tests have perfect specificity and are conditionally independent, the sensitivity of the new test is estimated accurately at its value of 90% irrespective of the number of tests in the composite reference standard (fig 1a). The estimate of the specificity of the new test steadily improves with every added test until it reaches the true value (fig 1b).
When the specificity of the component tests falls to 98%, the specificity estimates of the new test are almost identical to those obtained previously, but the sensitivity of the new test is now underestimated (fig 1). When the composite reference standard was composed of three tests, for example, the estimated sensitivity was 59%, much lower than the true value of 90%. This underestimation worsens with every test added to the composite reference standard.
Overestimating sensitivity or specificity of the new test
So far, we have assumed that all tests—component tests in the composite reference standard and the new test—are conditionally independent. In practice, however, errors made by multiple tests might be correlated.13 In studies evaluating new tests for Chlamydia trachomatis, for example, the component tests and the new test are typically nucleic acid amplification tests, which risk making the same false positive error of detecting a non-viable organism.14 In these situations, the composite reference standard can overestimate the accuracy of the new test because it does not adjust for the presence of conditional dependence—that is, the errors of the new test remain undetected.
To study the effects of conditional dependence, we generated data from a setting where both tests are likely to make the same false positive errors even though their specificity remains high at 98% (see web table 2 for details).6 As in our previous example, the sensitivity of the new test is underestimated, and this worsens with each component test added to the composite reference standard (fig 1). The specificity of the new test is underestimated when compared to a single imperfect test but becomes overestimated as component tests are added to the composite reference standard (fig 1). As the number of component tests increases, the estimated specificity of the new test converges to a value higher than the true value.12 In our example, the new test’s specificity will converge at 99.94%, compared with its true specificity of 98%. This may not seem like a large magnitude of bias but could lead to an important underestimate of the number of false positives the new test will produce in a low prevalence population.
Comparability across studies
When a new test is compared to the same composite reference standard in two different studies, the reported value of the new test’s accuracy will depend on the disease prevalence in each study. Using a standardised composite reference standard therefore does not ensure comparability across studies. We considered a composite reference standard based on three conditionally independent component tests, each with sensitivity 70% and specificity 98%. Disease prevalence ranged from low (5%) to high (30%), as might be expected across geographic regions or healthcare settings. The new test’s sensitivity was assumed to be 90% and its specificity 98%, as before. We found the estimated sensitivity of the new test ranged from 43% when the prevalence was 5% to 79% when the prevalence is 30% (web figure 1). Estimated specificity did not vary greatly with prevalence for the settings we used.
Latent class models can make better use of data
The drawbacks of the composite reference standard can be overcome using a statistical modelling approach called latent class analysis.3 Instead of classifying subjects into fixed disease categories, latent class analysis estimates the probability that each patient has the disease using all observed tests, including the test under evaluation (web figure 2). It adjusts for the sensitivity and specificity of each test as well as the possibility of conditional dependence between them. Simply put, latent class analysis considers how certain we are about classifying patients into diseased or non-diseased groups rather than making a black and white decision. Column 8 of table 1 shows the estimated risk of TB for each observed combination of tests based on a recent latent class analysis for the childhood TB data.6
Notably, the estimated risk of TB from this latent class analysis follows the gradation proposed by an expert group’s clinical case definition (column 9 of table 1),4 which classifies subjects into confirmed TB, probable TB, possible TB, and unlikely TB groups, using the same four tests as the composite reference standard. Our data show that composite reference standard would classify confirmed, probable and possible TB cases all as having TB, resulting in an estimated prevalence of 94%. Using culture as a reference would only consider the confirmed TB cases, resulting in an underestimate of the prevalence (16.4%). Latent class analysis estimates a 100% risk of TB for most cases of confirmed TB, though the risk is lower for unusual patterns. The risk of TB among the probable and possible TB cases ranges from 9% to 52% among patients with a negative Xpert test but increases when Xpert is positive. The resulting prevalence estimate based on latent class analysis is 26.7%. Because the latent class analysis adjusts for conditional dependence between culture and Xpert, it also provides a more realistic estimate of Xpert sensitivity (49.4%) than would be obtained with culture (74.4%) or the composite reference standard (22.5%) (see web material for how the latent class analysis estimates were calculated).
The advantages of latent class analysis are accompanied by the challenges of using a more sophisticated analytical technique. Construction of these models requires interdisciplinary expertise of both methodologists and clinicians6 to determine the particular tests, covariates, conditional dependence structure, and previous knowledge to be considered. Validation of these models against a perfect reference may not always be possible. Sometimes competing models cannot be distinguished using standard statistical methods.15 This is not a drawback of latent class analysis, but a reflection of the uncertainty in our knowledge due to the lack of a perfect reference test. Comparison with external information, such as the experts’ clinical case definition, can aid in assessing whether the model provides sensible results. This step is important because, as with all statistical models, incorrect model specification can lead to biased results.16
Composite reference standards are considered valid for estimating diagnostic accuracy when no perfect reference standard exists.37 But we have shown that the OR rule based composite reference standard leads to biased estimates of the accuracy of a new test unless stringent conditions hold. The additional information gathered from each component test results in worsening bias. Our previous work has shown these observations also apply to composite reference standards based on the AND rule and or the K positive rule.12
Composite reference standards may be considered clinically meaningful17 as they resemble clinical decision rules, which classify patients into mutually exclusive categories to support decision making—for example, rules identifying whether a subject is a candidate for TB treatment. Such decision rules are not necessary in research settings as no black or white decision needs to be made. Clinical decision rules might indicate the best possible management strategy, but are recognised by clinicians as imperfect.16 Yet similar rules are used to define composite reference standards for a diagnostic accuracy studies with no such recognition.
In the absence of a perfect reference test, a new test could be evaluated in terms of outcomes such as diagnostic yield or effect on patient management instead of accuracy.18 Latent class analysis would also be relevant in such analyses to estimate percentage of overdiagnosis or overtreatment,619 eventually supporting the development of optimal clinical decision rules.
As our ability to measure results on multiple tests/biomarkers increases, development of appropriate latent class models should be pursued in the absence of a perfect reference test to make optimal use of the data gathered.
More detail on the problems with composite reference standards https://www.ncbi.nlm.nih.gov/pubmed/26555849
A review paper on use of latent class models for diagnostic research https://www.ncbi.nlm.nih.gov/pubmed/24272278
Guidelines for reporting latent class models https://www.equator-network.org/reporting-guidelines/stard-blcm/
Contributors and sources: The authors include biostatisticians, epidemiologists, and clinicians with expertise in diagnostic research and a particular interest in methods for evaluating diagnostic accuracy in the absence of a perfect reference test. All authors participated in planning and writing the paper. IS generated the numerical examples. ND is the guarantor.
Funding: This work was supported by funding from the Canadian Institutes of Health Research (Grant number 89857).
Competing interests: All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/coi_disclosure.pdf and declare support from the Canadian Institutes of Health Research for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years, no other relationships or activities that could appear to have influenced the submitted work.