Verification problems in diagnostic accuracy studies: consequences and solutionsBMJ 2011; 343 doi: https://doi.org/10.1136/bmj.d4770 (Published 02 August 2011) Cite this as: BMJ 2011;343:d4770
- Joris A H de Groot, clinical epidemiologist1,
- Patrick M M Bossuyt, professor of clinical epidemiology2,
- Johannes B Reitsma, associate professor of clinical epidemiology2,
- Anne W S Rutjes, senior researcher3,
- Nandini Dendukuri, assistant professor clinical epidemiology and biostatistics4,
- Kristel J M Janssen, clinical epidemiologist1,
- Karel G M Moons, professor of clinical epidemiology1
- 1Julius Center for Health Sciences and Primary care, UMC Utrecht, PO Box 85500, 3508GA Utrecht, Netherlands
- 2Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center Amsterdam, 1100 DE Amsterdam, Netherlands
- 3Division of Clinical Epidemiology and Biostatistics, Institute of Social and Preventive Medicine-University of Bern, 3012 Bern, Switzerland
- 4Royal Victoria Hospital, Quebec, Canada H3A 1A1
- Correspondence to: J A H de Groot
- Accepted 2 May 2011
The accuracy of a diagnostic test or combination of tests (such as in a diagnostic model) is the ability to correctly identify patients with or without the target disease. In studies of diagnostic accuracy, the results of the test or model under study are verified by comparing them with results of a reference standard, applied to the same patients, to verify disease status (see first panel in figure⇓).1 Measures such as predictive values, post-test probabilities, ROC (receiver operating characteristics) curves, sensitivity, specificity, likelihood ratios, and odds ratios express how well the results of an index test agree with the outcome of the reference standard.2 Biased and exaggerated estimates of diagnostic accuracy can lead to inefficiencies in diagnostic testing in practice, unnecessary costs, and physicians making incorrect treatment decisions.
The reference standard ideally provides error-free classification of the disease outcome presence or absence. In some cases, it is not possible to verify the definitive presence or absence of disease in all patients with the (single) reference standard, which may result in bias. In this paper, we describe the most important types of disease verification problems using examples from published diagnostic accuracy studies. We also propose solutions to alleviate the associated biases.
Often not all study subjects who undergo the index test receive the reference standard, leading to missing data on disease outcome (see middle panel in figure⇑). The bias associated with such situations of partial verification is known as partial verification bias, work-up bias, or referral bias.3 4 5
Clinical examples of partial verification
Various mechanisms can lead to partial verification (see examples in table 1⇓).
When the condition of interest produces lesions that need biopsy and subsequent histological verification (as in many cancers), it is impossible to verify negative index test results (“where to biopsy?”). An example is F-18 fluorodeoxyglucose positron emission tomography (FDG-PET) to detect possible distant metastases before planning curative surgery in patients with carcinoma of the oesophagus: only the hotspots detected by PET can be sampled by biopsy and verified histologically.6
Ethical reasons can also play a role in withholding a reference standard. Angiography is still considered the best method for detecting pulmonary embolisms, but, because of its invasiveness and risk of serious complications, it is now considered unethical to perform this reference standard in low risk patients, such as those with a low clinical probability and negative D-dimer result.10
Sometimes the reference standard may be temporarily unavailable, or patients and doctors may decide to refrain from disease verification. In a study evaluating the accuracy of digital rectal examination and prostate specific antigen (PSA) for the early detection of prostate cancer, 145 out of 1000 men fulfilled the criterion for verification by the reference standard (transrectal ultrasound combined with biopsy). However, 54 of these men did not undergo the reference standard, for unknown reasons.7 In another study the accuracy of dobutamine-atropine stress echocardiography for the diagnosis of coronary artery disease was assessed, with coronary angiography as the reference.8 Only a small proportion of patients received this reference standard because the clinicians’ decision to refer to angiography depended on the patient’s history and test results.
Potential for bias
The above examples show that partial disease verification, and thus missing disease outcome status in some of the patients, is often not completely at random or non-selective. It is usually based on results of the index test under study or other observed patient variables or test results. If so, the missing outcome status is selectively missing, as the reason for disease verification is associated with other information. For example, patients with a positive index test result or with a high clinical suspicion based on other variables (that is, high probability before the index test) are often more likely to be verified by the reference test than patients with negative test results or a low probability before the index test. Simply leaving such selectively unverified patients out of the analysis will leave a non-random (selective) part of the original group for analysis and thus generate biased estimates of the accuracy of the index test under study.
The direction and size of this bias will depend on how selective the reason for non-verification is, the number of patients whose results are not verified, and the ratio between the number of patients with positive and negative index test results that remain unverified.5 The bias always occurs in the estimates of the sensitivity and specificity of the diagnostic index test or model under study, and often also in the predictive value. When the reason for partially missing outcomes is based only on the results of the index test, the predictive values of this index test will indeed be unbiased (see below). If, however, the reason for referral for reference testing is not only due to the index test results but also to other patient information, the predictive values of the index test will be affected.15
Corrections for partial verification bias
One of the early methods to correct for partial verification bias was developed by Begg and Greenes.16 Briefly, this method uses only the pattern of disease and non-disease verified by the reference standard among the patients with a positive or negative result of the (single) index test under study. This pattern is then used to calculate the expected number of diseased and non-diseased among the non-verified patients with a positive or negative index test result to obtain an inflated 2×2 table as if all patients were verified by the reference standard. This correction method assumes that the reason for referral to the reference test is only due to the result of the index test under study. Hence, conditional on these index test results, the decision to verify is in fact a random process. The method can also be extended to more than one test result, but this requires exact knowledge of the reasons and patterns behind the partial verification.16 17
More recently, multiple imputation methods have been proposed to correct for partial verification problems.18 19 Multiple imputation can be viewed as a “statistical” workout of the intuitive “diagnostic reasoning” of the clinician. Just as a clinician in practice decides whether to refer a patient for disease verification by a (more invasive, burdensome, or costly) reference standard based on all available patient information, multiple imputation techniques also use all available information of a patient—and that of similar patients—to estimate the most likely value of the missing reference test result in non-verified patients.
Imputation methods comprise two phases—an imputation phase where each missing reference test result is estimated and imputed from all available patient information, and an analysis phase where accuracy estimates of the diagnostic index test or model are computed by standard procedures based on the now completed dataset. Several imputation variants are available, ranging from single imputation of missing reference test values to multiple imputation.20 21 Instead of filling in a single value for each missing value, as with single imputation, multiple imputation procedures replace each missing value with a set of plausible values to represent the uncertainty about the imputed value. These multiple imputed datasets are then analysed, one by one, again by standard procedures. The results from these analyses are combined to produce accuracy estimates of the diagnostic index test(s) or model and confidence intervals that properly reflect the uncertainty due to missing values.20 21
For optimal application of multiple imputation techniques to address partial verification, it is important for researchers to collect as much detailed data as possible on study subjects that could potentially drive the (selective) referral for reference testing. The performance of the multiple imputation or other correction methods will improve with more and better information that may be involved in disease verification decisions. The flexibility of the multiple imputation method enables the incorporation of multiple pieces of observed patient information, not only the results of the index test under study, thereby increasing the likelihood of correctly imputing missing reference test values in patients in whom the disease status was selectively not verified.17 18 19
The discussed mathematical methods to correct for selectively missing verification, and thus partial verification bias, make use of observed (patient) information or variables. They assume that the reasons for missing verification depend on the observed information only. Clearly, this assumption cannot be tested with the data at hand, since non-observed information is, by definition, not available. If one expects selectively missing reference test results as a result of unobserved information, there are methods to perform additional (sensitivity) analysis to quantify to what extent the diagnostic accuracy estimates of the index test change under these situations.22 23
Another common approach in diagnostic accuracy studies is to use an alternative, second best, reference test in those subjects for whom the first, preferred reference test cannot or will not be used (see third panel in figure⇑). Although this seems a clinically appealing and ethical approach, bias arises when the results of the two reference tests are treated as interchangeable. Both reference tests are, almost by definition, of different quality in terms of classification of the target disease or may even define the target disease differently.24 25 Hence, simply combining all disease outcome data in a single analysis (table 2⇓), as if both reference tests are yielding the same disease outcomes, does not reflect the “true” pattern of disease presence and absence. Such an estimation of disease prevalence differs from what one would have obtained if all subjects had undergone the preferred reference standard. Consequently, all estimated measures of the accuracy of the diagnostic index test or model will be biased. This is called differential verification bias.3 4
When evaluating a new marker for acute appendicitis, histopathology of the appendix is the preferred reference test, but clinical follow-up is sometimes used as an alternative (for example, if histopathology is considered too invasive). Compared with histopathology, clinical follow-up is likely to have a higher implicit threshold to detect appendicitis, so it will label more patients as non-diseased (that is, no appendicitis). Thus, these two reference tests define the target condition in a different way. Histopathology might seem the preferred reference test because it reveals even the smallest number of inflamed cells, but one could argue that the more relevant information for clinical practice is not whether the patient has inflamed cells but whether the patient will recover without intervention. This would make clinical follow-up the preferred reference, even though it would be unethical to adopt for all subjects and to withhold surgery. This does mean that accuracy estimates from a combination of histopathology and follow-up will differ systematically from what one would have obtained if all index test results had been verified by either clinical follow-up or histology.
Because accuracy estimates of the new index test ignore the use of different reference tests, they are also difficult to interpret. In situations of differential verifications such as this, the results should be corrected and reported separately for each reference standard to provide informative and unbiased measures of accuracy of the diagnostic index test or model. We illustrate this with a clinical example from the recent literature.
In a recent study11 the elbow extension test (EET) was examined for its accuracy in ruling out elbow fractures. The preferred reference test was radiography. For unstated reasons (costs, efficiency, or minimising radiation exposure), radiography was planned in patients with a positive EET result whereas the patients with a negative EET received a structured follow-up assessment by telephone after 7–10 days to verify whether elbow fracture was absent (the alternative reference test). Only patients who met any of the pre-specified recall criteria were asked to return to the emergency department for radiography. The rest were considered not to have a clinically significant elbow fracture. The resulting data are shown in table 3⇓.
The authors reported overall estimates of accuracy of the EET, ignoring the use of different reference standards (table 4⇓, first row). Though both radiography and structured follow-up are useful verification methods, their results are not necessarily interchangeable.
The availability of 181 patients with a negative EET who were, after all, evaluated by radiography (“protocol violations” in table 3⇑) enables us to apply the above mentioned correction methods for partial verification, under the assumption that, conditional on the index test result, the decision to verify is a random process.
The corrected values of sensitivity and specificity clearly show the consequences of differential verification (table 4⇑, second row). We found differences in the estimates of EET accuracy when verification bias is simply ignored and when it is adjusted for. The negative predictive value (the item of primary interest, to rule out elbow fractures), with respect to radiography alone was lower than the value reported by the authors and fell below the desired value of ≥97%. This clearly shows that two reference tests should not be viewed as one.
(For a more detailed discussion of this example and the possibilities to correct for differential verification, see de Groot et al, 201126)
Further corrections for differential verification bias
Recently, a Bayesian method was proposed for simultaneously adjusting for differential verification bias and for the fact that these multiple reference tests were imperfect.26 The method produces accuracy measures both with respect to the latent disease status and with respect to the use of different reference tests. The former can be considered as a more general measure of performance of the index test with respect to a theoretically defined target condition or disease status since none of the reference tests used is considered perfect. However, the index test’s accuracy measures for each of the reference standards may be considered of greater clinical relevance, as these reflect the accuracy against the reference tests that are commonly also performed in daily practice, and on which patient management decisions will often be based.
In diagnostic accuracy studies, all efforts should be made to verify as many test results as possible, preferably all, with the optimal reference test to avoid bias. In practice, the burden on patients, costs, or other reasons often prevent this from happening (table 1⇑).27
If test outcome is verified by the reference test for only some of the patients, which is usually selective disease verification based on other observed patient information, we advise the use of the mathematical correction methods described above.16 17 19
There is insufficient knowledge to make general statements about what proportion of missing reference standard results might be acceptable and at which point correction methods will become unreliable. Following various statistical guidelines,18 19 20 21 28 29 we recommend the use of correction methods even with small rates of missing verification data. Even small proportions of missing outcomes may yield biased accuracy estimates of the index test(s) or model under study if the non-verified sample is highly selective.
What upper limits of missing reference test data can still be corrected for is even harder to say.4 Recently Janssen et al showed that, even for large amounts of missing data, imputation leads to less biased results than simply ignoring the (selectively) non-measured subjects.28 The authors warn that this possibility for imputation depends on how selective or different the observed and non-observed subjects are and how many results remain to build “good enough” imputation models. In any case, authors applying correction or imputation methods for addressing partial verification should provide insight in both issues—how many subjects had missing reference test values and how different were the verified and non-verified patients by comparing both groups on their observed characteristics.29 30
If the preferred reference test is not possible and thus missing in complete subgroups, applying a different, usually inferior, reference test will obviously produce different information about the disease status. In such cases, the results should be reported separately for each reference test to provide more clinically informative and unbiased measures of diagnostic accuracy.3 If in these situations one still wants to quantify the accuracy of the diagnostic index test or model with regard to the same underlying target condition, one should also correct for possible imperfections of the applied reference tests.26
In studies of diagnostic accuracy studies, ideally all patients undergoing the index test are verified by the reference standard
This is not always possible, and incomplete or improper disease verification is one of the major sources of bias in diagnostic accuracy studies
Partial verification bias occurs when not all patients are verified by the reference standard; instead, disease verification is related to other, previous (index) test results or patient characteristics. Multiple imputation methods can be used to correct for the partial verification bias
An alternative reference test may be used for those cases where verification with the preferred reference test is not possible. This can result in differential verification bias if the results of both reference tests are treated as equal and interchangeable, when they are really of different quality or define the target condition differently. Instead, the estimated accuracy of the diagnostic index test should be reported separately for each reference test
Contributors: KGMM is guarantor for the article and heads a research team aimed at improving methods for quantification of the diagnostic and prognostic value of medical tests, biomarkers, and other devices. KJMJ and JAHdG are clinical epidemiologists in his team. PMMB and JBR lead the Biomarker and Test Evaluation (BiTE) programme to develop and appraise methods for evaluating medical tests and biomarkers, and spearheaded the STARD initiative to improve the reporting of diagnostic accuracy studies. AWSR’s PhD thesis was on sources of bias and variation in diagnostic accuracy studies and she currently works to update QUADAS, a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. ND is engaged in research in the area of methods for diagnostic studies.
Cite this as: BMJ 2011;343:d4770
Funding: We acknowledge the support of the Netherlands Organization for Scientific Research (projects 9120.8004 and 918.10.615).
Competing interests: All authors have completed the ICJME unified disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare that they have no financial or non-financial interests that may be relevant to the submitted work.
In diagnostic accuracy studies the ability of a test or combination of tests to correctly identify patients with or without the target condition is verified by applying a reference standard in all patients who have undergone the index test. Incomplete or improper disease verification is one of the major sources of bias in diagnostic accuracy studies. This study describes the various types of disease verification problems, including empirical examples, and proposes solutions to alleviate the ass
Provenance and peer review: Not commissioned; externally peer reviewed.