# Systematic reviews of evaluations of diagnostic and screening tests

BMJ 2001; 323 doi: https://doi.org/10.1136/bmj.323.7305.157 (Published 21 July 2001) Cite this as: BMJ 2001;323:157## All rapid responses

*The BMJ*reserves the right to remove responses which are being wilfully misrepresented as published articles.

Further to our rapid response to Juni P et al. Systematic reviews in

health care. Assessing the quality of controlled clinical trials. BMJ

2001;323:42-6, we would also like to focus on how dramatic can be the

effect of intra and inter-observer variation on the sensitivity and

specificity of a diagnostic or screening test [1]. We demonstrated, for

example, that for an observer variation with a proportion of agreement of

0.33 and 0.71 for abnormal and normal tests, respectively, and an assumed

sensitivity and false positive rate of 50% and 17%, respectively, the

latter may vary from 0% to 100% and 0% to 33%, respectively [1].

Systematic reviews including large studies and/or many small studies with

many observers will also tend to be biased towards average values whereas

systematic reviews including a large study with a single observer will

also tend to be biased towards the particular, high, average or low

results of that study. Shouldn't the largely random effect arising from

intra-observer variation and the random and systematic effect, arising

from inter-observer variation [2], be also obligatorily discussed in

systematic reviews of diagnostic and screening tests?

References

1- Bernardes J, Costa-Pereira A. How should we interpret RCTs based

on unreproducible methods.

http://www.bmj.com/cgi/content/full/322/7300/1457#EL9

2- Grant A. Principles for clinical evaluation of methods of

perinatal monitoring. J Perinat Med 1984;12:227-31.

Competing interests: Bernardes and Costa-Pereira are involved in the

development and validation of reproducible computerized diagnostic tests

**Competing interests: **
No competing interests

**05 September 2001**

EDITORS - Deeks, in the third of four articles on evaluations of

diagnostic and screening tests,1 promoted the odds ratio as often being

constant regardless of the diagnostic threshold. We agree with Deeks'

statement that the choice of threshold varies according to the prevalence

of the disease. However, the statement that the odds ratio is generally

constant regardless of the diagnostic threshold can be misleading. The

value of an odds ratio, like that of other measures of test performance

(e.g. sensitivity, specificity, likelihood ratios), depends on

prevalence.2 For example, a test with a diagnostic odds ratio of 10.00 is

considered to be a very good test by current standards. It is easy to

verify that this is generally true only in high risk populations. A

diagnostic odds ratio of 10.00 in a low risk may well represent very weak

association between the experimental test and the gold standard test.

This is so because the observable range of values for an odds ratio

increases as the prevalence of the disease decreases (i.e. moves away from

1/2).

Nicole Jill-Marie Blackman

Senior Biostatistician

GlaxoSmithKline,

1250 South Collegeville Road, P.O. Box 5089,

Collegeville PA

USA

email: nicole_blackman-1@gsk.com

Competing interests: None

REFERENCES

1.Deeks JJ. Systematic reviews of evaluations of diagnostic and

screening tests. British Medical Journal 2001; 323: 157-162.

2. Kraemer HC. The robustness common measures of 2 X 2 association

due to misclassifications

**Competing interests: **
No competing interests

There seem to be errors in Figure 2 of BMJ 2001;323:157-62.

In the Sensitivity plot on the left, some of the point estimates

appear to be incorrect. For example, we are told that the second Nasri

study has a specificity of 1.00 (6/6), but the point estimate is at about

0.7 on the graph; there in no point esimate for Tavani; and the point

estimate for Varner is at about 0.45 rather than 0.5.

Is the scale reversed at the bottom of the Specificty plot? For

example, the numbers for the Botsis study (14/114) suggest a low

specificity of 0.12, but the point estimate is at about 0.88 (1 - 0.12).

On the other hand, we are told that the Perti study has a specificity of

96/131, or 0.73, but the point estimate is at about 0.27 (1 - 0.73).

**Competing interests: **
No competing interests

An author's error has been made in the labelling of the right-hand graph in

Figure 2, and publisher's errors have been made in the positioning of some

of the points in both panels of this figure.

The numerators given in the right-hand panel of Figure 2 are the numbers of

false positives (not true negatives as labelled). This column is also

incorrectly labelled in the corresponding book chapter (Systematic Reviews

in Health Care: Meta-analysis in Context, page 266).

The sensitivity point estimates for Nasri(b)and Taviani should be

sensitivity=1.0. The 1-specificity point estimate for Goldstein should be

at 1-(16/27)=0.41 and not at 0.6 as depicted. The positioning of all of

these points in the corresponding book chapter is correct (Systematic

Reviews in Health Care: Meta-analysis in Context, page 266).

I thank the keen eyed readers who have pointed out these errors.

**Competing interests: **
No competing interests

**27 July 2001**

Dear Sir,

We enjoyed Deeks’ excellent presentation of two very important topics in systematic reviews of diagnostic accuracy research 1. We should like to draw attention to three points on which perhaps Deeks might have simplified matters too much.

First, in his ‘Framework for considering study quality and likelihood of bias’ he states that the reference diagnosis should be available for all (our italics) patients. However, when the reference test is dangerous and/or expensive it may be wise (and more ethical) to restrict the performance of the reference diagnosis to a random sample of patients who tested negatively on the experimental test. Since this approach does not affect the relative frequency of disease presence among patients with a negative experimental test result, the two approaches will produce identical diagnostic odds ratios (DOR). However, the sampling approach may become statistically infeasible when very small numbers of false negatives are to be expected. Investigators who use the sampling approach should report the sampling fraction to put their readers/reviewers in the position to recalculate the correct sensitivities, specificities and likelihood ratios (LR), because these, unlike the DOR, will not be the same for the two approaches.

Consider an even more extreme example to illustrate the futility of what might be called the “filling-the-fourfold-table-reflex” in diagnostic accuracy research. Consider a study on an experimental test that claims to give clinicians more certainty in situations where they have only a few indications that disease may be present. However, let’s assume that the indications are not strong enough to justify the performance of truly invasive and unpleasant tests. Without the new experimental test these patients would be sent home. The value of the new test lies in its ability to identify - in a relatively non-invasive and inexpensive fashion - those patients who have the disease and would benefit from treatment. In this scenario, the analysis of only those patients who test positively on the experimental test (two cells filled of the fourfold table) suffices to learn about its clinical usefulness.

Second, Deeks ends his explanation of the application of the likelihood ratio with stating that: “Knowledge of other characteristics of a particular patient that either increase or decrease their prior probability of endometrial cancer can be incorporated into the calculation by adjusting the pretest probability accordingly.” However, this assumes constancy of likelihood ratios (an assumption that seems difficult to eradicate), which should not be assumed because it is very unlikely and usually incorrect. In clinical practice, the knowledge of other patient characteristics (known by performing other ‘tests’) will have an influence on the magnitude of the LRs of following tests. This is so because when a chain of diagnostic tests (history taking, physical exam, lab tests, imaging) is performed on a patient certain results from his clinical history make the likelihood to find certain lab results more (or less) likely, which in turn influence the chances of finding certain imaging results. In other words, the results of the component tests are not mutually independent. For example, on average, women with a positive test on ultrasound (thickened endometrium) are more likely to test positively on hysteroscopy in which the endometrial thickness is also assessed, albeit in a different manner. The theoretical solution to this problem is the calculation of LRs that are conditional on the results of the preceding tests in the diagnostic test chain. In practice, this is usually not feasible due to lack of sufficient data and most investigators use logistic regression models to account for all these dependencies. These models, however, yield DORs, not LRs. It is partly this complexity that hampers the application of simple diagnostic accuracy studies to clinical practice. We support Deeks where he calls for more clinically relevant diagnostic studies.

Third, we agree with Deeks that DORs and summary receiver operating characteristic curves may be difficult to interpret. Deeks gives the example of a DOR of 29 and explains that a DOR of 29 could be interpreted as belonging to a sensitivity of 0.95 and a specificity of 0.60 or vice versa. This ‘vice versa-ness’, according to Deeks, limits the DOR’s clinical application. However, given the ranges of sensitivities and specificities in the material from his case study (0.8-1.0 and 0.27-0.87, respectively, excluding one study reporting a sensitivity of 0.50 based on two patients) the choice between these two combinations should be in favour of the former. More general, in instances where a diagnostic test allows the clinician to (technically) adjust the cutoff at a point along the (summary) receiver operating characteristic curve, such a curve may be quite useful in selecting the most (cost-) effective cutoff.

Finally, for those who like to faithfully reproduce Deeks’ graphs and summaries, in figure 2, the numerators in the second column of the right-hand panel represent the number of false positives, not the true negatives. This error may have large numerical consequences when it goes undetected.

Gerben ter Riet, MD, clinical epidemiologist 1,2

Alphons G.H. Kessels, MD, MSc, medical statistician 2

Lucas M. Bachmann, MD, research fellow 3

1 Dept Epidemiology, Maastricht University, Maastricht, The Netherlands

2 Dept Clinical Epidemiology & Medical Technology Assessment, Maastricht University Hospital, Maastricht, The Netherlands.

3 Horten Centre, University of Zurich, Zurich, Switzerland

Competing interests: none

1. Deeks JJ. Systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001;323: 157-62.

**Competing interests: **
No competing interests

**26 July 2001**

Dear Sir- The "ROC" curves in fig 3 are hard to understand. The

units of the ordinates and abscissas in the centre and right squares do

not seem to correspond to likelihood ratios or diagnostic odds. Anyway,

the scatter of the supposed false positive values looks much the same in

each section. One would expect this as diagnostic odds are derived from

the LR+/LR- ratios, which in turn come from sensitivity and specificity

estimates. No extra information has been added. The apparent reduction in

variance must come from the forced choice between sensitivity or

specificity as determinants, as the authors hint. Using one "cut off"

point for diagnosis (here it is 5mm) means that only one point on the ROC

curve is addressed and estimated. It cannot give any idea of the whole

curve (as shown in fig 1), its discriminatory capacity, or its variance.

It is a useful reminder of the fact that test information is not

infallible and that things like sensitivity and specificity are variables

with errors attached.

GH Hall MD

**Competing interests: **
No competing interests

## Corrected Correction

This is hardly a Rapid Response, being 8 years overdue. The

correction, prompted by readers and the author, leaves the impression that

the sensitivity is given by (false positives)/(true negatives + false

positives), whereas it was correctly stated in the original article as

(true negatives)/(true negatives + false positives). The problem arises

because the points to be plotted have been calculated from the false

positive fraction.

The correction should really be corrected to make it clear that it is

(1-sensitivity) that is being plotted. What is really required is a new

figure 2.

Competing interests:

None declared

Competing interests:No competing interests14 March 2009