How to read a paper: Papers that report diagnostic or screening testsBMJ 1997; 315 doi: http://dx.doi.org/10.1136/bmj.315.7107.540 (Published 30 August 1997) Cite this as: BMJ 1997;315:540
- Trisha Greenhalgh, senior lecturer ()a
- a Unit for Evidence-Based Practice and Policy Department of Primary Care and Population Sciences University College London Medical School/Royal Free Hospital School of Medicine Whittington Hospital London N19 5NF
If you are new to the concept of validating diagnostic tests, the following example may help you. Ten men are awaiting trial for murder. Only three of them actually committed a murder; the seven others are innocent of any crime. A jury hears each case and finds six of the men guilty of murder. Two of the convicted are true murderers. Four men are wrongly imprisoned. One murderer walks free.
This information can be expressed in what is known as a two by two table (table 1). Note that the “truth” (whether or not the men really committed a murder) is expressed along the horizontal title row, whereas the jury's verdict (which may or may not reflect the truth) is expressed down the vertical row.
These figures, if they are typical, reflect several features of this particular jury:
the jury correctly identifies two in every three true murderers;
it correctly acquits three out of every seven innocent people;
if this jury has found a person guilty, there is still only a one in three chance that they are actually a murderer;
if this jury found a person innocent, he or she has a three in four chance of actually being innocent; and
in five cases out of every 10 the jury gets it right.
These five features constitute, respectively, the sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of this jury's performance. The rest of this article considers these five features applied to diagnostic (or screening) tests when compared with a “true” diagnosis or gold standard. A sixth feature—the likelihood ratio—is introduced at the end of the article.
Validating tests against a gold standard
Our window cleaner told me that he had been feeling thirsty recently and had asked his general practitioner to be tested for diabetes, which runs in his family. The nurse in his surgery had asked him to produce a urine specimen and dipped a stick in it. The stick stayed green, which meant, apparently, that there was no sugar in his urine. This, the nurse had said, meant that he did not have diabetes.
New tests should be validated by comparison against an established gold standard in an appropriate spectrum of subjects
Diagnostic tests are seldom 100% accurate (false positives and false negatives will occur)
A test is valid if it detects most people with the target disorder (high sensitivity) and excludes most people without the disorder (high specificity), and if a positive test usually indicates that the disorder is present (high positive predictive value)
The best measure of the usefulness of a test is probably the likelihood ratio—how much more likely a positive test is to be found in someone with, as opposed to without, the disorder
I had trouble explaining that the result did not necessarily mean this, any more than a guilty verdict necessarily makes someone a murderer. The definition of diabetes, according to the World Health Organisation, is a blood glucose level above 8 mmol/l in the fasting state, or above 11 mmol/l two hours after a 100 g oral glucose load, on one occasion if the patient has symptoms and on two occasions if he or she does not.1 These stringent criteria can be termed the gold standard for diagnosing diabetes (although purists have challenged this notion2).
The dipstick test, however, has some distinct practical advantages over the fullblown glucose tolerance test. To assess objectively just how useful the dipstick test for diabetes is, we would need to select a sample of people (say 100) and do two tests on each of them: the urine test (screening test) and a standard glucose tolerance test (gold standard). We could then see, for each person, whether the result of the screening test matched the gold standard (see table 2). Such an exercise is known as a validation study.
The validity of urine testing for glucose in diagnosing diabetes has been looked at by Andersson and colleagues,3 whose data I have adapted for use (expressed as a proportion of 1000 subjects tested) in table 3.
From the calculations of important features of the urine dipstick test for diabetes (box), you can see why I did not share the window cleaner's assurance that he did not have diabetes. A positive urine glucose test is only 22% sensitive, which means that the test misses nearly four fifths of people who have diabetes. In the presence of classical symptoms and a family history, the window cleaner's baseline chances (pretest likelihood) of having the condition are pretty high and is reduced to only about four fifths of this (the negative likelihood ratio, 0.78; see below) after a single negative urine test. This man clearly needs to undergo a more definitive test.
Does the paper validate the test?
Question 1: Is this test potentially relevant to my practice?
Sackett and colleagues call this the utility of the test.6 Even if this test were 100% valid, accurate, and reliable, would it help me? Would it identify a treatable disorder? If so, would I use it in preference to the test I use now? Could I (or my patients or the taxpayer) afford it? Would my patients consent to it? Would it change the probabilities for competing diagnoses sufficiently for me to alter my treatment plan?
Question 2: Has the test been compared with a true gold standard?
You need to ask, firstly, whether the test has been compared with anything at all. Assuming that a “gold standard” test has been used, you should verify that it merits the description, perhaps by using the questions listed in question 1. For many conditions, there is no gold standard diagnostic test. Unsurprisingly, these tend to be the conditions for which new tests are most actively sought. Hence, the authors of such papers may need to develop and justify a combination of criteria against which the new test is to be assessed. One specific point to check is that the test being validated in the paper is not being used to define the gold standard.
Question 3: Did this validation study include an appropriate spectrum of subjects?
Although few investigators would be naive enough to select only, say, healthy male medical students for their validation study, only 27% of published studies explicitly define the spectrum of subjects tested in terms of age, sex, symptoms or disease severity, and specific eligibility criteria.7 Importantly, the test should be verified on a population which includes mild and severe disease, treated and untreated subjects, and those with different but commonly confused conditions.6
Although the sensitivity and specificity of a test are virtually constant whatever the prevalence of the condition, the positive and negative predictive values depend crucially on prevalence. This is why general practitioners are sceptical of the utility of tests developed exclusively in a secondary care population, and why a good diagnostic test is not necessarily a good screening test.
Question 4: Has workup bias been avoided?
This is easy to check. It simply means, “Did everyone who got the new diagnostic test also get the gold standard, and vice versa?” There is clearly a potential bias in studies where the gold standard test is performed only on people who have already tested positive for the test being validated.7
Question 5: Has expectation bias been avoided?
Expectation bias occurs when pathologists and others who interpret diagnostic specimens are subconsciously influenced by the knowledge of the particular features of the case—for example, the presence of chest pain when interpreting an electrocardiogram. In the context of validating diagnostic tests against a gold standard, all such assessments should be “blind.”
Question 6: Was the test shown to be reproducible?
If the same observer performs the same test on two occasions on a subject whose characteristics have not changed, they will get different results in a proportion of cases. Similarly, it is important to confirm that reproducibility between different observers is at an acceptable level.9
Question 7: What are the features of the test as derived from this validation study?
All the above standards could have been met, but the test might still be worthless because the sensitivity, specificity, and other crucial features of the test are too low—that is, the test is not valid. What counts as acceptable depends on the condition being screened for. Few of us would quibble about a test for colour blindness that was 95% sensitive and 80% specific, but nobody ever died of colour blindness. The Guthrie heel-prick screening test for congenital hypothyroidism, performed on all babies in Britain soon after birth, is over 99% sensitive but has a positive predictive value of only 6% (it picks up almost all babies with the condition at the expense of a high false positive rate),10 and rightly so. It is more important to pick up every baby with this treatable condition who would otherwise develop severe mental handicap than to save hundreds the minor stress of a repeat blood test.
Question 8: Were confidence intervals given?
A confidence interval, which can be calculated for virtually every numerical aspect of a set of results, expresses the possible range of results within which the true value will probably lie. If the jury in the first example had found just one more murderer not guilty, the sensitivity of its verdict would have gone down from 67% to 33%, and the positive predictive value of the verdict from 33% to 20%. This enormous (and quite unacceptable) sensitivity to a single case decision is, of course, because we validated the jury's performance on only 10 cases. The larger the sample, the narrower the confidence interval, so it is particularly important to look for confidence intervals if the paper you are reading reports a study on a relatively small sample.11
Question 9: Has a sensible “normal range” been derived?
If the test gives non-dichotomous (continuous) results—that is, if it gives a numerical value rather than a yes/no result—someone will have to say what values count as abnormal. Defining relative and absolute danger zones for a continuous variable (such as blood pressure) is a complex science, which should take into account the actual likelihood of the adverse outcome which the proposed treatment aims to prevent. This process is made considerably more objective by the use of likelihood ratios (see below).
Question 10: Has this test been placed in the context of other potential tests in the diagnostic sequence?
In general, we treat high blood pressure simply on the basis of a series of resting blood pressure readings. Compare this with the sequence we use to diagnose coronary artery stenosis. Firstly, we select patients with a typical history of effort angina. Next, we usually do a resting electrocardiogram, an exercise electrocardiogram, and, in some cases, a radionuclide scan of the heart. Most patients come to a coronary angiogram only after they have produced an abnormal result on these preliminary tests.
If you sent 100 ordinary people for a coronary angiogram, the test might show very different positive and negative predictive values (and even different sensitivity and specificity) than it did in the ill population on which it was originally validated. This means that the various aspects of validity of the coronary angiogram as a diagnostic test are virtually meaningless unless these figures are expressed in terms of what they contribute to the overall diagnostic work up.
A note on likelihood ratios
Question 9 above described the problem of defining a normal range for a continuous variable. In such circumstances, it can be preferable to express the test result not as “normal” or “abnormal” but in terms of the actual chances of a patient having the target disorder if the test result reaches a particular level. Take, for example, the use of the prostate specific antigen (PSA) test to screen for prostate cancer. Most men will have some detectable antigen in their blood (say, 0.5 ng/ml), and most of those with advanced prostate cancer will have high concentrations (above about 20 ng/ml). But a concentration of, say, 7.4 ng/ml may be found either in a perfectly normal man or in someone with early cancer. There simply is not a clean cutoff between normal and abnormal.12
We can, however, use the results of a validation study of this test against a gold standard for prostate cancer (say a biopsy of the prostate gland) to draw up a whole series of two by two tables. Each table would use a different definition of an abnormal test result to classify patients as “normal” or “abnormal.” From these tables, we could generate different likelihood ratios associated with an antigen concentration above each different cutoff point. When faced with a test result in the “grey zone” we would at least be able to say, “This test has not proved that the patient has prostate cancer, but it has increased [or decreased] the odds of that diagnosis by a factor of x.”
The likelihood ratio thus has enormous practical value, and it is becoming the preferred way of expressing and comparing the usefulness of different tests.6 For example, if a person enters my consulting room with no symptoms at all, I know that they have a 5% chance of having iron deficiency anaemia, since I know that one person in 20 in the population has this condition (in the language of diagnostic tests, the pretest probability of anaemia is 0.05).13
Now, if I do a diagnostic test for anaemia, the serum ferritin concentration, the result will usually make the diagnosis of anaemia either more or less likely. A moderately reduced serum ferritin concentration (between 18 and 45 μg/l) has a likelihood ratio of 3, so the chances of a patient with this result having iron deficiency anaemia is 0.05x3—or 0.15 (15%). This value is known as the post-test probability of the serum ferritin test. The likelihood ratio of a very low serum ferritin concentration (below 18 μg/l) is 41, making the chances of iron deficiency anaemia in a patient with this result greater than unity. On the other hand, a very high concentration (above 100 μg/l; likelihood ratio 0.13) would reduce the chances of the patient being anaemic from 5% to less than 1%.13
Figure 1 shows a nomogram, adapted by Sackett and colleagues from an original paper by Fagan,14 for working out post-test probabilities when the pretest probability (prevalence) and likelihood ratio for the test are known. The lines A, B, and C, drawn from a pretest probability of 25% (the prevalence of smoking among British adults), are the trajectories through likelihood ratios of 15, 100, and 0.015, respectively—three different tests for detecting whether someone is a smoker.15 Actually, test C detects whether the person is a non-smoker, since a positive result in this test leads to a post-test probability of only 0.5%.
The articles in this series are excerpts from How to read a paper: the basics of evidence based medicine. The book includes chapters on searching the literature and implementing evidence based findings. It can be ordered from the BMJ Publishing Group: tel 0171 383 6185/6245; fax 0171 383 6662. Price £13.95 UK members, £14.95 non-members.
Thanks to Dr Sarah Walters and Dr Jonathan Elford for advice, and in particular to Dr Walters for the jury example.