- Daniel Pewsner, senior research fellow1,
- Markus Battaglia, senior research fellow1,
- Christoph Minder, professor of medical statistics1,
- Arthur Marx, senior registrar2,
- Heiner C Bucher, professor of clinical epidemiology3,
- Matthias Egger (), professor of epidemiology and public health medicine4
- 1 Division of Epidemiology and Biostatistics, Department of Social and Preventive Medicine, University of Bern, Switzerland
- 2 Department of General Internal Medicine, Inselspital, University of Bern, Switzerland
- 3 Basel Institute for Clinical Epidemiology, University Hospitals, Basel, Switzerland
- 4 MRC Health Services Research Collaboration, Department of Social Medicine, University of Bristol, Bristol
- Correspondence to: M Egger, Department of Social and Preventive Medicine, University of Bern, Finkenhubelweg 11, CH-3012 Berne, Switzerland
- Accepted 20 April 2004
Dr X is back from her annual leave. Dr Y, the locum doctor, reports on the patients he saw during her absence, including a 40 year old teacher who had sprained her right ankle. Returning from a conference, she had stumbled while walking down the stairs with a heavy bag. Examination revealed a moderately swollen lateral right ankle. The patient was able to walk but was clearly in pain. Her breath smelt of alcohol.
Ruling diagnoses in and out with SpPIns and SnNOuts
Dr Y had applied the Ottawa ankle rules—decision rules designed to exclude fractures of the malleolus and the midfoot—and found no bone tenderness.1 He had previously visited the website of a centre for evidence based medicine2 and printed out a list of diagnostic tests that can rule out, or rule in, the condition in question without requiring further investigations.
The probability of disease, given a positive or negative test result (post-test probability), is usually obtained by calculating the likelihood ratio of the test result and using formulas based on Bayes's theorem (see box 1), or a nomogram,3 to convert the estimated probability of the suspected diagnosis before the test result was known (pretest probability) into a post-test probability, which takes the result into account.4 Likelihood ratios indicate how many times more likely a test result is to be expected in a patient with the disease compared with a person free of the disease and thus measure a test's ability to modify pretest probabilities.
David Sackett and others have argued that such calculations are unnecessary when a test is highly sensitive or highly specific.4–6 In this situation the likelihood ratio of a negative test will generally be very small, and the likelihood ratio of a positive test very large. A negative test will thus rule out, and a positive result rule in, disease. Two mnemonics that capture the properties of such tests have been coined: SnNOut (high sensitivity, negative, rules out) and SpPIn (high specificity, positive, rules in).4 This concept has become increasingly popular, with many websites for evidence based medicine listing such tests and inviting users to nominate further SpPIn and SnNOut tests. The understanding of the SnNOut principle among medical students was recently examined in a randomised trial.7
Negative results from highly sensitive tests can rule a diagnosis out (sensitive, negative, out = SnNOut), and positive results from highly specific tests can rule a diagnosis in (specific, positive, in = SpPIn)
Studies quoted as showing SpPIn or SnNOut properties may be affected by spectrum bias, partial verification bias, or incorporation bias. Others may be too small to define test characteristics with sufficient precision
The power of a test to rule a diagnosis out does not depend exclusively on its sensitivity, as suggested by the SnNOut rule, but is reduced by low specificity. Similarly, the power to rule in depends on both specificity and sensitivity
The evidence from studies of a test's accuracy should be critically assessed, and post-test probabilities (with 95% confidence intervals) should be calculated when evaluating potential SnNOut or SpPIn tests
Assuming that a diagnosis can be ruled in or out with confidence, when in reality it cannot, could have serious consequences for patients
The website listed the Ottawa ankle rules as a SnNOut test,2 indicating that in the teacher's case a fracture could safely be ruled out without radiography. Indeed, the patient made an uneventful and full recovery within four weeks. Alerted by the patient's alcoholic breath, Dr Y wondered whether an alcohol problem might have contributed to the accident and used the CAGE questions (see box 2) to investigate this further. According to the same website,2 the CAGE instrument has SpPIn properties, ruling the diagnosis in if two or more questions are answered affirmatively. The patient confirmed that she felt she should cut down on alcohol and that she had felt bad repeatedly about her drinking. Dr X, who had known her for over 10 years, explained that the patient's alcohol intake was moderate and well controlled, and that she was socially well integrated but very health conscious and somewhat anxious. A few days later, Dr X got a telephone call from the patient, who was clearly upset about the locum doctor's suggestion that she had an alcohol problem. A further consultation was required to clarify the situation and restore trust.
Box 1: Definitions of concepts and terms
Sensitivity—The proportion of people with the disease who are correctly identified by a positive test result (“true positive rate”)
Specificity—The proportion of people free of the disease who are correctly identified by a negative test result (“true negative rate”)
SnNOut—Mnemonic to indicate that a negative test result (N) of a highly sensitive test (Sn) rules out the diagnosis (Out)
SpPIn—Mnemonic to indicate that a positive test result (P) of a highly specific test (Sp) rules in the diagnosis (In)
Likelihood ratios—Measure of a test result's ability to modify pretest probabilities. Likelihood ratios indicate how many times more likely a test result is in a patient with the disease compared with a person free of the disease.
Likelihood ratio of a positive test result (LR+)—The ratio of the true positive rate to the false positive rate: sensitivity/(1-specificity)
Likelihood ratio of a negative test result (LR-)—The ratio of the false negative to the true negative rate: (1-sensitivity)/specificity
Pretest probability (prevalence)—The probability that an individual has the target disorder before the test is carried out
Post-test probability—The probability that an individual with a specific test result has the target condition (post-test odds/[1+post-test odds]) or
Pretest odds—The odds that an individual has the target disease before the test is carried out (pretest probability/[1-pretest probability])
Post-test odds—The odds that a patient has the target disease after being tested (pretest odds×LR)
Positive predictive value (PPV)—The proportion of individuals with positive test results who have the target condition. This equals the post-test probability given a positive test result
Negative predictive value (NPV)—The proportion of individuals with negative test results who do not have the target condition. This equals one minus the post-test probability given a negative test result
Critical appraisal of test evaluation studies
In this article, we examine examples of test evaluation studies that websites and a textbook of evidence based medicine4 have cited as showing that the tests had SpPIn or SnNOut properties. The studies were chosen to illustrate methodological issues. We assessed the quality of studies as described elsewhere,8 extracted the two-by-two table from the original publication, and calculated likelihood ratios and post-test probabilities with exact binomial 95% confidence intervals based on the pretest probabilities observed in the studies. Finally, we examined whether the post-test probability of the condition in question in the population studied was compatible with the notion of safely ruling the condition in or out, and considered the transferability of study results to other settings and populations. Tables 1 and 2 summarise the study characteristics and results from our critical appraisal.
Box 2: CAGE questionnaire for detecting alcohol misuse
Have you ever felt you should Cut down on your drinking?
Have people Annoyed you by criticising your drinking?
Have you ever felt bad or Guilty about your drinking?
Have you ever had a drink first thing in the morning to steady your nerves or to get rid of a hangover (Eye opener)?
Random error and bias
A diagnostic study may be too small to define test performance with sufficient precision. For example, a website2 and the textbook4 interpreted a study of ankle swelling in patients with suspected ascites10 as demonstrating SnNOut properties. The absence of a history of ankle swelling is thus assumed to rule out ascites.2 4 However, the study was based on only 15 patients with ascites and confidence intervals were wide, with the lower 95% confidence interval of the sensitivity including 68%. This means that absence of ankle swelling is still compatible with a 15.8% probability of ascites, which clearly is unacceptably high (table 1).
Studies with methodological flaws tend to overestimate the accuracy of diagnostic tests.21 Bias can be introduced when tests are evaluated in patients known to have the disease and in people known to be free of it—so called diagnostic case-control studies. In this situation patients with borderline or mild expressions of the disease, and conditions mimicking the disease are excluded, which can lead to exaggeration of both sensitivity and specificity.21 This is called spectrum bias because the spectrum of study patients will not be representative of patients seen in practice. For example, the textbook considered auscultatory percussion in the diagnosis of pleural effusion as a SpPIn test.4 This assessment was based on a study that compared patients who were selected because of the presence or absence of radiological signs of effusion.16 The impressive results (100% specificity and 96% sensitivity (table 2)) may therefore not be reliable. The textbook and a website22 also claim that the presence of retinal vein pulsation in ophthalmoscopy excludes increased intracranial pressure (a SnNOut test). This is based on a study that compared patients known to have increased pressure with people not suspected to have increased intracranial pressure.11
Partial verification bias may be introduced when the reference test or tests are not applied consistently to confirm negative results of the index test. Some patients are either excluded or considered true negatives. This may lead to overestimation of sensitivity and underestimation of specificity or to overestimation of sensitivity and specificity.21 The textbook considered the CAGE questionnaire for diagnosing alcohol misuse (box 2) to be a SpPIn test.4 This is based on a study that subjected only a fraction of CAGE-negative persons to further testing (liver enzymes, medical record review, and physician interviews (table 2)), 17 thus possibly introducing bias.
Similarly, incorporation bias may be present if the test under evaluation is also part of the reference test.21 This will lead to overestimation of test accuracy because experimental and reference tests are no longer independent. For example, a website listed abdominojugular reflux for the diagnosis of congestive heart failure as a SpPIn test,22 on the basis of a study that used clinicoradiographic criteria, including abdominojugular reflux, as the reference test (table 2).18
Sensitivity and specificity
The likelihood ratio associated with a negative test result does not depend on its sensitivity alone, as suggested by the SnNOut rule, but also on its specificity. For example, a website considered that the clinical criteria for the diagnosis of Alzheimer's disease had SnNOut properties2 based on a sensitivity of 93% (table 1).12 However, despite this high sensitivity, the likelihood ratio of a negative test was a modest 0.3, because of the test's low specificity of 23% (100 - 93/23 = 0.3, see box 1). Indeed, in the population studied, the probability of Alzheimer's disease, given a negative test, was 25% (table 1). The power to rule out a diagnosis thus depends on both sensitivity and specificity.
Similarly, the ability to rule in depends not only on specificity, as suggested by the SpPIn rule, but also on sensitivity. A study examining the presence of a third heart sound in the diagnosis of congestive heart failure (table 2)19—which a website interpreted as demonstrating SpPIn properties23—is an example of a highly specific test (99%) that suffers from a low sensitivity (24%). The figure shows how the power to rule a disease in or out is eroded when highly specific tests are not sufficiently sensitive, or highly sensitive tests are not sufficiently specific.
Transferability and applicability
The performance of a diagnostic test often varies considerably from one setting to another, which may be due to differences in the definition of the disease, the exact nature of the test, and its calibration and the characteristics of those with and without the disease in a given setting.21 For example, patients attending primary care practices will generally have disease at an earlier stage than patients in secondary and tertiary care, which may reduce a test's sensitivity. Patients free of the disease in tertiary care will tend to have other conditions, which could reduce the specificity of a diagnostic test. Interpreting data on a test's accuracy thus requires defining the exact nature of the test used, the disease, and the patient population studied. For example, the website that listed the CAGE questionnaire as a SpPIn test for alcohol dependence2 cited, as the evidence for this, a study that had been performed in black women admitted to a trauma centre in the United States,20 which may not be applicable to other populations and settings.
Even when we assume that sensitivity and specificity do not change between settings and patient populations, test results will have different interpretations depending on whether a test is performed in a low risk population, such as in primary care, or high risk patients in a referral centre. For example, in the study evaluating the third heart sound in the diagnosis of heart failure,19 the pretest probability or prevalence in a general practice setting was 16%. In this situation, a positive test with a likelihood ratio of 18 will not allow the diagnosis to be ruled in with confidence: the post-test probability is only increased to 77% (table 2). If the pretest probability were 50%, however—such as in a cardiology outpatient clinic—the same positive test would produce a post-test probability of 95% (see box 1 for formula).
The interpretation of studies will be strongly influenced by the nature of the condition and the invasiveness of further investigations. For example, a study assessing urinary albumin:creatinine ratios below 1.8 g/mol for ruling out microalbuminuria in men with type 2 diabetes in primary care13 and the study examining the absence of a history of ankle swelling for ruling out ascites in men admitted to general internal medicine wards10 both produced post-test probabilities of about 3%. In the first case, we accepted a website's conclusion that the urinary albumin:creatinine ratio had SnNOut properties2: we thought that the post-test probability of microalbuminuria was sufficiently low with a negative test result, considering that guidelines recommend regular testing of patients with type 2 diabetes.24 In the second case, however—and unlike the textbook4—we thought that in men with suspected ascites but no history of ankle swelling a probability of ascites of 3%, a sign often associated with serious conditions, was still too high and that sonography should be used to rule the diagnosis in or out.25 As mentioned above, another problem with this study is the small sample size, which resulted in wide confidence intervals.
Prompted by a colleague's experience with the Ottawa ankle rule and the CAGE questionnaire, we examined diagnostic test evaluation studies, which on websites and in a textbook of evidence based medicine were interpreted as demonstrating tests' ability to conclusively rule a diagnosis in or out (SpPIn or SnNOut tests). We calculated likelihood ratios to measure the tests' power to rule the target conditions in or out and assessed the study designs for possible bias, and we have given examples of tests where these methodological issues raise questions about whether the tests have SpPIn or SnNOut properties.
We believe that the concept of SpPIn or SnNOut tests can help clinicians in interpreting diagnostic test results. However, identifying and promoting tests with SpPIn or SnNOut properties should be based on a careful appraisal of the evidence, including the methodological quality of the test evaluation studies, and not simply on the test's sensitivity or specificity. Likelihood ratios and typical post-test probabilities should be calculated and reported, together with measures of statistical uncertainty, as recommended in the Standards for Reporting Diagnostic Accuracy Studies (STARD).26 Assessments should ideally be based on a systematic review of all available studies, which may include a meta-analysis to increase the precision of estimates of test accuracy. For example, a recent meta-analysis of 27 test accuracy studies of the Ottawa ankle rules confirmed that in many settings this decision aid can indeed exclude fractures and reduce the number of unnecessary radiographs.14 The fact that results may not be transferable to other populations and settings should be stressed, and the information required to judge transferability and applicability should be provided. Clearly, assuming that a diagnosis can be ruled in or ruled out with confidence, when in reality it cannot, could have serious consequences for patients.
We thank Nicola Low for helpful comments on an earlier version of this manuscript.
Contributors DP had the idea of critically appraising test accuracy studies that were interpreted as demonstrating SpPIn or SnNOut properties. CM advised on statistical issues. ME and DP wrote the first draft of the article. All authors contributed to the appraisal of the evidence from studies and to writing the final draft of the article.
Funding The Krankenfürsorgestiftung der Gesellschaft für das Gute und Gemeinnützige (GGG), Basle, Switzerland, and the Swiss Academy of Medical Sciences supported this study.
Competing interests None declared.