Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
BMJ 2004;329:209-213 (24 July), doi:10.1136/bmj.329.7459.209
Daniel Pewsner, senior research fellow1, Markus Battaglia, senior research fellow1, Christoph Minder, professor of medical statistics1, Arthur Marx, senior registrar2, Heiner C Bucher, professor of clinical epidemiology3, Matthias Egger, professor of epidemiology and public health medicine4
1 Division of Epidemiology and Biostatistics, Department of Social and Preventive Medicine, University of Bern, Switzerland, 2 Department of General Internal Medicine, Inselspital, University of Bern, Switzerland, 3 Basel Institute for Clinical Epidemiology, University Hospitals, Basel, Switzerland, 4 MRC Health Services Research Collaboration, Department of Social Medicine, University of Bristol, Bristol
Correspondence to: M Egger, Department of Social and Preventive Medicine, University of Bern, Finkenhubelweg 11, CH-3012 Berne, Switzerland egger{at}ispm.unibe.ch
The probability of disease, given a positive or negative test result (post-test probability), is usually obtained by calculating the likelihood ratio of the test result and using formulas based on Bayes's theorem (see box 1), or a nomogram,3 to convert the estimated probability of the suspected diagnosis before the test result was known (pretest probability) into a post-test probability, which takes the result into account.4 Likelihood ratios indicate how many times more likely a test result is to be expected in a patient with the disease compared with a person free of the disease and thus measure a test's ability to modify pretest probabilities.
David Sackett and others have argued that such calculations are unnecessary when a test is highly sensitive or highly specific.4-6 In this situation the likelihood ratio of a negative test will generally be very small, and the likelihood ratio of a positive test very large. A negative test will thus rule out, and a positive result rule in, disease. Two mnemonics that capture the properties of such tests have been coined: SnNOut (high sensitivity, negative, rules out) and SpPIn (high specificity, positive, rules in).4 This concept has become increasingly popular, with many websites for evidence based medicine listing such tests and inviting users to nominate further SpPIn and SnNOut tests. The understanding of the SnNOut principle among medical students was recently examined in a randomised trial.7
|
The website listed the Ottawa ankle rules as a SnNOut test,2 indicating that in the teacher's case a fracture could safely be ruled out without radiography. Indeed, the patient made an uneventful and full recovery within four weeks. Alerted by the patient's alcoholic breath, Dr Y wondered whether an alcohol problem might have contributed to the accident and used the CAGE questions (see box 2) to investigate this further. According to the same website,2 the CAGE instrument has SpPIn properties, ruling the diagnosis in if two or more questions are answered affirmatively. The patient confirmed that she felt she should cut down on alcohol and that she had felt bad repeatedly about her drinking. Dr X, who had known her for over 10 years, explained that the patient's alcohol intake was moderate and well controlled, and that she was socially well integrated but very health conscious and somewhat anxious. A few days later, Dr X got a telephone call from the patient, who was clearly upset about the locum doctor's suggestion that she had an alcohol problem. A further consultation was required to clarify the situation and restore trust.
|
|
|
|
Random error and bias
A diagnostic study may be too small to define test performance with sufficient precision. For example, a website2 and the textbook4 interpreted a study of ankle swelling in patients with suspected ascites10 as demonstrating SnNOut properties. The absence of a history of ankle swelling is thus assumed to rule out ascites.2
4 However, the study was based on only 15 patients with ascites and confidence intervals were wide, with the lower 95% confidence interval of the sensitivity including 68%. This means that absence of ankle swelling is still compatible with a 15.8% probability of ascites, which clearly is unacceptably high (table 1).
Studies with methodological flaws tend to overestimate the accuracy of diagnostic tests.21 Bias can be introduced when tests are evaluated in patients known to have the disease and in people known to be free of itso called diagnostic case-control studies. In this situation patients with borderline or mild expressions of the disease, and conditions mimicking the disease are excluded, which can lead to exaggeration of both sensitivity and specificity.21 This is called spectrum bias because the spectrum of study patients will not be representative of patients seen in practice. For example, the textbook considered auscultatory percussion in the diagnosis of pleural effusion as a SpPIn test.4 This assessment was based on a study that compared patients who were selected because of the presence or absence of radiological signs of effusion.16 The impressive results (100% specificity and 96% sensitivity (table 2)) may therefore not be reliable. The textbook and a website22 also claim that the presence of retinal vein pulsation in ophthalmoscopy excludes increased intracranial pressure (a SnNOut test). This is based on a study that compared patients known to have increased pressure with people not suspected to have increased intracranial pressure.11
Partial verification bias may be introduced when the reference test or tests are not applied consistently to confirm negative results of the index test. Some patients are either excluded or considered true negatives. This may lead to overestimation of sensitivity and underestimation of specificity or to overestimation of sensitivity and specificity.21 The textbook considered the CAGE questionnaire for diagnosing alcohol misuse (box 2) to be a SpPIn test.4 This is based on a study that subjected only a fraction of CAGE-negative persons to further testing (liver enzymes, medical record review, and physician interviews (table 2)), 17 thus possibly introducing bias.
Similarly, incorporation bias may be present if the test under evaluation is also part of the reference test.21 This will lead to overestimation of test accuracy because experimental and reference tests are no longer independent. For example, a website listed abdominojugular reflux for the diagnosis of congestive heart failure as a SpPIn test,22 on the basis of a study that used clinicoradiographic criteria, including abdominojugular reflux, as the reference test (table 2).18
Sensitivity and specificity
The likelihood ratio associated with a negative test result does not depend on its sensitivity alone, as suggested by the SnNOut rule, but also on its specificity. For example, a website considered that the clinical criteria for the diagnosis of Alzheimer's disease had SnNOut properties2 based on a sensitivity of 93% (table 1).12 However, despite this high sensitivity, the likelihood ratio of a negative test was a modest 0.3, because of the test's low specificity of 23% (100 - 93/23 = 0.3, see box 1). Indeed, in the population studied, the probability of Alzheimer's disease, given a negative test, was 25% (table 1). The power to rule out a diagnosis thus depends on both sensitivity and specificity.
Similarly, the ability to rule in depends not only on specificity, as suggested by the SpPIn rule, but also on sensitivity. A study examining the presence of a third heart sound in the diagnosis of congestive heart failure (table 2)19which a website interpreted as demonstrating SpPIn properties23is an example of a highly specific test (99%) that suffers from a low sensitivity (24%). The figure shows how the power to rule a disease in or out is eroded when highly specific tests are not sufficiently sensitive, or highly sensitive tests are not sufficiently specific.
|
Transferability and applicability
The performance of a diagnostic test often varies considerably from one setting to another, which may be due to differences in the definition of the disease, the exact nature of the test, and its calibration and the characteristics of those with and without the disease in a given setting.21 For example, patients attending primary care practices will generally have disease at an earlier stage than patients in secondary and tertiary care, which may reduce a test's sensitivity. Patients free of the disease in tertiary care will tend to have other conditions, which could reduce the specificity of a diagnostic test. Interpreting data on a test's accuracy thus requires defining the exact nature of the test used, the disease, and the patient population studied. For example, the website that listed the CAGE questionnaire as a SpPIn test for alcohol dependence2 cited, as the evidence for this, a study that had been performed in black women admitted to a trauma centre in the United States,20 which may not be applicable to other populations and settings.
Even when we assume that sensitivity and specificity do not change between settings and patient populations, test results will have different interpretations depending on whether a test is performed in a low risk population, such as in primary care, or high risk patients in a referral centre. For example, in the study evaluating the third heart sound in the diagnosis of heart failure,19 the pretest probability or prevalence in a general practice setting was 16%. In this situation, a positive test with a likelihood ratio of 18 will not allow the diagnosis to be ruled in with confidence: the post-test probability is only increased to 77% (table 2). If the pretest probability were 50%, howeversuch as in a cardiology outpatient clinicthe same positive test would produce a post-test probability of 95% (see box 1 for formula).
The interpretation of studies will be strongly influenced by the nature of the condition and the invasiveness of further investigations. For example, a study assessing urinary albumin:creatinine ratios below 1.8 g/mol for ruling out microalbuminuria in men with type 2 diabetes in primary care13 and the study examining the absence of a history of ankle swelling for ruling out ascites in men admitted to general internal medicine wards10 both produced post-test probabilities of about 3%. In the first case, we accepted a website's conclusion that the urinary albumin:creatinine ratio had SnNOut properties2: we thought that the post-test probability of microalbuminuria was sufficiently low with a negative test result, considering that guidelines recommend regular testing of patients with type 2 diabetes.24 In the second case, howeverand unlike the textbook4we thought that in men with suspected ascites but no history of ankle swelling a probability of ascites of 3%, a sign often associated with serious conditions, was still too high and that sonography should be used to rule the diagnosis in or out.25 As mentioned above, another problem with this study is the small sample size, which resulted in wide confidence intervals.
We believe that the concept of SpPIn or SnNOut tests can help clinicians in interpreting diagnostic test results. However, identifying and promoting tests with SpPIn or SnNOut properties should be based on a careful appraisal of the evidence, including the methodological quality of the test evaluation studies, and not simply on the test's sensitivity or specificity. Likelihood ratios and typical post-test probabilities should be calculated and reported, together with measures of statistical uncertainty, as recommended in the Standards for Reporting Diagnostic Accuracy Studies (STARD).26 Assessments should ideally be based on a systematic review of all available studies, which may include a meta-analysis to increase the precision of estimates of test accuracy. For example, a recent meta-analysis of 27 test accuracy studies of the Ottawa ankle rules confirmed that in many settings this decision aid can indeed exclude fractures and reduce the number of unnecessary radiographs.14 The fact that results may not be transferable to other populations and settings should be stressed, and the information required to judge transferability and applicability should be provided. Clearly, assuming that a diagnosis can be ruled in or ruled out with confidence, when in reality it cannot, could have serious consequences for patients.
Contributors: DP had the idea of critically appraising test accuracy studies that were interpreted as demonstrating SpPIn or SnNOut properties. CM advised on statistical issues. ME and DP wrote the first draft of the article. All authors contributed to the appraisal of the evidence from studies and to writing the final draft of the article.
Funding: The Krankenfürsorgestiftung der Gesellschaft für das Gute und Gemeinnützige (GGG), Basle, Switzerland, and the Swiss Academy of Medical Sciences supported this study.
Competing interests: None declared.
![]()
CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
Read all Rapid Responses
Israeli students are refusing to perform intimate examinations on anaesthetised women without their informed consent.