# Risky business: doctors’ understanding of statistics

BMJ 2014; 349 doi: https://doi.org/10.1136/bmj.g5619 (Published 17 September 2014) Cite this as: BMJ 2014;349:g5619## All rapid responses

*The BMJ*reserves the right to remove responses which are being wilfully misrepresented as published articles.

Thanks to those who have written to point out that 1 in 51 is a better estimate of the positive predictive value of the hypothetical test than the figure of 1 in 50 that I gave. Trying to make my explanation as straightforward as possible, I rounded up. But I should have written: ‘So the chance of someone with a positive test actually having the disease is * about * one in 50.’

**Competing interests: **
No competing interests

My criticism of the way test usefulness is portrayed is its tortuous complexity, which is illustrated by the ‘problem’ posed here to apparently test doctors’ understanding of risk. Instead it is a vivid example of how current teaching is causing widespread confusion and putting students and trainee doctors off the subject. This is partly due to historical reasons because the terms were ‘borrowed’ from WW2 radar. Adjusting a dial could change the sensitivity and specificity of radar signals to maximise the chance of spotting aircraft and to minimise seeing birds.

If the ability of a simple (positive/negative) test to predict a diagnosis is being assessed, all we need to know how often a diagnosis was found when the test was positive (e.g. 80/200 patients) and how often the test was positive when the diagnosis was present (e.g. 80/100 patients). If we are also told how many people were studied (e.g. 1000), then this tells us the basic facts about that study. However, similar information will be needed about all other symptoms, signs, test results and diagnoses. Sadly, this is not available and we have to guess to do our job.

Given the above values of 80/200, 80/100 and 1000, the enthusiastic can say that the sensitivity is 0.8, the positive predictive value is 0.4, the false discovery rate is 0.6, the diagnostic prevalence is 0.1, the positive test result prevalence is 0.2, the false negative rate is also 0.2, the false positive rate is 0.133, the specificity is 0.867, the prevalence of the absence of a diagnosis in the study population is 0.9, the negative predictive value is 0.975, etc. This information is of little if any value clinically but is often used to torment people by asking them to think backwards by calculating the positive predictive value from various combinations of the above values. But the answer is obvious from the main study observation of 80/200. It is 80/200 = 0.4.

We need more clarity and less obfuscation. In order to do sensible research, we also need to understand the set theory of differential diagnostic reasoning, diagnostic confirmation and treatment selection [1].

Reference:

1. Llewelyn H, Ang AH, Lewis K, Abdullah A. The Oxford Handbook of Clinical Diagnosis, 3rd edition. Oxford University Press, Oxford, 2014, pages 625 to 627.

**Competing interests: **
No competing interests

**26 September 2014**

Dear Editors

In providing an explanation of the answer to Casscells et al question, which is given as follows:

"(In case you’re struggling, one way of thinking about the question is to imagine that 1000 people are given the test. Since the prevalence is one in 1000, one of these people will have the disease. But a false positive rate of 5% means that 50 people will have a positive test result. So the chance of someone with a positive test actually having the disease is one in 50.)"

Mr Martyn ironically exposes himself to the lack of true understanding of how the real answer should be derived.

Based on the the "common sense" reasoning from the actual Casscells et al paper uses the usual equation for the Positive Predictive Value (PPV) being

PPV = TP / (TP + FP)

whereby TP is True Positive and FP is False Positive

Based on this equation:

1 out of any 1000 people has the disease

So 999 people does not have the disease

False Positive Rate of a test for the disease is 5%

Of the 999 people without the disease, about 0.05 X 999 = just under 50 people would have tested positive on the test.

Of 1000 people tested, there can only be a maximum of 1 person who is tested positive with the disease (true positive)

(Just under) 50 people (false positive) plus 1 person (max possible number of people with the disease to have positive test in a group of 1000 ie maximum possible true positive) means there is a maximum possible just under 51 people who would have tested positive out of 1000

So PPV is TP / (TP + FP) = (maximum possible 1) / (just under 51) = not more than 1/51*

* which is the number stated in the 1978 Casscells et al paper and not 1/50 as Mr Martyn suggest. The values are close but nevertheless it is still the wrong working solution.

**Competing interests: **
No competing interests

It seems ironic in the rapid response by Dr Llewelyn, who is listed as Hon Fellow in mathematics and consultant physician, when it was asserted that:

"Doctors use symptoms and other findings by considering their differential diagnoses, often ranking the diagnoses in order of the frequency in which they occur in the finding. We then look for other findings that occur commonly in one of more diagnoses and rarely or never by definition in others. By doing this we try to assemble a combination of findings within which the frequency of patients with one of the differential diagnoses is high"

and that:

"The specificity, false positive and likelihood ratio have never been a part of differential diagnostic reasoning. "

While it is true that no one (I know of anyway) carries around a set of numbers relating to statistical values of medical tests, the fact is that understanding sensitivity, specificity, true and false postiives/negatives and the like is precisely the basis of why we can still steer our diagnosis towards a set of differentials, since the formulation of diagnosis is based on accepting or rejecting signs and symptoms (if looking for the presence of a specific symptom is a test) found on examination.

For example if one accepts the absence of a pathognomonic symptom or sign does not rule out the possibility a diagnosis, then one accepts the idea of a false negative.

Similarly doctors need to understand the impact and basis of ordering tests for investigations, in which how it would (or should) change the final diagnosis, particularly when the tests are ordered as a screening/'fishing' basis or for confirmation of diagnosis.

The classic example would have been choosing between V/Q scan and CTPA in investigating patients assessed as having high risk or low risk pulmonary embolism. Barring issues with radiation and renal dysfunction (in the use of intravenous contrast), understanding the results of PIOPED and PIOPED II (with its associated pre/post test probability and likelihood ratio) should guide physicians in selecting the test which will provide the result that the physician is willing to accept (without wasting time and money as well as unnecessary risks in performing test which is not going to be helpful).

Perhaps we are too used to pressing a button and getting a (perceived definitive) answer without associated statistical data and provision attached to the test result. This is a common illusion we all are guilty of making.

Understanding the role of medical statistics and accepting no single test is definitive without clauses/provision is an important step back on the path to practising good medicine.

**Competing interests: **
No competing interests

Test is positive.

Risk in population with positive test is 1 in 50.

Risk of having disease in a particular patient with positive test is 95%.

**Competing interests: **
No competing interests

Actually, I agree with John Lowe, as I realised the minute after I fired off my last response. Probability and uncertainty is difficult, especially when predicting the future !!

But in fact that does mean we are both on the same page in the real world - real doctors have serious matters to deal with, and will need a whole new way of thinking to grasp the real relevance of risk !

Gerd Grigerenzer thought natural frequencies would be the solution, but I think pictures, fuzzy thinking, and not precision maths, are probably the way to go.

**Competing interests: **
Having a life

It is possible to observe directly the proportion of patients with a symptom or other finding who turn out to have a diagnosis. There is no need to ‘calculate’ it as the ‘positive predictive value’ from the ‘sensitivity’, ‘specificity’ and ‘false positive rate’, which are confusing terms that are alien to doctors. Doctors use symptoms and other findings by considering their differential diagnoses, often ranking the diagnoses in order of the frequency in which they occur in the finding. We then look for other findings that occur commonly in one of more diagnoses and rarely or never by definition in others. By doing this we try to assemble a combination of findings within which the frequency of patients with one of the differential diagnoses is high.

The specificity, false positive and likelihood ratio have never been a part of differential diagnostic reasoning. There is also a misconception that the likelihood ratio (sensitivity divided by ‘one minus the specificity’) is a constant for any test result that can be applied in any clinical setting. This is wrong [1]. The resulting dubious practice of multiplying the prior odds by the likelihood ratio to give posterior odds should be dropped in favour of traditional differential diagnostic reasoning, which is what doctors actually do day in day out. This will allow tests for use in diagnosis to be assessed sensibly.

Reference:

1. Llewelyn H, Ang AH, Lewis K, Abdullah A. The Oxford Handbook of Clinical Diagnosis, 3rd edition. Oxford University Press, Oxford, 2014, pages 625 to 627.

**Competing interests: **
No competing interests

**24 September 2014**

Thanks to Sam Lewis for his wry observation and valiant attempt to boil down the problem to a manageable clinical compote, but . . . I have to disagree. The true probability does NOT lie between 1 in 50 and 1 in 51, as he says, but rather between 0 in 50 and 1 in 51. To make things more concrete, if we take the former case, ie when the test in question has zero sensitivity - implausible in practice but perfectly conceivable in theory - then none of our positive test results will be true ie include the one unlucky person in a thousand who has the disease, no matter how many times we repeat the test; indeed, under these conditions a negative result would be more predictive than a positive outcome, since if we exclude all the positives (all false) then 1 in 1000 minus 50 = 950 negatives will have the disease, a reverse 'sensitivity' of 1 in 950, which is slightly better than chance, at 1 in 1000.

**Competing interests: **
No competing interests

Sorry to be picky, but surely the precise answer here is 1 in 51? The author gives no sensitivity for the hypothetical test. Assuming it is decent and our 1 in 1000 "diseased" patient would be correctly identified by a true positive test result, one should not count this in the 5% false positive rate. There will be 51 test-positives in a room of any 1000 people: 1 true positive and 50 false positives. The chance that the test-positive patient truly has the disease (the positive predictive value) would then be 1 in 51.

**Competing interests: **
No competing interests

## Re: Risky business: doctors’ understanding of statistics

Christopher Martyn highlights the lamentable state of knowledge of doctors with regard diagnostic tests[1]. Since we can assume that doctors are intelligent beings, and that intelligent beings only learn what is useful and don’t recall what they don’t use regularly, we might surmise that, despite the protestations of the medical statisticians, knowledge of the characteristics of diagnostic tests is not a prime requirement of a practising doctor. Spurred by Dr Martyn’s article we recently performed an informal email survey of 18 local GPs of whom 16 responded. We asked them about two diseases that managed by GPs, often without the input of hospital specialists: asthma and type II diabetes.

We asked the GPs how they diagnosed the disease. For asthma, most followed the BTS/SIGN guidelines but there was no single test to make a clinical decision, and clinical history and examination are just as important as investigations such as spirometry. For type II diabetes some used tests based on glucose and some the more recent test based on HbA1c%. Only two GPs, one for asthma and one for diabetes knew the sensitivity and specificity of the tests they were using. However, the general message was that these diseases are diagnosed by a combination of tests, and it would be difficult to know what the sensitivity and specificity of the combination would be, which may help explain the GPs’ ignorance and lack of interest in the values.

Dr Martyn’s article was based on a survey of doctors in Boston[2]. We wondered how useful the paradigm used to test the doctors in Boston would be for GPs in the UK. Essentially given the sensitivity, specificity and prevalence, one should be able to work out the positive predictive value. We have shown that most GPs in our (albeit small) survey do not this in practice. The problem as highlighted by Dr Martyn is particularly acute when the prevalence is low, since even highly sensitive and specific tests have low positive predictive value, which is commonly overestimated. However, it is unclear what the relevant prevalence or pre-test probability is in these circumstances.

Thus GPs are likely to fail to the same test as the Harvard doctors for a number of reasons:

1) Diagnosis is based on a combination of tests and clinical examination and there is little research based on the sensitivity and specificity of the combination of different examinations as opposed to a one-off test, which is why GPs are unlikely to know the values.

2) It is unclear what is meant by the prevalence of asthma or diabetes for these GPs. It is not the proportion of people in the population with the disease, but rather the proportion of people who come to consult who have the disease (perhaps with similar age and clinical history). This proportion is likely to be quite high and so the issue of overestimating the positive predictive value is less important.

3) The prevalence of the disease will also depend on the severity of the disease being tested for and so this also muddles the calculations.

In conclusion we feel that testing doctors on the interpretation of diagnostic tests, whilst presenting the doctors with an intellectual challenge which one feels they should be capable of answering, does not reflect the day-to-day reality of many doctors’ lives, and is the reason why many doctors do it poorly. Given the lack of data on pre-test probability (indeed even simple prevalence data is difficult to get hold of) in practice, and of the sensitivity and specificity of the test, it is not possible to do the calculation.

One is left with the unsatisfactory combination of scientific principles and intuitive guess work, what David Sackett referred to as “the science of the art of the clinical examination “[3]

MJ Campbell, Professor of Medical Statistics

I Rotherham, Medical Student

C Sefton, Medical Student

J Dickson, GP and NIHR Clinical Lecturer

References

1. Martyn C. Risky business: doctors’ understanding of statistics. BMJ 2014; 349:g5619 doi: 10.1136/bmj.g5619

2. Manrai AK, Bhatia G, Strymish J, Kohane IS, Jain SH. Medicine’s uncomfortable relationship with math: calculating positive predictive value. JAMA Intern Med 2014; 174:991-3.

3. Sackett DL, Rennie DL. The science of the art of the clinical examination. JAMA 1992;262: 2650-2

Competing interests:No competing interests05 November 2014