# Sifting the evidence—what's wrong with significance tests?Another comment on the role of statistical methods

BMJ 2001; 322 doi: https://doi.org/10.1136/bmj.322.7280.226 (Published 27 January 2001) Cite this as: BMJ 2001;322:226## All rapid responses

*The BMJ*reserves the right to remove responses which are being wilfully misrepresented as published articles.

Understanding statistical significance testing

Editor – Few readers of the BMJ appear to understand statistical

significance testing(1, 2). This is not surprising. Imagine a patient with

a BP of 170/110 asking a doctor to give the probability that

‘hypertension’ is replicated by being higher than 150/90 when repeated.

Instead of being told ‘about 95%’ say, the patient is told that if we

assume that the blood pressure was 120/80, then the likelihood of the

observed BP being 170/110 or higher by chance would be 4%. The BP of

120/80 is analogous to a null hypothesis and the likelihood of 4% is

analogous to a P value.

In a cross over trial, 14 out of 19 patients respond to drug A better

than B. The ‘P value’ is calculated using the binomial theorem by

selecting 19 patients at random from a hypothetical population where ‘A is

better than B’ in 0.5 of cases (a null hypothesis). The proportion

selected with the observed result of 14/19 or 15/19 up to 19/19 would be

3.18% (i.e. P = 0.0318).

A doctor might wonder about the probability of replication(3) i.e.

drug A being still better than drug B if the study was repeated with the

same numbers (i.e. getting a result of 10/19 or 11/19, up to 19/19). We

can find this by selecting 19 patients from a hypothetical population of

0.737 (=14/19) made up of patients from pooled studies with a result of

14/19. Using the binomial theorem again, the proportion of studies with a

result of 10/19 or over is 96.06%, i.e. the ‘probability of replication’

in this case.

Note that this probability of replication of 96.06% is similar to 1 –

P = 100 – 3.18 = 96.82%. This is the case when ‘replication’ means getting

a similar result again but in the special case that just excludes the null

hypothesis. The probability of replication would also rise or fall,

conditional upon other factors such as the specified replication limits

(these may be confidence limits), the accuracy the study’s description,

the similarity of patients and geographical areas.

Replicating clinical findings and study results is familiar to

doctors and scientists. To understand something, we have to use models

based on our own familiar experiences. Statistical hypotheses are

concerned with replication. Scientific hypotheses are based on much

imagination and use of other models too that are not necessarily

statistical(4).

D E H Llewelyn

Consultant Physician

KRUF Centre, West Wales General Hospital, Carmarthen SA31 2AF

1. Editor’s choice. BMJ 2001; 322:0 (27th January).

2. Sterne JAC, Smith GD. BMJ 2001; 322: 226-231. (27th January).

3. Llewelyn DEH. Assessing the validity of diagnostic tests and

clinical decisions. MD thesis, University of London 1988.

4. Llewelyn DEH, Hopkins A. Editors’ introduction. In Analysing how

we reach clinical decisions. London: Royal College of Physicians

Publications 1993.

**Competing interests: **
No competing interests

I wish to commend Sterne, et al. for an enlightening article.

However, I wonder if they were perhaps a bit too critical of epidemiology.

Sterne appropriately recognize that there is too much emphasis on p-values

(and confidence intervals, for that matter), and that bias and other

deficiencies in study design are far more important as causes of spurious

results than chance is. I am an epidemiologist, as are Sterne and his co-

authors; epidemiology and clinical research are therefore the fields we

know best.

I suspect in other fields, whether they be nuclear physics,

econometrics or genomics, there is as much controversy, contradiction and

disagreement at the "bleeding edge" of research as there is in

epidemiology. What makes epidemiology unique is that epidemiologists

study outcomes such as cancer, coronary artery disease, etc. for which

most of us are at risk; and risk factors such as diet, smoking, physical

activity, etc. that affect most of our daily lives. Furthermore, on the

surface, epidemiologic results are more easily explained to the general

public than are the results of the latest experiment in quantum mechanics.

This almost guarantees that the latest provocative hiccough from an

epidemiologists database will be published in a high-visibility medical

journal and will ultimately end up as a story on the evening television

news.

So the problem of discrepant results is probably not unique to

epidemiology- the visibility of epidemiologic controversy is what makes it

unique.

**Competing interests: **
No competing interests

Sir

Ever obedient to your magisterial injunctions in Editor's Choice I

essayed an ascent of the dizzy heights of Sterne et al's article but came

to grief in the foothills.

The obsessional concern with abstruse mathematics that is so apparent

in the professional press puts me in mind of the futile conjecturing of

generations past concerning the number of angels who could stand on the

head of a pin.

As a model of clarity I commend to you the style of "Improvised

Munitions Handbook" (Headquarters, Department of the Army. TM31-210.

Sections 13-14) in which the authors append a comment for each material -

"This material was tested. It is effective."

All else is recondite, if elegant, but ultimately fruitless

ratiocination.

Yours sincerely

Steven Ford

**Competing interests: **
No competing interests

Could you please consider providing an assessment of the “likelihood”

approach (1) for your readers? (See Thomas Perneger’s response to last

week’s article (2) on significance testing.) I can’t imagine that many

researchers would want to see another p value if they knew about the

likelihood approach. Suitable software to implement the methods routinely

is another matter.

The likelihood approach can interpret some studies that are difficult

to interpret using the frequentist or Bayesian approaches, such as a

clinical trial that is stopped early because the results “look good”. The

likelihood approach avoids the Bayesian requirement to specify prior

beliefs (vague prior beliefs cannot be accurately represented by a

necessarily precise probability distribution (1)) and avoids some of the

frustrations of the frequentist approach (1) e.g. being unable to collect

at little extra data to clarify your result.

I think that Royall’s book (mentioned by Thomas Perneger) is a

concise and readable introduction to this approach for people of modest

statistical literacy.

(1) Royall RM, Statistical Evidence: A Likelihood Paradigm, Chapman

& Hall / CRC 1997.

(2) Sterne JAC & Smith GD, Sifting the evidence – what’s wrong with

significance tests?, BMJ 2001;322:226-231.

**Competing interests: **
No competing interests

I can agree with much of what Sterne and Davey Smith say ('Sifting

the evidence..' BMJ 322; 226-231) but I think the problems would be better

rephrased in a more combative way.

The reason why the BMJ is increasingly full of articles about

statistics, and thinks medics are deficient in statistics (a la Richard

Smith), is because the BMJ confuses statistics with natural science. It is

why the journal will soon be of more use to NHS managers than clinicians.

The confusion between a statistical hypothesis and a scientific one

is widespread. A scientific theory does not have an infinite distribution

in the sense of probability theory. Newtonian physics is either right or

wrong. There is not an infinite array of Newtonian theories merging with

those of Einstein. The two theories are qualitatively different. Each

theory orders reality in a discontinuous way. They are not summaries of

reality nor can they be considered to have errors in a statistical sense.

You don't do a systematic review of the Ptolemists and Copernicus and then

do a Cochrane plot. The planets either move in a certain space time

position or they don't.

The view of a scientific theory as something that brings coherence to

nature, as a revealer of 'hidden likenesses', has little to do with

probability theory nor is it compatible with the sort of 'pop' Bayesian

theorising we are so often subjected to. David Hume described the fatal

weakness of inductionism a long time ago, and even if the logical

positivists forgot or pretended to forget his reasoning, we shouldn't. In

a clinical context, the idea that you can take say ten thousand people

with a stroke, and then randomise them to either of two treatments and

expect to get sense at the end is naïve. It is effectively non-scientific

or at least, very low level operations research. If you engage in this

sort of random intervention you should expect little sympathy, nor be

surprised that when the experiment (sic) is repeated the results are

different.

The argument therefore that Sterne and Davey Smith make about the

importance of power of testing has little to do with science. You may

argue that you should accept a lower P value, but this of course means

that some experiments will not be done and is in reality beside the point.

Indeed a steady trickle of letters to main stream medical journals over

the last twenty years has pointed out how doing small studies is

unethical. This faulty logic is now institutionalised by every ethics

committee in the land. Should you do one study of a single hypothesis on a

hundred patients, or test a larger number of hypotheses on smaller

numbers. How do you chose? Well certainly not by performing power

calculations. Instead I would argue that the chief consideration is the

strength of the scientific hypothesis guiding the experiment. P values are

not markers of truth. If they were we would have invented an

epistemological engine. We haven't, nor can we, for reasons that Popper

laid out in formal language, but really are intuitively obvious. If you

really go into a trial not knowing whether your intervention is going to

have an effect, I argue you shouldn't be doing the trial in the first

place. In a peculiar reversal of the common place mantra, clinical trials

should only be done when you are pretty certain of the answer.

Statistical theory has made major contribution to many areas of

biological and physical science. The RCT remains a powerful technique in

medical research. What brings statistics into disrepute within mainstream

medicine is the attempt to separate those with an understanding of biology

and patients from those who go around adding up other people's p values.

It is quite noticeable that wherever there is heated discussion about

statistics on a topic little progress is made. By contrast genetics is

full of statistical theory, but with few exceptions results published in

genetics journals are robust, simply because those involved understand

both statistics and the use of statistical testing to test scientific

hypotheses. This is true of engineering and a whole host of sciences but

the same cannot be said of many clinical trials or much epidemiology. Or

do we really imagine that those busy systematic souls at York and Oxford

will now do systematic reviews of quantum mechanics, linguistics,

evolutionary genetics etc. and expect to reveal the structure of nature.

No of course not. The idea that averaging the data is a way to understand

nature would be laughable if it did not do so much harm to both genuine

clinical discovery and patient care. Perhaps they should work on

turbulence.

Professor Jonathan Rees FMedSci

Dermatology,

University of Edinburgh

jonathan.rees@ed.a.cuk

**Competing interests: **
No competing interests

Dear Editor,

I fully agree that 'p' values are grossly 'misunderstood'. In this

letter, I want to make few comments on the statistical significance of an

association for both the authors and readers.

First, statistical significance should never be viewed as a clearcut

yes or no statement, but rather merely as a guide to action. A

statistically significant result does not mean that chance cannot have

accounted for the findings, only that such an explanation is unlikely.

Similarly, a result that is not statistically significant does not mean

that chance is responsible for the results- only that it cannot be

excluded as a likely explanation. The absolute magnitude of the 'p' value,

as well as the contribution of sample size as seen from the confidence

interval, must be considered in interpreting the utility of the results.

Second, the presence of a statistically siginificant association

provides no information about whether the exposure under study is itself

responsible for the observed effect. While a significant 'p' value

indicates that chance is an unlikely explanation of the findings, it

cannot assess the adequacy of the study design or evaluate the possibility

that results may be attributable to bias or confounding. Conversely, the

absence of statistical significance does not imply that the association

cannot be one of cause and effect. It may merely mean that the sample size

was inadequate to exclude chance as a likely explanation of the findings.

Finally, statistical significance cannot address whether the

differences observed are important for health or longevity- in other

words, whether they have any biologic importance. The British statistician

Sir Austin Bradford Hill referred to such undue emphasis on the results of

tests of statistical significance when he stated that all too often "the

glitter of the t-table diverts attention from the inadequacy of the fare".

Regards,

Zubair Kabir

References:

Sterne JAC, Smith GD. Sifting the evidence- what's wrong with

significance test? BMJ 2001; 322: 226-231.

Competing interests: none

**Competing interests: **
No competing interests

Dear Editor

Congratulations on a particularly fascinating and thought provoking

edition

of the BMJ. Jonathan Stearne and Professor George Davey Smith (pp 226ff)

illustrate the fallibility of statistical tests of significance. Socially

responsible people are shocked that government promotes a dogma and then

finds evidence to back it (Sally Macintyre et all pp 222 ff). It seems

that

a guideline may be considered good, bad, irrelevant, wrong, premature or

tardy depending on who is speaking. (Vigabatrin guidelines pp236ff).

Colleagues, let us mistrust everything. Whereas once I thought this a

cynical cop-out, I now realise that it is an intellectually respectable

stance, meeting Professor Davey Smiths's explanation of a Bayesian

position on

statistical truth. If I understand him correctly, this means: This is what

I

think I know. Now let's see if you lot can shake my view.

Given that the validity and statistical robustness of evidence is so

fragile, what else are practitioners to do? Certainly not to put our trust

in consensus statements, such as those emanating from NICE. Though these

may

be an improvement on "Tendentious Opinions Selectively Heralded" or TOSH

their advice may yet, it seems, prove to be nothing more than "Current

Right

Advice... Probably" ie CRAP.

In any case, such edicts are relics of the time when, by Bolam's Test

it was

a defence against complaint to appeal to an agreed body of professional

opinion. With evidence as contentious as your contributors highlight, the

test of "best current opinion" becomes illusory. There will always be

another expert opinion a standard deviation away. So practitioners remain

at

risk of being DUMPED ie Disgraced Under Motives Political, Expedient or

Duplicitous. Or even worse, being SUED or Savaged Unless Expectations

Delivered.

There is an irony in this. As evidence becomes devalued and more

relativistic, "arriving at a considered judgement" becomes more

important.

Such judgement used to be called professionalism. It predated guidelines

until it was undermined by the joint efforts of the GMC (or "Get Ministers

Clapping") and The Department of Health (TDOH) (or "Targeting Doctors

Offers

Headlines").

In my view there is an urgent need for a new institute to support

those of

us hoping to preserve professional medical practice, where we try our best

to do our best for our patients, taking into account their individual

circumstances and wishes, while drawing on our own experience of

practicing

medicine. The institute will not deny the need for research nor the

importance of the P value. It will embrace the controlled trial yet not

dismiss the N=1 study. It will accept guidelines but only as aide-memoires

and not gospel. Since everyone else has an institute with a heart-warming

title, may I propose The Institute for Perfect Understanding Seldom

Happens.

Opinion Flirts with Facts. "PUSH OFF" will do nicely!

.

Yours not entirely tongue in cheek

Dr Michael Apple

GP

Garston Medical Centre, Watford WD25 9GP

e-mail m.apple@virgin.net

No competing interests declared

**Competing interests: **
No competing interests

EDITOR- Sterne and Smith 1 addressed the problems of the perenially

perplexing P value. Though their advice for the authors of research papers

regarding P values may make little difference to the reader. In my survey

of 100 junior doctors, only 39% knew what 'P value' meant! Other

statistical terms such as 'power' was understood by 22% and 'type 1 error'

by only 12%. In view of these results, the readers rather than the authors

of research may be accused of needing statistical education.

It has been suggested that rather than attempt to teach all medical

undergraduates analytical methods, they may benefit more from

understanding basic concepts 2 appropriate to reading research papers.

Perhaps postgraduate education should focus upon the methods of data

analysis and presentation of results to be more easily understood by the

rest of us.

Guy Nash

Research Fellow, Hammersmith Hospital, London

guy.nash@ic.ac.uk

1. Sterne JAC, Smith GD. Sifting the evidence - what's wrong with

significance tests? BMJ 2001;322:226-231

2.Appleton DR. What statistics should we teach medical undergraduates

and graduates? Statistics in Medicine 1990 ;9(9):1013-21.

**Competing interests: **
No competing interests

When I trained as a statistician at Imperial College in the early

1950s I learnt that in his first book Fisher treated p=0.05 as grounds for

repeating an experiment and p=0.01 as the minimum level for publishing.

At the time he was working at Rothamstead and trying to bring some

order to a mass of undigested data. Furthermore a "p=0.01" published

result would typically lead to further tests by real farmers under real

farming conditions. It was most emphatically not a ground for assuming the

universal truth of the experimental result.

**Competing interests: **
No competing interests

## Nations’ subconscious and lifestyle scares

Sir,

Sterne and Smith (ref.) describe medical reasearch as incrementally

contributing to an existing body of knowledge, an approach introducing to

Bayesian statistics.

Consequently, popular reactions and beliefs might be used as prior

evidence of plausibility for research on epidemiological topics, such as

the ones contributing to "lifestyle scares" (mobile phones & tumors,

depleted uranium & leukemias, etc.).

They point out that this kind of "subconscious bayesianism", albeit

cynical (sceptical), can be often proved to be well rational.

In my opinion, what sounds true for Great Britain and northern

Europe, does not for Latin countries, where popular reactions are rather

emotional.

Although in favour of a Bayesian approach to statistical inference,

I wonder how pieces of prior evidence can be utilized with the same

effectiveness in different places of Europe, when a research is done on

“lifestyle scares”.

Should "subconscious bayesianism" be submitted to euro-regulation in

the future?

J.A.C. Sterne and G.D. Smith. Sifting the evidence - what's wrong

with significance tests?

BMJ 2001; 322:226-31

Giuseppe Giocoli

Competing interests:No competing interests11 March 2001