# P values

BMJ 2008; 337 doi: https://doi.org/10.1136/bmj.a201 (Published 24 September 2008) Cite this as: BMJ 2008;337:a201## All rapid responses

*The BMJ*reserves the right to remove responses which are being wilfully misrepresented as published articles.

As Michael Campbell writes, this is well worn ground and I wouldn’t

disagree with any of the correct interpretations and explanations above,

though some of it is a little technical for the average reader.

A clinician’s interest is often in whether a positive study result is

a chance finding or real. We are usually less interested in negative

results (though we shouldn’t be). RA Fisher, the statistician who

proposed hypothesis testing, saw the P value as showing the strength of

evidence against the null hypothesis. Sterne and Davey Smith repeat this

advice in their summary points. “When there is a meaningful null

hypothesis, the strength of evidence against it should be indexed by the P

value. The smaller the P value, the stronger is the evidence (1).”

Interpreting the P value as the precise probability that the null

hypothesis is true will be misleading if the P value is large and if other

evidence is ignored, but is unlikely to cause much problem if the P value

is very small. Clinicians should not put much store by large P values nor

interpret studies in isolation. This does cover most situations and is, I

believe, easier to remember.

1) Sterne JAC and Smith GD .Sifting the evidence –what’s wrong with

significance tests? BMJ, 2001; 322:266-31

Competing interests:

None declared

**Competing interests: **
No competing interests

Whilst of course Harper Gilmour is strictly correct in his frequency

definition of a P value, I can’t help feeling that we are going over old

ground. Much of what has been said is covered by Sterne and Davey (1) in

the BMJ. The problem is frequentist approach just doesn’t reflect what

researchers do and, I suspect, is misunderstood by the majority of

researchers. Jennifer Baker is right to reference Steven Goodman’s papers,

where he advocates the use of Bayes factors instead of P values. The

advantage of these is that, if one is prepared to specify to what extent

one believes the null hypothesis, the Bayes factor will tell you how the

data change this belief. Dr Baker also mentions that the P-value also has

a Bayesian interpretation. We discussed this in a paper some time ago (2).

One point is that it is a one-sided¬ P value that has this

interpretation. Thus in Dr Gilmour’s example, the one sided P-value is 0.5

and so with a vague prior, having collected the data and found the two

means to be the same, there is an even chance whether or not the null

hypothesis is true. Whilst I believe we should still teach the

frequentist definition, I don’t think we should blind ourselves to these

other approaches, not least because the frequentist approach is so open

to criticism.

1) Sterene JAC and Smith GD .Sifting the evidence –what’s wrong with

significance tests? BMJ, 2001; 322:266-31

2) Burton PR, Gurrin LC. and Campbell M.J. Clinical significance not

statistical significance: a simple Bayesian alternative to p-values. J

Epidemiology and Community Health 1998; 52, 318-323

Competing interests:

None declared

**Competing interests: **
No competing interests

**15 October 2008**

In the notes accompanying the answer to this question Fletcher states

that it will not matter much if a p value is interpreted as the

probability that the null hypothesis is true. It can be seriously

misleading to interpret a p value in this way, especially in relation to

an underpowered study.

Suppose we are interested in a comparison of means from two

independent samples where an independent samples t test might be

appropriate. For example if the variable of interest were SBP and two

groups of 10 subjects both yielded identical means of 140mmHg and within

group standard deviations of 8mmHg. Since the sample means are equal, the

value of the t statistic is 0 and the p value is 1.0. With Fletcher’s

interpretation, the probability that the null hypothesis is true is 1.0;

in other words it has been proven to be true and (if these data arose from

a small clinical trial) the two treatments under comparison have been

proven to be equivalent.

This of course is not the correct interpretation of the results of

this study as can be seen from the 95% CI for the difference between the

two population means, which is from –7.5 to +7.5mmHg. Thus although the

best estimate of the difference in population means is 0, we cannot rule

out a difference of up to 7.5mmHg in either direction. In the context of a

trial comparing two antihypertensive drugs, a difference of this magnitude

would be of clinical importance. The correct interpretation of this study

is that it is inconclusive, and that a larger study with adequate power to

detect clinically important differences is required. Thus, following

Fletcher’s advice would clearly lead to a serious misinterpretation of the

results. Although this is an extreme example, the same misinterpretation

could occur in relation to any underpowered study.

The BMJ has for many years been active in promoting the statistical

education of its readers and has published many excellent articles by

medical statisticians such as the Statistics Notes series, one of which is

particularly relevant to this issue (1). In view of this it is

disappointing to find such a grossly misleading statement appearing in an

item intended to promote statistical education.

Reference

1. Altman DG, Bland JM Statistics notes: Absence of evidence is not

evidence of absence. British Medical Journal 1995; 311: p485.

Competing interests:

None declared

**Competing interests: **
No competing interests

But test performance has nothing to do with prior disease

probability. Negative and positive predictive values do, but that's

different. Test performance, in the usual sense, is determined by

sensitivity and specificity.

Competing interests:

None declared

**Competing interests: **
No competing interests

Paolo Tomasi states that a test will 'perform much worse' in a GP's

surgery than in a tertiary referral centre, because the prior probability

of disease is lower, and so the predictive value of a positive result is

correspondingly reduced. But that's only half the story: the negative

predictive value will be higher in the GP setting, and very often the GP

will be using the test to exclude disease. So the test is not performing

worse - it's just performing differently.

But he's right that there is a lot of ignorance out there concerning

the effect of prior probability (or disease prevalence) on test

performance. It's something we haven't taught our medical students in the

past, and although some of us are doing our best to remedy that, it will

take a while. Even more importantly, we should be teaching the public (and

the politicians and the editors of the popular press) a few of the basic

principles underlying the use of clinical tests, especially in the

screening context.

But I'll leave that for the next generation - life's too short.

Competing interests:

None declared

**Competing interests: **
No competing interests

ENDGAMES:

John Fletcher

P values

BMJ 2008; 337: a201

The correct definition is option (e): a P value is the probability of

obtaining data as extreme, or more extreme, than those observed if the

null hypothesis is correct. This is a frequentist definition, but is also

correct in a Bayesian interpretation. An excellent pair of articles by

Goodman [1,2] in the Annals of Internal Medicine in 1999 explains these

concepts well.

Clinicians (and other non-statisticians) should not be encouraged to

believe, or taught anything else. It leads to continuing poor

interpretations of clinical trials, especially in the regulatory field.

Jennifer Baker MSc (medical statistics) MBBS, grad. dip. Public

Health

1. Goodman SN. Toward evidence-based medical statistics. 1: The P-

value fallacy. Ann Intern Med. 1999;130:995-1004.

2. Goodman SN. Toward evidence-based medical statistics. 2: The Bayes

Factor. Ann Intern Med. 1999;130:1005-1013.

Competing interests:

None declared

**Competing interests: **
No competing interests

Indeed, one of the most common mistakes in clinical medicine is that

many doctors are not aware that a diagnostic test performs in a very

different way if the a priori probability of disease changes. This has

important practical consequences, for example where the a priori

probability is low (e.g.: GP surgeries), a test will usually perform much

worse than in situations of high a priori probability (e.g. a tertiary

referral centre full of competent specialists). In the GP surgery, many of

the positives will be false positives, whereas in the tertiary centre most

will be true positives, using the same test with the same sensitivity and

same specificity.

GPs have a hard diagnostic life!

Competing interests:

None declared

**Competing interests: **
No competing interests

Fletcher asserts that the probability that the null hypothesis is

true is frequently assumed to be the practical interpretation of a P

value, and that while not strictly correct it will not matter much when

working to this assumption. The probability that the null hypothesis is

true is, broadly speaking, the question that we want an answer to,

unfortunately incorrectly equating this probability to the P-value matters

a great deal under many circumstances.

The correct definition of a P value is the probability of obtaining

data as extreme, or more extreme, than those observed if the null

hypothesis is correct. As all good Bayesians know, the probability of

observing x (the observed data) given y (the null hypothesis) is not the

same as the probability of y given x (the incorrect interpretation of the

P-value). To answer the real question of interest based on a P-value it

is also necessary to know the prior probability that the null hypothesis

is correct and the statistical power to detect an effect if the

alternative hypothesis is correct. For a detailed discussion of these

concepts see Wacholder et al (2004) .

The concepts are essentially the same as the concepts needed to

interpret the characteristics of a diagnostic test. The test

characteristics are the sensitivity (equivalent to power) and specificity

(equivalent to 1– the P value). The question that the diagnostician

really wants answered is not the probability that the patient tests

positive if he does not have the disease, but the probability that he does

or does not have the disease if the test is positive. To answer that

latter question it is also necessary to know the prevalence of the disease

(equivalent to the prior probability). Assuming that 1-specificity is the

same as the positive predictive value would lead to serious errors in

interpretation of the test and the same applied to assuming the P-value is

the same as the probability that the null hypothesis is true. Perhaps the

major problem in interpretation comes from over-estimation of the prior

probability for the alternative hypothesis.

Reference

1. Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N.

Assessing the probability that a positive report is false: an approach for

molecular epidemiology studies. J Natl Cancer Inst 2004;96(6):434-42.

Competing interests:

None declared

**Competing interests: **
No competing interests

**26 September 2008**

## P-value and its role and limitations in biology disciplines

To answer the P value question: There could be more answers based on

the field of study we are paying attention to. Thus e.g. in non-biological

sciences this answer might be very close to the truth. If we look at the

clinical side of p-value from the other standpoint- this answer might have

some limitations. If p-value is non-significant -there is either no

difference between groups or there were too few subjects to demonstrate

such difference (not sufficient power) and other related statistical

questions (time of follow up, its duration etc..). In biology systems we

can have another limitation(s) when we try to use one cut-off point only

and make this the arbitrary point, then we making such non-necessary

assumption as treatment dichotomy Drug-success (yes) or (no) (1).

(1) Guyatt G, Jaenschke R, Heddle N, Cook D, Shannon H, Walter S.

Basic statistics for clinicians. 1. Hypothesis testing. Can Med Assoc J

1995; 152:27-32.

Competing interests:

None declared

Competing interests:No competing interests28 October 2008