Interpreting and reporting clinical trials with results of borderline significance
BMJ 2011; 343 doi: https://doi.org/10.1136/bmj.d3340 (Published 04 July 2011) Cite this as: BMJ 2011;343:d3340
All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
Hackman and Kirkwood's1 recent article addresses a difficult dilemma
for clinical researchers in reporting results of borderline significance.
They refer to our recent study of an intervention to improve secondary
prevention of coronary heart disease (CHD) in general practice2 which also
exemplifies how a trial may be confounded by changes in practice which are
contemporaneous with its execution. Our comparator group was the existing
standard management of CHD, which has improved over recent years, with
increased recognition of the value of disease registers, recall systems
and regular reviews. Hackman and Kirkwood highlight our problem - against
this background, the effects of our added complex intervention components
(with an emphasis on social cognitive theory and goal-setting by patients)
could be expected to be smaller than when we planned our target sample
size, calculated on the basis of previous data, and the consequent
difficulty of obtaining a small P value from our findings. These are real
difficulties in obtaining conclusive research evidence for the value of
innovative interventions, as clinical researchers must endeavour to work
within the context of ever-changing policy and practice. The appropriate
interpretation of borderline results thus has great importance for policy
making.
As noted previously, P-values may be considered a 'Quasimodo'
approach to statistical inference - ringing a bell to bring a result to
people's notice. A significant P-value provides evidence of a significant
effect but gives no impression of the magnitude of the likely effect size.
Interval Estimation does provide such an estimate from which a more
practical conclusion can be made.
Given the perspective of Hackman and Kirkwood, we may have been
'softer' in our reporting of absence of effect from our intervention,
although our conclusion was made in the context of finding no intervention
effect on other primary outcomes, analysed both as categorical and
continuous measures. Of interest, we are about to undertake a six year
follow-up of the study and this may inform regarding the definitive
conclusion of the intervention's value.
1 Hackshaw A, Kirkwood A. Interpreting and reporting clinical trials
with results of borderline significance. BMJ 2011;343:d3340
2 Murphy AW, Cupples ME, Smith SM, Byrne M, Byrne MC, Newell J, et
al. Effect of tailored practice and patient care plans on secondary
prevention of heart disease in general practice: cluster randomised
controlled trial. BMJ 2009;339:b4220
Competing interests: AW Murphy holds unrestricted educational grants from MSD, Pfizer and Mepha.
Dear Editor
Hackshaw and Kirkwood [1] make elementary errors in their interpretation
of confidence intervals (CIs) and P values. They quote a hazard ratio 0.83
(95% CI 0.65 to 1.05, P=0.12) and then state 'there is only a 6% chance
that it (the true effect) exceeds 1'. As explained in countless
statistics books, the correct interpretation of a (one-sided) P value is
that if the true effect is 1 there is a 6% chance of getting a hazard
ratio of 1.20 (=1/0.83) or greater, from such an experiment. This is the
rather arcane frequentist argument, which they are implicitly using since
they refer to 'P values'. Similarly their statement that 'there is a 50%
chance that the true hazard ratio is between 0.77 to 0.95' is also
incorrect. It begs the question about how does one evaluate 'chance' in a
frequentist model? What they mean is that if one repeatedly did the
experiment and constructed a confidence interval based on 0.674 standard
errors either side of the estimated value, 50% of the time this would
include the true hazard ratio.
It may appear that the frequentist approach leads to some contorted
reasoning. However, there is a Bayesian correspondence to this method , as
we discussed some years ago, and which we advocated for studies which
unavoidable have low statistical power [2] . Thus IF one is prepared to
assume an 'uninformative prior', i.e. we have no prior beliefs about where
the true hazard ratio is likely to be before the trial, THEN using Bayes
theorem we can describe the posterior distribution and obtain a
'prediction interval', which in simple cases will be numerically the same
as the confidence interval and we can make statements such as those that
Hackshaw and Kirwood wish to make. However, it is sloppy reasoning to
confuse a frequentist and Bayesian approach, and the confusion can easily
lead to errors in inference.
One can extend the Bayesian argument further. As explained in 'Statistics
at Square One'[3] , Goodman [4] has advocated the use of 'Bayes Factors' .
Suppose we were testing two alternative hypotheses and the prior
hypothesis was that they were equally likely. Then one can show that if,
after the study we obtained P=0.05 against the null hypothesis, the
probability of the null hypothesis being true is still 0.13. P=0.05 is an
arbitrary cut off and in these circumstances it would not appear strong
evidence against the null hypothesis.
Two other comments. The BMJ has long advocated the use of confidence
intervals and it is a pity the authors did not see fit to reference the
BMJ's own publication 'Statistics with Confidence' [5] . In the Figure in
the paper, they seem to suggest that the distribution (which we may assume
is a posterior distribution) is Normally distributed. For a hazard it is
more likely to be log-Normally distributed as indicated by the fact that
the estimated hazard 0.83 is closer to the lower limit of 0.65 than the
upper limit of 1.05.
It is surprising that the BMJ referees allowed this paper to be
published as it stands.
References
1 Hackshaw A Kirkwood A. Interpreting and reporting clinical trials with
results of borderline significance. BMJ 2001:343:d3340
2 Burton PR, Gurrin LC Campbell MJ Clinical significance not
statistical significance: a simple Bayesian alternatieve to P values. J
Epidemiol Community Health 1998; 52: 318-323
3 Campbell MJ, Swinscow TDV Statistics at Square One 11th Ed. P69.
BMJ Books. Oxford: Wiley-Blackwell p69
4. Goodman SN. Of P-values and Bayes: a modest proposal. Epidemiology
2001: 12: 295-7.
5. Altman DG, Machin D Bryant TN and Gardner MJ (eds) Statistics with
Confidence 2nd Ed London: BMJ Publishing Group 2000.
Competing interests: No competing interests
Whilst few would disagree that the assessment of the reliability of a
study depends on more than juat the p-value, it is nonetheless a key
indicator that should not be dismissed lightly
The use of the 5% p-value threshold appears to have become universal
in biomedical research, yet it does not seem to to be based on any clear
statistical reasoning. So far as I can make out, the origin of this
threshold seems to lie in a discussion of the theoretical basis of
experimental design, published by the Cambridge geneticist and
statistician RA Fisher in 1926 [1].
Fisher's work laid the statistical foundation for the evolution of
randomised controled trials, which he and others developed over the
subsequent 25 years. With regard to the use of p<0.05, this was an
arbitrary threshold that he adopted in order to discuss the broader issue
of statistical significance, which was then a novel concept. To quote
Fisher from this paper:
"...If one in twenty does not seem high enough odds, we may, if we
prefer it, draw the line at one in fifty or one in a hundred. Personally,
the writer prefers to set a low standard of significance at the 5 per cent
point, and ignore entirely all results which fails to reach this level. A
scientific fact should be regarded as experimentally established only if a
properly designed experiment rarely fails to give this level of
significance..."
My interpretation of this is that the 5% standard represents the
absolute minimum standard for a single study, with non-significant results
only being admissable if the study falls within the context of a broader
evidence base, composed of similar studies that did yield statistically
significant results.
So while it may be entirely reasonable to set a higher threshold than
5%, especially where methodological concerns raise the risk of bias,
moving to a lower threshold is much more difficult to justify.
Reference
1. Fisher RA. The arrangement of field experiments. Journal of the
Ministry of Agriculture of Great Britain 1926; 33:503-513.
Competing interests: I run a company that provides data analytic and health economic consultancy services to the pharmaceutical industry.
The arbitrary nature of thresholds used to interpret p-values has
always seemed strange. The only rationale for adherence to p<0.05 to
indicate significance is one of convention and that is hardly scientific.
To me a p-value tells me the probability of getting an observed effect
when the null-hypothesis of no effect (in the population from which the
sample is drawn) is true. Surely this is all about risk aversion and is
directly linked to the serious of a particular condition and how we
respond to it? For a particularly serious condition with no alternative
treatment we may well be prepared to accept more than a 5% risk of having
got it wrong.
Why not simply report p-values and not impose an interpretation on to
them? Confidence intervals are clearly more complicated but one option
would be to report (perhaps in addition to conventional ones) confidence
intervals at the maximum level at which they only contain positive
effects. It would then be up to the reader to determine whether being ,
for example, 60% confident of a positive effect is acceptable or not.
Competing interests: No competing interests
Sir
I am grateful to Dr Lewis for his offer [1] but, regrettably, I must
decline. The quid pro quo is far too one-sided. I stand to gain only his
agreement with my criticism of the paper by Hackshaw and Kirkwood [2]
while, in return, I am being asked to forfeit something of much greater
importance.
I will stick to my position that statistics-based research in
medicine - in other words, large-scale RCTs and epidemiological studies -
is fundamentally flawed and that we should stop pretending that it
produces anything that is reliable or valuable. [3]
James Penston
e-mail: james.penston@nhs.net
References
1. Lewis LS. BMJ Rapid responses, 10th July 2011.
2. Hackshaw A, Kirkwood A. Interpreting and reporting clinical trials
with results of borderline significance. BMJ 2011;343;d3340.
3. Penston J. Stats.con - How we've been fooled by statistics-based
research in medicine. The London Press, November 2010.
Competing interests: No competing interests
Having rightly criticised Hackshaw et al, for pursuing P-values
beyond th conventiopnal 5% distribution tail, you too seem to want the
penny AND the bun ! You overstate the matter with "Statistics-based
research in medicine is fundamentally flawed [2] and attempting to repair
individual defects is a waste of time. We should stop pretending that
large RCTs and epidemiological studies produce anything that is reliable
or valuable."
I believe that large effects are quickly obvious in small trials, but
the plain fact remains that small effects will require very large trials
to reliably discern them. I agree your contention that abandoning the
convention of the p=0.05 cut-off will open the floodgates to spurious
effects - but only if you agree that there are right ( and wrong) uses of
statistical methods.
Dr Sam Lewis
Competing interests: No competing interests
Sir,
Medical researchers, the executives of pharmaceutical companies and
members of special interest groups will be dancing in the streets. The
paper by Hackshaw and Kirkwood [1] is nothing but a license to make
unfounded claims about the efficacy of treatment from the results of
randomised trials. Data manipulation is already rife in the research
literature [2] and the proposed re-interpretation of the statistical
analysis will only make matters worse.
The authors' stated aim is to avoid the situation in which the
results of a study are ignored because the data fail to achieve
conventional statistical significance. "If a clinically important effect
is observed with a P-value of just above 0.05 (or an upper or lower
confidence limit close to the no effect value) it is incorrect to conclude
no effect and not consider further what is likely to be an effective
intervention..." [1] Thus, provided that there is a "clinically important"
effect - an arbitrary judgment - and that the P-value or confidence
interval is borderline - another arbitrary judgement - then we should
claim that there is evidence of an effect. But the authors want it both
ways: they criticise the use of the cut-off level for the P-value at 0.05
while, at the same time, they propose changes based on equally arbitrary
criteria.
It would be naive to believe that this re-interpretation will not be
seized upon by those with a vested interest in the outcome of medical
research in order to distort the findings, just as it would be naive to
believe that the authors' call for the use of "moderate words" would lead
to anything but an abuse of language.
Statistics-based research in medicine is fundamentally flawed [2] and
attempting to repair individual defects is a waste of time. We should stop
pretending that large RCTs and epidemiological studies produce anything
that is reliable or valuable.
James Penston
e-mail: james.penston@nhs.net
References
1. Hackshaw A, Kirkwood A. Interpreting and reporting clinical trials
with results of borderline significance. BMJ 2011;343;d3340
2. Penston J. Stats.con - How we've been fooled by statistics-based
research in medicine. The London Press, November 2010.
Competing interests: No competing interests
While Hackshaw and Kirkland do an excellent job of describing
problems encountered when interpreting trials with p-values close to the
pre-specified alpha, they missed the biggest problem and largest
opportunity. The problem is the use of frequentist statistics in medical
research and the opportunity is the use of Bayesian methods. Here is a
partial list of the advantages of Bayesian statistics: they focus on the
estimation of the difference rather than a declaration that there is
(isn't) a difference, they allow for the incorporation of prior knowledge
about the trial's topic, they allow for sensitivity analyses regarding
that prior knowledge as well as important aspects of the trial's design
(examination of how bias could affect results). For these and a host of
other reasons, we should stop trying to patch frequentist methods and
migrate to more sensible and useful Bayesian methods.
Competing interests: No competing interests
Re:A confusion of frequentist and Bayesian arguments
The precise definition of a (frequentist) confidence interval is that
there is a 95% probability that it contains the population (true) effect.
As Professor Campbell correctly points out, this is different to the
wording we used, which implied that the true effect lies within a
specified range with a certain probability. We admit we used the looser
language, but did so only because this is how many researchers interpret
confidence intervals, and therefore we used terminology which most are
familiar (rightly or wrongly). For practical purposes, we believe little
is lost by this. So, for example where we say that "there is a 50% chance
that the true hazard ratio is between 0.77 and 0.90", it would be more
appropriate to say "there is a 50% probability that the interval 0.77 to
0.90 contains the true effect".
However, what we really wanted readers to focus on is that the middle
of a confidence interval is more likely to contain the true effect than
either extreme end (rather than the precise definition of a confidence
interval). This is one of the two main purposes of our article. It has
indeed been stated before (eg Altman et al)[1] but is still a feature that
many researchers are currently unaware. However, it has important
consequences for interpreting results where one end of a confidence
interval just overlaps the no effect value.
We made it clear in our article that the overuse of p-values is not
new, by giving two references, almost 10 years apart [2,3]. What we were
trying to point out was that the practice of focussing on p-values to
interpret data is still endemic in research, despite recommendations not
to do so, and is particularly problematic when the p-value just exceeds
0.05. This was the second main purpose of the article.
We thank Professor Campbell for helping to clarify the language used
to interpret confidence intervals, but hope that this does not detract
from the two key messages.
1.Altman DG, Machin D Bryant TN and Gardner MJ (eds) Statistics with
Confidence 2nd Ed London: BMJ Publishing Group 2000.
2. Altman DG, Bland JM. Absence of evidence is not evidence of
absence. Br Med J 1995; 311: 485
3. Alderson P. Absence of evidence is not evidence of absence. Br Med
J 2004; 328: 476 doi: 10.1136/bmj.328.7438.476
Competing interests: No competing interests