Listen to the data when results are not significant
BMJ 2008; 336 doi: https://doi.org/10.1136/bmj.39379.359560.AD (Published 03 January 2008) Cite this as: BMJ 2008;336:23All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
Hewitt, Mitchell &Torgerson (1) are right when they state that
“authors may claim that the non-significant result is due to lack of power
rather than lack of effect”, or that “no firm conclu¬sions can be drawn
because of the modest sample size”. As a matter of fact, it is quite easy
to find thousands of examples of the type “we have not found significance,
but with a larger sample size we probably would do” in current research
reports. The main problem with such a statement is that it is true, or, to
be more precise, that it is always true.
This is one of the basic deficiencies of the statistical test of
hypothesis based on p-values: the magnitude of p depends on the size of
the sample size. Everyone knows that, given a large enough sample size,
you will be able to reject the null hypothesis. It is closely related with
the fact that the null hypothesis cannot be true, because it represents
only one point in infinitude of points on a line, so its probability is
zero. These and several other weaknesses of conventional Null Hypothesis
Significance Testing (NHST) have been pointed out repeatedly along decades
(2, 3, 4, 5, 6); consequently, it is not possible to share the affirmation
that “Statistical significance is important … to guide us in the
interpretation of a study’s results”.
The double standard used to maintain such an incongruous,
impoverished, and potentially misleading procedure becomes obvious when
one note that nobody says “we have found significance, but with a smaller
sample size, we probably would not have found it”, which is always true as
well.
One of the explanations for the almost universal application of
frequentist inference recourses is that most of the people makes a ritual
use of them, thinking that they do what they actually do not: a lot of
researchers think that p is a measure of the probability that the null is
true [7, 8, 9] or that a 95% confidence interval of a difference contains
the true effect with probability equal to 0,95. It can be remembered the
well known Cohen’s citation that states: “What's wrong with NHST? Well,
among many other things, it does not tell us what we want to know, and we
so much want to know what we want to know that, out of desperation, we
nevertheless believe that it does!” [10]. The same thing can be said about
confidence intervals.
Hewitt, Mitchell &Torgerson themselves misunderstand the frequentist
nature of a confidence interval. They erroneously say that the 51%
confidence interval “shows where, more often than not, the true treatment
estimate will lie”. The true treatment difference either is a constant
that lies between the extremes of this interval or does not lies there; it
is not a number that sometimes is within this range and sometimes is not.
The claims that “each value within the confidence interval is not equally
plausible” and “values that are close to the point estimate are more
likely to correspond to the true value than estimates towards the extreme
of the confidence interval” reflects a commonplace misconception.
Concerning this specific interval, one only can be sure that it has been
obtained using a procedure that 51% of the times would produce intervals
that contain the (constant) value of the difference. Only for a
probability interval (not a confidence one), obtained by means of a
Bayesian approach, the quoted texts would be valid.
References
1. Hewitt C, Mitchell N, Torgerson D. Heed the data when results are
not significant. BMJ 2008; 336: 23-25.
2. Bakan D (The test of significance in psychological research.
Psychological Bulletin 1996; 66, 423-437.
3. Gardner MJ, Altman DG Confidence intervals rather than P values:
estimation rather than hypothesis testing. BMJ 1986; 292: 746–750.
4. Hunter J E Needed: A ban on the significance test. Psychological
Science 1997; 8: 3-7.
5. Goodman SN Toward evidence-based medical statistics (I): The p value
fallacy. Annals of Internal Medicine 1999; 130: 995-1004.
6. Matthews RA Facts versus Factions: the use and abuse of
subjectivity in scientific research. European Science and Environment
Forum Working Paper; reprinted in Rethinking Risk and the Precautionary
Principle. Ed: Morris, J. 2000; Oxford : Butterworth.
7. Lecoutre MP, Poitevineau J, Lecoutre B Even statisticians are not
immune to misinterpretations of Null Hypothesis Significance Tests
International Journal of Psychology 2000; 38: 337 – 345.
8. Haller H, Krauss S (2002) Misinterpretations of significance: A
problem students share with their teachers? Methods of Psychological
Research Online 7: 1-20.
9. Gigerenzer G, Krauss S, Vitouch O (2004) The null ritual: What you
always wanted to know about significance testing but were afraid to ask.
In David Kaplan (ed). The Handbook of Methodology for the Social Sciences.
(Ch.21)
10. Cohen J The earth is round (p < .05). American Psychologist
1994; 49: 997-1003.
Competing interests:
None declared
Competing interests: No competing interests
Following on from Evan Lloyd's response, might I suggest that the
public interest would best be served if it became obligatory for details
of the source of funding for trials and studies, and that of their
sponsoring bodies,to be clearly set out within their reports.
Perhaps a Government Health warning should then be printed at the
foot, so that all might see if science or commerce was the driving force
behind the work. At present it almost appears that a pre printed form
would be required.
Is there still any totally independent medical research body in the
country? MRC was intended to be such a body, but little is heard from it.
Competing interests:
Statin damaged patient
Competing interests: No competing interests
The paper by Hewitt, Mitchell & Torgerson (1) is fascinating and
valuable, and
I hope some changes will result from their analyses. They suggest that
negative or nil results may not be accepted because the investigators have
invested a lot of intellectual capital (and professional standing?) into
the idea.
There may also be pressure because businesses may have funded the study,
or may see financial opportunities in a positive outcome. I would like to
suggest that this scenario happened with the dietary fat/heart disease
hypothesis.
During the 1970s, despite the fact that there was a large body of
clinical and
scientific knowledge and evidence to the contrary, there were a growing
number of ‘experts’ who claimed that dietary fat produced cholesterol,
which
in turn caused Coronary Heart Disease (CHD), including death. To confirm
the theory, long term controlled studies were set up trying to lower the
dietary fat content in the study group. After 10 years, the first study
MRFIT
(Multiple Risk Factor Intervention Trial) reported its results in October
1982
(2). Despite the very stringent dietary restrictions (fat in the diet
reduced by
25%), the cholesterol showed only a small (5%) fall, and there was no
difference between the groups in the incidence of CHD or deaths. The
study
therefore failed to support the theory. By some strange timing, a World
Health Organisation (WHO) committee of experts had produced a report in
the summer of 1982 (3) calling for a fundamental change i.e. reduction of
fat
intake, in the Western diet. This recommendation was therefore made
before
the MRFIT results were available to the scientific and medical community.
It
seems likely that the ‘experts’ on the WHO committee were the same
‘experts’ involved in the MRFIT trial, and, being aware of the
disappointing
outcome of the trial, were trying to pre-empt the final decision. A
similar
European trial (4) also reported similar disappointing results. The
proponents
of the fat theory rationalised the results by saying that the members of
the
trial groups had not been trying hard enough and that the advice to reduce
fat intake should be rolled out to everyone. I can personally remember
Prof M
Oliver (Professor of Cardiology at Edinburgh University) saying that, if
all the
effort put into the trials had not produced the desired result, there was
no
possibility that the public at large could be ‘persuaded’ to alter their
diet
further than the trial subjects.
Despite the negative evidence from these trials, a conference (5) in
1984
decided that the advice to lower cholesterol levels should be applied to
all,
even those with NORMAL cholesterol levels. This decision was based on the
results of the Lipid Control Programme (LRCCPPT) trial (6) where the
‘successful’ outcome was announced at a press conference before the
results
were published. In the study (6) people who had a genetic protein
abnormality which resulted in VERY HIGH cholesterol levels, and a VERY
HIGH
risk of cardiac death, had their cholesterol levels reduced to normal
levels by
chemical means The outcomes were that the risk of heart disease, and
death,
returned to normal levels. (An opportunity for the pharmaceutical giants?)
Prof M Oliver disagreed with the conclusions of the congress and noted
that
the panel making the decision was ‘packed’ with supporters of the theory
(7).
The title of his letter (7) “Consensus or Nonsensus conference on Coronary
Heart Disease” was very apt. Since then pronouncements on diet, all
supporting the ‘consensus’, seem to be made by ‘panels of experts’ without
reference to research findings.
However, an analysis of nine MRFIT type studies (8) showed that none
had
been effective, making a total of over 170,000 subjects studied with no
‘positive’ results. In a town on the South Coast of England there was a
far
higher intake of saturated fat, but a much lower level of CHD deaths, than
in
a town in the North of England (9). Also an analysis of world wide
figures
(10) shows that climate is a much better predictor of cholesterol levels
and
CHD deaths than diet. Also, though incidence of CHD deaths is now falling
in
many countries (11, 12), including Scotland, in all these countries the
levels
of fat consumption remained the same throughout the period of the rise and
subsequent fall (12). Also the already low incidence of CHD deaths in
Japan
has been falling further despite the level of fat in the diet rising
steadily (12).
The fatty diet/cholesterol/CHD thesis has produced a vast army of
“worried
well”, who keep going to their doctor for checks on their normal
cholesterol
levels. The tests cost money, personal or state, and more funds (3Bn£ per
year) go to the firms who provide the cholesterol lowering drugs. On the
‘real’evidence, is this a waste of money?
It is important to remember that ‘consensus’ doesn’t always mean the
stated
facts are true. After all at one time there was a consensus that the
world was
flat. Isn’t it time this subject was reviewed dispassionately?
References
1. Hewitt C, Mitchell N, Torgerson D. Heed the data when results are
not
significant. BMJ 2008; 336: 23-5.
2. MRFIT Research Group. Multiple risk factor intervention trial.
JAMA 1982;
248: 1465-77.
3. WHO. Prevention of Coronary Heart Disease, WHO Technical Report
Series,
No. 678, 1982, WHO, Geneva.
4. WHO European Collaborative Group. Multifactorial trial in the
prevention of
heart disease, incidence and mortality results. Eur Ht j. 1983; 4: 141-7.
5. Consensus Conference. Lowering blood cholesterol to prevent heart
disease. JAMA 1985; 253:2080-7.
6. Lipid Research Clinic Programme. LRC-CPPT results. JAMA 1984;
251:
351-6.
7. Oliver M. Consensus or nonsensus conference on coronary heart
disease.
Lancet 1985; I: 1087.
8. Ebrahim S. Systematic review of randomised controlled trials of
multiple-
risk factor interventions for preventing coronary heart disease. BMJ
1997;
314: 1666-73.
9. Cade JE, Barker DJP, Margetts BM, et al. Diet and inequalities of
health in
three English towns. BMJ 1988; i: 1359-62.
10. Lloyd EL. The role of cold in ischaemic heart disease: a review.
Public
Health 1991; 105: 205-15.
11. Walker WJ. Coronary mortality – What is going on? JAMA 1974;
227:
1045-6.
12. le Fanu J. Eat Your Heart Out. The fallacy of the healthy diet.
1987:
Macmillan, London. p 109.
Competing interests:
None declared
Competing interests: No competing interests
There are plenty of examples showing that trialists are rarely neutral about their research. The results of the findings of a study need to be judged on the likelihood that it is a true finding. Using Bayes principal the evidence that an intervention improves the outcome of a whole population needs to be much stronger than the evidence that it does not.
Nature does nothing uselessly.
For example it is current common practice to interfere with the normal transition from fetal to adult pattern circulation by clamping the cord immediately after birth. Why should it be necessary to carry out a randomised controlled trial in order to prove that such intervention is harmful? On principle it should only be necessary to prove that it is not beneficial and much weaker evidence is required to reach such a conclusion. There is already substantial evidence that immediate cord clamping is harmful to the neonate yet the practice continues.
When will we will start to heed data with significant results?
Competing interests:
None declared
Competing interests: No competing interests
Listen to all the evidence when results are not significant
Dear Editor,
This interesting paper by Hewitt et al (1) discusses an important
issue in relation to research methodology. However, it is unfortunate that
the length of their article precluded a less selective and more balanced
representation of our work.
Hewitt et al seem to believe that our overall conclusion was that the
intervention should be used, however basing this solely on the ‘What this
study adds’ box as they did is unreasonable given the more detailed
discussion and interpretation in the article text. Firstly we stated five
times in the article that the intervention was not associated with a
reduction in injuries, and three times that it was associated with an
increase in the primary care injury attendance rate. The “what this study
adds” box also included a statement that “larger differences in safety
practices may be required to affect injury rates”. Unfortunately Hewitt et
al fail to mention these points in their discussion of our paper.
Hewitt et al claim that we “seem to use proxy measures of outcome as
justification for the intervention” However these measures which include
safety equipment possession and use and parental satisfaction, were
defined as secondary outcome measures, in our paper and we clearly stated
in the first sentence of our section on “interpretation of the findings”
that “the increased possession and use of safety equipment among families
in the intervention arm did not translate into a lower injury rate”. It is
unfortunate that in table 2, they report the results from the analysis of
our primary outcome measure (any medically attended injury) but include
our interpretation from the analysis of secondary outcome measures (safety
equipment possession and use), and not our interpretation of our analysis
of our primary outcome measure. Hewitt et al include a quote from our
paper in which we were positive about safety equipment schemes such as
those organised by SureStart. We feel it would have been more balanced if
they had also included our adjoining sentence:
“However, our findings also highlight the importance of rigorously
evaluating the widespread provision of equipment not only in terms of
safety practices but also in terms of injury outcomes and uptake of
schemes by those most at risk.”
Hewitt et al argue that we noted that it was unlikely that
intervention would not reduce injury rates because "several observational
studies have shown a lower risk of injury among people with a range of
safety practices." They also state that “it is, surprising to seek
reassurance from non-randomised data when a randomised trial shows the
"wrong" result.” Hewitt et al clearly took exception to our reference to
observational studies here. Yet, despite us pointing out in the article
that there are very few RCTs in this area measuring injury outcomes (all
of which were underpowered to detect a plausible reduction in medically
attended injury rates), they fail to appreciate that the majority of
evidence in this area comes from observational studies. Are they really
suggesting that all such evidence should be ignored?
Our analyses of primary outcome measures at the child level
demonstrated that the increase in injury rates for any medically attended
injury was confined only to primary care attendances. Secondary care
attendances (IRR 1.02, 95% CI 0.90 to 1.13) and hospital admissions (IRR
1.02, 95% CI 0.70 to 1.40) were not increased by the intervention. We
stated that several explanations are possible for the higher attendance
rate in primary care among intervention arm children and argued that this
may have resulted from either increased awareness amongst intervention arm
parents with subsequent increased reporting of minor injuries or from risk
compensation, whereby parents feel safer because of having the safety
equipment and consequently change other behaviours that are protective
against injury. It is plausible that raising parental awareness might
increase primary care attendances for more minor injuries but not
secondary care attendances or hospital admissions for more severe
injuries. We believe this is less likely to be the case for risk
compensation since if parents change other protective behaviours this
might be unlikely only to affect minor injuries. We state that “further
work is required to explore these hypotheses further”, but Hewitt et al
fail to include this in their discussion of our paper.
It is also possible that some minor injuries occurred in the
intervention arm as a direct result of having the safety equipment, e.g.
children trapping fingers in stair gates. However, families were informed
by both the equipment fitters and the health visitors that if they
encountered problems with the equipment then they should contact their
health visitor as soon as possible, and although a very small number of
parents did so in relation to the refitting of some equipment, none of the
families reported injuries involving the equipment. This suggests that
this mechanism does not explain the higher primary care attendance rate
seen in the intervention arm.
We agree with Hewitt et al that the decision to use a P value of 0.05
or a 95% confidence interval to determine statistical significance is
arbitrary but widely accepted, but consider their use of 67% and 51% to be
equally arbitrary but without wide acceptance. Wider use of these limits
would greatly increase the chances of detecting both beneficial and
harmful effects of new interventions when no such effects exist leading to
unnecessary costs and public concern.
We believe that further research is urgently needed to examine the
protective effect of specific items of equipment on the injuries they
could potentially prevent. Hewitt et al argue that safety advice and
safety equipment should not be given to families with young children
because it increases the risk of harm and cost. Is this a reasonable
conclusion bearing in mind that the increased primary care attendance rate
could be due to increased parental reporting of injury, the limited
ability of our study to demonstrate reductions in injury due to the higher
than expected baseline prevalence of safety equipment and the lower 95%
confidence intervals including the possibility that the intervention
reduces secondary care attendances by 10% and hospital admissions by 30%?
Throwing the baby out with the bath water seems premature under these
circumstances.
Reference:
1. Hewitt C, Mitchell N, Torgerson D. Heed the data when results are not
significant. BMJ 2008; 336: 23-25.
Competing interests:
We are authors of one of the articles discussed in the paper by Hewitt et al.
Competing interests: No competing interests