Peer review of statistics in medical research: the other problem
BMJ 2002; 324 doi: https://doi.org/10.1136/bmj.324.7348.1271 (Published 25 May 2002) Cite this as: BMJ 2002;324:1271All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
Conclusions presented in a peer reviewed publication are meant to be
the 'gold standard' against which medical treatment is measured. Popular
press articles, books, letters to the same journal even are relegated to
'so what' status. "Peer Review" medics say, the rest is just so much
personal opinion, unverified anecdote, to be ignored.
Now we hear that Peer Review is badly flawed - you could swap the
rejected and accepted piles and no one would notice, it is suggested.
I think the victims of bad medical science would, and have.
Competing interests: No competing interests
The messages of Richard Smith and Peter Bacchetti come over quite
clearly. Peer-review processes can be highly error-prone, a problem that
is "especially acute for highly innovative research."
However, the fact of high error-proneness demands profounder reforms
than those proposed by Bacchetti. The "changes in the culture" should
include recognition that improving the probability of correct evaluations
in error-prone environments requires a restructuring of peer-review
processes as currently practiced.
The basic rules, quite familiar to those who survive in Wall Street,
are (i) hedge your bets, and (ii) look at track record. For more on this
please see my peer-review web-site at
http://post.queensu.ca/~forsdyke/peerrev.htm
Sincerely,
Donald R. Forsdyke M.B., B.S., Ph.D.
Department of Biochemistry, Queen's
University, Canada
Competing interests: No competing interests
I am delighted to see Professor Altman et al. reiterate that
confidence intervals are preferable to post-hoc power calculations, but I
must disagree with the reasons they cite for nevertheless presenting prior
power calculations for completed studies. Just as confidence intervals
more directly and clearly address uncertainty than power calculations, so
too the information that they see flowing from prior power calculations
can instead be presented more directly. Papers can provide information
about early stopping, recruitment rates, and any relevant departures from
expectations (some may not be scientifically important) without any
reference to power calculations. While the importance of prespecification
can be debated (is it really impossible for an effect to be important if
the outcome was not prespecified?), it is easy enough to simply state the
planned primary and secondary outcomes, again without reference to power
calculations. The clinical importance of a given effect size similarly
does not rely on power calculations; instead, power calculations usually
rely on (subjective) assessments of clinical importance.
The remaining point they cite is that prior power calculations
provide assurance that the study was 'properly planned'. The scientific
relevance of this is unclear to me. If a study finds important
information by blind luck instead of good planning, so be it. I still
want to know the results. I discussed in the paper the frequent
difficulty of conforming to common notions of 'proper' sample size
planning, and Dr. Horrobin has expanded on this. In addition, even when
investigators are able to conform to the ideal model, the assumptions
frequently turn out to be inaccurate. Does this mean that the studies
were poorly planned? Should they be disregarded? Or should we learn from
them what we can? Requiring prior power calculations may provide back-end
enforcement of the sort of ethics-based sample size review that Dr.
Horrobin describes. One problem with this is that it extends the reach of
a process that is already unrealistically rigid and obstructionist.
Another is that, unlike front-end enforcement, it is easily circumvented
by providing a power calculation that was not really done beforehand. If
a study passed ethical review before commencing and provides information
important enough to warrant publication, it seems reasonable to assume
that its potential to produce such information was high enough that asking
for the subjects' cooperation was ethical.
If the reasons for presenting prior power calculations are weak, the
question remains: why not? I believe the most important reason is that it
contributes to the very problem it is supposed to mitigate:
misinterpreting p>0.05 as providing strong support for the null
hypothesis. Authors, reviewers, and readers may reasonably interpret the
presence of a power calculation as having some direct relevance for
interpreting the results. Given the oblique (I have argued) reasons for
presenting power calculations, they may understandably assume that the
calculation is there to assure us that the sample size is adequate to
support the classical statistical hypothesis testing approach of either
'accepting' or 'rejecting' the null hypothesis. And despite subtleties
that statistical theorists may try to convey, 'accepting' the null
hypothesis is usually described as, 'There was no difference'. The needed
corrective for this type of erroneous, binary interpretation is to
encourage investigators and readers to pay attention to estimated effects
and the confidence intervals around them. I contend that presenting power
calculations works against this. In addition, presenting power
calculations opens the study to a second round of unhelpful sample size
criticisms, of which Dr. Zwarenstein's experience provides an extreme
example.
There is certainly room for disagreement about what practices and
recommendations will improve the use of statistical methods in medical
research. I thank Professor Altman, et al. for contributing their
perspective and hope that this clarifies where and why I disagree.
Competing interests: No competing interests
Merrick Zwarenstein has recounted how he had papers rejected because
the trials failed to reach the planned sample size. Peter Bacchetti, in
reply, said that he often recommends against presenting sample size
planning information in reports of completed trials, because confidence
intervals more directly address uncertainty in the study's results. He
also observed that "Unfortunately, many published standards for presenting
studies' results, as well as scales for rating article quality, insist
that power calculations are necessary even after the results are known and
speculation about power is no longer needed (either p was <0.05 or
not)." We wish to explain why the CONSORT statement [1,2] includes the
recommendation that RCT reports say how the sample size was determined,
including details of a prior power calculation if done.
There is indeed little merit in calculating the statistical power
once the results of the trial are known; the power is then appropriately
indicated by confidence intervals. And we agree that failing to reach the
planned sample size is not a reason to reject a paper. However, power
calculations
are still of importance to readers, both directly and indirectly.
First, if the achieved smaller size differs from the planned sample
size the reader will wish to know why: was this was just because of an
overestimate of the likely recruitment rate or because the trial stopped
early because of a statistically significant result (perhaps after
multiple looks at the data, and, if so, was a formal stopping rule or
guideline used)?
Second, a power calculation indicates strongly what is or should be
the principal outcome measure for the trial (although it may not indicate
how the analysis will be performed). This is a safeguard against changing
horses in midstream and claiming a big effect on an outcome that was not a
primary outcome or even not prespecified. Also, a power calculation is
explicit evidence that the trial was properly planned and that some
thought was given to the size of effect that would be clinically important
(even though we all know the values used are often rather optimistic in
order to keep the sample size down).
Douglas G Altman(a), David Moher (b), Kenneth F Schulz(c)
(a) Cancer Research UK Medical Statistics Group, Centre for
Statistics in Medicine, Institute of Health Sciences, Oxford, UK
(b) Thomas C Chalmers Centre for Systematic Reviews, University of
Ottawa, Ontario, Canada
(c) Family Health International and Department of Obstetrics and
Gynecology, School of Medicine, University of North Carolina at Chapel
Hill, North Carolina, USA
1 Moher D, Schulz KF, Altman DG for the CONSORT Group. The CONSORT
statement: revised recommendations for improving the quality of reports of
parallel-group randomised trials. Lancet 2001;357:1191-4.
2 Altman DG, Schulz KF, Moher D, Egger M, Davidoff, Elbourne D,
Gøtzsche PC, Lang T for the CONSORT Group. The revised CONSORT statement
for reporting randomized trials: explanation and elaboration. Annals of
Internal Medicine 2001;134:663–94.
Competing interests: All authors are members of the CONSORT group.
DGA is one of the BMJ's statistical advisers.
Competing interests: No competing interests
Can I support the responders to Peter Bacchetti's interesting
article, who argue that it is rigid adherence to rules by people with a
limited understanding of statistics that is the problem. On the issue of
sample size, to quote from a recent paper Williamson et al (1) 'Their
(sample size calculations) purpose is not to give an exact number, but
rather to subject the study design to scrutiny, including an asessment of
the validity and reliability of data collection, and to give an estimate
to distinguish whether tens, hundreds or thousands of participants are
required....The view is taken that all studies should include an honest
assessment of the power and effect size of a study, but that an ethics
committee need not automatically reject studies of low power.'
I think one solution is improved feedback to referees and reviewers.
I sit on a scholarship panel,and now routinely provide feedback as to
whether there are 'hawks' and 'doves' on the panel and who they are. I
believe the panel finds this helpful!
Conflict of interest. I am on the BMJ Statistical advisory panel.
Ref:
1. Williamson, P, Hutton, J.L., Bliss J, Blunt J, Campbell MJ and Nicolson
E. Statistical review by research ethics committees. J Roy Statist Soc A,
2000; 163 : 5-13.
Competing interests: No competing interests
The article by Bacchetti (BMJ, 20 May) with its comments about uncertainties surrounding power calculations prompted me to seek the advice of your correspondents about an issue which seems to becoming more common and which has major implications for clinical research.
Laxdale is a pharmaceutical company which, like many others, frequently conducts pilot studies on new entities. These involve the administration of novel, potentially therapeutic compounds to patient populations which have never been exposed to that compound previously. Our usual practice is to state in the protocol that in this situation there is no reasonable basis for a power calculation. In order to collect information about an effect size (if any) and the variance of that effect size, we state that we plan to randomise a modestly sized group of patients to two or three doses of the active drug and to placebo or a known active treatment. On the basis of such pilot studies which, depending on the indication and on the advice were receive from experienced clinicians, might typically include 15-30 patients per group we then plan further studies with at least some basis on which we can do a sensible power calculation.
However, recently we have encountered Multicentre Regional Ethics Committees (MRECS) which, although satisfied about other aspects of a pilot study have insisted on a formal power calculation as an ethical requirement. One MREC argued that in the absence of information about our product in a particular patient population we should look at results in similar patient populations obtained with other products with quite different mechanisms of action. It was argued that we should base our power calculations on these other results, and on the minimum useful clinical improvement which might be expected. In vain we pointed out that there was no scientific basis for basing a power calculation on results with a quite different product. Moreover, by basing the power calculation for a pilot study on the minimum useful benefit, group sizes would have to be very large and as a result unnecessarily large populations of patients might thus be exposed either to placebo, or an ineffective drug, or even a drug which might be toxic in this new patient population. However these arguments were not accepted and we were told that ethical approval would be withheld until we performed a power calculation. So we complied.
There appears to be little literature on the use of power calculations in pilot studies. More surprisingly, we have not been able to find a single published study which in a consecutive series of trials compared pre-study power calculations with the actual clinical results obtained. There are studies of power calculations in published papers but that is very different from prospectively evaluating whether power calculations have real validity or whether they require so many assumptions that in practice they are of limited value. There is a strong theoretical basis for power calculations, but the complete absence of an published prospective evaluation raises suspicions. I can think of no other procedure in clinical research or clinical practice which has become a standard requirement with so little evidence from real world studies.
I am seeking two pieces of advice from your readers. The answers are relevant not just to us, but to anyone from either industry or academia trying to do pilot studies of new compounds:
1. What is the rationale and where is the evidence base for requiring power calculations in pilot studies of new entities being studied for the first time in a new population?
2. Is the requirement for power calculations for any study anything more than a theoretical construct or does there exist a prospectively collected series of studies which demonstrates that predictions made using power calculations are consistently validated by what actually happens during the study?
Does the application of power calculations have a strong experimental basis which validates the procedures used? Or are the inevitable assumptions which must be made so flawed that calculated power frequently bears little relation to actual power?
David F. Horrobin, DPhil, BM, BCh
Research Director, Laxdale Ltd,
King’s Park House, Stirling, UK FK7 9JQ
Competing interests: No competing interests
Sir,
I read with interest Professor Peter Bacchetti’s excellent article
[1], highlighting “the other problem of peer review of finding flaws that
are not really there based on unfounded statistical criticism, and its de-
moralizing effect on authors”. I wish to add some thoughts to the debated
issues.
Professor David Horrobin’s original classics on the subject [2,3]
have not yet been surpassed. It was updated recently [4] and prompted some
contributory thoughts [5]. Having enough experience as author of reject
articles and some as peer reviewer, I find the most devastating effect to
author’s morale is making no comment, giving no reason for rejection or
not replying all. The BMJ is guilty on this account as an article of mine
was rejected that was accepted elsewhere after minor editing [6]. The BMJ,
however, is in the good company of most biomedical journals who apply the
COPE rules.
The article lacked statistics of any kind that perhaps might be one
of the reasons it was disliked at BMJ. To Editors’ credit, however, it
took about a month to say ‘No’ that caused no momentum loss, unlike other
Journals who reach the same verdict on other articles after 6 months or a
year that drag another year or two before the author could recover and
gather enough time, interest and energy to face the damn thing again. One
subtle aim of that article [6], mentioned to BMJ Editors, was an attempt
to say that “there is science and in particular evidence based medicine
without statistics”.
It is a devil’s advocate to say statistics has not only been made
into a “big lie” but also ‘false God’. It was invented elsewhere but
currently worshiped only at most medical and surgical journals. A look at
Science and Nature testifies such prestigious magazines have reduced
statistics to real size and value as a “tool for testing a hypothesis”. It
is not too basic a question for every biomedical peer reviewer to find out
the exact role, aim and limitations of statistics. Some was mentioned in
an article [7] nobody noticed save the late great Professor GD Chisholm
editor of Br J Urology. It was based on a study that was rejected by a
grant committee. It aimed at resolving 2 of the most serious puzzles of
current clinical practice, postoperative hyponatraemia and the multiple
organ dysfunction or failure syndromes [8].
However, giving data and statistics [7,8] before clarifying the
theories [9] has proved as wrong as putting the cart in front of the
horse. Einstein’s methods on proposing the special and general relativity
theory is the correct way. When statistics was haled in the sixties
everyone thought it was the only mean to discover “The Unifying Theory”.
This has proved both immensely costly and wrong. The basic fact is
‘statistics cannot, was not intended to and will never could, make a
discovery’. Observation, mental experiments and the X factor are the only
way to make a discovery long before it is verified and proved by practical
studies and statistical tests. Before explaining the X-factor allow me
tell a relevant true story that symbolizes the current problem with
statistics.
Two friends of mine in UK had a disagreement, made a bit on a round
of drinks and decided the first person to enter the hospital club will be
the judge. Guess who did? I did but having no clue on how to resolve the
conflict suggested that a flip of a coin might be the best way. They
agreed also to my condition that while head or tail will determine the
winner among them, if the coin stood on edge the judge should be the
winner of all. It did and I won. Another conflict started on: Who should
buy the 3rd round of drinks? Both agreed that it was my turn. I explained
that buying the 3rd round will gain good company but lose all winnings,
and my turn should be the 5th round! The point is statistics can tell the
probability of head or tail and exclude the odd but when evaluating to
either 0 or 100% and the truth is known, instead of expiring it generates
residual arguments.
Professor Richard Smith contributed to this debate by quoting Dr
Hedge on Professor Robert Fox’s famous thought that “swabbing the rejects
with the accepts does not make a difference.” He added that perhaps it has
already been done at BMJ” and asked “How can you know?” With due respect
Sir, I frankly think nobody can. Despite a proven incremental value of an
average article it does not make a noticeable difference or great loss to
scientific advances. Statistically speaking that means a quality article
submitted to BMJ has 50% chance of being accepted or rejected. So, why not
save everybody the trouble and toss a coin? Here is where statistics has
shot itself in the foot. It gives an average chance to the average and an
odd chance to the odd but can’t tell which is important.
The odd chance of a tossed coin to stand on edge matches that of a
breakthrough scientific or medical article coming an editor or peer
reviewer’s way but detecting such article makes all the difference. Some
call it a hunch or gut feeling. Others qualify it by the three-pronged
tests of quality, relevance and civility. Identifying the “X-factor” that
makes such an article stand out is worth all the trouble. I honestly do
not know but it is the arresting beauty found in Einstein’s famous papers,
Newton’s laws, Mozart’s music and Shakespeare’s writing among many
examples that include medicine [2-4]. I wrote 2 articles on such para-
scientific para-medical stuff to identify the X-factor, “Rules and lures
of the science game” and “The Mozarts of Science” sent to journals nearly
two years ago and have not received a reply yet. I think a message of
“Ignore the big headed bustard” arrived.
Qualified people to find out the X-factor are COPE members. Another
question that requires a ‘Yes’ or ‘No’ answer would be: if any of
Einstein’s papers is evaluated using the current peer review standard and
statistics adopted by most biomedical journals, would it be accepted? Peer
reviewers take note: This BMJ “Rapid Response” site is a practical warning
that I think its inventor has already earned a place in medical history.
References
1. Bacchetti P. EDUCATION AND DEBATE. Peer review of statistics in medical
research: the other problem. BMJ 2002; 324: 1271-1273
2. Horrobin DF. Referees and research administrators: Barier to scientific
research. BMJ 1974; 2:216-218.
3. Horrobin DF. The philosophical basis of peer review and the suppression
of innovation. JAMA 1990; 263: 1438-1441.
4. Stephens WE. Basic philosophy and concepts underlying peer review.
Medical Hypotheses 1999; 52: 31-36
5. Ghanem AN. Guidelines and Code of Ethics. Saudi Medical Journal 2000;
21 (7): 694.
6. Ghanem AN. Features and Complications of Nephroptosis Causing the Loin
Pain and Haematuria Syndrome: Preliminary Report. Saudi Med. J. 2002; 23
(2): 197-205
7. Ghanem AN, Ward JP. Osmotic and metabolic sequelae of volumetric
overload in relation to the TURP syndrome. Br J Uro 1990; 66: 71-78
8. Ghanem AN. Magnetic field-like fluid circulation of a porous orifice
tube and relevance to the capillary-interstitial fluid circulation:
Preliminary report. Medical Hypotheses 2001; 56 (3): 325-334.
9. Ghanem AN. Serum sodium changes during and after transurethral
prostatectomy. Saudi Medical Journal 2002; 23 (4):477-9
Competing interests: No competing interests
Over a two year period during which time I assessed 294 claiments for
Disability Living Allowance I came up with some interesting observations.
I submitted this paper as a short report the the Journal of the Royal
College of General Practitioners.
It was rejected primarily on the basis that I had not sort approval from
an ethical committee nor informed the claiments that there might be the
possibility that I might write my observations up for the benefit of the
my colleagues.
As a working doctor I had experience and observations to share, but it
appears spontaneous observation and discussion is to be stiffled by peer
reviewers of the type aptly described.
How many great discoveries would have been shelved if their discoverers
had to go backand ask people if they could annonymously share their
experiences and ideas.
Competing interests: No competing interests
I was interested to see Peter Bacchetti's critique of the
statistical peer review process, in particular his
highlighting the lack of tangible rewards for good
reviewers.
As one of the BMJ's statistical advisers, I aim when
writing a review to get weak papers rejected and strong
papers improved. The affirmation of this process is to
see suggested improvements to a paper incorporated
in the final version, showing that the authors have
valued the suggestions.
If authors were routinely asked to rate reviewers’
comments, and the ratings were given (with or without
reviewers’ names) at the end of the paper with the
acknowledgements, this would provide an opportunity
for good (and indeed bad!) reviewing to be recognised
in a structured way.
Competing interest: I am one of the BMJ's statistical
advisers.
Competing interests: No competing interests
Power and Responsibility
I agree with Professor Bacchetti (and most correspondents). Power is
of no relevance in interpreting a completed study. In his classic,
Planning of Experiments, Sir David Cox has this to say. 'Power is
important in choosing between alternative methods of analysing data and in
deciding on an appropriate size of experiment. It is quite irrelevant in
the actual analysis of data'(1).
I also agree with much of what Dr Horrobin says but he overstates the
case against sample size determinations in pilot studies. In most
indications the variability is a function of the disease not the treatment
and the fact that the treatment has not been studied is no bar to using an
estimate. The difference you are seeking is not the same as the difference
you expect to find and again you do not have to know what the treatment
will do to find a figure. This is common to all science. An astronomer
does not know the magnitude of new stars until he has found them but the
magnitude of star he is looking for determines how much he has to spend on
a telescope.
The definition of a medical statistician is 'one who will not accept
that Columbus discovered America...because he said he was looking for
India in the trial plan'(2). Columbus made an error in his power
calculation, he relied on an estimate of the size of the Earth that was
too small, but he made one nonetheless and it turned out to have very
fruitful consequences.
References
1. Cox, D.R. Planning of Experiments, Wiley, New York, 1958, p161
2. Senn, S.J. Statistical Issues in Drug Development, Wiley, Chichester,
1997, p58
Competing interest. I consult extensively for the pharmaceutical
industry and my career as an academic is furthered by publication and
grants awarded
Competing interests: No competing interests