Letters

Peer review of statistics in medical research

BMJ 2002; 325 doi: https://doi.org/10.1136/bmj.325.7362.491/a (Published 31 August 2002) Cite this as: BMJ 2002;325:491

This article has a correction. Please see:

Journal reviewers are even more baffled by sample size issues than grant proposal reviewers

  1. Merrick Zwarenstein (merrick.zwarenstein{at}mrc.ac.za), director,health systems research
  1. Medical Research Council, Tygerberg, South Africa, 7505
  2. Laxdale, Stirling FK7 9JQ
  3. Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
  4. Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Children's Hospital of Eastern Ontario Research
  5. Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, PO Box 13950 Research Triangle Park, NC 27709, USA
  6. University of California, San Francisco, CA 94143, USA
  7. Research and Development Support Unit, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
  8. Department of Mental Health, University of Exeter, Exeter EX2 5AF

    EDITOR—With reference to the article by Bacchetti,1 the confusion surrounding sample size estimates in research protocols elicits quite strange responses from reviewers when they are faced with the completed research in a report submitted to a journal for publication. One of our submissions was rejected because the planned sample size was not attained. But the effect size was greater in the study than we had anticipated, and thus the difference was of clinical and statistical significance. Another submission met the same fate for a similar reason—it was an equivalence trial—and even though the difference in effect between intervention and control arms (and both sides of the confidence interval around this difference) lay completely within the equivalence interval, the fact that the planned sample size was not attained in some way invalidated the result in the mind of the reviewer.

    Although sample size estimation is useful in considering the feasibility of conducting a study (and protocol reviewers should discourage funding for studies that are plainly too small to be meaningful) attainment of the planned sample size does not seem to me to be a useful indicator by which journal reviewers should assess the validity of a completed research report in which clinically and statistically meaningful results have been obtained.

    Footnotes

    • Conflict of interest My competing interest in relation to this question is my desire to publish research in the face of overoptimistic sample size estimates in my grant proposals.

    References

    1. 1.

    Rationale for requiring power calculations is needed

    1. David F Horrobin (agreen{at}laxdale.co.uk), research director.
    1. Medical Research Council, Tygerberg, South Africa, 7505
    2. Laxdale, Stirling FK7 9JQ
    3. Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
    4. Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Children's Hospital of Eastern Ontario Research
    5. Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, PO Box 13950 Research Triangle Park, NC 27709, USA
    6. University of California, San Francisco, CA 94143, USA
    7. Research and Development Support Unit, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
    8. Department of Mental Health, University of Exeter, Exeter EX2 5AF

      EDITOR—The article by Bacchetti with its comments about uncertainties surrounding power calculations prompted me to seek advice about an issue that has implications for clinical research.1 The company I work for, Laxdale Limited, often conducts pilot studies on new entities. Our usual practice is to state in the protocol that there is no reasonable basis for a power calculation. In order to collect information about an effect size (if any) and the variance of that effect size, we state that we plan to randomise a modestly sized group of patients to two or three doses of the active drug and to placebo. Depending on indication and on advice from experienced clinicians, such studies might include 15-30 patients per group. Results in hand, we can then plan further studies with evidence on which to base a sensible power calculation.

      Recently a multicentre regional ethics committee insisted on a formal power calculation as an ethical requirement for a pilot study. We were told to base power calculations on results obtained with other products with different mechanisms of action and on the minimum useful clinical improvement that might be expected. In vain we pointed out that this procedure had no scientific basis and that, by basing the power calculation for a pilot study on the minimum useful benefit, group sizes would have to be large and many patients might unnecessarily be exposed to placebo, or an ineffective drug, or even a drug that might be toxic in this new patient population.

      Little literature is available on power calculations in pilot studies. We have not found any study that, for a consecutive series of trials of any type, compared prestudy power calculations with the results obtained. There are studies of power calculations in published papers, but that is different from prospectively evaluating whether power calculations have validity or whether they require so many assumptions that they are of limited practical use. There is a theoretical basis for power calculations, but the absence of any prospective evaluation raises suspicions. What other procedure in clinical research has become standard with so little evidence from real world studies?

      What is the rationale and where is the evidence base for requiring power calculations in pilot studies of new entities? More generally, does the use of power calculations for any studies have a strong experimental basis? Or are the assumptions so flawed that calculated power frequently bears little relation to actual power?

      References

      1. 1.

      Reporting power calculations is important

      1. Douglas G Altman (doug.altman{at}cancer.org.uk), director.,
      2. David Moher,
      3. Kenneth F Schulz
      1. Medical Research Council, Tygerberg, South Africa, 7505
      2. Laxdale, Stirling FK7 9JQ
      3. Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
      4. Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Children's Hospital of Eastern Ontario Research
      5. Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, PO Box 13950 Research Triangle Park, NC 27709, USA
      6. University of California, San Francisco, CA 94143, USA
      7. Research and Development Support Unit, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
      8. Department of Mental Health, University of Exeter, Exeter EX2 5AF

        EDITOR—In response to the article by Bacchetti, Zwarenstein (letter above) recounted how he had papers rejected because the trials failed to reach the planned sample size.1 Bacchetti responded on bmj.com, observing that, unfortunately, many published standards for presenting studies' results, as well as scales for rating article quality, insist that power calculations are necessary even after the results are known and speculation about power is no longer needed (either P was <0.05 or not).2 We wish to explain why the CONSORT statement includes the recommendation that reports of randomised controlled trials say how the sample size was determined, including details of a prior power calculation if done. 3 4

        There is little merit in calculating the statistical power once the results of the trial are known; the power is then appropriately indicated by confidence intervals. We agree that failing to reach the planned sample size is not a reason to reject a paper. But power calculations are still of importance to readers, both directly and indirectly.

        Firstly, if the achieved smaller size differs from the planned sample size the reader will wish to know why: was this just because of an overestimate of the likely recruitment rate or because the trial stopped early because of a statistically significant result (perhaps after multiple looks at the data, and, if so, was a formal stopping rule or guideline used)?

        Secondly, a power calculation indicates strongly what is or should be the principal outcome measure for the trial (although it may not indicate how the analysis will be performed). This is a safeguard against changing horses in midstream and claiming a big effect on an outcome that was not a primary outcome or even not prespecified. Also, a power calculation is explicit evidence that the trial was properly planned and that some thought was given to the size of effect that would be clinically important (even though we all know the values used are often rather optimistic in order to keep the sample size down).

        All authors are members of the CONSORT group. DGA is one of the BMJ ‘s statistical advisers.

        References

        1. 1.
        2. 2.
        3. 3.
        4. 4.

        Author's thoughts on power calculations

        1. Peter Bacchetti (pbacchetti{at}epi.ucsf.edu), professor.
        1. Medical Research Council, Tygerberg, South Africa, 7505
        2. Laxdale, Stirling FK7 9JQ
        3. Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
        4. Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Children's Hospital of Eastern Ontario Research
        5. Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, PO Box 13950 Research Triangle Park, NC 27709, USA
        6. University of California, San Francisco, CA 94143, USA
        7. Research and Development Support Unit, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
        8. Department of Mental Health, University of Exeter, Exeter EX2 5AF

          EDITOR—I am delighted to see Altman et al reaffirm that confidence intervals are preferable to post hoc power calculations, but I disagree with their reasons for nevertheless presenting prior power calculations for completed studies. Just as confidence intervals more directly and clearly address uncertainty than power calculations, so too the information that they see flowing from prior power calculations can instead be presented more directly. Papers can provide information about early stopping, recruitment rates, and relevant departures from expectations without giving power calculations. The importance of prespecification can be debated, but it is easy enough to simply state the planned primary outcome without reference to power calculations. The clinical importance of a given effect size similarly does not rely on power calculations.

          The remaining point Altman et al cite is that prior power calculations provide assurance that the study was properly planned. The scientific relevance of this is unclear to me. If a study finds important information by blind luck instead of good planning, I still want to know the results. I discussed in the paper the frequent difficulty of conforming to common notions of “proper” sample size planning, and Horrobin has expanded on this. Even when investigators can approximate the ideal, assumptions often turn out to be inaccurate. Does this mean that the studies were poorly planned and should be disregarded? Or should we learn from them what we can? Requiring prior power calculations as back end enforcement of the sort of ethics based sample size review that Horrobin describes seems particularly misguided to me. If a study provides information important enough to warrant publication it seems reasonable to assume that it did not have an unethically small sample size.

          An important reason not to present prior power calculations is that doing so contributes to the very problem it is supposed to mitigate: misinterpreting P>0.05 as providing strong support for the null hypothesis. Authors, reviewers, and readers may reasonably interpret the presence of a power calculation as having some direct relevance for interpreting the results. Given the oblique (I have argued) reasons for presenting power calculations, they may understandably assume that the calculation's purpose is to assure us that the sample size is adequate for the classic statistical hypothesis testing approach of either accepting or rejecting the null hypothesis. Despite subtleties that statistical theorists may try to convey, “accepting” the null hypothesis is often described as, “There is no difference.” The needed corrective for this type of erroneous interpretation is to encourage investigators and readers to pay attention to estimated effects and the confidence intervals around them, particularly when interpreting “negative” results. I contend that presenting power calculations works against this. In addition, presenting power calculations opens the study to a second round of unhelpful sample size criticisms, of which Zwarenstein's experience provides an extreme example.

          There is certainly room for disagreement about what practices and recommendations will improve the use of statistical methods in medical research. I thank Altman et al for contributing their perspective and hope that this clarifies where and why I disagree. I also thank the other correspondents for the important information and thoughts they have provided.

          Reviewers' contributions should be thoughtful, constructive, and encouraging

          1. Graham A Barton (andrew.barton{at}phnt.swest.nhs.uk), coordinator.
          1. Medical Research Council, Tygerberg, South Africa, 7505
          2. Laxdale, Stirling FK7 9JQ
          3. Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
          4. Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Children's Hospital of Eastern Ontario Research
          5. Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, PO Box 13950 Research Triangle Park, NC 27709, USA
          6. University of California, San Francisco, CA 94143, USA
          7. Research and Development Support Unit, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
          8. Department of Mental Health, University of Exeter, Exeter EX2 5AF

            EDITOR—Congratulations to Bacchetti for his paper on some of the difficulties inherent in our present peer review system for publications and grant applications.1 May I add to his examples the following opening gambit from one of our reviewers: “There is a great need regarding virtually all aspects of life for [patient group] across their life spans in those countries in which this population at least has been attended to.” The project team has to respond to this (and the 70 questions that follow) within a week. As we are unable to discern its meaning, or even whether it is a positive or negative comment, writing to the BMJ seems a more useful way to spend 30 minutes of the deadline.

            In our role as a resource for clinicians attempting to get published or applying for funding, and as researchers in our own right, staff at the research and development support unit see more reviews than most. Bachetti says that the number of criticisms in a review is taken to be a measure of its quality. Reviewers attempting to achieve this “quality” all too often and obviously stray into areas about which they know little, and statistics is the most obvious example (the English language is another). They nevertheless feel empowered to make damning comments. May I add that it is the sneering attitude with which they feel obliged to do so which makes the process so profoundly disheartening for inexperienced researchers. We all review scientific papers and grant applications in the research and development support unit as well as provide a peer review system for local trusts under the new research governance arrangements.

            When these are blind I am embarrassed to find my own reviews described by applicants or authors as “the supportive” or “the encouraging” referee. It takes time and effort to put a funding bid together, and applicants are usually to be congratulated on doing so within a strict deadline. Inexperienced authors may still have an important message to convey and should be encouraged to do so. The honour roll is an attractive idea and could be linked to, say, the research assessment exercise for heads of academic departments to be enthusiastic. This elevated status should be reserved for reviewers whose contribution is thoughtful, constructive, and encouraging.

            References

            1. 1.

            Suggested solution may partly solve other problem

            1. Mary Fox (mfox{at}ex.ac.uk), research psychologist
            1. Medical Research Council, Tygerberg, South Africa, 7505
            2. Laxdale, Stirling FK7 9JQ
            3. Centre for Statistics in Medicine, Institute of Health Sciences, Oxford OX3 7LF
            4. Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Children's Hospital of Eastern Ontario Research
            5. Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, PO Box 13950 Research Triangle Park, NC 27709, USA
            6. University of California, San Francisco, CA 94143, USA
            7. Research and Development Support Unit, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
            8. Department of Mental Health, University of Exeter, Exeter EX2 5AF

              EDITOR—I was delighted to read Bacchetti's article on the flaws of the peer review system.1 The power of ill informed or undermotivated reviewers is disproportionate to their gatekeeping role. One of the main frustrations with the current system is the lack of ability of researchers to respond to spurious criticism.

              However, a recently submitted proposal to a medical research charity has given some hope in this area as we were able to respond to reviewers' comments on the proposal before it went to committee for adjudication. Some of the comments were valid, but most showed that the reviewers had not read the application properly. In either case we were able to amend or clarify the situation.

              This process added another two weeks before hearing the outcome. This is a small amount of delay considering the many months it takes to put an application together, and it gives some confidence that when articles or proposals are rejected that enormous amounts of effort have not been wasted for trivial reasons. I recommend that this process becomes part and parcel of the peer review system

              References

              1. 1.
              View Abstract

              Sign in

              Log in through your institution

              Subscribe