Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
Rapid Responses to:
|
|
Rapid Responses published:
|
|
|||
|
BM HEGDE, Vice Chancellor MANIPAL-576 119. India
Send response to journal:
|
Dear Sir, I could not agree more! How true? If the editors were to exchange the reject file with the accept file, I do not think that it makes much of a difference! yours
|
|||
|
|
|||
|
Reinhard Vonthein, Statistician Tuebingen University Hospital, 72070 Tuebingen, Germany
Send response to journal:
|
I experienced similar problems, e.g. a scatter plot with regression line and confidence band was thought too complicated and confusing and should have been replaced by two means and standard error bars. Other comments were helpful, e.g. why draw a box plot and report mean and SD? In these cases we wrote as a team of specialists, at least a physician and a statistician. Just this month I assisted two referees. They consulted me, as they would have for writing, because statistical precedures were beyond their judgement--after obtaining editors' consent, off course. So we produced reviews of hopefully adequate quality, by matching the writing team with a reviewing team. Besides, we all learned something on applied statistics, and the authors earned praise for their considerable effort. In one case, my services were acknowledged as a review of its own by the editor. The other case will have to appear on the bill for consultancy. So my new suggestion is to amend review rules and allow review teams. They might be usefull in interdisciplinary research, too. Thus the reviewers lists will be ever longer, which calls for honours lists of Prof. Bachetti's design. |
|||
|
|
|||
|
Ann E. Smith, MRC Research Fellow University of Ulster, Jordanstown, Newtownabbey, Co. Antrim, BT37 0QB
Send response to journal:
|
The article by Peter Bacchetti gave an initial impression of displaying a slight tendency to the current popular sport of statistics bashing! That said, having read what the author of the article actually said, I can only agree with his suggestions and conclusions. As I understand it, he is stating that it is reviewers who have limited knowledge of statistics who are the real problem. These people have an inflexible approach to the subject and show "dogmatic" adherence to the rules because they do not understand the limitations of what can be achieved. For example, with sample size estimations in complex studies, it is frequently not possible to have an answer, one can only compromise and admit to the deficiencies in this. Probably statistics teaching has a lot to answer for in that the subject has difficult concepts and the only way to try to get these over is to give the student something decisive to hang on to. I can remember as a PhD student being howled out for saying something was marginally significant. Either significant or not was the answer, with a clear cut- off (usually p = 0.05). The recent (not yet published) study by the BMJ (in which I was a participant) has pre-empted this article to some extent by going some way to addressing whether training can help in improving the quality of peer reviews. |
|||
|
|
|||
|
Merrick Zwarenstein, director, health systems research medical research council, tygerberg south africa, 7505
Send response to journal:
|
The confusion surrounding sample size estimates in research protocols elicits quite strange responses from reviewers when they are faced with the completed research in a report submitted to a journal for publication. One of our submissions was rejected because the planned sample size was not attained. However, the effect size was greater in the study than we had anticipated, and thus the difference was of clinical and statistical significance. Another submission met the same fate for a similar reason- it was an equivalence trial, and even though the difference in effect between intervention and control arms (and both sides of the confidence interval around this difference) lay completely within the equivalence interval, the fact that the planned sample size was not attained in some way invalidated the result in the mind of the reviewer. While sample size estimation is useful in considering the feasibility of conducting a study (and protocol reviewers should discourage funding for studies which are plainly too small to be meaningful) attainment of the planned sample size does not seem to me to be a useful indicator by which journal reviewers should assess the validity of a completed research report in which clinically and statistically meaningful results have been obtained. my competing interest in relation to this question is my desire to publish research in the face of over optimistic sample size estimates in my grant proposals |
|||
|
|
|||
|
Graham A Barton, Research & Development Support Unit Coordinator RDSU, PPMS, University of Plymouth, Room N17, ITTC Building, Tamar Science Park, Plymouth PL6 8BX
Send response to journal:
|
Congratulations to Peter Bachetti for his excellent paper on some of the difficulties inherent in our present peer review system for publications and grant applications.1 May I add to his examples the following opening gambit from one of our reviewers: “There is a great need regarding virtually all aspects of life for [patient group] across their lifespans in those countries in which this population at least has been attended to.” The project team has to respond to this (and the 70 questions that follow) by next Wednesday. As we are unable to discern its meaning, or even whether it is a positive or negative comment, writing to the BMJ seems a more useful way to spend the next 30 minutes. In our role as a resource for clinicians attempting to get published or applying for funding, and as researchers in our own right, Research and Development Support Unit (RDSU) staff see more reviews than most. Bachetti rightly says that the number of criticisms in a review is taken to be a measure of its quality. Reviewers attempting to achieve this ‘quality’ all too often and obviously stray into areas about which they know little and statistics is the most obvious example (the English language in our example). They nevertheless feel empowered to make damning comments. May I add that it is the sneering attitude with which they feel obliged to do so which makes the process so profoundly disheartening for inexperienced researchers. We all review scientific papers and grant applications in the RDSU as well as provide a peer review system for local trusts under the new research governance arrangements. When these are blind I am embarrassed to find my own reviews described by applicants/authors as “the supportive” or “the encouraging” referee. It takes time and effort to put a funding bid together and applicants are usually to be congratulated on doing so within a strict deadline. Inexperienced authors may still have an important message to convey and should be encouraged to do so. The honour roll is an attractive idea and could be linked to, say, the RAE for heads of academic departments to be enthusiastic. This elevated status should be reserved for reviewers whose contribution is thoughtful, constructive, and encouraging. 1. Bacchetti P. Peer review of statistics in medical research: the other problem. BMJ 2002;324:1271-1273 |
|||
|
|
|||
|
Marina Cuttini, senior epidemiologist Regional Agency for Health of Tuscany, Via Vittorio Emanuele II 64, 50134 Florence and IRCCS Burlo G, Eva Buiatti
Send response to journal:
|
Bacchetti proposes the grading of the referees' performance by fellow reviewers of the same paper as a way to improve the review process, and reward good quality work (BMJ 2002; 324: 1271-73. We share his concerns about "finding flaws that are not there", and discouraging authors with unfounded criticisms. Yet, we wonder whether asking the referees for the additional effort of grading each other's review after having graded the original paper would not further discourage their unpaid and unrecognized work, rather than improving it. We propose that a policy of a)simply forwarding each referee's comments not only to the author but also to all other fellow reviewers of the same paper, and b)routinely informing them of the final outcome of the review (ie either publication of the paper, rejection or request for modification) would provide the referees with a concrete sign that their work has been paid attention to, and with the opportunity for a critical self-analysis and improvement. In many years of peer-reviewing in the field of pediatrics, ethics and epidemiology, it happened to us only once to obtain this kind of feed- back from a scientific journal: it was an interesting and instructive experience. |
|||
|
|
|||
|
Ross E Upshur, Primary Care Research Unit University of Toronto
Send response to journal:
|
I'm sympathetic to Professor Barchetti's call for changes in the culture of peer review. I am not convinced that he has provided sufficient argumentation to support his claim that "Mistaken criticism is a general problem, but may be especially acute for statistics. " I'm even more surprised that such claims are made in the absence of any statistical support. Professor Barchetti's supporting argumentation uses quotations from selected grants without any sense of how these were chosen from the universe of possible examples. I agree with his sentiments that criticisms should be raised only with complete understanding of the matter at hand, but this is good advice generally and is not specific to statistical methods. The bigger question is how such mistaken statistical ideas arise in the first place. The proximate cause may be rooted in the statistical education of the reviewers. The potential role that statisticians play in mistaken criticism is an area not addressed by Professor Barchetti. |
|||
|
|
|||
|
axel ellrodt, american hospital of paris 63 Bd Victor Hugo. 92202 Neuilly sur Seine. France
Send response to journal:
|
Does small sample size really hamper the validity of positive studies? I am a layman in statistics but I remember having been taught by my statistic teacher at medical school that the small number of subjects is taken into account by the statistics calculation. Therefore, if a well- designed trial on 30 patients shows that treatment A is significantly better than treatment B (p= 0,02 for example), this has no less value than a bigger sized trial. My feeling is that such a significant difference has more chances to be associated to a clinical relevance than if a 30000 subject trial were needed to show the same difference. I'd appreciate any comment that would tell me where I'm wrong. |
|||
|
|
|||
|
Richard Smith, Editor BMJ
Send response to journal:
|
Dr Hegde, a friend whom I long to join on his 4 am walk through the tropical scents of Karnataka, writes: "If the editors were to exchange the reject file with the accept file, I do not think that it makes much of a difference!" Robbie Fox, the great editor of the Lancet, famously had the same thought. But perhaps the BMJ has done this already. Perhaps nobody has noticed. How can you know? Richard Smith
|
|||
|
|
|||
|
Mary Fox, Research Psychologist Dept. of Mental Health, University of Exeter, EX2 5AF
Send response to journal:
|
I was delighted to read Proff Bacchetti's article on the flaws of the peer review system. The power of ill informed or under motivated reviewers is disproportionate to their gate-keeping role. One of the main frustrations with the current system is the lack of ability of researchers to respond to spurious critism. However, a recently submitted proposal to a medical reseach charity has given some hope in this area as we were able to respond to reviewers' comments on the proposal before it went to committee for adjudication. Some of the comments were valid but most showed that the reviewers had not read the application properely. In either case we were able to amend or clarify the situation. This process added another two weeks befor hearing the outcome. This is a small amount of delay considering the many months an application takes to put together and gives some confidence that when articles or proposals are rejected that enormous ammounts of effort have not been wasted for trivial reasons. I recommend that this iterative process becomes part and parcel of the peer review system |
|||
|
|
|||
|
Peter Bacchetti, Professor Univ. of Ca., San Francisco 94143 USA
Send response to journal:
|
Re: a partial solution to the other problem I had suggested this very thing in an early version of this paper, but removed it to shorten the piece and because I worried that it would be considered unrealistic. I am delighted to learn that at least one research sponsor has added this to its process. This is much better than resubmitting for a later funding cycle, which is not always possible. Re: journal reviewers even more baffled by sample size issues than grant proposal reviewers Issues and misconceptions about sample size warrant much more discussion that I was able to provide in this piece. For studies that are completed, I often recommend against presenting sample size planning information, because confidence intervals around the estimated effect(s) more directly address what should be of concern--the uncertainty in the study's results. Unfortunately, many published standards for presenting studies' results, as well as scales for rating article quality, insist that power calculations are necessary even after the results are known and speculation about power is no longer needed (either p was <0.05 or not). Even so, I have had clients successfully respond to reviews by pointing out that presentation of confidence intervals eliminates the need to show power calculations. This case also brings up a troubling aspect of statistical practice: the meaning of the same data generated by the same biological processes can apparently vary depending on the intentions of the investigator. This case of Dr. Zwarenstein's is so extreme that I wonder whether this was a pretext rather than the real reason for rejection. Re: Does small sample size really hamper positive studies validity ? The initial submission of this paper had included one additional example: "… the numbers of patients studied is too small to guarantee the significance of the P-value." I dropped this in order to meet the length requirement, but perhaps this type of misconception is more widespread than I had thought. There can be technical issues about the accuracy of a p-value calculation from a small sample, but an array of exact methods are now available to deal with these concerns. Dr. Ellrodt is correct that p=0.02 is just as hard to get under the null hypothesis with N=30 as it is with N=30,000. If p=0.02 with N=30, the observed advantage of Treatment A must be much larger than in the situation where p=0.02 with N=30,000, because more uncertainty remains with N=30. If a study with N=30,000 only finds p=0.02, it provides very strong evidence against any clinically meaningful effect: the 95% confidence interval does not include any values very far from zero, even though it also does not include zero itself. |
|||
|
|
|||
|
Tim J Cole, Professor of medical statistics Institute of Child Health, London WC1N 1EH, UK
Send response to journal:
|
I was interested to see Peter Bacchetti's critique of the statistical peer review process, in particular his highlighting the lack of tangible rewards for good reviewers. As one of the BMJ's statistical advisers, I aim when writing a review to get weak papers rejected and strong papers improved. The affirmation of this process is to see suggested improvements to a paper incorporated in the final version, showing that the authors have valued the suggestions. If authors were routinely asked to rate reviewers’ comments, and the ratings were given (with or without reviewers’ names) at the end of the paper with the acknowledgements, this would provide an opportunity for good (and indeed bad!) reviewing to be recognised in a structured way. Competing interest: I am one of the BMJ's statistical advisers. |
|||
|
|
|||
|
Richard D Colman, General and Occupational physician York
Send response to journal:
|
Over a two year period during which time I assessed 294 claiments for Disability Living Allowance I came up with some interesting observations. I submitted this paper as a short report the the Journal of the Royal College of General Practitioners. It was rejected primarily on the basis that I had not sort approval from an ethical committee nor informed the claiments that there might be the possibility that I might write my observations up for the benefit of the my colleagues. As a working doctor I had experience and observations to share, but it appears spontaneous observation and discussion is to be stiffled by peer reviewers of the type aptly described. How many great discoveries would have been shelved if their discoverers had to go backand ask people if they could annonymously share their experiences and ideas. |
|||
|
|
|||
|
DR Ahmed N. Ghanem, MD, FRCS, Consultant Urological Surgeon King Khalid Hospital, Najran, Saudi Arabia
Send response to journal:
|
Sir, I read with interest Professor Peter Bacchetti’s excellent article [1], highlighting “the other problem of peer review of finding flaws that are not really there based on unfounded statistical criticism, and its de- moralizing effect on authors”. I wish to add some thoughts to the debated issues. Professor David Horrobin’s original classics on the subject [2,3] have not yet been surpassed. It was updated recently [4] and prompted some contributory thoughts [5]. Having enough experience as author of reject articles and some as peer reviewer, I find the most devastating effect to author’s morale is making no comment, giving no reason for rejection or not replying all. The BMJ is guilty on this account as an article of mine was rejected that was accepted elsewhere after minor editing [6]. The BMJ, however, is in the good company of most biomedical journals who apply the COPE rules. The article lacked statistics of any kind that perhaps might be one of the reasons it was disliked at BMJ. To Editors’ credit, however, it took about a month to say ‘No’ that caused no momentum loss, unlike other Journals who reach the same verdict on other articles after 6 months or a year that drag another year or two before the author could recover and gather enough time, interest and energy to face the damn thing again. One subtle aim of that article [6], mentioned to BMJ Editors, was an attempt to say that “there is science and in particular evidence based medicine without statistics”. It is a devil’s advocate to say statistics has not only been made into a “big lie” but also ‘false God’. It was invented elsewhere but currently worshiped only at most medical and surgical journals. A look at Science and Nature testifies such prestigious magazines have reduced statistics to real size and value as a “tool for testing a hypothesis”. It is not too basic a question for every biomedical peer reviewer to find out the exact role, aim and limitations of statistics. Some was mentioned in an article [7] nobody noticed save the late great Professor GD Chisholm editor of Br J Urology. It was based on a study that was rejected by a grant committee. It aimed at resolving 2 of the most serious puzzles of current clinical practice, postoperative hyponatraemia and the multiple organ dysfunction or failure syndromes [8]. However, giving data and statistics [7,8] before clarifying the theories [9] has proved as wrong as putting the cart in front of the horse. Einstein’s methods on proposing the special and general relativity theory is the correct way. When statistics was haled in the sixties everyone thought it was the only mean to discover “The Unifying Theory”. This has proved both immensely costly and wrong. The basic fact is ‘statistics cannot, was not intended to and will never could, make a discovery’. Observation, mental experiments and the X factor are the only way to make a discovery long before it is verified and proved by practical studies and statistical tests. Before explaining the X-factor allow me tell a relevant true story that symbolizes the current problem with statistics. Two friends of mine in UK had a disagreement, made a bit on a round of drinks and decided the first person to enter the hospital club will be the judge. Guess who did? I did but having no clue on how to resolve the conflict suggested that a flip of a coin might be the best way. They agreed also to my condition that while head or tail will determine the winner among them, if the coin stood on edge the judge should be the winner of all. It did and I won. Another conflict started on: Who should buy the 3rd round of drinks? Both agreed that it was my turn. I explained that buying the 3rd round will gain good company but lose all winnings, and my turn should be the 5th round! The point is statistics can tell the probability of head or tail and exclude the odd but when evaluating to either 0 or 100% and the truth is known, instead of expiring it generates residual arguments. Professor Richard Smith contributed to this debate by quoting Dr Hedge on Professor Robert Fox’s famous thought that “swabbing the rejects with the accepts does not make a difference.” He added that perhaps it has already been done at BMJ” and asked “How can you know?” With due respect Sir, I frankly think nobody can. Despite a proven incremental value of an average article it does not make a noticeable difference or great loss to scientific advances. Statistically speaking that means a quality article submitted to BMJ has 50% chance of being accepted or rejected. So, why not save everybody the trouble and toss a coin? Here is where statistics has shot itself in the foot. It gives an average chance to the average and an odd chance to the odd but can’t tell which is important. The odd chance of a tossed coin to stand on edge matches that of a breakthrough scientific or medical article coming an editor or peer reviewer’s way but detecting such article makes all the difference. Some call it a hunch or gut feeling. Others qualify it by the three-pronged tests of quality, relevance and civility. Identifying the “X-factor” that makes such an article stand out is worth all the trouble. I honestly do not know but it is the arresting beauty found in Einstein’s famous papers, Newton’s laws, Mozart’s music and Shakespeare’s writing among many examples that include medicine [2-4]. I wrote 2 articles on such para- scientific para-medical stuff to identify the X-factor, “Rules and lures of the science game” and “The Mozarts of Science” sent to journals nearly two years ago and have not received a reply yet. I think a message of “Ignore the big headed bustard” arrived. Qualified people to find out the X-factor are COPE members. Another question that requires a ‘Yes’ or ‘No’ answer would be: if any of Einstein’s papers is evaluated using the current peer review standard and statistics adopted by most biomedical journals, would it be accepted? Peer reviewers take note: This BMJ “Rapid Response” site is a practical warning that I think its inventor has already earned a place in medical history. References 1. Bacchetti P. EDUCATION AND DEBATE. Peer review of statistics in medical research: the other problem. BMJ 2002; 324: 1271-1273 2. Horrobin DF. Referees and research administrators: Barier to scientific research. BMJ 1974; 2:216-218. 3. Horrobin DF. The philosophical basis of peer review and the suppression of innovation. JAMA 1990; 263: 1438-1441. 4. Stephens WE. Basic philosophy and concepts underlying peer review. Medical Hypotheses 1999; 52: 31-36 5. Ghanem AN. Guidelines and Code of Ethics. Saudi Medical Journal 2000; 21 (7): 694. 6. Ghanem AN. Features and Complications of Nephroptosis Causing the Loin Pain and Haematuria Syndrome: Preliminary Report. Saudi Med. J. 2002; 23 (2): 197-205 7. Ghanem AN, Ward JP. Osmotic and metabolic sequelae of volumetric overload in relation to the TURP syndrome. Br J Uro 1990; 66: 71-78 8. Ghanem AN. Magnetic field-like fluid circulation of a porous orifice tube and relevance to the capillary-interstitial fluid circulation: Preliminary report. Medical Hypotheses 2001; 56 (3): 325-334. 9. Ghanem AN. Serum sodium changes during and after transurethral prostatectomy. Saudi Medical Journal 2002; 23 (4):477-9 |
|||
|
|
|||
|
David F Horrobin, Research Director Laxdale Ltd, Stirling, FK7 9JQ
Send response to journal:
|
The article by Bacchetti (BMJ, 20 May) with its comments about uncertainties surrounding power calculations prompted me to seek the advice of your correspondents about an issue which seems to becoming more common and which has major implications for clinical research.
Laxdale is a pharmaceutical company which, like many others, frequently conducts pilot studies on new entities. These involve the administration of novel, potentially therapeutic compounds to patient populations which have never been exposed to that compound previously. Our usual practice is to state in the protocol that in this situation there is no reasonable basis for a power calculation. In order to collect information about an effect size (if any) and the variance of that effect size, we state that we plan to randomise a modestly sized group of patients to two or three doses of the active drug and to placebo or a known active treatment. On the basis of such pilot studies which, depending on the indication and on the advice were receive from experienced clinicians, might typically include 15-30 patients per group we then plan further studies with at least some basis on which we can do a sensible power calculation. However, recently we have encountered Multicentre Regional Ethics Committees (MRECS) which, although satisfied about other aspects of a pilot study have insisted on a formal power calculation as an ethical requirement. One MREC argued that in the absence of information about our product in a particular patient population we should look at results in similar patient populations obtained with other products with quite different mechanisms of action. It was argued that we should base our power calculations on these other results, and on the minimum useful clinical improvement which might be expected. In vain we pointed out that there was no scientific basis for basing a power calculation on results with a quite different product. Moreover, by basing the power calculation for a pilot study on the minimum useful benefit, group sizes would have to be very large and as a result unnecessarily large populations of patients might thus be exposed either to placebo, or an ineffective drug, or even a drug which might be toxic in this new patient population. However these arguments were not accepted and we were told that ethical approval would be withheld until we performed a power calculation. So we complied. There appears to be little literature on the use of power calculations in pilot studies. More surprisingly, we have not been able to find a single published study which in a consecutive series of trials compared pre-study power calculations with the actual clinical results obtained. There are studies of power calculations in published papers but that is very different from prospectively evaluating whether power calculations have real validity or whether they require so many assumptions that in practice they are of limited value. There is a strong theoretical basis for power calculations, but the complete absence of an published prospective evaluation raises suspicions. I can think of no other procedure in clinical research or clinical practice which has become a standard requirement with so little evidence from real world studies. I am seeking two pieces of advice from your readers. The answers are relevant not just to us, but to anyone from either industry or academia trying to do pilot studies of new compounds: 1. What is the rationale and where is the evidence base for requiring power calculations in pilot studies of new entities being studied for the first time in a new population? 2. Is the requirement for power calculations for any study anything more than a theoretical construct or does there exist a prospectively collected series of studies which demonstrates that predictions made using power calculations are consistently validated by what actually happens during the study? Does the application of power calculations have a strong experimental basis which validates the procedures used? Or are the inevitable assumptions which must be made so flawed that calculated power frequently bears little relation to actual power? David F. Horrobin, DPhil, BM, BCh
|
|||
|
|
|||
|
Michael J Campbell, Professor of Medical Statistics University of Sheffield
Send response to journal:
|
Can I support the responders to Peter Bacchetti's interesting article, who argue that it is rigid adherence to rules by people with a limited understanding of statistics that is the problem. On the issue of sample size, to quote from a recent paper Williamson et al (1) 'Their (sample size calculations) purpose is not to give an exact number, but rather to subject the study design to scrutiny, including an asessment of the validity and reliability of data collection, and to give an estimate to distinguish whether tens, hundreds or thousands of participants are required....The view is taken that all studies should include an honest assessment of the power and effect size of a study, but that an ethics committee need not automatically reject studies of low power.' I think one solution is improved feedback to referees and reviewers. I sit on a scholarship panel,and now routinely provide feedback as to whether there are 'hawks' and 'doves' on the panel and who they are. I believe the panel finds this helpful! Conflict of interest. I am on the BMJ Statistical advisory panel. Ref: 1. Williamson, P, Hutton, J.L., Bliss J, Blunt J, Campbell MJ and Nicolson E. Statistical review by research ethics committees. J Roy Statist Soc A, 2000; 163 : 5-13. |
|||
|
|
|||
|
Douglas G Altman, Director Centre for Statistics in Medicine, Institute of Health Sciences, Oxford, Douglas G Altman, David Moher , Kenneth F Schulz
Send response to journal:
|
Merrick Zwarenstein has recounted how he had papers rejected because the trials failed to reach the planned sample size. Peter Bacchetti, in reply, said that he often recommends against presenting sample size planning information in reports of completed trials, because confidence intervals more directly address uncertainty in the study's results. He also observed that "Unfortunately, many published standards for presenting studies' results, as well as scales for rating article quality, insist that power calculations are necessary even after the results are known and speculation about power is no longer needed (either p was <0.05 or not)." We wish to explain why the CONSORT statement [1,2] includes the recommendation that RCT reports say how the sample size was determined, including details of a prior power calculation if done. There is indeed little merit in calculating the statistical power once the results of the trial are known; the power is then appropriately indicated by confidence intervals. And we agree that failing to reach the planned sample size is not a reason to reject a paper. However, power calculations are still of importance to readers, both directly and indirectly. First, if the achieved smaller size differs from the planned sample size the reader will wish to know why: was this was just because of an overestimate of the likely recruitment rate or because the trial stopped early because of a statistically significant result (perhaps after multiple looks at the data, and, if so, was a formal stopping rule or guideline used)? Second, a power calculation indicates strongly what is or should be the principal outcome measure for the trial (although it may not indicate how the analysis will be performed). This is a safeguard against changing horses in midstream and claiming a big effect on an outcome that was not a primary outcome or even not prespecified. Also, a power calculation is explicit evidence that the trial was properly planned and that some thought was given to the size of effect that would be clinically important (even though we all know the values used are often rather optimistic in order to keep the sample size down). Douglas G Altman(a), David Moher (b), Kenneth F Schulz(c) (a) Cancer Research UK Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, Oxford, UK (b) Thomas C Chalmers Centre for Systematic Reviews, University of Ottawa, Ontario, Canada (c) Family Health International and Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina at Chapel Hill, North Carolina, USA 1 Moher D, Schulz KF, Altman DG for the CONSORT Group. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 2001;357:1191-4. 2 Altman DG, Schulz KF, Moher D, Egger M, Davidoff, Elbourne D, Gøtzsche PC, Lang T for the CONSORT Group. The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Annals of Internal Medicine 2001;134:663–94. Competing interests: All authors are members of the CONSORT group. DGA is one of the BMJ's statistical advisers. |
|||
|
|
|||
|
Peter Bacchetti, Professor Univ of Ca, San Francisco, CA 94143 USA
Send response to journal:
|
I am delighted to see Professor Altman et al. reiterate that confidence intervals are preferable to post-hoc power calculations, but I must disagree with the reasons they cite for nevertheless presenting prior power calculations for completed studies. Just as confidence intervals more directly and clearly address uncertainty than power calculations, so too the information that they see flowing from prior power calculations can instead be presented more directly. Papers can provide information about early stopping, recruitment rates, and any relevant departures from expectations (some may not be scientifically important) without any reference to power calculations. While the importance of prespecification can be debated (is it really impossible for an effect to be important if the outcome was not prespecified?), it is easy enough to simply state the planned primary and secondary outcomes, again without reference to power calculations. The clinical importance of a given effect size similarly does not rely on power calculations; instead, power calculations usually rely on (subjective) assessments of clinical importance. The remaining point they cite is that prior power calculations provide assurance that the study was 'properly planned'. The scientific relevance of this is unclear to me. If a study finds important information by blind luck instead of good planning, so be it. I still want to know the results. I discussed in the paper the frequent difficulty of conforming to common notions of 'proper' sample size planning, and Dr. Horrobin has expanded on this. In addition, even when investigators are able to conform to the ideal model, the assumptions frequently turn out to be inaccurate. Does this mean that the studies were poorly planned? Should they be disregarded? Or should we learn from them what we can? Requiring prior power calculations may provide back-end enforcement of the sort of ethics-based sample size review that Dr. Horrobin describes. One problem with this is that it extends the reach of a process that is already unrealistically rigid and obstructionist. Another is that, unlike front-end enforcement, it is easily circumvented by providing a power calculation that was not really done beforehand. If a study passed ethical review before commencing and provides information important enough to warrant publication, it seems reasonable to assume that its potential to produce such information was high enough that asking for the subjects' cooperation was ethical. If the reasons for presenting prior power calculations are weak, the question remains: why not? I believe the most important reason is that it contributes to the very problem it is supposed to mitigate: misinterpreting p>0.05 as providing strong support for the null hypothesis. Authors, reviewers, and readers may reasonably interpret the presence of a power calculation as having some direct relevance for interpreting the results. Given the oblique (I have argued) reasons for presenting power calculations, they may understandably assume that the calculation is there to assure us that the sample size is adequate to support the classical statistical hypothesis testing approach of either 'accepting' or 'rejecting' the null hypothesis. And despite subtleties that statistical theorists may try to convey, 'accepting' the null hypothesis is usually described as, 'There was no difference'. The needed corrective for this type of erroneous, binary interpretation is to encourage investigators and readers to pay attention to estimated effects and the confidence intervals around them. I contend that presenting power calculations works against this. In addition, presenting power calculations opens the study to a second round of unhelpful sample size criticisms, of which Dr. Zwarenstein's experience provides an extreme example. There is certainly room for disagreement about what practices and recommendations will improve the use of statistical methods in medical research. I thank Professor Altman, et al. for contributing their perspective and hope that this clarifies where and why I disagree. |
|||
|
|
|||
|
Donald R Forsdyke, Associate Professor Department of Biochemistry, Queen's University, Kingston, Ontario, Canada K7L3N6
Send response to journal:
|
The messages of Richard Smith and Peter Bacchetti come over quite clearly. Peer-review processes can be highly error-prone, a problem that is "especially acute for highly innovative research." However, the fact of high error-proneness demands profounder reforms than those proposed by Bacchetti. The "changes in the culture" should include recognition that improving the probability of correct evaluations in error-prone environments requires a restructuring of peer-review processes as currently practiced. The basic rules, quite familiar to those who survive in Wall Street, are (i) hedge your bets, and (ii) look at track record. For more on this please see my peer-review web-site at http://post.queensu.ca/~forsdyke/peerrev.htm Sincerely,
|
|||
|
|
|||
|
Brian Morgan, Freelance journalist Cardiff CF11 6LF
Send response to journal:
|
Conclusions presented in a peer reviewed publication are meant to be the 'gold standard' against which medical treatment is measured. Popular press articles, books, letters to the same journal even are relegated to 'so what' status. "Peer Review" medics say, the rest is just so much personal opinion, unverified anecdote, to be ignored. Now we hear that Peer Review is badly flawed - you could swap the rejected and accepted piles and no one would notice, it is suggested. I think the victims of bad medical science would, and have. |
|||
|
|
|||
|
Stephen J Senn, Professor of Pharmaceutical and Health Statistics University College London WC1E 6BT
Send response to journal:
|
I agree with Professor Bacchetti (and most correspondents). Power is of no relevance in interpreting a completed study. In his classic, Planning of Experiments, Sir David Cox has this to say. 'Power is important in choosing between alternative methods of analysing data and in deciding on an appropriate size of experiment. It is quite irrelevant in the actual analysis of data'(1). I also agree with much of what Dr Horrobin says but he overstates the case against sample size determinations in pilot studies. In most indications the variability is a function of the disease not the treatment and the fact that the treatment has not been studied is no bar to using an estimate. The difference you are seeking is not the same as the difference you expect to find and again you do not have to know what the treatment will do to find a figure. This is common to all science. An astronomer does not know the magnitude of new stars until he has found them but the magnitude of star he is looking for determines how much he has to spend on a telescope. The definition of a medical statistician is 'one who will not accept that Columbus discovered America...because he said he was looking for India in the trial plan'(2). Columbus made an error in his power calculation, he relied on an estimate of the size of the Earth that was too small, but he made one nonetheless and it turned out to have very fruitful consequences. References 1. Cox, D.R. Planning of Experiments, Wiley, New York, 1958, p161 2. Senn, S.J. Statistical Issues in Drug Development, Wiley, Chichester, 1997, p58 Competing interest. I consult extensively for the pharmaceutical industry and my career as an academic is furthered by publication and grants awarded |
|||