Rapid Responses to:

LETTERS:
Julie A Barber, Simon G Thompson, A Coomarasamy, D Van Der Berg, J G Williams, D R Cohen, and I T Russell
Open access follow up for inflammatory bowel disease
BMJ 2000; 320: 1730 [Full text]
*Rapid Responses: Submit a response to this article

Rapid Responses published:

[Read Rapid Response] Confidence intervals should be used in reporting trials
Martin Bland   (23 June 2000)
[Read Rapid Response] Re: Confidence intervals should be used in reporting trials
J G Williams   (6 July 2000)
[Read Rapid Response] Let's Get the Analyses Right
Vance W. Berger, Bethesda, MD 20892-7354   (23 July 2002)

Confidence intervals should be used in reporting trials 23 June 2000
 Next Rapid Response Top
Martin Bland,
Professor of Medical Statistics
St. George's Hospital Medical Shcool

Send response to journal:
Re: Confidence intervals should be used in reporting trials

The dispute between Barber and Thompson, advocating a t test, and Williams, Cohen and Russell, advocating a Mann-Whitney U test, has its roots in the use of P values rather than confidence intervals. If Williams et al had reported their results as a confidence interval for the difference in mean cost, as recommended for the results of clinical trials published in BMJ, the question would not have arisen. The large sample Normal confidence interval for the difference in cost of secondary treatment (routine minus open access) is -£180 to +£238, the point estimate being £29.

Barber and Thompson state that although t test methods are only strictly valid for data that are normally distributed, they are fairly robust and give a reliable comparison of means, provided that skewness is not too extreme and sample sizes are moderately large. This is true, but refers to the Type I error, not the Type II error. In other words, the chance of a t test producing a spurious positive result is 5% whatever the data. The power of the t test is undoubtedly reduced when data are highly skew. However, the sample size is surely large enough for the large sample Normal comparison, which does not require data to follow a Normal approximation and to which the t method approximates, to be valid, even with such highly skew data. This provides the confidence interval above, and the same P value as the t test.

Barber and Thompson are correct that the Mann-Whitney U test makes an overall comparison of distributions in the two groups, in terms of both shape and location, and does not specifically test for a difference in means. Although it is often described as a test of the difference between medians, this is only the case if we can assume that the two distributions being compared have exactly the same shape. Under these circumstances, it would be a test for the difference between two means also. This is not the case here, as the standard deviations are different. This is highly significant by an F test, although as this requires a Normal distribution for the data it is only approximate here. Thus Barber and Thompson are correct in arguing that a significant Mann-Whitney U test implies only a difference in distribution, not mean. Cost data typically have very uneven distributions, with many observations having same low value and a few observations being very high. Distributions can differ considerably and yet have the same or similar means. I agree with Barber and Thompson that Williams et al have misinterpreted the Mann-Whitney result.

Williams et al say that the Mann-Whitney U test was only an interim analysis, but I could find no mention of this in their original paper. The actual observed difference is only 5% of the mean for the standard treatment, so the statement by Williams et al that analysis to be published elsewhere confirms that open access greatly reduces secondary care costs is very surprising. I look forward to seeing it with interest.

Re: Confidence intervals should be used in reporting trials 6 July 2000
Previous Rapid Response Next Rapid Response Top
J G Williams,
director
School of Postgraduate Studies in Medical & Health Care

Send response to journal:
Re: Re: Confidence intervals should be used in reporting trials

Professor Bland's contribution to this debate is helpful and we will revisit our bootstrapping analysis.

It is regrettable that editorial change to our original authors' reply substituted the word 'greatly' for our word 'significantly' with regard to differences in secondary care costs. Any differences in these costs, whether significant or otherwise, were never claimed to be great.

Let's Get the Analyses Right 23 July 2002
Previous Rapid Response  Top
Vance W. Berger,
Mathematical Statistician
National Cancer Institute,
Bethesda, MD 20892-7354

Send response to journal:
Re: Let's Get the Analyses Right

Bland makes a compelling case that "Potentially incorrect conclusions, based on faulty analyses, should not be allowed to remain in the literature to be cited uncritically by others" [1]. In fact, by chance faulty analyses may lead to proper conclusions. When they do, the harm is not in citing the results, but rather in emulating the faulty analyses in future studies [2]. As such, articles that use faulty analyses set a bad precedent [3], regardless of the results. Even more dangerous are articles endorsing, as opposed to simply using, faulty analyses, especially when these articles appear in the medical (as opposed to statistical) literature, thereby giving the mistaken impression that these faulty analyses are generally held to be correct by all methodologists. Unfortunately, such misleading articles are not as uncommon as many would like to believe. For example, Barber and Thompson [4] recently endorsed the use of the t-test (which is exact only when the data are normally distributed) instead of the Wilcoxon test (which is exact with or without normality) for cost data. This preference for the t -test, which is based on its using as a test statistic the difference in raw means instead of the mean difference of ranks, was stated authoritatively, as if there could be no reasonable disagreement. In fact, the argument that the difference in raw means is a better test statistic than the difference in mean ranks is debatable [5]. Even granting this, however, one would still have to weigh the benefits of using this "better" test statistic against the need to assume normality presumably entailed by doing so. In fact, such a trade-off is unnecessary, because one could combine the raw mean test statistic with an exact test. This would be the exact t-test [5]. If one could define what constitutes superiority of one test to another in a given context, say cost studies, then it would be possible to objectively compare various tests. It is possible that such a comparison would favor the exact t-test to the Wilcoxon test, although this remains to be seen. Until such an argument is made convincingly, it would be premature to recommend one over the other, especially in the medical literature. If robustness to the normality assumption is a criterion considered in evaluating tests, then it is highly unlikely that the approximate t-test would be favored to the Wilcoxon test. Subject to certain caveats regarding situations in which an approximate method is preferable to the best available exact method [5][6], it is even less likely that the approximate t-test would be preferable to the exact t-test. The time for publishing unilateral methodological recommendations in the medical literature is only after it is clear that these recommendations are, in fact, correct. So Barber and Thompson [4] are at best premature, and at worst misleading. If we want to ensure that only the best methodologies are used in the medical studies that drive future medical decisions, then recommendations must follow, and not precede (or preempt) serious discussion and debate in the methodological literature. Until this becomes the norm, the public will continue to question the reliability and relevance of the findings of medical studies.

References

[1]. Bland M, Fatigue and Psychological Distress: Statistics Are Improbable, BMJ 2000;320:515-516.

[2]. Altman DG, Poor-Quality Medical Research: What Can Journals Do?, JAMA 2002;287:2765-2767.

[3]. Altman DG, Statistics in Medical Journals, Statistics in Medicine 1982;1:59-71.

[4]. Barber JA, Thompson SG, Would Have Been Better To Use t-Test than Mann-Whitney U-test, British Medical Journal 2000;320(7251):1730.

[5]. Berger VW, Lunneborg C, Ernst MD, Levine JG, Parametric Analyses in Randomized Clinical Trials, Journal of Modern Applied Statistical Methods 2002;1(1):74-82.

[6]. Berger VW, Comment on Ludbrook and Dudley, American Statistician 2000;54(1):85-86.