Mann-Whitney test is not just a test of medians: differences in spread can be important
BMJ 2001; 323 doi: https://doi.org/10.1136/bmj.323.7309.391 (Published 18 August 2001) Cite this as: BMJ 2001;323:391- Anna Hart, principal lecturer (ahart{at}uclan.ac.uk)
- Accepted 20 February 2000
The Mann-Whitney (or Wilcoxon-Mann-Whitney) test is sometimes used for comparing the efficacy of two treatments in clinical trials. It is often presented as an alternative to a t test when the data are not normally distributed. Whereas a t test is a test of population means, the Mann-Whitney test is commonly regarded as a test of population medians. This is not strictly true, and treating it as such can lead to inadequate analysis of data.
Summary points
The Mann-Whitney test is used as an alternative to a t test when the data are not normally distributed
The test can detect differences in shape and spread as well as just differences in medians
Differences in population medians are often accompanied by equally important differences in shape
Researchers should describe the clinically important features of data and not just quote a P value
Use of Mann-Whitney test
The Mann-Whitney test is a test of both location and shape. Given two independent samples, it tests whether one variable tends to have values higher than the other. As Altman states, one form of the test statistic is an estimate of the probability that one variable is less than the other,1 although this statistic is not output by many statistical packages. In the case where the only distributional difference is a shift in location, this can indeed be described as a difference in medians. Hence, for example, the online help facility in Minitab 10.51 states that the Mann-Whitney test is “a two-sample rank test for the difference between two population medians … It assumes that the data are independent random samples from two populations that have the same shape.” Figure 1 shows two distributions for which this is the case. One distribution is shifted 0.75 units to the right: the medians differ by 0.75 units but the shapes are identical.
Theoretically, in large samples the Mann-Whitney test can detect differences in spread even when the medians are very similar. However, an alternative form of the test is better than the standard Mann-Whitney test for this purpose.2 The alternative test, however, is not very efficient when population medians are unequal and is not widely available in statistical packages.
Differences in population medians are often accompanied by other differences in spread and shape. Moreover, the difference in medians may not be the most striking or the most clinically important difference. It is important to look at distributional differences and discuss them. Figure 2 shows an example in which the median values are 0.65 and 1.14 units. The distribution with the larger median also has larger spread. The spread is shown clearly in Figure 3, which shows box plots of samples of 25 drawn from these two distributions. (The P value from the Mann-Whitney test is 0.02.) If the difference is assumed to be merely a difference in medians other clinically important information could be ignored.
Methods
I examined the use of the Mann-Whitney test in papers published in the BMJ between September 1999 and August 2000. I did an online search of the electronic text of the journal using the keywords Wilcoxon, Mann, and Whitney. I identified five papers that had used the Mann-Whitney test but where, in my judgment, the information given suggested that there might be important distributional differences other than a shift in location. These are described briefly below.
Examples
Grande et al studied the impact on place of death of a hospital at home service for palliative care.3 The authors noted a significant difference among patients randomised to hospital at home care: “Patients in the hospital at home group who were admitted to the service survived significantly longer after referral than hospital at home patients who were not admitted (16 v 8 days).” There were 112 patients admitted to the service (median survival 16 days, interquartile range 5-42.5) and 73 patients who were not admitted (8, 3-18 days). The striking feature about these three sets of summary statistics is that each in the former group is about twice that for the second group. This suggests that the difference between the two distributions might not be just a shift of 8 days: the difference might be multiplicative, not additive—that is, patients who were admitted might survive twice as long as those who were not admitted.
Williams et al did a cost effectiveness study of open access follow up for inflammatory bowel disease.4 One of the measures was the total cost of secondary care, and this was compared for two groups: open access and routine visit. The mean (SD) cost was £582 (£807.94) for the 77 patients in the open access group and £611 (£475.47) for the 78 patients in the routine visit group. Although the mean is higher in the second group, the standard deviation is much higher in the first. There must, therefore, have been some very large values in the first group. Without further information it is difficult to be sure, but there seem to be distributional differences between the two groups. The choice of a Mann-Whitney test for these economic data has been criticised elsewhere.5 If total expenditure is the aspect of prime interest then a t test would have been more appropriate.6 If the interest lay in the distributions, it is unlikely that the medians alone would adequately have described the differences.
Lux et al studied responses of local research ethics committees.7 A conclusion was that “The required number of complete copies of protocols and documents … was significantly lower for the local committees that used a fast track system.” The 44 committees in the fast track group required a median of three copies (95% percentiles 2 and 13) compared with 11 (1 and 15) copies for the 55 committees in the standard group. Not only are the medians different, the distributions must also be different. About half of the fast track committees asked for two or three copies, whereas about half of the other committees asked for 11-15 copies. These differences, which the authors did not comment on, relate to shape as well as location of the distributions.
Macleod et al studied women with breast cancer from affluent and deprived areas.8 One of their conclusions is “The time between the date of the referral letter and the first clinic was one day shorter in women from affluent areas.” The median (interquartile range) time was 6 (1-13) days in the affluent area and 7 (4-20) days in the deprived area. Although the medians differ by one day, the summary statistics suggest that the data for the deprived group are more right skewed, and differences between the two groups might be much more pronounced for the higher waiting times. It would have been helpful to discuss this in the paper.
A similar feature is even more evident in data from a study of pain in blood glucose testing.9 A visual analogue scale was used to record pain at the ear or thumb. The authors report “The median pain score was 2 mm in the ear group and 8.5 mm in the thumb group … the difference in median pain score is small.” Although this is true, the box plots in the paper show that the spread of scores in the thumb group is much greater than for the ear group. In particular, at least three out of 30 people in the thumb group report a score that is at least twice the highest value in the ear group. Overall, values seem much higher in the thumb group. This is important because patients are likely to be more concerned with the worst pain they might experience than the median value.
Recommendations
Researchers should take care to describe their data and to be clear about the features that are most clinically important. They should use the statistical test that is most relevant for their hypotheses, and describe the features of the data that are likely to have caused a hypothesis to be rejected. As is always the case, it is not sufficient merely to report a P value. In the case of the Mann-Whitney test, differences in spread may sometimes be as clinically important as differences in medians, and these need to be made clear to the reader.
Footnotes
-
Funding None.
-
Competing interests None declared.