Statistics Notes: Detecting skewness from summary information
BMJ 1996; 313 doi: https://doi.org/10.1136/bmj.313.7066.1200 (Published 09 November 1996) Cite this as: BMJ 1996;313:1200- a ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF
- b Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
- Correspondence to: Mr Altman.
As we have noted before, many statistical methods of analysis assume that the data have a normal distribution.1 When the data do not they can often be transformed to make them more normal.2 Readers of published papers may wish to be reassured that the authors have carried out an appropriate analysis. When authors present data in the form of a histogram or scatter diagram then readers can see at a glance whether the distributional assumption is met. If, however, only summary statistics are presented—as is often the case—this is much more difficult. If the summary statistics include the range of the data then some idea of the distribution may be gained. For example, a range from 7 to 41 around a mean of 15 suggests that the data have positive skewness. However, as the range is based on the two most extreme (and hence atypical) values this inference is not reliable. Similar asymmetry affecting the lower and upper quartiles3 would be much more convincing evidence of a skewed distribution. Usually, however, the only summary statistics presented are the mean and either the standard deviation or standard error. Such information cannot show that the data are near to a normal distribution, but they can sometimes show that they are not.
There are two useful tricks. The normal distribution extends beyond two standard deviations either side of the mean. It follows that for measurements which must be positive (like most of those encountered in medicine) if the mean is smaller than twice the standard deviation the data are likely to be skewed. Table 1 shows urinary cotinine levels related to number of cigarettes smoked daily. Clearly the data must be highly skewed, as the mean is smaller than the standard deviation in each group. This aspect of the data was not apparent in the original paper, which gave just the means and standard errors. (We added the standard deviations, derived simply as standard error × (square root)n.) As a consequence, the use of t tests was not easily seen to be incorrect.
The second indicator of skewness can be used when, as in table 1, there are data for several groups of individuals. As we have noted,2 deviations from the normal distribution and a relation between the standard deviation and mean across groups often go together. If the standard deviation increases as the mean increases then this is a good indication that the data are positively skewed, and specifically that a log transformation may be needed.2 There is a clear relation between mean and standard deviation for the cotinine data. As we have noted, log transformation often removes skewness and makes the standard deviations more similar.
In this example we can detect skewness from summary statistics, but we cannot tell what the effect of log transformation would have been. That requires the raw data.