Statistics notes: Transformations, means, and confidence intervalsBMJ 1996; 312 doi: https://doi.org/10.1136/bmj.312.7038.1079 (Published 27 April 1996) Cite this as: BMJ 1996;312:1079
- a Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
- b ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF
- Correspondence to: Professor Bland.
When we use transformed data in analyses,1 this affects the final estimates that we obtain. Figure 1 shows some serum triglyceride measurements, which have a skewed distribution. A logarithmic transformation is often useful for data which have positive skewness like this, and here the approximation to a normal distribution is greatly improved. For the untransformed data the mean is 0.51 mmol/l and the standard deviation 0.22 mmol/l. The mean of the log10 transformed data is -0.33 and the standard deviation is 0.17. If we take the mean on the transformed scale and back transform by taking the antilog, we get 10-0.33=0.47 mmol/l. We call the value estimated in this way the geometric mean. The geometric mean will be less than the mean of the raw data.
When triglyceride is measured in mmol/l the log of a single observation is the log of a measurement in mmol/l. The average of n such transformed measurements is also the log of a number in mmol/l, so the antilog is back in the original units, mmol/l.
The antilog of the standard deviation, however, is not measured in mmol/l. Calculation of the standard deviation of the log transformed data requires taking the difference between each log observation and the log geometric mean. The difference between the log of two numbers is the log of their ratio.2 As a ratio is a dimensionless pure number, the units in which serum triglyceride was measured would not matter; the standard deviation on the log scale would be the same. As a result, we cannot transform the standard deviation back to the original scale.
If we want to use the standard deviation or standard error it is easiest to do all calculations on the transformed scale and transform back, if necessary, at the end. For example, the 95% confidence interval for the mean on the log scale is -0.35 to -0.31. To get back to the original scale we antilog the confidence limits on the log scale to give a 95% confidence interval for the geometric mean on the natural scale (0.47) of 0.45 to 0.49 mmol/l. For comparison, the 95% confidence interval for the arithmetic mean using the raw, untransformed data is 0.48 to 0.54 mmol/l. These limits are wider than those for the geometric mean. This is because with highly skewed data the extreme observations have a large influence on the arithmetic mean, making it more prone to sampling error. Lessening this influence is one advantage of using transformed data.
If we use another transformation, such as the reciprocal or the square root,1 the same principle applies. We carry out all calculations on the transformed scale and transform back once we have calculated the confidence interval. This works for the sample mean and its confidence interval. Things become more complicated if we look at the difference between two means. We shall look at this in another Statistics Note.