Statistic Notes: Regression towards the meanBMJ 1994; 308 doi: https://doi.org/10.1136/bmj.308.6942.1499 (Published 04 June 1994) Cite this as: BMJ 1994;308:1499
- J M Bland,
- D G Altman
- Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
- Medical Statistics Laboratory, Imperial Cancer Research Fund, London WC2A 3PX.
The statistical term “regression,” from a Latin root meaning “going back,” was first used by Francis Galton in his paper “Regression towards Mediocrity in Hereditary Stature.”1 Galton related the heights of children to the average height of their parents, which he called the mid- parent height (figure). Children and parents had the same mean height of 68.2 inches. The ranges differed, however, because the mid-parent height was an average of two observations and thus had its range reduced. Now, consider those parents with a mid-height between 70 and 71 inches. The mean height of their children was 69.5 inches, which was closer to the mean height of all children than the mean height of their parents was to the mean height of all parents. Galton called this phenomenon “regression towards mediocrity”; we now call it “regression towards the mean.” The same thing happens if we start with the children. For the children with height between 70 and 71 inches, the mean height of their parents was 69.0 inches. This is a statistical, not a genetic phenomenon.
If we take each group of mid-parents by height and calculate the mean height of their children, these means will lie close to a straight line. This line came to be called the regression line, and hence the process of fitting such lines became known as “regression.”
In mathematical terms, if variables X and Y have standard deviations sX and sY, and correlation r, the slope of the familiar least squares regression line can be written rsy/sx. Thus a change of one standard deviation in X is associated with a change of r standard deviations in Y. Unless X and Y are exactly linearly related, so that all the points lie along a straight line, r is less than 1. For a given value of X the predicted value of Y is always fewer standard deviations from its mean than is X from its mean. Regression towards the mean occurs unless r=1, perfect correlation, so it always occurs in practice. We give some examples in a subsequent note.