# Correlation in restricted ranges of data

BMJ 2011; 342 doi: http://dx.doi.org/10.1136/bmj.d556 (Published 11 March 2011)
Cite this as: BMJ 2011;342:d556
1. J Martin Bland, professor of health statistics1,
2. Douglas G Altman, professor of statistics in medicine2
1. 1Department of Health Sciences, University of York, York YO10 5DD
2. 2Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD
1. Correspondence to: Professor M Bland martin.bland{at}york.ac.uk

In a study of 150 adult diabetic patients there was a strong correlation between abdominal circumference and body mass index (BMI) (r = 0.85).1 The authors went on to report that the correlation differed in different BMI categories as shown in the table.

View this table:

Correlation between abdominal circumference and body mass indeed (BMI) in 1450 adult patients with diabetes

The authors’ interpretation of these data was that in patients with low or high BMI values (BMI <25 kg/m2 and BMI >35 kg/m2) the correlation was strong, but in those with BMI values between 25 and 35 kg/m2 the correlation was weak or missing. They concluded that measuring abdominal circumference is of particular importance in subjects with the most frequent BMI category (25 to 35 kg/m2).

When we restrict the range of one of the variables, a correlation coefficient will be reduced. For example, fig 1 shows some BMI and abdominal circumference measurements from a different population. Although these people are from a rather thinner population, the correlation coefficient is very similar, r = 0.82 (P<0.0001). When we divide the sample into the same four restricted ranges of BMI at 20, 25, and 30 kg/m2, the correlation coefficient in each interval is smaller than the correlation coefficient for the whole sample. This phenomenon is to be expected; it is a result of restricting the range of data, not any particular property of BMI and abdominal circumference.

BMI and abdominal circumference in 202 men and women, with correlation coefficients in four restricted ranges and overall

One interpretation of the correlation coefficient r is that r2 is the proportion of the variation in abdominal circumference explained or predicted by the variation in BMI. If we restrict the range of BMI values we reduce the variation in BMI, which will explain less variation in abdominal circumference, and r will fall. If we further reduce the variation in BMI until all remaining patients have the same BMI, then we cannot explain any variation in abdominal circumference and the correlation must be zero. (By contrast within any of the sections of fig 1 the fitted regression line would be the same, apart from random variation.)

For another example, fig 2 shows the weights and heights of the same sample, with different symbols for men and women. Clearly, the lower end of the height range for men is higher than the lower end of the range for women, but the upper ends of the ranges are very similar. The men’s heights (SD 6.0 cm) are less variable than those of the women (SD 8.9 cm) or the heights of both sexes combined (also SD 8.9 cm). The correlation coefficients for women and for both men and women are very similar and considerably larger than that for men alone.

Weight and height in 202 men and women, with correlation coefficients

The same phenomenon can arise when the sample is restricted using another variable related to the ones being studies. For example, the correlation between weight and height of schoolchildren will increase as the age range is increased. But a spurious correlation may also be seen in such a situation, for example between shoe size and spelling ability.2 Such an example illustrates the well worn phrase that an observed association does not imply causation.

Correlation coefficients are a property of the variables and also the population in which they are measured. If we look at a restricted population, we should not conclude that there is little or no relation between the variables because the correlation coefficient is small. But given a clear relation in the whole group, we see no point in looking within categories of one of the variables. In any case, regression is generally the preferred approach to considering the relation between two continuous variables.

## Notes

Cite this as: BMJ 2011;342:d556

## Footnotes

• Acknowledgements: The data are taken from a student elective project by Dr Malcolm Savage.

• Contributors: JMB and DGA jointly wrote and agreed the text, JMB did the statistical analysis.

• Competing interests: All authors have completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous 3 years; no other relationships or activities that could appear to have influenced the submitted work.