Correlation in restricted ranges of dataBMJ 2011; 342 doi: https://doi.org/10.1136/bmj.d556 (Published 11 March 2011) Cite this as: BMJ 2011;342:d556
All rapid responses
In a recent statistical note in this journal, Dr. Bland and Dr.
Altman cautioned the use of the correlation of two random variables, X and
Y, in a restricted range of data (ref. 1). They concluded that when we
restrict the range of one of the random variables, says X, the correlation
coefficient between X and Y will be reduced naturally, therefore a smaller
correlation observed in a restricted range of one variable does not
necessarily imply any particular different relationship between two
variables in the range. They further explained that the reduction through
the meaning of the coefficient of determination, which measures the
proportion of the variability in Y explained by the variability in X: if
we restrict the range of one of the random variables X, we reduce the
variation in X. Therefore it will explain less variability in Y, and hence
the correlation between X and Y in the restricted range of X will
naturally be reduced.
We congratulate Dr. Bland and Altman for this important observation
on the natural reduction of the correlation coefficient if we restrict one
of the variables in a particular range. However, the explanation of the
reduction could be made clearer and more general. In addition, a
discussion on the magnitude of this reduction would be of great interest.
For example, we may want to know how this reduction is related to the
probability of X falling in the restricted range. Below, we shall
explicitly explain why the reduction occurs and how the reduction depends
on the range of the restriction.
Suppose that we are interested in estimating the correlation between
X and Y based on a random sample of n paired samples from a bivariate
normal distribution. As a bivariate normal distribution can be
transferred to the standard bivariate normal distribution with the same
correlation coefficient, without loss of generality, we assume the
bivariate normal distribution has mean (0, 0), variance (1, 1), and a
correlation r. Now let us consider the correlation in the restricted
interval: X is between a and b. Let f(.) and F(.) denote the standard
normal probability density function and cumulative distribution function.
Similar to what we have done before (ref. 2), it can be shown that, when n
is large, the correlation in this restricted interval of X, converges to
is the variance of the truncated standard normal variable of X
within the range of (a, b).
Knowing that the variance of the truncated standard normal variable
within the range of (a, b) is smaller than or equal to 1, the variance of
the unrestricted X, we derive that the restricted correlation is less or
equal to r. Therefore, the correlation in this restricted interval of X
attenuates the correlation between X and Y. Figure 1 illustrates how the
level of attenuation depends on the range (a, b) graphically.
Specifically, it shows how the attenuation depends on the probability of X
<=a and the probability of a< X <= b.
Contributors: LN and HC equally initiated, designed, and drafted the
paper. All authors approved this version of the paper.
Views expressed in this paper are the author's professional opinions and do not necessarily represent the official positions of the U.S. Food and Drug Administration.
1. Bland JM, Altman DG. Correlation in restricted ranges of data. BMJ
2. Chu H, Nie L, Cole SR. Sample size and statistical power assessing the
effect of interventions in the context of mixture distributions with
detection limits. Stat Med 2006;25(15):2647-57.
Figure 1: The relationship between the correlation coefficients rr
for the restricted interval of X and the probability of X <= a and the
probability of a< X <= b. The 19 lines presented in each plot
correspond to the (unrestricted) coefficient being - 0.9 to 0.9 by 0.1
from top to bottom.
Competing interests: No competing interests