Intended for healthcare professionals


Statistics Notes: Correlation, regression, and repeated data

BMJ 1994; 308 doi: (Published 02 April 1994) Cite this as: BMJ 1994;308:896
  1. J M Bland,
  2. D G Altman
  1. Department of Public Health Sciences, St George's Hospital Medical School, London SW 17 0RE
  2. Medical Statistics Laboratory, Imperial Cancer Research Fund, London WC2A 3PX
  1. Correspondence to: Dr Bland.

    In clinical research we are often able to take several measurements on the same patient. The correct analysis of such data is more complex than if each patient were measured once. This is because the variability of measurements made on different subjects is usually much greater than the variability between measurements on the same subject, and we must take both kinds of variability into account. For example, we may want to investigate the relation between two variables and take several pairs of readings from each of a group of subjects. Such data violate the assumption of independence inherent in many analyses, such as t tests and regression.

    Researchers sometimes put all the data together, as if they were one sample. Most statistics textbooks do not warn the researcher not to do this. It is so ingrained in statisticians that this is a bad idea that it never occurs to them that anyone would do it.

    Consider the following example. The data were generated from random numbers, and there is no relation between X and Y at all. Firstly, values of X and Y were generated for each “subject,” then a further random number was added to make the individual “observation.” The data are shown in the table and figure. For each subject separately the correlation between X and Y is not significant. We have only five subjects and so only five points. Using each subject's mean values, we get the correlation coefficient r=-0.67, df=3, P=0.22. However, if we put all 25 observations together we get r=-0.47, df=23, P=0.02. Even though this correlation coefficient is smaller than that between means, because it is based on 25 pairs of observations rather than five it becomes significant. The calculation is performed as if we have 25 subjects, and so the number of degrees of freedom for the significance test is increased incorrectly and a spurious significant difference is produced. The extreme case would occur if we had only two subjects, with repeated pairs of observations on each. We would have two separate clusters of points centred at the subjects' means. We would get a high correlation coefficient, which would appear significant despite there being no relation whatsoever.

    Simulated data showing five pairs of measurements of two uncorrelated variables for subjects 1, 2, 3, 4, and 5

    View this table:

    Simulated data for five pairs of measurement of two uncorrelated variables (X and Y) for five subjects

    There are two simple ways to approach these types of data. If we want to know whether subjects with a high value of X tend also to have a high value of Y we can use the subject means and find the correlation between them. For different numbers of observations for each subject, we can use a weighted analysis, weighting by the number of observations for the subject. If we want to know whether changes in one variable in the same subject are paralleled by changes in the other we can estimate the relation within subjects using multiple regression. In either case we should not mix observations from different subjects indiscriminately, whether using correlation or the closely related regression analysis.