Statistics notes: Calculating correlation coefficients with repeated observations: Part 1—correlation within subjectsBMJ 1995; 310 doi: http://dx.doi.org/10.1136/bmj.310.6977.446 (Published 18 February 1995) Cite this as: BMJ 1995;310:446
- a Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
- b Medical Statistics Laboratory, Imperial Cancer Research Fund, PO Box 123, London WC2A 3PX
- Correspondence to: Dr Bland.
In an earlier Statistics Note1 we commented on the analysis of paired data where there is more than one observation per subject, as shown in table I. We pointed out that it could be highly misleading to analyse such data by combining repeated observations from several subjects and then calculating the correlation coefficient as if the data were a simple sample. This note is a response to several letters about the appropriate analysis for such data.
The choice of analysis for the data in table I depends on the question we want to answer. If we want to know whether subjects with high values of intramural pH also tend to have high values of PaCO2 we are interested in whether the average pH for a subject is related to the subject's average PaCO2. We can use the correlation between the subject means, which we shall describe in a subsequent note. If we want to know whether an increase in pH within the individual was associated with an increase in PaCO2 we want to remove the differences between subjects and look only at changes within.
To look at variation within the subject we can use multiple regression. We make one of our variables, pH or PaCO2, the outcome variable and the other variable and the subject the predictor variables. Subject is treated as a categorical factor using dummy variables3 4 and so has seven degrees of freedom. We use the analysis of variance table3 4 for the regression (table II), which shows how the variability in pH can be partitioned into components due to different sources. This method is also known as analysis of covariance and is equivalent to fitting parallel lines through each subject's data (see figure). The residual sum of squares in table II represents the variation about these lines. We remove the variation due to subjects (and any other nuisance variables which might be present) and express the variation in pH due to PaCO2 as a proportion of what's left: (Sum of squares for PaCO2)/(Sum of squares for PaCO2 + residual sum of squares) The magnitude of the correlation coefficient within subjects is the square root of this proportion. For table II this is: (square root) 0.1153/0.1153+0.3337 = 0.51 The sign of the correlation coefficient is given by the sign of the regression coefficient for PaCO2. Here the regression slope is -0.108, so the correlation coefficient within subjects is -0.51. The P value is found either from the F test in the associated analysis of variance table, or from the t test for the regression slope. It doesn't matter which variable we regress on which; we get the same correlation coefficient and P value either way.
If we incorrectly calculate the correlation coefficient ignoring the fact that we have 47 observations on only 8 subjects, we get -0.07, P=0.7. Hence the correct analysis within subjects reveals a relation which the incorrect analysis misses.