Calculating correlation coefficients with repeated observations: Part 2—correlation between subjectsBMJ 1995; 310 doi: https://doi.org/10.1136/bmj.310.6980.633 (Published 11 March 1995) Cite this as: BMJ 1995;310:633
- a Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
- b Medical Statistics Laboratory, Imperial Cancer Research Fund, PO Box 123, London WC2A 3PX
- Correspondence to: Dr Bland.
This is the thirteenth in a series of occasional notes on medical statistics
In earlier Statistics Notes1 2 we commented on the analysis of paired data where there is more than one observation per subject. It can be highly misleading to analyse such data by combining repeated observations from several subjects and then calculating the correlation coefficient as if the data were a simple sample.1 The appropriate analysis depends on the question we wish to answer. If we want to know whether an increase in one variable within the individual is associated with an increase in the other we can calculate the correlation coefficient within subjects.2 If we want to know whether subjects with high values of one variable also tend to have high values of the other we can use the correlation between the subject means, which we shall describe here.
The table shows the mean pH and Paco2 for each of eight subjects, with the number of pairs of observations for each. The 47 pairs of measurements from which these means were calculated were given previously.2 Here we are interested in whether the average pH for a subject is related to the subject's average Paco2.
We can calculate the usual correlation coefficient for the mean pH and mean Paco2. For the data in the table this gives r=0.09, P=0.8.
This analysis does not take into account the different numbers of measurements on each subject. Whether this matters depends on how different the numbers of observations are and whether the measurements within subjects vary much compared with the means between subjects. We can calculate a weighted correlation coefficient, using the number of observations as weights. Many computer programs will calculate this, but it is not difficult to do by hand.
We denote the mean pH and Paco2 for subject i by xi and yi, the number of observations for subject i by mi, and the number of subjects by n. It is fairly obvious4 that the weighted mean of the xi is (summation)mixi/(summation)mi. In the usual case, where there is one observation per subject, the mi are all one and this formula gives the usual mean (summation)xi/n.
An easy way to calculate the weighted correlation coefficient is to replace each individual observation by its subject mean. Thus the table would yield 47 pairs of observations, the first four of which would each be pH=6.49 and Paco2=4.04, and so on. If we use the usual formula for the correlation coefficient on the expanded data we will get the weighted correlation coefficient. However, we must be careful when it comes to the P value. We have only 8 observations (n in general), not 47. We should ignore any P value printed by our computer program, and use a statistical table instead.
The actual formula for a weighted correlation coefficient is: (summation)mixiyi - (summation)mixi(summation)miyi/(summation)mi ((summation)mix2i - ((summation)mixi)2/(summation)mi) ((summation)miyi - ((summation)miyi)2/(summation)mi) where all summations are from i=1 to n. When all the mi are equal they cancel out, giving the usual formula for a correlation coefficient.
For the data in the table the weighted correlation coefficient is r=0.08, P=0.9. There is no evidence that subjects with a high pH also have a high Paco2. However, as we have already shown,2 within the subject a rise in pH was associated with a fall in Paco2.