Analysis of cluster randomised trials with an assessment of outcome at baselineBMJ 2018; 360 doi: https://doi.org/10.1136/bmj.k1121 (Published 20 March 2018) Cite this as: BMJ 2018;360:k1121
- Richard Hooper, reader1,
- Andrew Forbes, professor2,
- Karla Hemming, senior lecturer3,
- Andrea Takeda, systematic reviewer4,
- Lee Beresford, statistician1
- 1Centre for Primary Care and Public Health, Queen Mary University of London, London E1 2AB, UK
- 2Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Australia
- 3Institute of Applied Health Research, University of Birmingham, Birmingham, UK
- 4UCL Institute of Health Informatics, University College London, London, UK
- Correspondence to: R Hooper
- Accepted 23 February 2018
When designing a trial, the decision to randomise a trial in clusters is usually a pragmatic one.1 For example, the intervention might be delivered at cluster level, or there might otherwise be a risk of people in the same cluster sharing their treatments, and thus attenuating treatment effects. Box 1 lists three illustrative cluster randomised trials that include a baseline assessment of participants’ outcomes. Coventry and colleagues, for example, used a cluster randomised design with general practices as clusters to look at the effectiveness of an integrated collaborative care model for people with depression and long term physical conditions.2
Examples of cluster randomised trials with assessments of participant outcomes at baseline
Integrated primary care for patients with mental and physical multimorbidity: cluster randomised controlled trial of collaborative care for patients with depression comorbid with diabetes or cardiovascular disease. Coventry et al, BMJ 20152
Objective: To test the effectiveness of an integrated collaborative care model for people with depression and long term physical conditions.
Primary outcome: Symptoms of depression on the self-reported symptom checklist-13 depression scale.
Planned sample size: 15 general practices (clusters) per arm and 15 patients per cluster (n=450) to detect a difference between groups equivalent to a standardised effect size of 0.4, with 80% power (α=0.05; intraclass correlation coefficient 0.06), allowing for 20% attrition.
Design and analysis: Patients registered at a participating practice with a record of diabetes or coronary heart disease were screened for depressive symptoms over the telephone, and face to face two weeks later, to determine eligibility. Participants’ depression scale scores were collected at baseline and again after four months’ follow-up (a cohort design). Outcomes at follow-up were compared between the intervention and control arms. Analysis was conducted at the individual level, adjusting for clustering and for baseline score.
Liverpool Care Pathway for patients with cancer in hospital: a cluster randomised trial. Costantini et al, Lancet 20143
Objective: To assess the effectiveness of the Liverpool Care Pathway translated into an Italian context (LCP-I) in improving the quality of end of life care for patients with cancer in hospitals and for their families.
Primary outcome: Mean score on the overall quality of care toolkit scale.
Planned sample size: 10 hospital wards (clusters) per arm and 20 patients per cluster (n=400) to detect an absolute increase of 10 points on the toolkit scale, with 80% power (α=0.05; intraclass correlation coefficient 0.1).
Design and analysis: In each ward, all patients who died in the three months before randomisation and in the six months after the conclusion of the LCP-I programme were identified (a repeated cross section design) and their quality of end of life care was assessed. Outcomes at follow-up were compared between the intervention and control arms. Analysis was at the individual level, adjusting for clustering and for the mean baseline assessment of outcome in that ward.
School based education programme to reduce salt intake in children and their families (School-EduSalt): cluster randomised controlled trial. He et al, BMJ 20154
Objective: To determine whether an education programme targeted at schoolchildren could lower salt intake in children and their families.
Primary outcome: Salt intake (measured as urinary sodium excretion, averaged over two consecutive 24 h collections).
Planned sample size: 12 schools (clusters) per arm and 10 children per cluster (n=240) to detect a difference in salt intake of 1.0 g a day, with 90% power (α=0.05; intraclass correlation coefficient 0.01).
Design and analysis: One class was selected from each school, and 10 children were randomly selected from each class. Each child’s urinary sodium was measured at baseline and again after six months follow-up (a cohort design). Analysis was at the individual level, and compared the change from baseline to follow-up between the intervention and control arms.
The sample size required for a cluster randomised trial is larger than for an individually randomised trial: how much larger depends on a parameter called the intracluster correlation—the correlation between the outcomes of two individuals from the same cluster.5 The higher the intracluster correlation, the more heterogeneity there is between clusters, and the greater the advantage in controlling for cluster differences, for example, with a baseline assessment.6 Researchers might choose to assess the same individuals at baseline and follow-up (a cohort design) or choose to take different samples from the same cluster on the two occasions (repeated cross sections), and this may again be a pragmatic decision. Costantini and colleagues (box 1) studied the quality of end of life care in patients with cancer, leading them to a design in which they recruited different sample groups from each hospital ward at baseline and follow-up.3
Clinical trials that are randomised in clusters often include an assessment of participants’ outcomes in a baseline period
The analysis of cluster randomised trials is more complex than for individually randomised trials, with various methods suggested to allow for baseline assessments of outcome
We recommend either an analysis of covariance approach that takes account of cluster differences at baseline, or an analysis that treats assessments at baseline and follow-up as longitudinal but recognises that there will not be any systematic differences between the randomised groups at baseline
Simply comparing the difference between outcomes at baseline and follow-up in the two randomised groups—that is, calculating the difference of differences—is not the best approach, and can be misleading
Approaches to analysis
Difference of differences
To estimate the effect of an intervention in a trial with clusters randomised to intervention and control groups, and assessments at baseline and follow-up, one method is with a longitudinal, repeated measures analysis that tests for a statistical interaction between group and time. This is sometimes called a difference of differences analysis because it evaluates how much the groups differ in terms of the difference between outcomes at baseline and follow-up. This approach was used by He and colleagues to evaluate a school based education programme aimed at reducing salt intake in children and their families (box 1).4
In individually randomised trials, the shortcomings of a difference of differences approach (usually known in this case as a change score analysis) are well understood.78 If the baseline assessment is only poorly correlated with the follow-up, then subtracting the baseline outcome just adds random noise to the signal we are trying to detect. Worse still, if the two groups differ substantially (by chance) at baseline and then level themselves out at follow-up, we might be misled by the difference of differences into thinking that a change had been effected in one group but not the other (regression to the mean).9 The problem is that the analysis does not use what we know: before randomisation, outcomes in the two groups should, on average, be the same.
Analysis of covariance
The method usually recommended for baseline adjustment in an individually randomised trial is analysis of covariance (ANCOVA).7 Adjustment in cluster randomised trials is more complex. We consider repeated cross section and cohort designs in turn.
In a repeated cross section design, where there are different individuals at baseline and follow-up, a simple way to deal with baseline assessments of outcomes is to bundle them up in each cluster as a mean or other aggregate measure, to form a cluster level covariate. Outcomes at follow-up can then be analysed either at aggregate cluster level or as individual outcomes, in either case adjusting for the cluster level baseline covariate. If individual outcomes are analysed at follow-up, then the analysis must also allow for clustering using mixed regression or generalised estimating equations, as with any cluster randomised trial.1 This method was used by Costantini and colleagues (box 1)3: each patient’s quality of care score at follow-up was adjusted for the mean quality of care score found in that hospital ward at baseline.
In a cohort design, the same individuals are assessed at baseline and follow-up. In this design, an obvious approach to ANCOVA is to analyse each individual’s outcome at follow-up adjusted for that individual’s outcome at baseline, with mixed regression or generalised estimating equations to allow for differences between clusters. This ANCOVA approach with individual level baseline adjustment was used by Coventry and colleagues (box 1)2: each participant’s depression scale score at follow-up was adjusted for the score at baseline, thus allowing for participant differences. However, in a methodological paper, Klar and Darlington found that even more precise results could be obtained by adjusting an individual’s outcome at follow-up both for the individual’s baseline assessment and for the baseline cluster mean.10 In the Coventry example, this would be achieved by adjusting also for the mean of the baseline depression scores of all patients from the same practice. The baseline cluster mean captures more information about cluster differences than the individual baseline assessment.
Constrained baseline analysis
In individually randomised trials, an alternative to ANCOVA is to treat outcomes collected at baseline and follow-up as longitudinal, and to use a repeated measures analysis to estimate the effect of the intervention being switched on in one of the randomised groups on the second of these occasions.11 This approach is sometimes referred to as a constrained baseline analysis because, unlike a difference of differences analysis, it assumes that there is no systematic difference between the groups at baseline. In cluster randomised trials, this is a special case of the approach to analysis recommended more generally for stepped wedge trials (cluster randomised trials where outcomes are assessed at multiple time points, and different clusters switch over to the intervention at different times).12 Two ways of doing this have been described in the literature. The first approach assumes that the correlation between two people from the same cluster is the same whether they are sampled in the same period or a different period.13 The second approach allows the correlation to be weaker between different periods.141516 The distinction is technical but important. In a study using the Health Improvement Network general practice database, the health outcomes of glycated haemoglobin, systolic and diastolic blood pressure, body mass index, total cholesterol, and high density lipoprotein cholesterol were investigated. Researchers found that correlations between individuals from the same practice were between 12% and 51% smaller when those individuals were sampled from different 15 month periods,17 motivating the use of the second, more flexible model. A constrained baseline analysis that lacks this flexibility in the correlation structure is known to overstate the precision of the treatment effect, potentially leading to false positive findings.15
In the cluster randomised case, a constrained baseline analysis might be expected to produce similar results to ANCOVA, as in the individually randomised case.11 The method is extremely flexible, is available in cohort or repeated cross section forms, and allows an analysis based on individual level data, with no aggregation needed either at baseline or at follow-up.
Benefits of aggregating outcomes by cluster
When a cluster randomised trial involving relatively few clusters (eg, <40 clusters) is analysed with mixed regression or generalised estimating equations, there is known to be an increased risk of a false positive finding (that is, an inflated type I error rate) unless an appropriate correction is made.18 Corrections such as that of Kenward and Roger are increasingly accessible, and should be considered in such cases.19 For repeated cross section designs, one benefit of performing an ANCOVA entirely at the cluster level—aggregating follow-up outcomes by cluster and adjusting for the aggregate baseline outcome—is that an approach using mixed regression or generalised estimating equations is unnecessary. This simplification of the analysis keeps the risk of false positive findings under control. But could a cluster level analysisalso help with cohort designs? Theory shows that if the correlation between the baseline cluster mean and the mean at follow-up in the same cluster is known, it is immaterial whether it was the same participants who were assessed on each occasion or different participants: the precision of the treatment effect estimate will be the same.14 Therefore, if we treated a cohort design as if it were a repeated cross section design, and adjusted purely at the aggregate, cluster level rather than the individual level, the analysis could perform just as well. Indeed, in some situations, cluster level and individual level analyses of cohort designs can give identical results (web appendix).
Individual level adjustment at baseline also runs into difficulties if some participants in a cohort design have missing baseline assessments. ANCOVA with individual level adjustment at baseline requires us in this case either to impute the missing baseline assessments or to exclude those individuals from the analysis. Adjusting only for a baseline cluster mean offers a straightforward analysis of all available data without the need to impute or exclude data, but there are still difficulties in this case. Participants who are assessed at baseline but have a missing follow-up assessment still warrant individual attention, because the baseline could offer a valuable clue to the unobserved outcome (and reasons for dropping out).20 As observed above, a constrained baseline analysis offers an alternative approach using all available data, but at the individual level. Finally, having a missing baseline assessment could, in some trials, mean that a participant is not eligible to be followed up.
The web appendix provides tutorials showing how to implement the analyses described above using the Stata package (Stata Corporation). The supplement also includes results of large scale simulations of trials in typical scenarios that illustrate some of the performance issues outlined. A constrained baseline analysis with a realistic correlation structure performed consistently well for both cohort and repeated cross section designs, as did (more surprisingly) an ANCOVA performed entirely at the cluster level. A constrained baseline analysis with less flexible correlation structure risked an inflated type I error rate in some scenarios, while a difference of differences analysis and ANCOVA with purely individual level baseline adjustment did not always achieve the statistical power that was expected.
Further work, including simulation studies, is needed to quantify the performance of different methods in a broader range of circumstances than we have been able to consider here. We only looked at balanced situations where equal numbers of participants are sampled from each cluster at each assessment. (Klar and Darlington also provided simulations of unbalanced situations.)10 Methods outlined in this article can be applied to unbalanced designs, although lack of balance introduces further subtleties. Aggregation of outcomes at cluster level, for example, might not be the most efficient way to weight the contribution of clusters of different size. We have emphasised the importance of getting the right model for the correlation structure, but there could be complexities beyond those we have considered—for example, outcomes at baseline and follow-up might correlate in different ways in the intervention and control groups.
The analysis of a cluster randomised trial with a baseline assessment of outcome is not as straightforward as it might seem, but the advice is similar for cohort and for cross sectional designs. ANCOVA should adjust for the baseline cluster mean, even in a cohort design where individual level adjustment at baseline is also possible. A good, all round alternative to ANCOVA is a constrained baseline analysis with a suitably flexible model for the correlation between individuals from the same cluster. We do not recommend a difference of differences analysis for a cluster randomised trial. Any analysis using mixed regression or generalised estimating equations has an increased risk of a false positive finding when there are relatively few clusters, so analysts should apply a correction in this case if one is available, or consider aggregating results at the cluster level.
Contributors: RH conceived the article and led the writing of the paper. He is the guarantor. RH, AT, and LB performed simulations and conducted scoping reviews to identify example studies. AF and KH contributed further ideas. All authors contributed to the final version of the manuscript.
Competing interests: We have read and understood BMJ policy on declaration of interests and declare that we have no competing interests.
Provenance and peer review: Not commissioned; externally peer reviewed.