Internal and external validity of cluster randomised trials: systematic review of recent trialsBMJ 2008; 336 doi: http://dx.doi.org/10.1136/bmj.39517.495764.25 (Published 17 April 2008) Cite this as: BMJ 2008;336:876
- Sandra Eldridge, professor of biostatistics1,
- Deborah Ashby, professor of medical statistics2,
- Catherine Bennett, statistician1,
- Melanie Wakelin, lecturer in medical statistics1,
- Gene Feder, professor of primary care research and development1
- 1Centre for Health Sciences, Barts and The London School of Medicine and Dentistry, London E1 2AT
- 2Wolfson Institute of Preventive Medicine, Barts and The London School of Medicine and Dentistry, London EC1M 6BQ
- Correspondence to: S Eldridge
- Accepted 25 February 2008
Objectives To assess aspects of the internal validity of recently published cluster randomised trials and explore the reporting of information useful in assessing the external validity of these trials.
Design Review of 34 cluster randomised trials in primary care published in 2004 and 2005 in seven journals (British Medical Journal, British Journal of General Practice, Family Practice, Preventive Medicine, Annals of Internal Medicine, Journal of General Internal Medicine, Pediatrics).
Data sources National Library of Medicine (Medline) via PubMed.
Data extraction To assess aspects of internal validity we extracted data on appropriateness of sample size calculations and analyses, methods of identifying and recruiting individual participants, and blinding. To explore reporting of information useful in assessing external validity we extracted data on cluster eligibility, cluster inclusion and retention, cluster generalisability, and the feasibility and acceptability of the intervention to health providers in clusters.
Results 21 (62%) trials accounted for clustering in sample size calculations and 30 (88%) in the analysis; about a quarter were potentially biased because of procedures surrounding recruitment and identification of patients; individual participants were blind to allocation status in 19 (56%) and outcome assessors were blind in 15 (44%). In almost half the reports, information relating to generalisability of clusters was poorly reported, and in two fifths there was no information about the feasibility and acceptability of the intervention.
Conclusions Cluster randomised trials are essential for evaluating certain types of interventions. Issues affecting their internal validity, such as appropriate sample size calculations and analysis, have been widely disseminated and are now better addressed by researchers. Blinding of those identifying and recruiting patients to allocation status is recommended but is not always carried out. There may be fewer barriers to internal validity in trials in which individual participants are not recruited. External validity seems poorly addressed in many trials, yet is arguably as important as internal validity in judging quality as a basis for healthcare intervention.
In cluster randomised trials, groups or clusters of individuals, rather than individuals themselves, are randomised. These trials are increasingly common in health services research, being particularly appropriate for evaluating interventions aimed at changing behaviour in patients or practitioners or changing organisation of services. Clusters might, for example, consist of patients in general practices or older people in nursing homes. Cluster randomised trials are pragmatic, measuring effectiveness rather than efficacy1 and should therefore be both internally and externally valid.2
Internal validity refers to the extent to which differences identified between randomised groups are a result of the intervention being tested. It thus depends on good design, conduct, and analysis of the trial, with minimal bias.3 4 5 In addition, without a sufficient sample size, differences that do exist between randomised groups that are a result of the intervention being tested might not be detected; sufficient sample size can also be considered a marker of internal validity.5 For cluster randomised trials, statisticians have repeatedly emphasised the importance of accounting for the clustered nature of the data in sample size calculations and analysis6 7 8 9 but investigators have not always heeded this guidance.10 11 12 13 14
A potential barrier to internal validity highlighted more recently is lack of blinding to allocation status of those identifying or recruiting individuals into a cluster randomised trial.15 16 Concealment of allocation from those recruiting and randomising participants is well recognised as a cornerstone of internal validity for individually randomised trials.17 In cluster randomised trials there are two levels of participant: the cluster and the individual. Identification or recruitment of individuals, or both, often takes place after randomisation (of clusters) and if those carrying out the identification or recruitment of patients at this post-randomisation stage are not blind to allocation status, bias can occur. Puffer and colleagues recommend that reports include a clear statement about when individual participants are identified and whether or not those recruiting are blind to allocation status.16
Lack of other types of blinding is associated with poor internal validity in individually randomised trials18 and might result in poor internal validity in cluster randomised trials. Lack of blinding in outcome assessment is usually considered the most serious potential source of bias19; in most cluster randomised trials it is possible to assess outcomes blind to allocation status. The nature of the intervention in most of these trials, however, means that it is rarely possible to blind those delivering components of the intervention to individual participants. For example, an intervention might involve educational outreach to all clinical staff in intervention general practices (clusters); these staff, who must then deliver enhanced care to patients, cannot be blind to whether or not they receive education.w1 In addition, it is not always possible to blind individual participants to the fact that they are receiving an intervention—for example, if they are receiving leafletsw2—although this does not necessarily mean that they know their allocation status. This inability to blind health professionals (and sometimes individual participants) is a distinctive feature of these trials.
External validity refers to the extent to which study results can be applied to other individuals or settings. Several frameworks have been developed that are helpful in assessing this.20 21 22 The RE-AIM framework (table 1) was developed by Glasgow and colleagues to characterise the public health impact of interventions.22 23 ⇓ The framework has been used to assess the external validity of evaluations of interventions common in cluster randomised trials,23 24 25 although none of the previously published assessments specifically focuses on cluster randomised trials. Four features of RE-AIM are related to external validity: reach, adoption, implementation, and maintenance. We have focused on adoption and implementation because these factors can operate differently in individually and cluster randomised trials and are amenable to assessment from trial reports.
To judge adoption (the extent to which the settings included are representative of a wider population of settings and adequately described), a reader needs information on eligibility criteria for clusters, numbers of clusters randomised and analysed, and a discussion of generalisability of trial findings to clusters as well as individuals, all factors recommended in the extension to the CONSORT statement for cluster randomised trials.26 Cluster recruitment rate also contributes to an assessment of adoption. The implementation of an intervention as intended requires the cooperation of the clusters in potentially two distinct ways. Firstly, health professionals in clusters must comply with any intervention targeted at them—for example, an educational programme. Secondly, they must deliver components of the intervention they are supposed to be actively involved in—for example, extra counselling sessions to patients. Using terms defined by Bonell and colleagues in a framework for assessing generalisability, we refer to compliance with programmes targeted at health professionals in clusters as acceptability, and delivery of intervention as intended as feasibility (table 1).21 ⇑
We reviewed recent cluster randomised trial reports to assess the extent to which trial investigators have ensured internal validity through appropriate sample size calculations and analyses, blinding of those identifying and recruiting individual participants to allocation status, and blinding of patients and of outcome assessors. We explored the reporting of information useful in assessing external validity— namely, adoption through cluster eligibility, inclusion, retention, and generalisability, and implementation through the feasibility and acceptability of the intervention to health providers in clusters.
We included only trials in primary care to facilitate comparisons with results of trials selected from a previous review of cluster randomised trials.13 We defined primary care using a hybrid of the definitions used in the United Kingdom27 and the United States28; the rationale for this being that a definition that worked for these two different health services would also work in other countries. The definition is “accessible, often first contact, health care, usually provided within the community, which is either comprehensive, co-ordinated care involving sustained partnership with patients, or undifferentiated by age, gender, disease or organ. This includes comprehensive, co-ordinated care to particular subsets of the population sometimes for a fixed period, or care which focuses on sustaining health rather than treating illness.”13
We included trials reporting primary evaluations of effectiveness where randomisation was by cluster (for example, general practices) as long as there were some outcomes collected from observational units at a level below the randomisation unit (for example, individual patients). We excluded reports that referenced main trial findings elsewhere or did not report outcomes or where individual participants were randomised. SE searched the National Library of Medicine (Medline) database electronically for primary care trials published (including e-publications) in 2004 and 2005 in seven current journals that our previous research identified as publishing six or more cluster randomised trials in primary care in an earlier period (1997-2000) (British Medical Journal, British Journal of General Practice, Family Practice, Preventive Medicine, Annals of Internal Medicine, Journal of General Internal Medicine, Pediatrics). SE identified cluster randomised trials by examining the abstracts and, when necessary, full texts. On the basis of previous trends, we estimated that we needed toidentify 40 trials, enough to provide sensible estimates of proportions of trials in certain categories. Two reviewers (SE and CB or MW) independently extracted appropriate information and resolved discrepancies by discussion or by referral to GF and DA.
To assess the extent to which investigators had followed recommendations about adequate power and appropriate analyses, we calculated proportions of reports correctly accounting for clustering in design and analysis. We compared these with similar proportions from trial reports in the same seven journals in 1997-2000 (unpublished data from our previous review). To assess the extent of blinding of those identifying or recruiting individual patients to allocation status, we grouped the trials into four categories:
Possibility of bias in recruitment/identification of participants—Bias was possible if those identifying or recruiting patients were not blind to allocation status and could have had an impact on who was identified or recruited or could have relayed information to patients to make them more or less likely to consent or if information given to patients at consent was clearly different in different intervention groups.
Bias unlikely in recruitment/identification of participants—Bias was unlikely if those identifying and recruiting patients were blind to allocation status or criteria for patient entry were such that recruiters could not have had a substantial impact on who was recruited, or both.
No possibility of bias in recruitment/identification of participants—If identification was blind to allocation status and there was no recruitment of individual participants bias could not exist. This can happen if, for example, general practices are recruited and outcomes from individual participants are assessed via routine data.w3
Unclear—Used if we could not put a trial into one of the above categories based on the trial report.
Many trials in our review would have started before publication of the key paper that highlighted inadequate blinding at recruitment of patients as a barrier to internal validity16 and investigators might not have been fully aware of this issue at identification and recruitment of patients. We therefore also assessed whether or not investigators seemed to be aware of the issue at the time they published, as evidenced by appropriate discussion within their trial report. To assess other types of blinding we recorded whether the reports indicated that patients and those who assessed the primary outcome were blind to allocation status, not blind, or whether this was unclear. We defined the primary outcome as that specified by authors or, if not specified, the outcome used in the calculation of sample size or, if there was no sample size calculation, the first outcome presented in the abstract.
To assess adoption we extracted information reported on cluster eligibility and numbers approached, recruited, and lost to follow-up; when possible we calculated cluster recruitment and attrition rates. We compared results with those for trials from the same seven journals in 1997-2000. We also extracted any phrases investigators used to discuss cluster generalisability. To assess implementation we identified whether investigators reported the extent of adherence to any components of the intervention targeted directly at health professionals in clusters (acceptability) and the extent to which health professionals delivered any of these components to patients as intended (feasibility). In this sense, feasibility is not specific to cluster randomised trials but might be particularly important in these trials where interventions are often multifaceted and complex. We also identified whether investigators reported any lack of adherence to trial protocol as an issue in their trial. In addition, we assessed whether there was evidence of a substantial evaluation of trial processes to try to ascertain and understand acceptability and feasibility.
We identified 40 potential eligible trials and excluded six (in one clusters were not fully randomised, two referenced main trial findings elsewhere, two did not report outcomes, one was primarily a report of an individually randomised trial). We reviewed the 34 trials involving various cluster types and interventions (table 2)⇓.w1-w34 Most disagreements on data extraction were resolved by discussion between data abstractors.
All reports contained information on analysis and 29 on sample size calculations. One report mentioned a sample size calculation reported elsewhere (we categorised this as not clear whether sample size calculation accounted for clustering). Sixty two per cent (21/34) definitely accounted for clustering in sample size calculations and 88% (30/34) in analyses compared with 15% (9/60) and 73% (44/60), respectively, for trials in the same seven journals in 1997-2000 (unpublished data from previous review) (table 3)⇓.
Bias caused by lack of blinding of those identifying and recruiting individual participants was impossible or unlikely in 62% of trials (21/34) and possible in 21% (7/34) (table 3).⇓ In 14 trials individual participants (usually patients) were not recruited; we judged that selection bias was impossible in 12 and possible in one where general practitioners identified relevant patients after randomisation,w2 and one trial report was not clear enough for us to make a judgmentw12 (table 4)⇓. Where individual participants were recruited (20 trials), we judged that bias was unlikely in nine, possible in six, and that we could not judge in five. Five reports commented on the possibility of bias in participant recruitment or identification; this was more likely if we had judged that there was a possibility of such bias in the trial (three out of seven trials).w7 w22 w28 Individual participants were reported to be blind to allocation status in 56% (19/34) of trials. This was the case in all trials in which participants were not recruited except for two which randomised families or family compoundsw10 w23 and in seven out of 20 trials in which participants were recruited. In all of the latter seven trials, investigators reported making a specific effort to ensure that individual participants were not given information about allocation status (table 4)⇓. Primary outcome assessment was blind to allocation status in 44% (15/34) of trials (table 3)⇓; blinding was more likely if participants were recruited (10/20), but this effect could have arisen by chance (odds ratio 1.8, 95% confidence interval 0.4 to 7.3).
Most reports contained some information about cluster eligibility. We attempted to judge generalisability based on this and information about cluster inclusion and retention but found it difficult to do. Only 59% (20/34) of trial reports contained full information on numbers of clusters approached, recruited, and analysed (table 3)⇑; the comparable figure for trials from the same journals in 1997-2000 was 31% (19/60). We calculated cluster recruitment rates for 23 trials (median 50%, interquartile range 30-100%) and attrition rates for 27 (median 0%, 0-5%) (comparable figures for 1997-2000 trials were median 72%, 29-88%, and median 0%, 0-6%). Of the 18 trials with recruitment rates below 85%, only six reported a comparison of the characteristics of clusters approached and recruited (see table A on bmj.com). Two trials lost over a quarter of clusters after recruitment: one because of lack of eligible patients,w29 the other because some clusters did not allow data collection to be completed.w33
Fifty three per cent (18/34) of trial reports contained a discussion of cluster generalisability (table 3).⇑ This was more likely if they also reported full information on numbers of clusters approached, recruited, and analysed (13/20 v 5/14), although again this effect could have arisen by chance (odds ratio 3.3, 0.8 to 13.9). Most suggested that generalisability might be restricted, but only four explained how clusters included might differ from those not included: more interested, motivated, familiar with training methods, ready to change.w6 w17 w26 w33 None of these trials showed evidence of effectiveness of the intervention for the whole trial population and primary outcome.
Only two trials did not involve clusters in either an intervention targeted at them that they could opt out of or active involvement in intervention delivery; both assessed the effect of giving information to health professionals.w14 w19 Fifteen trials reported information about levels of intervention implementation, and four discussed it (see table A on bmj.com). In most of these trials implementation was less than optimal. No reasons were given for health professionals in the clusters not fully adhering to the intervention targeted at them (lack of acceptability). The most common reason given for less than optimal delivery of the intervention (lack of feasibility) was lack of time. Eight reports mentioned additional specific research, usually qualitative, which explored trial processes, acceptability, or feasibility.
When we divided the trials according to whether they were published in the BMJ or elsewhere, the BMJ scored higher than other journals on eight of the 10 criteria in table 3.⇑ The difference in the proportions of trials in which the primary outcome was assessed blind to allocation status (81% in the BMJ and 26% in other journals, odds ratio 12.7, 2.1 to 76.6) was particularly striking. All other differences could have arisen by chance.
The time trends in our data suggest an encouraging improvement in the extent to which investigators account for clustering in the design and analysis of cluster randomised trials. About a quarter of the trials were potentially biased because of procedures for selecting patients. Blinding of individual participants to allocation status was almost universal in trials in which individual participants were not recruited, but much less common in trials when individual participants were recruited. In less than half of the trials assessment of the primary outcome was blind to allocation status. In two fifths of reports there was no information about the implementation of the intervention; where there was information, implementation was almost always less than optimal. The reporting of information relating to cluster generalisability might have improved since the late 1990s but remains poor in almost half of the trials we reviewed. We were not able to assess time trends in procedures for selecting patients, blinding, or reporting of implementation because we had no data from earlier trials, but there seems to be considerable room for improvement. Because of small numbers of trials we are not able to make substantive conclusions about the differences in quality between journals, although our results suggest that trials reported in the BMJ might be of higher quality than trials in many other journals in respect of blinding those who assessed the primary outcomes.
Strengths and limitations
We focused on recent trials and had rigorous review procedures. We could not, however, judge the extent of some of the barriers to internal and external validity because of lack of reporting and might have underestimated the extent to which investigators recognised and dealt with some barriers as a result. In addition, we did not consider all possible barriers to validity, in particular inadequate descriptions of interventions, lack of generalisability of patients, and lack of maintenance of effect. A consideration of adequate description of the intervention was beyond the scope of our study, but previous research suggests that many interventions of the sort evaluated in cluster randomised trials are not described in enough detail to enable their adoption in other settings29 and makes recommendations for description.30 Although we limited our review to trials in primary care to facilitate comparison with an earlier review, we have no reason to think that our general conclusions are not more widely applicable. Limitation of the review to trials published in journals that are more familiar with this type of trial design might have led to an overoptimistic assessment of quality in comparison with the quality of trials in other journals.
There have been several previous reviews of cluster randomised trials.10 11 12 13 14 16 31 32 33 34 Most have indicated poor quality in relation to accounting for clustering in sample size and analysis. Previous statistical publications could have contributed to the increase in trials correctly accounting for clustering.1 8 9 35 36 37 38 Few reviews have explored the other aspects of internal and external validity that we considered. Using slightly different methods, Puffer et al found similar levels of evidence of bias in selection of patients in 36 trials published in the BMJ, Lancet, and New England Journal of Medicine in 1997-2002.16 In reviewing eight experimental and quasi-experimental studies of HIV prevention, Bonell et al found that none commented on the extent to which study samples were representative of the targeted populations.21 Our research concurs with their more general conclusion that few studies assessed the generalisability of their results. Recent research suggests that evaluation of process in trials of complex interventions, such as those described here, is important39; such evaluations could facilitate an understanding of generalisability.21 Although we did not identify many trials that had separate process evaluations, we looked for evidence of this only within the trial reports.
Cluster randomised trials are essential for evaluating certain types of intervention and often afford an important advantage over individually randomised trials in terms of internal validity because they are less prone to contamination bias. Nevertheless, other design features of such trials might compromise internal validity, largely through lack of blinding of those delivering care or identifying and recruiting participants or of the individual participants themselves. Sometimes such lack of blinding is inevitable, and sometimes it can be avoided.
To avoid bias, trial investigators should ideally ensure that those who identify or recruit individual participants, or both, are blinded to allocation status. If knowledge of allocation status is unlikely to influence the characteristics of individual participants identified or recruited (for example, if the inclusion process is computerised or unlikely to be subverted for other reasons), investigators should report this. As suggested previously,16 26 investigators should report identification and recruitment strategies transparently, particularly in relation to the timing of randomisation and intervention delivery, who identifies and recruits individuals, and whether they are blind to allocation status. Investigators should also detail the information given to participants. Full information about the trial might lead to later unblinding of patients, and possibly performance bias, when they are exposed to a particular intervention, while different information given to intervention groups might result in differential recruitment or expectation bias in participants.40 A few reports we reviewed detailed information given to patients at recruitment; all those that did suggested that patients were given identical information regardless of intervention group, and in many cases an effort was made to ensure that they did not know their allocation status.
This strategy, which might reduce bias, is nevertheless at odds with the generally accepted ethical principle of fully informed consent that proposes that patients should be given full information about the trial that they are participating in.40 Trial investigators should be aware that this conflict between science and ethics is also present in trials in which individual participants are not recruited; blinding of participants is easy to maintain, but participants receive no information about the trial. When individual participants, those identifying or recruiting them, and outcome assessors cannot be blind to allocation status, this might or might not have serious consequences for internal validity; as some issues seem distinct in these trials we cannot necessarily assume that results regarding factors that affect bias transfer from individually randomised trials to cluster randomised trials. Our study was too small to assess whether these various potential barriers to internal validity actually lead to biased results. Further studies are needed to explore this. We suggest that at the design stage of their trials investigators should systematically identify potential biases arising from lack of blinding, the anticipated relative importance of these biases, and whether there is any potential for avoidance.
Judgment about external validity can be facilitatedby the reporting of readily available information about numbers and characteristics of clusters approached, recruited, and analysed, and a discussion of generalisability. Information about the characteristics of included health professionals and organisations might be more important in cluster randomised trials, where clusters can have considerable impact on an intervention’s effect, than in individually randomised trials, where those delivering the intervention often have minimal impact on its effect. Nevertheless, we found it difficult to judge the generalisability of findings, even with this information. Indeed, a judgment about whether an intervention could be used in a different setting might depend on detailed knowledge of the area being researched, the setting and healthcare system of the country in which the trial takes place, and the setting and healthcare system to which the intervention might be transferred. Thus, while appropriate guidelines can govern how to assess internal validity, we might be able to assess only whether investigators have presented information that could be used to judge external validity. While frameworks for generalisability developed recently are helpful in this respect, uncertainty about external validity can still remain even when all the parameters of these frameworks are complied with. For cluster randomised trials one key element of this uncertainty is the current lack of knowledge about how clusters with different characteristics respond to different types of intervention. Indeed, most of the trial investigators in our review were not specific about the likely effect of the clusters included on external validity. In individually randomised drug trials, judgments about differences in health status and morbidity between trial participants and other populations are generally easier to make and routine monitoring of drug use after licensing can facilitate a judgment of generalisability.
Although it has already been recommended,41 no monitoring system exists to assess the wider effectiveness of complex interventions such as those aimed at clusters. Studies to assess the implementation and impact of similar interventions in different types of cluster and setting21 and exploration and synthesis of empirical evidence from existing trials could also help to fill this knowledge gap. This will mean exploiting the developing science of evidence synthesis; meta-analyses of complex interventions are often not credible and narrative analyses do not provide estimates of the influence of patient or cluster factors on effect sizes. For most cluster randomised trials, investigators should discuss the implementation of the intervention. Again, a better understanding of factors affecting implementation in different circumstances and among different clusters, possibly through evaluations of trial process,40 would clarify implications for external validity. Our study is too small to form any substantive conclusions about the relation between statistical significance and external validity, although it may be that reporting of certain aspects of external validity is influenced by the statistical significance of findings; this is an issue for future research.
Further observations on validity
In individually randomised trials there is usually a clear distinction between internal and external validity. For example, selection of individual participants into a trial affects external validity, while allocation of individual participants affects internal validity; implementation of the intervention by health professionals affects external validity. In cluster randomised trials, however, this distinction becomes blurred. Lack of blinding to allocation status at identification and recruitment of individual participants might affect internal validity through differential recruitment in two groups but might also affect external validity through the overall profile of participants. Similarly, as health professionals in intervention clusters generally have to implement a wider range of components of an intervention than those in control clusters, failure to implement components will probably be more common in intervention clusters and this might affect internal validity. Thus, while we have focused on internal and external validity, these are to some extent arbitrary distinctions in these trials. Our concern is, nevertheless, to highlight features of these trials that are potential barriers to their validity, both internal and external.
Cluster randomised trials are essential for evaluating certain types of intervention and there are often strong scientific reasons to conduct them. Issues relating to the internal validity of these trials, such as appropriate calculations of sample size and analysis, have been widely disseminated and are now better addressed by the research community. The importance of blinding those who identify and recruit patients has been raised but, as yet, is not always well addressed. There might be fewer barriers to internal validity in trials in which individual participants are not recruited. External validity has not been discussed previously in the literature and seems to be poorly addressed in many trials, yet is arguably as important as internal validity in judging the quality of trials as a basis for healthcare policy.
What is already known on this topic
Cluster randomised trials have not always been well designed and analysed
Lack of blinding in the identification and randomisation of individual participants can be a problem
What this study adds
The extent to which investigators are designing and analysing these trials appropriately has improved
Some trials still do not blind those recruiting and identifying participants
Information relating to cluster generalisability is generally poorly reported
Contributions: SE conceived the idea for the study, led the research, and wrote the initial draft. GF and DA contributed to design and interpretation. MW and CB extracted data. All authors contributed to the final paper. SE is guarantor.
Funding: SE received a HEFC promising research fellowship.
Competing interests: None declared.
Ethics approval: Not required.
Provenance and peer review: Not commissioned; externally peer reviewed.