Multiplicity of data in trial reports and the reliability of meta-analyses: empirical studyBMJ 2011; 343 doi: https://doi.org/10.1136/bmj.d4829 (Published 30 August 2011) Cite this as: BMJ 2011;343:d4829
- Britta Tendal, research fellow1,
- Eveline Nüesch, research fellow23,
- Julian P T Higgins, senior statistician4,
- Peter Jüni, head of division23,
- Peter C Gøtzsche, professor and director1
- 1Nordic Cochrane Centre, Rigshospitalet, Blegdamsvej 9, DK-2100, Copenhagen Ø, Denmark
- 2Institute of Social and Preventive Medicine, University of Bern, Switzerland
- 3CTU Bern, Bern University Hospital, Switzerland
- 4MRC Biostatistics Unit, Cambridge, UK
- Correspondence to: B Tendal
- Accepted 7 June 2011
Objectives To examine the extent of multiplicity of data in trial reports and to assess the impact of multiplicity on meta-analysis results.
Design Empirical study on a cohort of Cochrane systematic reviews.
Data sources All Cochrane systematic reviews published from issue 3 in 2006 to issue 2 in 2007 that presented a result as a standardised mean difference (SMD). We retrieved trial reports contributing to the first SMD result in each review, and downloaded review protocols. We used these SMDs to identify a specific outcome for each meta-analysis from its protocol.
Review methods Reviews were eligible if SMD results were based on two to ten randomised trials and if protocols described the outcome. We excluded reviews if they only presented results of subgroup analyses. Based on review protocols and index outcomes, two observers independently extracted the data necessary to calculate SMDs from the original trial reports for any intervention group, time point, or outcome measure compatible with the protocol. From the extracted data, we used Monte Carlo simulations to calculate all possible SMDs for every meta-analysis.
Results We identified 19 eligible meta-analyses (including 83 trials). Published review protocols often lacked information about which data to choose. Twenty-four (29%) trials reported data for multiple intervention groups, 30 (36%) reported data for multiple time points, and 29 (35%) reported the index outcome measured on multiple scales. In 18 meta-analyses, we found multiplicity of data in at least one trial report; the median difference between the smallest and largest SMD results within a meta-analysis was 0.40 standard deviation units (range 0.04 to 0.91).
Conclusions Multiplicity of data can affect the findings of systematic reviews and meta-analyses. To reduce the risk of bias, reviews and meta-analyses should comply with prespecified protocols that clearly identify time points, intervention groups, and scales of interest.
Meta-analyses of randomised clinical trials are crucial for making evidence based decisions. However, trial reports often present the same data in multiple forms when reporting different intervention groups, time points, and outcome measures.1 Although this multiplicity has always been a challenge in meta-analyses, its potential as a source of bias has received little attention.
The choice of the outcome of interest to include in systematic reviews is generally based on clinical judgment. However, since a fundamentally similar outcome might be measured on different scales, standardisation to a common scale is therefore required before the outcome can be combined in the meta-analysis. This standardisation is typically achieved by calculating the standardised mean difference (SMD) for each trial, which is the difference in means between the two groups, divided by the pooled standard deviation of the measurements.2 By this transformation, the outcome becomes dimensionless and the scales are comparable, because the results are expressed in standard deviation units. For example, a meta-analysis addressing pain might include trials measuring pain on a visual analogue scale and trials using a five point numerical rating scale. Combining these outcomes on different scales potentially adds a layer of multiplicity, because the outcome of interest might be measured on more than one scale not only across trials but also within the same trial. Multiplicity of data in trial reports might lead to biased decisions about which data to include in meta-analyses and hence threaten the validity of their results. In this study, we empirically assessed whether selecting between multiple time points, scales, and treatment groups affected SMD results in a randomly selected sample of Cochrane reviews.
Data source and selection
We included all Cochrane systematic reviews published in the Cochrane Library over 1 year (between issue 3 in 2006 and issue 2 in 2007) that presented a result as an SMD. For every review, we retrieved reports of all randomised trials that contributed to the first SMD result, and downloaded the latest protocols for all reviews in June 2007. Reviews were eligible if the SMD result was based on two to ten randomised trials and if the review protocol described the outcome. We excluded reviews if they only presented results of subgroup analyses.
We defined the index SMD result as the first pooled SMD result presented in the abstract or in the main body of text of the review that was not based on a subgroup analysis. We used index SMD results to identify a specific outcome for each meta-analysis from its protocol. To ensure that the review authors had not received additional outcome data from the authors of relevant trials, we only considered the first SMD result that was based exclusively on published data.
Based on the published protocol of each review, two observers (BT, EN) independently extracted all data from the original trial reports that could be used to calculate the SMD for the outcome that met our inclusion criteria. From each trial report, we extracted data for all experimental or control groups, time points, and measurement scales, provided that they were compatible with the definitions in the review protocol. If any required data were unavailable, we made approximations as previously described.3 We did not include interim analyses. Disagreements were resolved by discussion. We did not contact trial authors for unpublished data. Selection of reviews and trials and the extraction of data from trial reports were prespecified (protocol available on request).
We used Monte Carlo simulations to determine the variation in meta-analysis results from different SMD estimates, calculated from multiple time points, intervention groups, and measurement scales. We also used this simulation to estimate the overall impact of multiplicity. During each simulation, we randomly sampled one SMD and the corresponding standard error for each component trial in a specific meta-analysis. We used sampling with replacement from the population of all possible SMDs caused by multiplicity, and selected one SMD per trial. We then used this randomly sampled SMD and the corresponding standard error for fixed or random effects meta-analysis (as originally done in the published reviews), and calculated a pooled SMD for each meta-analysis. We repeated this process 10 000 times—that is, we undertook each meta-analysis 10 000 times, with a random selection of one SMD per trial each time. We then examined the distribution of pooled SMDs in histograms.
To estimate the impact of a single source of multiplicity (intervention groups, time points, measurement scales), we allowed only one source of multiplicity to vary at a time when randomly sampling SMDs for each trial. We standardised the other sources of multiplicity at prespecified standard values (group: pooled groups, time point: post treatment values, scale: first scale mentioned in text). For example, the analysis of multiplicity from different scales was based on post treatment values and pooled groups (if there were several possible groups). We would then randomly sample the values of the different scales for this time point and these groups to calculate the pooled SMD results. We expressed the variability of SMD results due to multiplicity as the difference between the smallest and largest pooled SMD results obtained from the Monte Carlo simulations. Only meta-analyses including trials with multiplicity contributed to these analyses. Finally, we compared the median pooled SMD from the Monte Carlo simulations to the index SMD that was published in the Cochrane review using a paired Wilcoxon test.
Figure 1⇓ shows the flowchart for the selection of meta-analyses. The 19 eligible meta-analyses included 83 trials that contributed to our study.4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Table 1⇓ shows the characteristics of included reviews, which addressed various condition types: psychiatric (eight reviews), musculoskeletal (two), neurological (two), gynaecological (one), hepatological (one), respiratory (one), and other (four). We studied psychological interventions in 10 meta-analyses, pharmacological interventions in four, physical interventions in three, and other interventions in two (exercise and humidified air). The index outcomes analysed in the 19 meta-analyses were diverse: pain in three, another symptom in 13, and other outcomes in three.
Information in review protocols
Table 2⇓ shows the level of information given in the review protocols. The protocols did not contain any information about which scales should be preferred. Eight protocols gave information about which time point or period to select, but only one gave enough information to avoid multiplicity, because the time point relevant for the selected index outcome was post-treatment, meaning that the data were collected by the end of treatment. A typical statement, which allowed for a potentially biased choice regarding the selection of a time point, was: “All outcomes were reported for the short term (up to 12 weeks), medium term (13 to 26 weeks), and long term (more than 26 weeks).”7 Another review about humidified air for treating croup15 stated: “The outcomes will be separately recorded for the week following treatment.” The selected outcome in this particular review was croup symptom score and none of the three included trials ran for this length of time, but reported symptoms 20 min to 12 hours after the intervention. Eighteen protocols described which type of control group to select but none reported any hierarchy among similar control groups or any intention to combine such groups.
Furukawa and colleagues provided an example of a protocol with many possible intervention or control groups.8 The authors aimed to compare combined psychotherapy and pharmacotherapy with psychotherapy or pharmacotherapy alone. They defined psychotherapy broadly, as “any other psychological approach.” 8 The pooled index SMD was based on seven trials, three of which had more than one possible intervention group.23 24 25 For three trials26 27 28 with only one intervention group, each contained three groups that could be used as control groups: one receiving pharmacotherapy only, one receiving psychotherapy only, and one receiving psychotherapy plus placebo.
Observed multiplicity in trial reports
Table 3⇓ presents the extent of multiplicity observed in the eligible reviews. Table 4⇓ gives an example of multiple eligible measurement scales, showing the different scales possible in the meta-analysis by Hunot and colleagues.10
Observed multiplicity in meta-analyses
In 11 (58%) meta-analyses, we identified at least one trial that provided data for more than one intervention or control group. Thirteen (68%) meta-analyses included at least one trial that reported more than one eligible time point and 12 (63%) included at least one trial that reported the index outcome using more than one eligible measurement scale. We identified one meta-analysis without multiplicity, because all three included trials only reported data of one intervention and control group, one eligible time point, and one measurement scale for the index outcome.18
Effects of multiplicity on meta-analysis results
Figure 2⇓ shows the distributions of possible pooled SMDs in each meta-analysis, after we randomly selected one possible SMD result per trial. Any type of multiplicity of data in the included trials affected pooled SMD results in 17 (89%) of 19 meta-analyses. The remaining two meta-analyses were not affected, because one study did not have multiple data in the trial reports18 and the observed multiplicity in another had no effect on the pooled SMD results.7 In one study, the Monte Carlo distributions do not include the published SMD, because the review authors used changes instead of end of follow-up values to calculate the SMD.18
In all 11 (58%) meta-analyses including at least one trial with more than one experimental or control group, we found variability in the pooled SMD results due to this type of multiplicity. In 12 (63%) meta-analyses, we found variability in the pooled SMD results due to multiplicity of data regarding time points (figure 2). In one meta-analysis with two trials that reported more than one eligible time point, we did not find multiplicity due to these different time points.7 In ten (53%) meta-analyses, we found variability in pooled SMD results from trial data of multiple measurement scales used for the index outcome. In two meta-analyses, one trial in each meta-analysis reported data for more than one measurement scale for the index outcome, but this multiplicity did not affect the pooled SMD results.6 22 In 12 (63%) reviews, the published pooled SMDs were more favourable for the experimental intervention than the median pooled SMD from the simulations (P=0.49).
Table 5⇓ presents the variability of pooled SMD results according to different sources of multiplicity (that is, groups, time points, or scales). Eighteen meta-analyses included trials with multiple data for at least one source. In these 18 meta-analyses, the treatment effect from multiplicity of data varied greatly (median difference between the smallest and largest SMDs within the same meta-analysis, 0.40 standard deviation units, range 0.04 to 0.91).
In 18 of the 19 meta-analyses in our study, we found multiplicity of data in at least one trial report within each meta-analysis, which frequently resulted in substantial variation in the pooled SMD results. The impact of multiple data in trial reports regarding intervention groups, time points, or measurement scales on meta-analysis results varied greatly across meta-analyses, ranging from almost no effect (0.04 standard deviation units) to a substantial one (0.91 standard deviation units, corresponding to a large treatment effect),29 with a median difference of 0.40 standard deviation units. We also estimated the effect of the individual sources of multiplicity, holding the other sources constant.
Example of potential implications of multiplicity of data
Table 6⇓ provides an example of data from trials investigating the effects of pharmacotherapy on anxiety levels. Depending on which time point is examined, the effect of pharmacotherapy varied widely from week to week within the individual trials. When we randomly selected one time point for each trial, SMDs varied from −0.76 (indicating a large benefit) to 0.05 (indicating little effect). For example, in the Fineberg 2005 trial, there was a large difference in the treatment effect from weeks 8 to 16.
If a meta-analysis were to pick only the most favourable trial results, its result would be biased and overly optimistic, which might affect clinical judgment about whether to use a particular treatment. Therefore, if the protocol does not state any prespecified time points for the meta-analyses, the meta-analysts might make data driven decisions based on the trial results as a whole. In the example in table 6, one could argue two strategies: to include the latest time point from all trials, or to use the length of the shortest trials and extract time points from the other trials that match this time point best. Another solution would be to include all time points in one analysis, similar to an analysis of repetitive measures in an individual trial (see below).
Strengths and limitations of study
Our selection of Cochrane reviews in this study was random, and the variability of the SMD results did not seem related to particular types of interventions or outcomes. To estimate the impact of multiplicity on meta-analysis results, we randomly selected one SMD per trial from a pool of eligible SMDs with equal probability and used these to calculate pooled SMDs for each meta-analysis. However, in practice, implicit rules regarding data extraction might apply within specialties. For example, one scale might be more commonly used than others, for example, Hamilton’s depression scale. Such implicit hierarchy of scales would be expected to reduce the multiplicity, but should be made explicit in protocols for systematic reviews.
Our results are transparent because we only included published results. Therefore, we probably underestimated the true level of multiplicity, since selective reporting of outcomes in trials is common.30 31 32 33 Positive, significant results are more likely to be published than non-significant results.34 Alternatively, if our random selection of SMDs for the meta-analyses did not reflect how review authors typically select in practice, we might have overestimated the observed effects of multiplicity.
Our study was possible because authors of Cochrane reviews must publish their protocols before they undertake and publish the review. We believe that most non-Cochrane meta-analyses do not have available protocols,35 and therefore the scope for multiplicity is probably greater than in Cochrane reviews. Although we examined three common sources of multiplicity of data in trial reports, there are other types of multiple data in trial reports—for example, different types of analysis such as intention to treat and per protocol analyses. Review authors might also be influenced by how many and which outcomes to select according to how favourable results appear to be in the published trial reports. The effect of the selection of scales, time points, and control groups has not been systematically assessed in any of the published Cochrane reviews.
The extent of multiplicity of data identified in trial reports is a function of the information provided in the review protocols: we would expect a poorly specified outcome to increase multiplicity. Data extraction for a meta-analysis depends on the information given by trial reports, and therefore it cannot be fully specified in advance without knowledge of the included trials. However, to minimise data driven selection of time points, measurement scales, or intervention groups, researchers should specify these decisions at the protocol stage. If amendments to the protocol are indicated, they should be reported transparently.36 37
Comparison with other studies
To our knowledge, our study is the first to show empirically the extent to which multiplicity of data can compromise the reliability of meta-analysis results. We have previously reported results from an observer agreement study of ten meta-analyses included in the present study.2 We found that disagreements in observers were common and often large, mainly because of: the different choices of groups, time points, scales, and calculations; different decisions on the inclusion or exclusion of particular trials; and data extraction errors.2 Bender and colleagues describe the problem of multiple comparisons in systematic reviews.1 They identified common reasons for multiplicity in reviews, but did not estimate the impact on the meta-analysis results.1 In our study, we included meta-analyses of SMDs, which could be associated with multiplicity of data because of the use of different measurement scales in included trials. However, multiplicity of data due to selection of time points and groups is not unique to SMD results and could apply to other effect measures, such as binary outcomes.
Possible approaches to minimise bias due to multiplicity of data
One approach to dealing with multiplicity in systematic reviews is to extract, analyse, and report all data available for intervention groups, time points, and measurement scales. However, this method could lead to considerable problems at interpretation, in view of the potential discrepancies between different scales or time points. As with the repetitive measures in an individual trial, all available time points reported in included trials could be analysed in a single meta-analysis while fully accounting for the correlation of repetitive measurements within a trial.38 Alternatively, assessments from different scales measuring similar concepts could be analysed in a single multivariate model, similar to the use of bivariate models in diagnostic research.39 Although the first approach of including repetitive assessments in a single analysis could be easily understandable, many readers could find the second approach difficult to understand.
Another approach could be to provide detailed protocols for systematic reviews with clearly specified time points, scales, and groups. Protocols should also include explicit and transparent hierarchies of each source of data, or strategies to combine sources (for example, if there are several control groups). Clinical judgment will be important here. Ideally, the choice of time points and scales should be evidence based, but empirical evidence for the most interesting time points and a hierarchy of scales according to their validity and responsiveness are rarely available. In addition, it is difficult to foresee everything at the protocol stage, and the scope, methodological quality, and quality of reporting of included studies might require subsequent modifications.40 Only Cochrane reviews are formally required to have a published protocol; however, and only about 10% of non-Cochrane reviews explicitly state a formal protocol.37 Protocol amendments could affect the results and conclusions of systematic reviews and should be made only after careful consideration and be reported transparently.36 37 Furthermore, the reporting of the methods and the results in meta-analyses must clearly explain how the results were achieved and how any multiplicity of data was handled.36
Multiplicity of data in trial reports and review protocols lacking a detailed specification of eligible time points, scales, and treatment groups can lead to substantial variability in meta-analysis results. Authors of systematic reviews should anticipate and consider the multiplicity of data in trial reports when writing protocols. To enhance reliability of meta-analyses, protocols should clearly define time points to be extracted, provide a hierarchy of scales, clearly define eligible treatment and control groups, and present strategies for handling multiplicity of data.
What is already known on this topic
Considerable observer variation exists in data extraction, which can be attributed to different choices and errors in the data extraction
The extent to which multiplicity of data in trial reports can compromise the reliability of meta-analysis results is unknown
What this study adds
Multiplicity of data in trial reports is substantial and has an important effect on meta-analyses results
Cite this as: BMJ 2011;343:d4829
Contributors: All authors had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. BT and EN contributed to the study equally. BT, EN, and PCG contributed to the study concept and design. BT and EN contributed to the acquisition of data and drafted the manuscript. JPTH, EN, and BT contributed to the analysis and interpretation of data. All the authors critically reviewed the manuscript for publication. PCG provided administrative, technical, and material support, and was the study supervisor and guarantor.
Funding: This study was part of a PhD (BT) funded by IMK Charitable Fund. The funding source had no role in the design and conduct of the study; data collection, management, analysis, and interpretation; preparation, review, and approval of the manuscript; or the decision to submit the paper for publication.
Competing interests: All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: this study is part of a PhD funded by IMK Charitable Fund; no financial relationships with any organisations that might have an interest in the submitted work in the previous 3 years; no other relationships or activities that could appear to have influenced the submitted work.
Data sharing: No additional data available.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.