Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
Published 13 August 2009, doi:10.1136/bmj.b3128
Cite this as: BMJ 2009;339:b3128
Britta Tendal, PhD student1, Julian P T Higgins, senior statistician4, Peter Jüni, head of division2,3, Asbjørn Hróbjartsson, senior researcher1, Sven Trelle, associate director2,3, Eveline Nüesch, PhD student2,3, Simon Wandel, PhD student2,3, Anders W Jørgensen, PhD student1, Katarina Gesser, PhD student5, Søren Ilsøe-Kristensen, PhD student5, Peter C Gøtzsche, director1
1 Nordic Cochrane Centre, Rigshospitalet, Dept 3343, Blegdamsvej 9, DK-2100 Copenhagen, Denmark, 2 Institute of Social and Preventive Medicine, University of Bern, Switzerland, 3 CTU Bern, Bern University Hospital, Switzerland, 4 MRC Biostatistics Unit, Institute of Public Health, University of Cambridge, Cambridge, 5 Faculty of Pharmaceutical Sciences, University of Copenhagen, Denmark
Correspondence to: B Tendal bt{at}cochrane.dk
Design Observer agreement study.
Data sources A random sample of 10 Cochrane reviews that presented a result as a standardised mean difference (SMD), the protocols for the reviews and the trial reports (n=45) were retrieved.
Data extraction Five experienced methodologists and five PhD students independently extracted data from the trial reports for calculation of the first SMD result in each review. The observers did not have access to the reviews but to the protocols, where the relevant outcome was highlighted. The agreement was analysed at both trial and meta-analysis level, pairing the observers in all possible ways (45 pairs, yielding 2025 pairs of trials and 450 pairs of meta-analyses). Agreement was defined as SMDs that differed less than 0.1 in their point estimates or confidence intervals.
Results The agreement was 53% at trial level and 31% at meta-analysis level. Including all pairs, the median disagreement was SMD=0.22 (interquartile range 0.07-0.61). The experts agreed somewhat more than the PhD students at trial level (61% v 46%), but not at meta-analysis level. Important reasons for disagreement were differences in selection of time points, scales, control groups, and type of calculations; whether to include a trial in the meta-analysis; and data extraction errors made by the observers. In 14 out of the 100 SMDs calculated at the meta-analysis level, individual observers reached different conclusions than the originally published review.
Conclusions Disagreements were common and often larger than the effect of commonly used treatments. Meta-analyses using SMDs are prone to observer variation and should be interpreted with caution. The reliability of meta-analyses might be improved by having more detailed review protocols, more than one observer, and statistical expertise.
There is often a multiplicity of data in trial reports that makes it difficult to decide which ones to use in a meta-analysis. Furthermore, data are often incompletely reported,2 3 which makes it necessary to perform calculations or impute missing data, such as missing standard deviations. Different observers may get different results, but previous studies on observer variation have not been informative, because of few observers, few trials, or few data.4 5 We report here a detailed study of observer variation that explores the sources of disagreement when extracting data for calculation of standardised mean differences.
We included reviews that reported at least one result as a standardised mean difference (SMD). The SMD is used when trial authors have used different scales for measuring the same underlying outcome—for example, pain can be measured on a visual analogue scale or on a 10-point numeric rating scale. In such cases, it is necessary to standardise the measurements on a uniform scale before they can be pooled in a meta-analysis. This is typically achieved by calculating the SMD for each trial, which is the difference in means between the two groups, divided by the pooled standard deviation of the measurements.1 By this transformation, the outcome becomes dimensionless and the scales become comparable, as the results are expressed in standard deviation units.
The first SMD result in each review that was not based on a subgroup result was selected as our index result. The index result had to be based on two to 10 trials and on published data only (that is, there was no indication that the review authors had received additional outcome data from the trial authors).
Five methodologists with substantial experience in meta-analysis and five PhD students independently extracted the necessary data from the trial reports for calculation of the SMDs. The observers had access to the review protocols but not to the completed Cochrane reviews and the SMD results. An additional researcher (BT) highlighted the relevant outcome in the protocols, along with other important issues such as pre-specified time points of interest, which intervention was the experimental one, and which was the control. If information was missing regarding any of these issues, the observers decided by themselves what to select from the trial reports. The observers received the review protocols, trial reports, and a copy of the Cochrane Handbook for Systematic Reviews6 as PDF files.
The data extraction was performed during one week when the 10 observers worked independently at the same location in separate rooms. The observers were not allowed to discuss the data extraction. If the data were available, the observers extracted means, standard deviations, and number of patients for each group; otherwise, they could calculate or impute the missing data, such as from an exact P value. The observers also interpreted the sign of the SMD results—that is, whether a negative or a positive result indicated superiority of the experimental intervention. If the observers were uncertain, the additional researcher retrieved the paper that originally described the scale, and the direction of the scale was based on this information. All calculations were documented, and the observers provided information about any choices they made regarding multiple outcomes, time points, and data sources in the trial reports. During the week of data extraction the issue of whether the observers could exclude trials emerged, as there were instances where the observers were unable to locate any relevant data in the trial reports or felt that the trial did not meet the inclusion criteria in the Cochrane protocol. It was decided that observers could exclude trials, and the reasons for exclusion were documented.
Based on the extracted data, the additional researcher calculated trial and meta-analysis SMDs for each observer using Comprehensive Meta-Analysis Version 2. To allow comparison with the originally published meta-analyses, the same method (random effects or fixed effect model) was used as that in the published meta-analysis. In cases where the observers had extracted two sets of data from the same trial—for example, because there were two control groups—the data were combined so that only a single SMD resulted from each trial.1
Agreement between pairs of observers was assessed at both meta-analysis and trial level, pairing the 10 observers in all possible ways (45 pairs). This provides an indication of the likely agreement that might be expected in practice, since two independent observers are recommended when extracting data from papers for a systematic review.1 2 5 6 Agreement was defined as SMDs that differed less than 0.1 in their point estimates and in their confidence intervals. The cut point of 0.1 was chosen because many commonly used treatments have an effect of 0.1 to 0.5 compared with placebo2; furthermore, an error of 0.1 can be important when two active treatments have been compared, for there is usually little difference between active treatments. Confidence intervals were not calculated, as the data from the pairings were not independent.
To determine the variation in meta-analysis results that could be obtained from the multiplicity of different SMD estimates across observers, we conducted a Monte Carlo simulation for each meta-analysis. In each iteration of the simulation, we randomly sampled one observer for each trial and entered his or her SMD (and standard error) for that trial into a meta-analysis. Thus each sampled meta-analysis contained SMD estimates from different observers. If the sampled observer excluded the trial from his or her meta-analysis, the simulated meta-analysis also excluded that trial. We examined the distribution of meta-analytic SMD estimates across 10 000 simulations.
|
|
Agreement at trial level
In table 2
the different levels of agreement are shown. Across trials, the agreement was 53% for the 2025 pairs (61% for the 450 pairs of methodologists, 46% for the 450 pairs of PhD students, and 52% for the 1125 mixed pairs). The agreement rates for the individual trials ranged from 4% to 100%. Agreement between all observers was found for four of the 45 trials.
|
|
Agreement at meta-analysis level
Across the meta-analyses, the agreement was 31% for the 450 pairs (33% for the 100 pairs of methodologists, 27% for the 100 pairs of PhD students, and 31% for the 250 mixed pairs) (table 2
). The agreement rates for the individual meta-analyses ranged from 11% to 80% (table 4
). Agreement between all observers was not found for any of the 10 meta-analyses.
|
1). The last 18 pairs (4%) were not quantifiable since one observer excluded all the trials from two meta-analyses. The median disagreement was SMD=0.22 for the 432 quantifiable pairs with an interquartile range from 0.07 to 0.61. There were no differences between the methodologists and the PhD students (table 2
|
|
|
The disagreement depended on the reporting of data in the trial reports and on how much room was left for decision in the review protocols. One of the reviews exemplified the variation arising from a high degree of multiplicity in the trial reports combined with a review protocol leaving much room for choice.11 In the review protocol, the time point was described as "long term (more than 26 weeks)," but in the two trials included in the meta-analysis there were several options. For one trial,19 there were two: end of treatment (which lasted 9 months) or three month follow-up. For the other,20 21 22 there were three: 6, 12, and 18 month follow-up (treatment lasted 3 weeks). The observers used all the different time points, and all had a plausible reason for their choice: in concordance with the time point used in the other trial, the maximum period of observation, and the least drop out of patients.
Strengths and weaknesses
The primary strength of our study is that we took a broad approach and showed that there are other important sources of variation in meta-analysis results than simple errors. Furthermore, we included a considerable number of experienced as well as inexperienced observers and a large number of trials to elucidate the sources of variation and their magnitude. Finally, the study setup ensured independent observations according to the blueprint laid out in the review protocols and likely mirrored the independent data extraction that ideally should happen in practice.
The experimental setting also had limitations. Single data extraction produces more errors than double data extraction.5 In real life, some of the errors we made would therefore probably have been detected before the data were used for meta-analyses, as it is recommended for Cochrane reviews that there should be at least two independent observers and that any disagreement should be resolved by discussion and, if necessary, arbitration by a third person.1 We did not perform a consensus step, as the purpose of our study was to explore how much variation would occur when data extraction was performed by different observers. However, given the amount of multiplicity in the trial reports and the uncertainties in the protocols, it is likely that even pairs of observers would disagree considerably with other pairs.
Other limitations were that the observers were under time pressure, although only one person needed more time, as he fell ill during the assigned week. The observers were presented with protocols they had not developed themselves, based on research questions they had not asked, and in disease areas where they were mostly not experts. Another limitation is that, even though one of the exclusion criteria was that the authors of the Cochrane review had not obtained unpublished data from the trial authors, it became apparent during data extraction that some of the trial reports did not contain the data needed for the calculation of an SMD. It would therefore have been helpful to contact trial authors.
Other similar research
The SMD is intended to give clinicians and policymakers the most reliable summary of the available trial evidence when the outcomes have been measured on different continuous or numeric rating scales. Surprisingly, the method has not previously been examined in any detail for its own reliability. Previous research has been sparse and has focused on errors in data extraction.2 4 5 In one study, the authors found errors in 20 of 34 Cochrane reviews, but, as they gave no numerical data, it is not possible to judge how often these were important.4 In a previous study of 27 meta-analyses, of which 16 were Cochrane reviews,2 we could not replicate the SMD result for at least one of the two trials we selected for checking from each meta-analysis within our cut point of 0.1 in 10 of the meta-analyses. When we tried to replicate these 10 meta-analyses, including all the trials, we found that seven of them were erroneous; one was subsequently retracted, and in two a significant difference disappeared or appeared.2 The present study adds to the previous research by also highlighting the importance of different choices when selecting outcomes for meta-analysis. The results of our study apply more broadly than to meta-analyses using the SMD, as many of the reasons for disagreement were not related to the SMD method but would be important also when analysing data using the weighted mean difference method, which is the method of choice when the outcome data have been measured on the same scale.
Conclusions
Disagreements were common and often larger than the effect of commonly used treatments. Meta-analyses using SMDs are prone to observer variation and should be interpreted with caution. The reliability of meta-analyses might be improved by having more detailed review protocols, more than one observer, and statistical expertise.
Review protocols should be more detailed and made permanently available, also after the review is published, to allow other researchers to check that the review was done according to the protocol. In February 2008, the Cochrane Collaboration updated its guidelines and recommended that researchers in their protocols list possible ways of measuring the outcomes—such as using different scales or time points—and specify which ones to use. Our study provides strong support for such precautions. Reports of meta-analyses should also follow published guidelines1 23 to allow for sufficient critical appraisal. Finally the reporting of trials needs to be improved, according to the recommendations in the CONSORT statement,24 reducing the need for calculations and imputation of missing data.
|
Cite this as: BMJ 2009;339:b3128
Funding: This study is part of a PhD funded by IMK Charitable Fund and the Nordic Cochrane Centre. The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; and preparation, review, or approval of the manuscript. The researchers were independent from the funders.
Competing interests: None declared.
Ethical approval: Not required
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.
![]()
CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
StumbleUpon
Technorati What's this?