Reporting and interpretation of SF-36 outcomes in randomised trials: systematic reviewBMJ 2009; 338 doi: https://doi.org/10.1136/bmj.a3006 (Published 12 January 2009) Cite this as: BMJ 2009;338:a3006
- Despina G Contopoulos-Ioannidis, assistant professor12,
- Anastasia Karvouni, research fellow3,
- Ioanna Kouri, research fellow3,
- John P A Ioannidis, professor34
- 1Department of Paediatrics, University of Ioannina School of Medicine, Ioannina, Greece
- 2Department of Paediatrics, George Washington University, School of Medicine and Health Sciences, Washington, DC, USA
- 3Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Greece
- 4Institute for Clinical Research and Health Policy Studies, Tufts University School of Medicine, Boston, MA, USA
- Correspondence to: J P A Ioannidis, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece
- Accepted 19 September 2008
Objective To determine how often health surveys and quality of life evaluations reach different conclusions from those of primary efficacy outcomes and whether discordant results make a difference in the interpretation of trial findings.
Design Systematic review.
Data sources PubMed, contact with authors for missing information, and author survey for unpublished SF-36 data.
Study selection Randomised trials with SF-36 outcomes (the most extensively validated and used health survey instrument for appraising quality of life) that were published in 2005 in 22 journals with a high impact factor.
Data extraction Analyses on the two composite and eight subdomain SF-36 scores that corresponded to the time and mode of analysis of the primary efficacy outcome.
Results Of 1057 screened trials, 52 were identified as randomised trials with SF-36 results (66 separate comparisons). Only eight trials reported all 10 SF-36 scores in the published articles. For 21 of the 66 comparisons, SF-36 results were discordant for statistical significance compared with the results for primary efficacy outcomes. Of 17 statistically significant SF-36 scores where primary outcomes were not also statistically significant in the same direction, the magnitude of effect was small in six, moderate in six, large in three, and not reported in two. Authors modified the interpretation of study findings based on SF-36 results in only two of the 21 discordant cases. Among 100 additional randomly selected trials not reporting any SF-36 information, at least five had collected SF-36 data but only one had analysed it.
Conclusions SF-36 measurements sometimes produce different results from those of the primary efficacy outcomes but rarely modify the overall interpretation of randomised trials. Quality of life and health related survey information should be utilised more systematically in randomised trials.
Quality of life outcomes and surveys of overall health status are considered useful to incorporate in randomised trials.1 2 3 4 5 6 7 Such data would be important to collect and report systematically, regardless of whether the results agree with the primary outcomes or not. It is unknown whether it is common for quality of life and health survey results to reach different conclusions from those of the primary efficacy outcomes, whether there is selective reporting of outcomes, or whether discordant results in these outcomes modify the conclusions of these trials. We therefore evaluated recently published trials (2005) in 22 leading journals. Many generic and disease specific quality of life and health survey measures exist.2 8 Some are difficult to compare or are prone to methodological shortcomings and suboptimal validation.1 2 9 10 11 To maximise comparability across trials covering diverse diseases and interventions we focused on trials using short form-36 (SF-36). Originally developed as a multipurpose health survey instrument, SF-36 has been translated in more than 50 countries as part of the international quality of life assessment project and has become the most extensively validated and used generic instrument for measuring quality of life. It is an instrument that has extensive applications for population health surveys, comparisons of relative burden of diseases, and differentiation of health benefits across groups produced by diverse interventions.11 12 13 14
We considered randomised trials with data on SF-36 published in 2005 in five major general medicine journals (New England Journal of Medicine, JAMA, Lancet, BMJ, PLoS Medicine) and 17 specialty journals with the highest impact factor among those that publish research from clinical trials for 2005 using the 2005 Journal Citation Reports (Circulation, Journal of the American College of Cardiology, Gastroenterology, Hepatology, Journal of the National Cancer Institute, Journal of Clinical Oncology, Blood, Annals of Internal Medicine, Diabetes, Diabetes Care, Brain, Annals of Neurology, American Journal of Respiratory and Critical Care Medicine, Journal of American Society of Nephrology, Arthritis and Rheumatism, American Journal of Psychiatry, Archives of General Psychiatry).
We considered randomised trials to be eligible that reported on any of the two composite (physical, mental) and eight subdomain SF-36 scores (physical functioning, role physical, bodily pain, general health, vitality, social functioning, role emotional, mental health). When referral was made to additional separate publications reporting primary efficacy or SF-36 outcomes, these were also retrieved. We also considered trials using SF-12, a shorter version of SF-36 (for composite scores). No restriction was set on disease and compared interventions. Whenever information was not reported on all 10 scores, we asked authors for missing information.
We searched the 22 target journals through PubMed using limits for randomised clinical trial (type of study) and 2005 (year of publication). Identified articles were downloaded in PDF format and screened electronically using Acrobat Reader “Find” tool for keywords: quality of life, SF36, SF 36, SF-36, short form 36, short form-36, SF-12, SF12, mental composite score, physical composite score, medical outcome study, MOS 36, MOS-36, and Ware. Articles passing electronic screening were further evaluated by two independent investigators (AK and IK). Disagreements were resolved by consensus. Remaining disagreements were resolved by DGC-I.
To probe whether SF-36 data may have remained unpublished we communicated (three emails, each sent three weeks apart) with the corresponding authors of 100 trials randomly selected among those not reporting SF-36 data. Selection was based on a list of 100 numbers generated randomly and applied to the 1057 retrieved articles, ordered serially per journal, after excluding the 52 eligible articles.
Data were extracted by three independent investigators (IK, AK, and DGC-I). Discrepancies were resolved by consensus. Remaining disagreements were resolved by JPAI.
From each eligible article we extracted information on authors, journal, design (superiority or non-inferiority), condition, interventions compared, sample size (randomised, analysed for SF-36), definition of primary efficacy outcome (as reported; if not clarified, we selected the outcome used for sample size calculations), time points and statistical analysis for the primary outcome and SF-36 assessments, whether SF-36 was a co-primary outcome, and whether any other quality of life and health related survey scales were used. We also recorded which SF-36 scores were reported and for which we could obtain missing information from authors.
For the primary efficacy outcome and for each of the presented SF-36 assessments we recorded whether the difference between compared arms was statistically significant (P<0.05) favouring the experimental arm, non-statistically significant, or statistically significant favouring the control arm. For trials with more than two arms we considered the comparison of each experimental intervention against control separately. We considered all comparisons and also present results separately for superiority and non-inferiority trials.15
Data on SF-36 outcomes were extracted for the reported analyses that corresponded as closely as possible to the same time points as for primary outcome data. Specifically, when measurements for primary or SF-36 outcomes were carried out at several time points, for primary efficacy outcomes we preferred analyses accounting for multiple measurements (for example, repeated measurement analysis) than analyses of single time points. If the primary outcome was a time to event analysis or incorporated serial longitudinal measurements, we preferred the analysis of serial longitudinal SF-36 measurements; if this was unavailable, we recorded whether there was formal statistically significant difference at any time points when SF-36 had been appraised. When the primary outcome was appraised at a single time point, we recorded the SF-36 outcomes at the single same (or closest) time point. In two comparisons where co-primary outcomes existed and could not be prioritised, we based the evaluation of statistical significance on overall authors’ interpretation.
We considered SF-36 results as statistically significant when at least one of the composite or subdomain scores showed a statistically significant result in favour or against the experimental intervention. There were no situations where some of the specific SF-36 scores were significant for the experimental intervention and others were significant against.
For statistically significant SF-36 effects when the respective primary efficacy outcome was discordant, we extracted information on the effect size of SF-36. Roughly, standardised mean differences of less than 0.30 standard deviations are small effects, 0.30-0.80 are moderate, and more than 0.80 are large.16 17 18 19 20 The corresponding cut-offs for raw scores are less than 4, 4-10, and more than 10 points.
For comparisons with discordant statistical significance on SF-36 and primary outcome results, we recorded whether the authors had discussed the SF-36 results at all, whether they commented on the discrepancy and if so with what arguments, and if SF-36 findings changed the interpretation of the trial results.
Overall 1057 trials were screened and 52 eligible trials identifiedw1-w52 with 66 eligible comparisons (figure⇓ and web extra table). Additional data were presented in other published articles on primary efficacy for one trialw43 and SF-36 for eight trials.w4 w21 w24 w29 w35 w36 w46 w51 Additional SF-36 data were provided directly by the authors in 11 trials with 13 comparisons (see web extra fig 1). Forty two trials (56 comparisons) addressed superiority, and 10 (10 comparisons) non-inferiority. In seven trials (10 comparisons) w2 w8 w35 w39 w40 w44 w45 SF-36 was described as a co-primary outcome. Additional quality of life or health survey instruments appeared in 16 trials (16 comparisons).
Eventually, data for physical composite score and mental composite score were available for 34 trials (39 comparisons) and 35 trials (40 comparisons, see web extra fig 1). Data on at least one of the eight subdomain scores were available for 36 trials (48 comparisons). Data on all possible SF-36 scores were available for 18 trials (eight published, 10 obtained from authors). Six trialsw6 w23 w29 w31 w35 w44 had collected information a priori only for specific subdomains.
Concordance of results
Of the 66 comparisons, 21 (32%) had discordant statistical significance for primary efficacy and SF-36 results (table 1⇓). Moreover, of the 56 comparisons of superiority trials 19 had discordant primary efficacy and SF-36 results (see web extra fig 2).
In onew44 of the 21 discrepancies, SF-36 was a co-primary outcome. In seven discrepancies, additional quality of life or health survey instruments were also used. In two trialsw14 w51 the additional instruments agreed with SF-36, and in fivew12 w15 w31 w44 w47 they agreed with the primary efficacy outcome.
In the 13 discordant comparisons with only SF-36 significant results (nine comparisons in favour and four against the experimental intervention; in seven trialsw7 w14 w21 w31 w44 w46 w47 and three trials,w15 w43 w51 respectively) there were 17 statistically significant specific scores (five normalised, 10 raw, two reporting only statistical significance without effect size); effect sizes were small in six, moderate in six, and large in three.
Interpretation of trial findings in discordant settings
Improved primary outcome only—SF-36 results did not modify the trial’s interpretation of these 11 comparisons (eight trials, table 2⇓).w4 w12 w16 w18 w41 w42 w43 w51 In five comparisons (four trials), SF-36 outcomes were only tabulated or alluded to in the results, without further discussion.w12 w16 w18 w42 In the other four trials the authors focused on other non-primary outcomes,w4 claimed that SF-36 was not sensitive enough to detect improvements,w41 adopted a non-intention to treat analysis for SF-36 with significant results,w43 or dismissed the importance of the negative effects on SF-36 in the face of benefits in disease-free survival.w51
Improved SF-36 only—SF-36 modified the interpretation of only two trials.w31 w44 The authors favoured the peer modelling videotape for breast cancer based on the significant and large improvement on SF-36 vitality despite no improvement on the IES-R (revised impact of events scale) (both were co-primary outcomes, table 2⇑).w44 In the chronic renal failure anaemia trial the benefit in vitality score from erythropoietin was acknowledged as clinically important.w31 In the other five comparisons (three trials), benefits on SF-36 did not change the interpretation.w7 w14 w21 One trial dismissed the SF-36 difference as transient and weak,w14 one trial considered the non-statistically significant benefits in efficacy as clinically important, whereas the significant improvements in SF-36 vitality scores were considered clinically unimportant and the authors then even questioned the use of SF-36 in trials on diabetes,w21 and in another trial the authors considered that the clinical significance of statistically significant differences in SF-36 domains in patients with fibromyalgia could not be evaluated.w7
Improved SF-36, non-inferiority on primary outcome—SF-36 did not modify the interpretation of these two trials.w46 w47 Both trials already concluded favourably for the experimental intervention that achieved the desired non-inferiority, and in one of themw47 the observed benefit in SF-36 was considered possibly due to chance.
Only SF-36 worsened—In one trialw15 where SF-36 worsened with the experimental intervention, the investigators interpreted the results as showing no consistent differences in quality of life, because an additional instrument (EQ5D) showed no significant differences.
Probing unpublished data
Authors of 69 of 100 additional randomly selected trials responded. SF-36 data had actually been collected from five trials. The data had been analysed for only one trial and did not show any statistically significant differences for SF-36 or the primary efficacy outcome.
In one third of the trial comparisons in our empirical evaluation, differential effects on primary efficacy outcomes compared with SF-36 were identified. However, when SF-36 compared with efficacy outcomes reached discordant conclusions, SF-36 rarely affected the interpretation of these trials. What we observed was generally a tendency to belittle rather than to pronounce discordant results. Several trials did not discuss the SF-36 findings at all, and most did not report all the tested SF-36 scores. Considering post hoc an instrument as insensitive or not worth reporting contradicts the initial choice to use this instrument as a trial outcome.
In most trials for chronic conditions, quality of life and surveys of health status are useful to consider. SF-36 was reported in fewer than 5% of the trials we screened, and our author survey suggested that some additional trials (at least five of 100) had collected information on SF-36 but without analysing or publishing it two or three years after the publication of the main trial results. Quality of life seems to remain undervalued in clinical research: few trials collect quality of life related data, fewer report on them, data are only partially presented, and quality of life rarely affects the trial interpretation.
We should acknowledge some caveats. Firstly, by selecting high impact journals we identified trials with high visibility and probably also high quality.21 It is unlikely that this strategy would have selected for discordant results between outcomes. Secondly, selective analysis and reporting bias may affect primary outcomes and not just SF-36,22 23 24 25 but this should not have increased the perceived rate of discrepancies between outcomes. Thirdly, discordance at the level of statistical significance does not necessarily mean that results for different outcomes differ beyond chance. Among statistically significant results, chance findings and non-clinically important differences are possible, and primary outcomes should be given more weight in the discussion than secondary outcomes. Given that trials are typically powered to address the primary outcome, a significant result in the primary outcome with a non-significant result in quality of life or health survey assessments may sometimes simply reflect lack of power for the quality of life or health survey outcome. Therefore we also examined the SF-36 effect sizes and the circumstances and discussion of discordant results. Fourthly, we did not carry out the same in-depth evaluation for trials where efficacy and SF-36 outcomes were concordant. It is unlikely that authors would then have modified their inferences, but SF-36 may have strengthened the conclusions. Finally, we did not examine trials using only other quality of life or health survey instruments beyond SF-36. However, SF-36 is the most robustly standardised and widely used one, and we wanted to maximise comparability. Although other scales may also be used, one study found that only 4.2% of trials reported any quality of life outcome.2
Although quality of life and health survey scales have been used in clinical trials for over 25 years, several issues remain debated.26 Besides problems of fragmented, selectively reported information, it is sometimes impossible to say whether and which analyses are based on a priori analytical plans.11 27 Proper attention to the importance of these outcomes should be given in clinical trials. Otherwise, with a growth in the clinical trials’ administrative paperwork,28 outcomes such as SF-36 may become routine compulsory assessments without a genuine interest to learn from them.
Overall, quality of life and health survey assessments provide a different window into patient outcomes and deserve to be included in more trials with complete reporting of results, and standardised interpretation. Unbiased data on these outcomes may enhance our ability to improve clinical decision making.
What is already known on this topic
Quality of life and related health survey outcomes could be essential in deciding whether an intervention is worth adopting
It is unknown whether such outcomes reach different conclusions from those of primary efficacy outcomes or whether they affect the interpretation of current clinical trials
What this study adds
Several randomised trials published in influential journals have had discordant results on primary efficacy outcomes compared with SF-36
When SF-36 and efficacy outcomes reached discordant conclusions, SF-36 rarely modified the interpretation of these trials
Quality of life and health related survey information deserves more standardised and systematic use in randomised trials
Cite this as: BMJ 2009;339:a3006
We thank the following for clarifying their results or providing additional data: N Assefi, D Buchwald, and C Jacobsen (for Assefi et alw1); A Avenell and JA Cook (for Avenell et alw2); J Carratala, LJ Crofford, J Pepper, and B Lees (for De Arenaza et alw10); H Escobar-Morreale, RM Greenhalgh, and LC Brown (for EVAR 1w14 and EVAR 2w15); I Gilron, S Hewlett, F Hill-Briggs, CA Hukins, P Jellema, and DA van der Windt (for Jellema et alw23); JA Klaber-Moffett, K Linde, PS Parfrey, and RN Foley (for Parfrey et alw31); DJ Torgerson (for Porthouse et alw32); D Revicki and JM Miranda (for Revicki et alw35); M Rienstra and IC van Gelder (for Rienstra et alw36); BL Rollman, MF Scheier, and S Colvin (for Scheier et alw40); N Shaheen, BN Singh, AL Stanton, and G Bleijenberg (for Stulemeijer et alw45); D van der Heijde and JH Stone (for Wegener’s Granulomatosis Etanercept Trialw50); and WA Whitelaw.
Contributors: JPAI conceived the study and is guarantor. All authors designed the protocol, analysed and interpreted the data, and approved the final manuscript. DGC-I, IK, and AK collected the data. DGC-I and JPAI drafted the manuscript. IK and AK critically revised the manuscript for important intellectual content.
Competing interests: None declared.
Ethical approval: Not required.
Provenance and peer review: Not commissioned; externally peer reviewed.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.