Interpreting treatment effects in randomised trialsBMJ 1998; 316 doi: http://dx.doi.org/10.1136/bmj.316.7132.690 (Published 28 February 1998) Cite this as: BMJ 1998;316:690
- Gordon H Guyatt (), professora,
- Elizabeth F Juniper, professora,
- Stephen D Walter, professora,
- Lauren E Griffith, research biostaticiana,
- Roger S Goldstein, professorb
- a Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada L8N 3Z5,
- b Division of Respiratory Medicine, University of Toronto, Toronto, Ontario, Canada
- Correspondence to: Dr Guyatt
- Accepted 5 October 1997
The need to measure the impact of treatments on health related quality of life has led to a rapid increase in the variety of instruments available and in their use as measures of outcome in clinical trials. One limitation of instruments that purport to measure health related quality of life is difficulty interpreting their results. In the past decade, investigators have progressed in making these questionnaire results interpretable. For example, we have shown that when questionnaires present response options in the form of seven point scales with verbal descriptions for each option (see box), the smallest difference that patients consider important is often approximately 0.5 per question. A moderate difference corresponds to a change of approximately 1.0 per question, and changes of greater than 1.5 can be considered large. Thus, for example, in a domain with four items, patients will consider a 1 point change in two or more items as important. This finding applies across different areas of function, including dyspnoea, fatigue, and emotional function in patients with chronic airflow limitation1; and symptoms, emotional function, and activity limitations in adults2 and children3 with asthma, parents of children with asthma,4 and adults with rhinoconjunctivitis.5 Initially, we used comparisons in the same patient to establish this difference, but more recently we have replicated this finding using differences between patients.6
Several questionnaires on quality of life related to health are available, but interpreting their results may be difficult
For some questionnaires, we now know that the smallest change in score that patients consider important is 0.5
Even if the mean difference between a treatment and a control is appreciably less than the smallest change that is important, treatment may have an important impact on many patients
A method for estimating the proportion of patients who benefit from a treatment when the outcome is a continuous variable has been developed
The method is outlined using two examples, one a crossover trial and the other a parallel group design
This approach emphasises the need to establish ranges of health related changes that represent trivial, small but important, moderate, and large changes in addition to mean differences
Clinicians and investigators tend to assume that if the mean difference between a treatment and a control is appreciably less than the smallest change that is important, then the treatment has a trivial effect. This may not be so. Let us assume that a randomised clinical trial shows a mean difference of 0.25 in a questionnaire in which the minimal important difference is 0.5. It might be concluded that the difference is unimportant and that the result does not support giving the treatment. This interpretation assumes that every patient treated scored 0.25 better than they would have done had they received the control and ignores the possibility that treatment might have a heterogeneous effect. Depending on the true distribution of results, the appropriate interpretation might be different.
Consider a situation in which 25% of the treated patients improved by a magnitude of 1.0, while the other 75% did not improve at all (mean change of 0). This would mean that the 25% of those treated obtained a moderate benefit from the intervention. Using the method that has recently been developed for interpreting the size of treatment effects—the number needed to treat—investigators have found that doctors often treat 25 to 50 patients, even as many as 100, in order to prevent a single adverse event. 7 8 Thus, the hypothetical treatment with a mean difference of 0.25 and a number needed to treat value of 4 proves to have a powerful effect.
We have developed a method for estimating the proportion of patients who benefit from a treatment when the outcome is a continuous variable. We outline this method using two examples, one a crossover trial and the other a parallel group design.
Seven point scale with verbal descriptors
The following options were given for response to the question “How short of breath have you felt during the last two weeks while climbing stairs?”
1—extremely short of breath
2—very short of breath
3—quite a bit short of breath
4—moderate shortness of breath
5—some shortness of breath
6—a little shortness of breath
7—not at all short of breath
In the seven point scales used in this study, 7 represents the best possible function, and 1 the worst possible function.
To complete the asthma quality of life questionnaire, patients rate the impairments they have experienced during the previous 14 days and respond to 32 questions on seven point scales similar to that in the box.9 In a multicentre double blind crossover randomised trial lasting 12 weeks, 140 patients received salmeterol (50 μg, twice daily), salbutamol (200 μg, four times daily) or placebo plus salbutamol (to be opened as needed). Each patient received all three regimens and used the questionnaire to rate their quality of life in relation to their asthma at the end of each study period.10
The mean differences between salmeterol and salbutamol, and between salmeterol and placebo, met conventional criteria for significance. In the current analysis, we examined and compared the distribution of different scores in the salmeterol, salbutamol, and placebo periods. We reasoned that the number of patients who had obtained important benefit from treatment would be the number with a difference of 0.5 or more favouring the treatment period, minus the number with a difference of 0.5 or more favouring the control period. This measure is analogous to the conventional risk difference, with 1 divided by the difference in risk being the number needed to treat.
The figure shows the distribution of differences between the salmeterol and salbutamol treatment periods in the activity domain of the asthma quality of life questionnaire and the difference in the proportion of the distribution in the important benefit compared with the important deterioration ranges. The distribution is approximately normal.
Table 1 shows that for both comparisons, differences between treatments failed to reach the threshold of the minimal important difference for the activity limitation section of the asthma quality of life questionnaire. In the symptom section of the questionnaire, the difference between salmeterol and salbutamol bordered on the minimal important difference. The only comparison in which the minimal important difference was clearly exceeded was that between salmeterol and placebo in the symptom section of the questionnaire.
In contrast to these mean differences, many patients had scores that were more than 0.5 better for salmeterol compared with salbutamol treatment for both symptoms and activity limitations. Fewer had scores that were 0.5 or more better for salbutamol compared with salmeterol. The difference in the proportions is even greater for the comparison between salmeterol and placebo (table 1).
Comparing salmeterol and salbutamol, clinicians would need to treat 4.5 patients for one patient to gain important benefit in the activity domain (or 45 for 10 to benefit). However, the number needed to treat for salmeterol compared with placebo in the activity domain is 2.9.
Parallel group trial
The chronic respiratory questionnaire, which includes 20 items measuring dyspnoea, fatigue, emotional function, and mastery (the extent to which patients feel in control), was developed for use in patients with moderate or severe chronic airflow limitation, and uses seven point scale response options.11 Seventy eight patients with chronic airflow limitation were randomly allocated to a six month programme of respiratory rehabilitation or to conventional community care. We used differences between the patients' chronic respiratory questionnaire scores at baseline and after 24 weeks reported in the primary analysis of the trial results in the current analysis.12 Mean differences between treatment and control for three domains reached significance.
The analysis of the parallel group trial provides additional challenges beyond those of the crossover trial. In theory, to calculate the proportion who improved on treatment we would have needed to know how rehabilitation patients would have fared had they received standard care, and how the standard care patients would have fared had they received rehabilitation. However, we could not observe these data directly because patients received only one treatment or the other. We do, however, know the proportion who improved, remained the same, and deteriorated relative to their baseline status in both treatment and control groups (table 2).
Table 3 shows the proportion of patients in the rehabilitation and control groups whose dyspnoea scores increased by more than 0.5 (improved), changed between −0.5 and 0.5 (unchanged), and fell by more than 0.5 (deteriorated). We can refer to the proportions improved, unchanged, and deteriorated in the two groups as the “marginals.” Given these marginals, there is, in general, no single way of filling in the individual cells in table 2—indeed, there are many possibilities. We have assumed that treatment and control responses are independent. Making this assumption, we obtain estimates of the individual cell values by multiplying the corresponding marginals (for instance, in table 2 we obtain the value for cell ax by multiplying the proportion improved in the rehabilitation group by the proportion improved in the standard care group). In table 2, cells ax, by, and cz represent patients whose outcome is the same irrespective of treatment. Patients in cells ay, az, and bz fared better receiving standard care than rehabilitation, and patients in cells bx, cx and cy fared better receiving rehabilitation than standard care. Thus, the proportion who received benefit from treatment is (bx+cx+cy)−(ay+az+bz), which in this case is (0.24+0.11+0.10)−(0.12+0.03+0.05)=0.25 (0.24 without rounding error). The number needed to treat value is therefore 1/0.24, or 4.2.
Table 3 gives the full results and shows that the mean difference between treatment and control groups exceeded the minimal important difference in two of the four domains. However, for all four domains, the difference in the proportion improved compared with deteriorated in the two treatment groups was similar, leading to consistent number needed to treat values of between 2.5 and 4.4.
Interpretation of treatment effects
The notion of taking a continuous variable, specifying a threshold that defines an important difference, and examining the proportions of patients who reach that threshold is not new. In considering the treatment of hypertension, Rose emphasised the difference between mean differences in populations and the impact these differences might have on individuals. In one specific example, Duffy argues persuasively that knowledge of mean changes in alcohol consumption in a population does not allow one to estimate change in the proportion of heavy drinkers. Rather, ascertaining the proportion of heavy drinkers requires direct measurement.13 Another good example of this approach comes from a recent controlled trial of tissue plasminogen activator treatment in patients with acute stroke.14 In reporting the results of this study, the authors presented both mean values of functional measures and differences in the proportions of patients who reached a threshold level of function.
What we have done that is new is to anchor the threshold difference using the smallest difference that patients consider important—the minimal important difference. We have shown how the method can be applied in both crossover and parallel group trials, how to generate the number needed to treat for one patient to benefit from therapy, and how superficial examination of mean differences can produce very misleading conclusions.
When mean differences fall below the minimal important difference, clinicians may intuitively conclude that the treatment has a small, and possibly unimportant, effect. Similarly, doctors who observe a mean difference that is appreciably greater than the minimal important difference may be ready to assume that each patient benefits. This is not necessarily the case. For example, we found a mean difference of 0.7 in the mastery domain of the chronic respiratory questionnaire between those who received and did not receive rehabilitation. Despite this substantial difference, the number needed to treat was 2.5. This means that for every five patients who complete a rehabilitation programme, only two will be better off—a result that may have major implications for the cost effectiveness of the intervention.
Our approach is not restricted to health related quality of life or functional status measures, but applies to any clinical variable. For instance, the interpretation of changes in pulmonary function, exercise capacity, or renal or cardiac function could all be analysed in this way. For these variables, however, the concept of the minimal important difference may be questioned. If renal failure requires dialysis or if cardiac function deteriorates to the point that a heart transplant is necessary, the importance for the patient is clear. Smaller changes in physiological function are important not in themselves, but rather through their effects on patient function and her or his health related quality of life. When considering differences that are important to patients, it may be more appropriate to measure function and health related quality of life directly rather than physiological variables.
Our approach is a way of making data more interpretable—we do not advocate its use as the only analysis. Power may be lost when converting continuous variables to dichotomous or categorical variables. We believe the initial analysis should examine whether differences in continuous variables meet criteria for significance. Once investigators have excluded chance as an explanation for differences between groups they can examine the proportions of patients who have deteriorated, remained the same, or improved as an aid in interpreting the importance of the results.
This approach emphasises the need to establish ranges of health related quality of life, symptoms, and functional status questionnaire changes that represent trivial, small but important, moderate, and large changes. When they understand these ranges, investigators reporting clinical trials should present not only mean differences but also the difference in the proportion of patients who experience important improvement, and the associated number needed to treat. Presenting the results in both ways will reduce the risk of important misinterpretation of randomised trials that directly measure aspects of living that are important to patients.
Funding: Supported in part by a grant from the Medical Research Council of Canada.
Conflict of interest: None.