Randomised controlled trials: understanding powerBMJ 2015; 350 doi: https://doi.org/10.1136/bmj.h3229 (Published 18 June 2015) Cite this as: BMJ 2015;350:h3229
- Philip Sedgwick, reader in medical statistics and medical education
- Correspondence to: P Sedgwick
The effects of manual lymph drainage on the development of lymphoedema related to breast cancer were investigated using a randomised controlled trial.1 The intervention was a six months’ treatment programme consisting of guidelines about prevention of lymphoedema, exercise therapy, and manual lymph drainage. Control treatment consisted of the same programme as the intervention but without the guidelines about manual lymph drainage. Participants were consecutive patients with breast cancer and unilateral axillary lymph node dissection. The length of follow-up was 12 months after surgery. The setting was hospitals in Belgium.
The outcome measures included the cumulative incidence of arm lymphoedema by follow-up. Arm lymphoedema was defined as an increase in arm volume of 200 mL or more in the value before surgery. The sample size was based on having 80% power to detect a difference between treatment groups of 20% in the cumulative incidence of arm lymphoedema, assuming a cumulative incidence of 30% for the control group at follow-up. The sample size calculation assumed a two sided hypothesis test and critical level of significance of 0.05 (5%). In total, 146 patients were required. To account for an estimated dropout rate of 10%, the required sample size was adjusted to 160 patients. In total, 160 patients were recruited, with 79 allocated to the intervention, and 81 allocated to control.
Overall, 154 (96.3%) patients completed follow-up, with four patients in the intervention group and two in the control group lost to follow-up. At 12 months after surgery, the percentage of patients with arm lymphoedema was higher in the intervention group than in the control group, although the difference was not significant (24% (n=18) versus 19% (n=15); difference 5%, 95% confidence interval −8% to 18%; P=0.45). It was concluded that there was no evidence that manual lymph drainage in addition to guidelines and exercise therapy after axillary lymph node dissection for breast cancer reduced the incidence of arm lymphoedema in the short term.
Which of the following statements, if any, are true?
a) The proposed difference of 20% between treatment groups in the primary outcome used to calculate the sample size was the smallest effect of clinical interest.
b) To have 100% statistical power would require sampling the entire population.
c) The trial was overpowered for the statistical test of the primary outcome.
d) Because the difference between treatment groups in the primary outcome was not significant, it can be assumed the intervention is equally as effective as the control.
Statements a, b, and c are true, whereas d is false.
The above trial was a superiority trial by design; the aim was to establish whether intervention was superior in effectiveness to the control treatment and reduced the cumulative incidence of arm lymphoedema, or whether the control treatment was superior. Superiority trials have been described in a previous question.2 Although it was anticipated that intervention would reduce the cumulative incidence of arm lymphoedema compared with the control treatment, sometimes results are unexpected and it was important that statistical hypothesis testing allowed for the possibility of the control treatment being superior. Therefore, traditional statistical hypothesis testing with a two sided alternative hypothesis was used to compare treatment groups in the primary outcome.3 When calculating the required sample size for the above trial, it was necessary to stipulate that the alternative hypothesis was two sided since it influences the required sample size. It was also necessary to indicate the critical level of significance when calculating the required sample size, although the standard level of significance of 0.05 (5%) is typically used when calculating the required sample size for trials.
It was essential that the researchers calculated the optimal sample size before starting the trial. The required number of participants was based on the clinical significance of the difference between treatment groups in the primary outcome. It was assumed that the cumulative incidence of arm lymphoedema at 12 months would be 30% for the control group. The assumption was based on previous research. For the intervention to be considered clinically effective and superior to the control treatment, the intervention group was required to demonstrate a 20% reduction in the cumulative incidence of arm lymphoedema at follow-up. This difference in the primary outcome is called the smallest effect of clinical interest (a is true). The smallest effect of clinical interest was proposed by the researchers on the basis of clinical experience or previous research. Obviously, larger differences between treatment groups would show clinical superiority, whereas smaller differences would not. If the control group demonstrated a reduction of 20% or more in the cumulative incidence of arm lymphoedema when compared with intervention, statistical significance would also be demonstrated.
The smallest effect of interest may not exist for the population. That is, the difference in cumulative incidence of arm lymphoedema between treatments groups at 12 months follow-up that would be seen if the treatments were applied to the entire population may be less than 20%. However, if the smallest effect of clinical interest does exist for the population, then the probability it is observed in the trial as statistically significant needs to be maximised. To achieve this, an optimal sample size was required. This underlies the concept of statistical power. Statistical power is based on the hypothetical situation of repeating the above trial an infinite number of times and under the same conditions—in particular, the samples would be of the same size. Random sampling would be employed to select the samples from the population, and therefore the samples would have different sample estimates for the population parameter of the difference between treatments in the primary outcome. Each trial would involve a statistical hypothesis test, resulting in a P value for the comparison of treatments in effectiveness. These P values would vary in magnitude. The percentage of these repeated samples that would demonstrate the smallest effect of clinical interest (if it existed in the population) as a statistically significant difference (P<0.05) is the statistical power of the calculated sample size. The magnitude of statistical power is stipulated when calculating the required sample size. In the above trial the power was set at 80%. Therefore, 80% of the hypothetical infinite samples would demonstrate the smallest effect of clinical interest (if it existed in the population) as statistically significant (P<0.05).
If the smallest effect of interest existed for the population, then it was essential that the probability it was observed in the trial—that is, the statistical power was as high as possible. However, increased statistical power is associated with a larger sample size. This is intuitive, because as sample size increases and approaches that of the population, the sample estimate of the difference in the cumulative incidence in arm lymphoedema in the trial would become similar to the population parameter. Therefore, as sample size increases so also does power since the smallest effect of clinical interest is more likely to be seen in the trial if it exists in the population. To have 100% statistical power would require sampling the entire population (b is true), although this would not be feasible. Therefore, a compromise is made between power and sample size when determining the required sample size for a trial. The power was set to 80% in the above trial, this being the minimum generally recommended when calculating sample size in clinical trials.
When designing a trial, determining the optimal sample size is an important ethical consideration. A trial needs to be adequately powered. If the sample size is too small the trial may not have adequate power and fail to identify the smallest effect of clinical interest, if it exists in the population. This might be unethical because time, effort, and resources will have been spent running a trial that has little potential to demonstrate clinical significance. Equally, if the sample size is too large then more participants will have been recruited than needed to show the smallest effect of clinical interest. This would also be considered unethical because time, effort, and resources would have been spent in recruiting too many participants.
For the sample size calculation for the above trial, statistical power was set at 80%. If the smallest effect of clinical interest existed in the population, then to observe it in the trial and demonstrate it as statistically significant with 80% power, 146 women needed to be recruited. The required sample size was adjusted for an estimated dropout rate of 10%, resulting in a total sample size of 160. The sample size calculation assumed equal numbers of participants would be allocated to each treatment group. In total, 160 women were recruited, of whom 154 women completed follow-up (intervention n=75, control n=79). The observed dropout rate was 3.7%, less than the estimated dropout rate of 10% that was used to adjust the sample size. Therefore, more participants were recruited than needed to demonstrate the smallest effect of clinical interest. Since more participants were recruited than was necessary, the statistical power was in excess of the 80% as specified in the sample size calculation for the required smallest effect of clinical interest. When a trial has power in excess of that stipulated in the sample size calculation, it is referred to as overpowered (c is true).
When making conclusions based on the results from a trial, it is important to appreciate the association between statistical power and the statistical significance of the comparison between treatment groups in the primary outcome. In particular, differences between treatment groups in the primary outcome that are identified as statistically significant may not be clinically significant. The above trial was overpowered. The implications of this was that it was possible to maintain the specified statistical power of 80%, yet report differences between the treatment groups in the cumulative incidence of arm lymphoedema that were smaller than the smallest effect of clinical interest as statistically significant. Hence it was possible that differences smaller than the required smallest effect of clinical interest, and therefore not clinically significant, could have been identified as statistically significant. An example has been described in a previous question.4
At 12 months after surgery, the percentage of patients with arm lymphoedema was higher in the intervention group than in the control group (24% versus 19%). However, the difference was not statistically significant. The results were probably unexpected, since it would have been anticipated that the intervention would reduce the cumulative incidence of arm lymphoedema when compared with control. Although the difference between treatment groups in the primary outcome was not significant, it cannot be inferred that the intervention was equally as effective as the control in reducing the cumulative incidence of arm lymphoedema (d is false). The inference that can be made based on the above result is that there was no evidence of a difference between the treatment groups in the cumulative incidence of arm lymphoedema, not that there is no difference. A previous question described how “absence of evidence is not evidence of absence.”5 That is, just because statistical hypothesis testing fails to find a difference between treatment groups in an outcome, it does not mean that a difference does not exist in the population. Although a difference may exist in the population, the study may not have demonstrated one because the trial participants were a single sample from the population; another sample may give a different sample estimate that might lead to a significant result. This is despite the trial having relatively high statistical power of 80%.
Cite this as: BMJ 2015;350:h3229
Competing interests: None declared.
Log in using your username and password
Log in through your institution
Register for a free trial to thebmj.com to receive unlimited access to all content on thebmj.com for 14 days.
Sign up for a free trial