Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt in the attic?BMJ 2001; 322 doi: http://dx.doi.org/10.1136/bmj.322.7292.989 (Published 21 April 2001) Cite this as: BMJ 2001;322:989
- Nick Freemantle (), professor of clinical epidemiology and biostatistics
- Accepted 7 December 2000
Impressive results for secondary outcomes or subgroup analyses pose problems for those trying to value the benefits observed in clinical trials. In the prospective randomised amlodipine survival evaluation study, comparing amlodipine with placebo in patients with severe heart failure, a prospectively defined subgroup of patients with non-ischaemic heart failure showed a 46% reduction in the risk of death (95% confidence interval 21% to 63%).1 This was achieved alongside a non-significant reduction in death from any cause or admission to hospital for major cardiovascular events (P=0.31), the prospectively defined primary outcome measure, and no observed benefits in the ischaemic group. The authors of the report commented: “Although this benefit was seen only in a subgroup of patients, it is likely that it reflects a true effect of amlodipine, since the randomisation procedure was stratified according to the cause of heart failure and a significant difference between the ischaemic and non-ischaemic strata was noted for both the primary and secondary end points of the study.”1
This article examines the interpretation that may be placed on the results of secondary end points and subgroup analyses in the context of clinical practice and health policy. With regard to health policy, it emphasises the need for discipline in interpreting clinical trials.
Impressive results in subgroup analyses and secondary outcomes can be hard to interpret
For individual patients, subgroup analyses and secondary end points can provide the best guide for clinical intervention
Health policy decisions such as those taken by NICE aim to guide the treatment of future patients and will be difficult to change
Health policy should be protected from undue inference by considering the results of predetermined primary outcomes
Prospectively declared primary outcomes
Randomised trials commonly include a range of patients with a particular disorder and estimate the average effect of the intervention being studied. Clinicians usually want to know the likely benefits and risks for an individual patient. However, attributing benefits to secondary outcomes or specific subgroups in a trial is problematic.
Registration trials require the development of prospective protocols and statistical analysis plans.2 These describe the inclusion and exclusion criteria for patients, treatment and its delivery, outcome assessment, and the statistical analyses. A key feature is the prospective identification of a primary outcome measure.
Clinical trials are major undertakings for sponsors and investigators. It would be odd for a single outcome to encompass all that interests investigators. Frequently, clinical trials include several outcome measures, raising the problem that the likelihood of finding a statistically significant result by chance alone increases with the number of tests undertaken. This is the “penalty for peeking.”3 One approach is to use a Bonferroni adjustment, modifying the P value to account for the multiple tests performed and the increased probability of chance findings achieving significance. This is too high a price to pay, however, since we are not equally interested in all the statistical tests, and the statistical adjustment increases the probability of failing to detect a true effect of treatment.4
Identifying prospectively a primary outcome measure simplifies the situation. Suppose a trial examines the effect of a clinical treatment through a single outcome measure, and the difference between the outcome in the treatment and control groups achieves a two sided P value of 0.05. This means that the observed difference between the groups (or greater) would occur by chance alone only five times in 100. If two outcomes are examined, the situation is complicated. Indeed, if the outcomes are unrelated, the probability of one of the P values being 0.05 is approximately halved (to slightly less than 10 times in 100). Declaring at the outset that an outcome is of principal importance protects the trial from the need to deal with this problem, but it relegates secondary outcomes and prospectively defined subgroup analyses to the status of descriptors.
When licensing pharmaceuticals, the US Food and Drug Administration nearly always requires two well designed randomised trials to achieve a one sided P value of 0.025 (an overall P value of 0.001) for the prospectively identified primary outcome measures against an appropriate comparator.5 Estimation (using confidence intervals) rather than hypothesis testing (using P values) is likely to be more helpful in interpreting the results of trials. Standard statistical procedures for estimation provide the most likely value (the point estimate of treatment effect) and a plausible range of values (95% confidence intervals) which are taken to describe the probable range of the true population effect.6
In 1980, the US Food and Drug Administration published its critique of the anturane reinfarction trial: “We are aware that it is unusual for an FDA critique of a clinical trial to be published in the medical literature. We believe that it is important in this instance, however, because … it illustrates so clearly the problems that may arise from subgroup analyses and exclusion of patients from analysis after they have completed a study … Our review … indicates that the cause-of-death classification and all conclusions based on it are unreliable, and that the favorable effect of sulfinpyrazone on overall mortality, especially during the first six months, depends heavily on the after-the-fact exclusion of certain deaths from the analysis.”7
There are good grounds to suggest that a prospectively determined primary outcome based on data from all randomised patients should be used to make policy decisions.8 This strategy will protect the decision maker from the substantive risk of undue inference.
Significant secondary end point or subgroup result
If the primary outcome measure is not statistically significant, what is the correct interpretation of the results of significant secondary outcomes or subgroup analyses? These analyses are analogous, although people often place greater confidence in secondary outcomes. Moyé comments: “The primary end point, chosen from many possible end points and afforded particular and unique attention during the trial, becomes unceremoniously unseated when it is discovered to be negative at the trial's conclusion. Like the ‘crazy aunt in the attic,’ the negative primary end point receives little attention in the end, is referred to only obliquely or in passing, and is left to languish in scientific backwaters.”8
Dr Milton Packer, representing the sponsor, made the following comments to the US Food and Drug Administration representatives (Drs Wood and Shepherd) during the licensing process for carvedilol:5
Dr Packer: Almost all of these P values are 0.00 something, so you can do this in a variety of ways, checking for robustness of the data by adding and subtracting endpoints, obviously post hoc, after the fact, and it all comes out the same way.
Dr Wood: Except for the primary endpoints.
Dr Packer: The primary endpoints don't make it, no matter how creative you are.
Without the benefits of hindsight, the decision to license carvedilol may not have served the public interest because of the prospective uncertainty about the result. β blockers have subsequently been shown to be effective in the treatment of mild to moderate heart failure. 9 10 However, when the decision was taken, all available statistical power had been “spent” on the primary outcome, and the play of chance could have considerable influence even though the secondary outcomes seemed to be statistically significant.
Assmann and colleagues argue that statistical inspection of subgroups should not simply rely on P values for the subgroup comparison but on tests for statistical interaction between groups.11 That is, tests that determine that a group of patients are significantly different from other patients in the trial. They suggest that “only if a statistical interaction test supports a subgroup effect should the results be influenced.” The suggestion is not new and echoes that of Peto et al.12 Although sensible, it is not failsafe, as is shown by the results of the second prospective randomised amlodipine survival evaluation study, which were reported recently.1 This study, which included only patients with non-ischaemic disease, identified no benefits for amlodipine in the treatment group. Pooled results from both trials indicate no benefits from amlodipine for the patient population as a whole or for the patients with non-ischaemic heart failure. This is despite findings of P<0.001 for all cause mortality in the subgroup of patients with non-ischaemic disease and P=0.004 for the interaction term between cause of heart failure and treatment in the first study.1
Oxman and Guyatt developed a series of questions to help clinicians decide whether apparent differences in subgroup responses are real.13 These are given in the box.
Are apparent differences in subgroup response real?
1 Is the magnitude of the difference clinically important?
2 Was the difference statistically significant?
3 Did the hypothesis precede rather than follow the analysis?
4 Was the subgroup analysis one of a small number of hypotheses tested?
5 Was the difference suggested by comparisons within rather than between studies?
6 Was the difference consistent across studies?
7 Is there indirect evidence that supports the hypothesised difference?
An individual patient faced with a serious condition may have only one opportunity to benefit from a potentially helpful treatment. Whatever the statistical results, the subgroup or secondary outcome results could provide the best available estimate of treatment effects for individual patients. Health policy decisions relate not just to the individual patient but to all patients in the future. These decisions require greater rigour because an incorrect decision will be hard to rectify. It may consign future patients to unnecessary treatment with associated risks (but no benefit) and use scarce healthcare resources futilely rather than allocating them to interventions likely to achieve worthwhile improvements in health status.
Thus, for health policy purposes the list in the box should be prefixed by a question asking whether the primary outcome measure was statistically significant. If the answer to that question is yes, then it may be appropriate to consider the remaining questions. A purist view suggests that when the primary end point is not significant the results should be used only for generating hypotheses. Even when the primary outcome is statistically significant, attention should be directed at the way statistical power is spent in the trial, and consideration should be given to the likelihood that findings in subgroups or secondary end points represent chance rather than reliable findings.
This suggestion has substantial implications. Interim guidance from the National Institute for Clinical Excellence (NICE) to sponsors (box) includes various references to the identification and description of subgroups of patients who will benefit from treatments in a manner that may be considered cost effective.
Guidance from NICE on identifying subgroups who may benefit14
Information should be provided in order that the clinical effectiveness of the technology can be evaluated—both qualitatively and quantitatively—in relation to those conditions for which it is indicated (both in general and for relevant subgroups)
The manufacturer or sponsor should include data supporting specific claims (for example, improved efficacy, safety, or diagnostic reliability). Data supporting claims in specific target groups of patients, in whom there may be particular advantages, should also be presented even if these are not specifically identified in the product literature
Manufacturers and sponsors should, as appropriate, provide an overall assessment of the health gain that has resulted, or will result, from the routine adoption of the new technology and in special patient subgroups
Methodological arguments counsel against the use of subgroups of patients—particularly those not prospectively defined—and, worse, against using subgroups derived on the basis of observed results. The National Institute for Clinical Excellence's recommendations for considering the cost effectiveness of drugs and devices deviate substantially from purist rigour and may be regarded as ill conceived or even irresponsible.
A review of the experience of the analogous Australian Pharmaceutical Benefits Economics Subcommittee described problems in the economic analysis in two thirds of submissions to the scheme, and it found that two thirds of these problems concerned the interpretation of clinical data.15 Purist rigour in licensing of pharmaceuticals is challenged by current practice in cost effectiveness analysis.16 As health systems increasingly consider cost effectiveness analyses as part of the decision making process for the reimbursement of drugs and devices, it is important that research evidence is properly interpreted, otherwise inappropriate pharmaceuticals will be incorporated in clinical practice.
Competing interests NF has received funding for research from various pharmaceutical and device companies, the Department of Health, the Medical Research Council and other medical charities.