# Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects

BMJ 2018; 363 doi: https://doi.org/10.1136/bmj.k4245 (Published 10 December 2018) Cite this as: BMJ 2018;363:k4245^{1}Predictive Analytics and Comparative Effectiveness Center, Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, Boston, MA 02111, USA^{2}Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC, Leiden, Netherlands

- Correspondence to: D M Kent dkent1{at}tuftsmedicalcenter.org

## Abstract

The use of evidence from clinical trials to support decisions for individual patients is a form of “reference class forecasting”: implicit predictions for an individual are made on the basis of outcomes in a reference class of “similar” patients treated with alternative therapies. Evidence based medicine has generally emphasized the broad reference class of patients qualifying for a trial. Yet patients in a trial (and in clinical practice) differ from one another in many ways that can affect the outcome of interest and the potential for benefit. The central goal of personalized medicine, in its various forms, is to narrow the reference class to yield more patient specific effect estimates to support more individualized clinical decision making. This article will review fundamental conceptual problems with the prediction of outcome risk and heterogeneity of treatment effect (HTE), as well as the limitations of conventional (one-variable-at-a-time) subgroup analysis. It will also discuss several regression based approaches to “predictive” heterogeneity of treatment effect analysis, including analyses based on “risk modeling” (such as stratifying trial populations by their risk of the primary outcome or their risk of serious treatment-related harms) and analysis based on “effect modeling” (which incorporates modifiers of relative effect). It will illustrate these approaches with clinical examples and discuss their respective strengths and vulnerabilities.

## Introduction

Austin Bradford Hill, the epidemiologist who formalized randomized clinical trial (RCT) methods, noted in the 1960s that although RCTs can determine the better treatment on average, they “do not answer the practicing doctor’s question: what is the most likely outcome when this particular drug is given to a particular patient?”1 But, if not with an RCT, how can we forecast outcomes in individuals under alternative treatments?

Kahneman and others have described two distinct approaches to single case prediction, the “inside view” and the “outside view.”23 The inside view considers a problem by focusing on the specifics of each case and understanding the many characteristics that make it unique. It is the view prioritized by “traditional” physicians who emphasize clinical experience and expert judgment and the view we spontaneously adopt for making decisions in virtually all aspects of life. By contrast, the outside view predicts by explicitly identifying a group of similar cases (a “reference class”) and ignoring some potentially important particulars; the reference class provides a statistical basis for prediction. This is referred to as “reference class forecasting.”

The central premise of evidence based medicine (EBM) is the recognition that Hill’s assertion was (at least partially) wrong: RCTs can be used to guide clinical decision making for individuals. In emphasizing this, RCTs were repurposed from tools to establish causality into tools for prediction, through reference class forecasting, in individual patients. There is now a wealth of evidence—in medicine and other fields—that predictions based on the inside view (even by “experts”) are vulnerable to all manner of cognitive biases, and that prioritizing impersonal data generally improves decision making.24 EBM has become the dominant paradigm both for medical decision making and for clinical practice guidelines.

Nevertheless, it is easy to recognize that Hill’s view was, in part, right. The result of a positive RCT only provides evidence that at least some of the enrolled patients benefited from the intervention. Logically, the impact this knowledge has on decision making in an individual (even one qualifying for the trial) is unclear when treatments can have very different effects in different patients. For example, thrombolysis in acute ischemic stroke can improve functional outcomes (through recanalization) but also worsen functional outcomes (through intracerebral hemorrhage); angiotensin converting enzyme inhibitors can prevent progression of renal insufficiency but can also cause it in some patients; antihypertensives prevent serious cardiac events but can also cause them; bisphosphonates can prevent fracture from osteoporosis but can also cause them5; carotid endarterectomy for symptomatic carotid stenosis can prevent strokes but can also cause them.6 Moreover, individual patients have many characteristics that might affect the likelihood of an outcome and the benefits or harms of treatment. Determining the best treatment for a given patient, the task of a clinician, is thus very different from determining the best treatment on average.

Thus, interest in understanding how a treatment’s effect varies across patients—a concept described as heterogeneity of treatment effects (HTE)—has been growing. This concept is central to the agenda for both personalized (or precision) medicine and comparative effectiveness research. HTE has been defined as non-random variability in the direction or magnitude of a treatment effect, in which the effect is measured using clinical outcomes.7 Despite this definition, the broad concept of HTE accommodates different perspectives8 and different goals,9 which have at times confused discussions.10

In this article, we focus on what we consider the most essential goal of HTE analysis for clinical decision making: prediction in the individual patient of outcomes under alternative treatments. Although we discuss fundamental difficulties in the prediction of treatment effects for individuals, we emphasize this goal because HTE analysis is of little value if it does not improve our ability to make predictions and decisions one patient at a time. Below, we discuss: fundamental difficulties with the prediction of “individual” risk and treatment effect common to all approaches; limitations of conventional (one-variable-at-a-time) subgroup analysis; and several different regression based approaches to “predictive” HTE analysis.

## Sources and selection criteria

This narrative review provided background for a larger project supported by both a 14 member technical expert panel and an evidence review committee. We used our extensive libraries for the review of basic epidemiological and statistical concepts relevant to HTE. For emerging methods related to predictive approaches to HTE, articles recommended by the technical expert panel and two targeted systematic searches by the evidence review committee were also used. The aims were to discover consensus based methodological recommendations for predictive HTE analysis in RCTs and to identify methodological papers evaluating regression based approaches to predictive HTE analysis. Key search terms included “heterogeneity of treatment effect”, “treatment effect”, “regression”, “statistical models”, “randomized controlled trials” (as topic), and “precision medicine”. These search terms were combined using appropriate Boolean operators to yield 2851 abstracts, which were hand searched. The evidence review committee prepared an annotated bibliography (see supplemental table 1).

## Conceptual background

Although the goal of predictive HTE analysis is to improve the prediction of the treatment effect and decision making in each patient,911 we acknowledge that this enterprise has fundamental limitations. Both risks and treatment effects can be determined only at the group level.12131415 Indeed, under a deterministic framework (that is, when outcomes in patients are viewed as being fully determined by prior causes and conditions), given complete knowledge, the only “true risk” for an individual would be either 0 or 1 for a binary outcome (such as death), and risk prediction should be regarded as a quantification of the limits of our knowledge, rather than an intrinsic property of the patient. Even if we accept the existence of a “true” risk for an individual (that is, a fundamentally stochastic universe), this true risk cannot be directly measured. Instead, a person’s risk is estimated by examining the frequency of outcomes in a group of other “similar” patients. But because similarity can in practice always be defined in many different ways (as we will discuss), a person’s risk cannot typically be uniquely determined; rather, it is a “model dependent” property.1415

The prediction of treatment effect in individual patients is even more challenging than prediction of outcomes. This is because treatment effects at the person level are inherently unobservable even in retrospect; outcomes under two counterfactual treatment conditions cannot be ascertained in the same person simultaneously. Thus, predicting treatment effect, and evaluating models that predict treatment effect, is fundamentally different from (and more difficult than) predicting outcome risk, because we are attempting to predict an “outcome” (that is, the difference in potential outcomes, with and without treatment) that is only partially observable in any patient.

Thus, both risk and the prediction of treatment effect must rely on assigning patients to groups (reference classes) to which the individual of interest is similar. But how can similarity be defined? Mathematician John Venn pointed out in 1876 that “every single thing or event has an indefinite number of properties or attributes observable in it, and might therefore be considered as belonging to an indefinite number of different classes of things.”16 Alternative methods of classifying patients will lead to different inferences for any given patient. This “reference class problem” has been subject to much discussion in other fields but has received surprisingly scant attention in the EBM literature.

The approach of EBM to the reference class problem has generally been to emphasize the broad reference class of the RCT population. Guyatt and colleagues’ classic *User’s Guide to the Medical Literature II* stated: “if the patient meets all the inclusion criteria, and doesn’t violate any of the exclusion criteria—there is little question that the results are applicable.”17 The enthusiasm for pragmatic trials, enrolling ever broader populations, represents an extrapolation of the view that broad based populations provide the most useful reference class for clinical decisions.18

Another approach to the reference class problem was suggested by Reichenbach, the theorist who first coined the term. He recommended calibration to “the narrowest reference class for which reliable statistics can be compiled,”19 but matching on just 10 binary characteristics gives rise to more than 1000 distinct subgroups (and 20 binary characteristics give rise to more than a million). Thus, this approach is limited by the problem of small samples, leaving the reference class problem unresolved. The narrowest possible class is the patient himself or herself, who is unique; the uniqueness of each case is why medicine at times becomes an improvisational, “inside view” enterprise so dependent on “clinical intuition.” What is needed is a principled way of prioritizing relevant patient characteristics.

The selection of an appropriate reference class is the central problem when using group evidence to forecast outcomes (or treatment effects) in individuals.20 The mapping of an individual to a group of similar (but non-identical) patients always requires (implicitly or explicitly) a model or scheme, whether that be the inclusion criteria of the overall trial or some narrower classification scheme. In this article we will review three broad analytic approaches used to derive more personalized treatment effect estimates: conventional (one-variable-at-a-time”) subgroup analysis, risk based subgroup analysis (or risk modeling), and treatment effect modeling.

## Conventional subgroup analysis

The most common approach to HTE analysis is to divide patients serially on the basis of single characteristics defined at baseline (such as male *v* female; old *v* young) and to serially test whether the treatment effect varies across the levels of each attribute. The literature and guidance on the conduct of subgroup analyses is extensive (and largely pejorative).2122232425262728293031323334 Nevertheless, subgroups remain routinely reported, often in the form of forest plots (fig 1). Understanding these analyses and their limitations is central to the understanding of predictive HTE analysis.

### Why most positive subgroup analyses are false

It is often emphasized that the appropriate statistical method for assessing HTE is to test for the contrastin effects among the levels of a baseline variable with a statistical test for interaction.38394041 This typically compares the relative risk (or the odds ratio or hazard ratio) across the levels of the subgrouping variable and corresponds to the epidemiologic concept of effect modification. A common mistake is to claim heterogeneity on the basis of separate tests of treatment effects within each subgroup2223—for example, when a P value reaches statistical significance in one group (say, men) but not in another (say, women).

However, even when adhering to the recommended practice of performing interaction tests, the credibility of “statistically significant” subgroup effects should be regarded cautiously. Several recent meta-epidemiological studies have shown that very few are corroborated in subsequent studies.244243 A recent empirical evaluation of sex-by-treatment interactions in 109 topics found only eight (7%) with statistically significant sex-by-treatment interactions42—a result that was not much greater than what would be expected by chance if relative effects between the sexes were always identical. These results suggest that most statistically significant subgroup effects represent false discoveries.24 Well known examples of misleading positive subgroup analyses include not just the influence of astrological signs on the effects of aspirin for patients with myocardial infarction,44 but far more plausible and therefore more harmful results (eg, aspirin is ineffective in secondary stroke prevention in women,45 beta blockers are ineffective in inferior wall myocardial infarction).2246

The low credibility of positive subgroup results is understandable because RCTs are powered for the main effect of treatment; at least four times the sample size would be needed to provide similar power for an interaction effect of similar magnitude (eg, for a relative odds ratio equal to the odds ratio of the main effect), even for a perfectly balanced subgroup. Alternatively phrased, these interaction effects are anticipated to be powered at about 30% for perfectly balanced subgroups (eg, males *v* females) in trials powered at 80% for the main treatment effect,3847 and less for unbalanced subgroups or for smaller effects. Moreover, because subgroup analyses are typically viewed as being without cost, they are often performed promiscuously across variables, with far less previous evidence than for the main effect in a RCT (which is typically not undertaken without a reasonable probability of success). The combination of a low proportion of anticipated true effects and low power explains the high proportion of false discoveries among “statistically significant” effects (fig 2). Thus, subgroup analyses generally provide the essential conditions for the reliable generation of false discoveries: weak theory and noisy data—that is, exploratory analyses testing multiple hypotheses performed in databases with low power.4850 In addition to false discovery, effect exaggeration—that is, “testimation bias” (also known as the “winner’s curse”)4951—can be anticipated because overestimated effects are preferentially selected through the use of a statistical criterion (such as a P value threshold). These two concerns are important not only in conventional subgroup analysis, but also when considering how best to develop multivariable prediction models to estimate effects for individual patients, which is the focus of this article.

### Why claims of “consistency of effect” are often misleading

Results similar to those shown in fig 1 (in which none of the tested subgroup interaction effects reach statistical significance) are often the basis for claims of “consistency of effects.” However, because trials are usually underpowered for subgroup analyses, the inability to find significant interactions should be anticipated. For example, fig 1A(the Occluded Artery Trial35) shows how clinically significant differences in effects between men and women and between young and old patients may not be statistically significant, even in large trials, and even when the point estimate of these effects is qualitatively different (harm in one stratum and benefit in another). Additionally, even when results seem to be highly consistent across “clinically important subgroups” (as in the Danish Multicenter Randomized Study on Fibrinolytic Therapy Versus Acute Coronary Angioplasty in Acute Myocardial Infarction (DANAMI-2) trial; fig 1B), null subgroup analyses do not imply that benefit-harm trade-offs are likely to be similar across all trial enrollees or that the overall treatment effect applies similarly across trial subjects. Indeed, a core assumption of personalized medicine is that, at the person level, HTE is ubiquitous (some patients benefit and others don’t, and this is not totally random).1352 Because one-variable-at-a-time subgroup analyses compare groups of patients who differ systematically on only a single variable, whereas individual patients differ from one another across many variables simultaneously, the conventional approach greatly under-represents the heterogeneity clinicians observe clinically (that is, at the person level). Subgrouping schemes, defined more comprehensively across many clinically salient variables simultaneously, may detect important differences in treatment effects that are obscured in conventional subgroup analysis.53 Indeed, clinically important HTE was subsequently identified in the DANAMI-2 trial when a risk modeling approach was applied.37

### Why conventional subgroup analyses are incongruent with the goals of predictive HTE analysis

Conventional subgroup analysis may detect “relative effect modification.” This can help inform theories about conditions under which treatments are especially effective or ineffective. However, this approach does not directly address the reference class problem—that all patients belong to multiple different subgroups, each of which may yield different inferences. For example, even assuming the that subgroup effects shown for both age and sex in the Occluded Artery Trial (fig 1A) are wholly credible, the optimal treatment for a young woman (or an old man) would be unclear. Because a patient has an indefinite number of attributes and can thus belong to an indefinite number of different reference classes, there are as many probabilities for a given patient (and by extension estimable treatment effects) as there are specifiable classes.

The application of conventional subgroup analysis to clinical decision making is further complicated because HTE is typically tested (and presented) on a relative scale (eg, odds ratio or relative risk), whereas the absolute risk difference (RD) scale (or its inverse, number needed to treat (NNT)) is the most important scale for clinical decision making.13545556 Although the literature sometimes emphasizes the distinction between “predictive factors” (relative effect modifiers) and “prognostic factors,” this distinction is somewhat artificial and can be as confusing as it is clarifying. This is because prognostic factors are “predictive” (that is, effect modifying) when effect is considered on the clinically important absolute scale, and predictive factors typically have “prognostic” effects that complicate clinical interpretation. For clinical decision making, prognostic and predictive effects should be considered simultaneously, because the ARD is a product of both the outcome risk and the relative treatment effect (fig 3). Thus, the presence of statistically significant heterogeneity on the relative scale does not necessarily imply clinically important HTE, which should always be assessed on the ARD scale (fig 3). Indeed, prognostic modeling can often reveal clinically important HTE, because differences in outcome risk are just as important as similar changes in relative risk when determining the ARD. Moreover, prognostic factors are much easier to model than relative effect modifiers, given abundant prior knowledge and much greater statistical power for main effect analyses rather than tests for interaction.

### Limitations of guidance for subgroup analysis

Guidance for analyzing, reporting, and interpreting subgroup analysis typically includes key recommendations13: subgroups should be fully defined a priori (to prevent data dredging); be limited in number (or corrected for multiplicity, or both); be well motivated by clinical reasoning or previous empirical studies; be in the expected (and pre-specified) direction922; be supported by formal tests for interaction; and be fully reported and cautiously interpretted.212230575859 It has also been recommended that the type of subgroup analysis (eg exploratory (fun to look at) or confirmatory (potentially actionable)) should be specified.95660 A further refinement is the development of an instrument to help evaluate the credibility of any positive subgroup effects.213061

Although this guidance thoughtfully deals with one aspect of the central dilemma of subgroups—the risk of a falsely positive subgroup—it mostly ignores the other term: the risk of overgeneralizing summary results to all patients who meet the enrollment criteria. Although the potential importance of HTE is increasingly recognized,346263646566 trialists, peer reviewers, and regulators have very little guidance on which subgroup analyses should be routine, expected, and necessary for the results to be considered fully and transparently reported.

## Predictive approaches to heterogeneous treatment effects

Predictive approaches to HTE are intended to ameliorate many of the above limitations of one-variable-at-a-time subgroup analysis. The goal of predictive HTE analysis is to develop models that can be used to predict which of two or more treatments will be best for individual patients when multiple variables that influence the benefits or harms of treatment are taken into account. We divide this type of analysis into two subcategories:

Firstly, risk modeling: an approach to predictive HTE analysis whereby a multivariable model (either externally or internally developed) that predicts the risk of an outcome (usually the primary study outcome) is applied to disaggregate patients in trials so that treatment effects can be examined across risk groups

Secondly, treatment effect modeling (or “effect modeling”): an approach to predictive HTE analysis that develops a model directly on trial data to predict treatment effects (that is, the difference in outcome risks under two alternative treatment conditions). Unlike risk modeling, such a model incorporates a term for treatment assignment and permits the inclusion of treatment by covariate interaction terms.

### Risk modeling

We have previously proposed a framework for risk modeling that prioritizes the reporting of relative and absolute treatment effects across risk strata for the primary trial outcome and suggests that these should be routinely reported.56 Why should outcome risk be prioritized as a subgrouping variable over other variables, such as age, sex, or comorbidities? Unlike other variables that may or may not modify treatment effect, outcome risk is a mathematical determinant of treatment effect. Table 1 shows the definition of several different measures of treatment effect. All of these measures depend on the outcome rate in the control group (the control event rate; CER), which is itself an observable proxy for outcome risk. Because outcome risk typically varies substantially in a trial population when risk is described through a combination of factors,67 the CER will also vary across the trial population when it is disaggregated with a prediction model. Except when trials have null effects, the ARD will generally vary when CER varies across the population (fig 3). Mathematically, only one measure of treatment effect (at most) can remain consistent when risk varies across the population.

Figure 4 shows the 30 day mortality risk estimates for 1058 patients with ST elevation myocardial infarction based on pretreatment clinical and electrocardiographic variables.69 The risk of mortality in the quarter of patients at highest risk is about 16 times higher than it is in the quarter at lowest risk. Doctors know (and simple algebra confirms) that for interventions that carry some risk of serious treatment related harm, benefit-harm trade-offs differ in patients at such different risks of mortality. However, it is common practice in research to aggregate these patients together in a trial and emphasize the overall summary results, thereby obscuring whether the differences in treatment effect across risk categories are clinically important. Thus, our view is that trial results are incompletely disclosed unless both outcome rates and treatment effects across risk groups are described.56667071

Figure 4 illustrates another commonly observed property6772—that the distribution of the predicted risk is skewed, such that the risk of mortality is lower than the average risk for about 75% of patients; the risk of mortality in the “typical” (median risk) patient is about 3%, about half the average risk that would be reflected in the summary result. The higher mortality risk is driven by the influential quarter of patients at highest risk. When the risk distribution is skewed, the overall benefit for a treatment seen in the trial’s summary results may not reflect the benefits or the benefit-to-harm trade-offs even in patients who are at typical risk (especially when there is some treatment related harm).6672

An understanding of the underlying distribution of risk for patients in RCTs can help inform anticipated subgroup effects, which by their nature are more credible than unanticipated subgroup effects (in the same way that confirmatory subgroup analysis is more credible than exploratory subgroup analysis (fig 2)). For example, when considering the use of a potentially effective invasive procedure (such as percutaneous coronary intervention; PCI) with a small risk of serious treatment related harm, it is anticipated that the benefit-harm trade-offs would be very different across the risk distribution shown in fig 4. Thus, despite “consistency of effects” in conventional subgroup analysis of the DANAMI-2 trial (fig 1B) (which compared PCI versus medical therapy in patients with ST-elevation myocardial infarction (STEMI)), clinically important HTE emerged when the population was subsequently stratified by mortality risk using the TIMI (thrombolysis in myocardial infarction) score (fig 5A). A risk stratified analysis based on an internally derived model using the data from the RITA-3 trial, which compared an invasive to a non-invasive approach for patients with non-STEMI/unstable angina, showed similar results (fig 5B).

The pattern observed in these trials is not rare. Rather, risk distributions seem to conform to predictable patterns, based on the prevalence of the outcome and the discriminatory performance of the prediction model.67 Other examples in which effects in high risk subpopulations obscure the lack of benefit (and even harm) in many typical or low risk patients include more intensive versus less intensive thrombolytic therapy in STEMI,73 activated protein C for sepsis (https://s3-us-west-2.amazonaws.com/drugbank/fda_labels/DB00055.pdf?1265922807),74 enoxaparin or tirofiban in acute coronary syndrome,757677 anticoagulation for stroke prevention in non-valvular atrial fibrillation,7879 fidaxomicin versus vancomycin to prevent recurrence of *Clostridium difficile* infection, and many others.6738081828384

The examples in fig 5 show how risk modeling can lead not only to important changes on the ARD scale but to statistically significant HTE on the relative scale. This interaction can emerge for many reasons but should be expected when there are known treatment related harms that are reflected in the primary outcome, because similar degrees of treatment related harm will outweigh (or substantially reduce) the benefits in low risk patients but not high risk patients.5366 At the same time, the importance of a significant “P value for interaction” should not be overemphasized when subgroups have very different outcome rates because the clinical importance of HTE needs to be determined on the absolute scale. For example, the Diabetes Prevention Program (DPP) trial tested both a lifestyle modification program and metformin pharmacotherapy against usual care in patients with pre-diabetes. It provides an interesting case where statistically significant relative effect modification was shown for one intervention (lifestyle modification) but not the other (metformin), even though clinically important HTE was shown for both interventions when effects were examined on the absolute scale (fig 6).

The importance of risk as a determinant of absolute benefit is widely accepted. The concept has entered guidelines, notably in the recommended approach to lipid lowering treatment for the prevention of coronary artery disease.88 The concept also underpins several algebraic approaches to “individualizing” evidence that are based on risk predictions and an assumption of consistent relative effects.89909192 Risk based analyses of RCTs permit this assumption to be examined.

#### External versus internal models

Although an applicable externally derived model would enable translation into practice, especially if well validated and clinically accepted, many of the above examples used internally developed risk models. These were derived on trial data “blinded” to treatment assignment. As long as good modeling practice (such as a large number of events per independent variable and a priori selection of risk variables based on previous literature) has been adhered to, models derived directly from RCT data provide “honest” (internally valid) treatment effect estimates within risk strata.5193 Although some researchers recommend that the control arm be used to model risk only,949596 this approach can potentially induce differential model fit on the two trial arms, biasing treatment effect estimates across risk strata, and exaggerating HTE.97 Indeed, with this approach, overfitting on the control arm can make completely innocuous and ineffective treatments appear to be beneficial in high risk patients and harmful in low risk patients. Various cross validation techniques have been proposed to mitigate this bias.98 However, given the small scale of the ARD compared with the predicted outcome risk, even very modest overfitting on the control arm can substantially bias estimates of the treatment effect.

Although internally derived (or endogenous) prognostic models can provide reliable estimates of treatment effects within trial risk strata,98 the implementation of an externally valid prognostic model is necessary for translation into practice. The finding of clinically important HTE across risk strata within a trial provides an important impetus for implementing an externally valid model. It should be noted that external validity is a general concern for RCT results and is not confined to results subgrouped using risk models.

#### Other dimensions of risk: heterogeneity of treatment related harm

It is also important to examine whether treatment related harms vary across risk strata because the treatment burden might not be constant across strata defined by outcome risk. When the two dimensions of risk are highly correlated (when high risk patients are also at greatest risk of treatment related harms), it becomes more difficult to segregate treatment favorable patients from treatment unfavorable ones.99100 Thus, to facilitate the interpretation of benefit-harm trade-offs, important treatment related harms should be reported at the same level of disaggregation (that is, in each of the risk strata) as the primary outcome.

For treatments with serious treatment related harm, a better understanding of the variation in the risk of these adverse events may help to “deselect” patients with unfavorable benefit-harm trade-offs.101Figure 7 illustrates two recent analyses that showed clinically important variation in the benefit-harm trade-offs in patients who were stratified by internal risk models for the treatment related harm (fracture in the case of pioglitazone; bleeding in the case of long course versus short course dual antiplatelet therapy). Although these analyses can be highly informative, differential overfitting may occur when the adverse outcome is rare in the control group, underscoring the importance of model validation.

Several trials have been stratified by combining models for outcome risk and for treatment related harm to make more comprehensive benefit-harm models.673104 Although this is ultimately the goal of evidence personalization, the arithmetic combination of predictions from different models poses serious challenges related to the calibration of predictions that are beyond the scope of this discussion. Finally, because the primary outcome is sometimes a composite of outcomes with treatment responsive causes and those with treatment unresponsive (or competing) causes, it may also be useful to stratify the trial population by an index that predicts the fraction of outcomes attributable to the treatment responsive cause.105106107 For example, implantable cardiac defibrillators may be of greater benefit in those who have a higher risk of sudden cardiac death compared with their risk of pump failure death108; PFO closure may be more beneficial in a subset of patients with stroke and PFO who are more likely to have a stroke that is caused by PFO rather than another occult mechanism109110; an anti-endotoxin specific therapy may be of greater benefit in patients with sepsis who are at higher risk of Gram negative rather than Gram positive causes of sepsis. Stratification of patients by prediction models that estimate risk of important competing events might also be informative in some circumstances.109110

### Treatment effect modeling

Although subgrouping on the basis of prognostic modeling has advantages over conventional subgroup analyses, outcome risk may not represent the optimal classification scheme. Prediction models developed on RCT data “unblinded” to treatment assignment have the potential to capture relative effect modification through the inclusion of treatment-by-covariate interaction terms. This may be important for determining (both relative and absolute) treatment effects and highly important for optimizing treatment selection.111 Indeed, approaches to stratified and personalized medicine have often focused exclusively on the discovery of effect modifiers on the relative scale,112 and some researchers reserve the term HTE to refer only to heterogeneity on the relative scale.113 When strong and well established effect modifiers exist—such as time from onset of symptoms to treatment for reperfusion therapies in myocardial infarction—treatment interaction effects can be included in the model, regardless of statistical significance. For example, stratification by predicted benefit (predicted outcome risk with treatment minus predicted outcome risk without treatment) could then stratify some lower risk patients with acute myocardial infarction who present very early as being more treatment favorable than some higher risk patients who present later.

However, the incorporation of relative effect modifiers (treatment interaction terms) that were selected on the basis of modeling on the trial itself into prediction models has special challenges. The selection of “statistically significant” relative effect modifiers for inclusion in a prediction model is identical in many respects to one-variable-at-a-time subgroup analysis and has many of the same vulnerabilities—weak theory and noisy data—that can lead to “false positives” and exaggerated effects (from testimation bias49 and other forms of overfitting). The number of events per interaction term needed for more accurate modeling of effect modification is many times greater than the number needed for main prognostic effects and has not been well studied. “Treatment benefit” prediction models using naive regression to select “statistically significant” interactions should be expected to provide misleading estimates of within strata effects because of unreliable, exaggerated, and highly influential interaction terms.114115 The vulnerability to overfitting leaves this approach prone to discovering false subgroup effects, even for treatments that are completely ineffective.

Nevertheless, the further individualization of treatment selection often depends on the discovery of treatment effect modifiers that are not well established. One promising approach is to select a set of variables anticipated to be relative effect modifiers on the basis of a priori clinical reasoning, and to use an omnibus test for significance (with the appropriate degrees of freedom) across all the included putative interaction terms. If the result of this overall test is statistically significant, all interactions are included in the model; otherwise, none are. Because interaction terms are still prone to overfitting, this process should be combined with penalized regression methods (such as lasso regression,116117 ridge regression,118119 or elastic net regularization regression),120121 which shrink model coefficients on the basis of model complexity to yield better predictions of the absolute treatment effect within new populations. Alternatively, when developing models “unblinded” to treatment assignment, a different set of data should be used for variable and model selection (that is, to define the reference class or subgrouping scheme) and for estimation of the treatment effect across strata. There is intense research interest in methods that combine effect modifier (biomarker) discovery with treatment effect estimation, including both machine learning approaches and regression based methods122123124125126127128129130131 (see supplemental table 1 for additional examples), although clinical application remains limited.121 These more complex and aggressive prediction approaches require more rigorous validation.

The SYNTAX score II (fig 8) is an example of a model for predicting benefit; eight variables were used as both prognostic variables and effect modifiers (in treatment interaction terms), in a score that predicts outcomes for patients with non-acute coronary artery disease under two revascularization strategies—coronary artery bypass graft surgery (CABG) versus PCI.133 Although the overall trial showed substantial benefit for CABG (the primary outcome was reduced from 17.8% with PCI to 12.4% with CABG; P=0.002),132 stratification by predicted benefit according to the SYNTAX score II indicated that the benefits of population-wide CABG may largely be achieved by targeting to the most treatment favorable quarter of patients, potentially avoiding the substantial trauma and morbidity associated with an open chest procedure in most patients.

#### Evaluating models that predict treatment benefit

The evaluation of a prediction model intended to estimate benefits using the usual metrics for outcome discrimination (eg, c-statistic) and calibration does not provide information on how well a model performs for predicting benefit—that is, the difference between outcome risk with two alternative strategies. Efforts to develop measures to assess model accuracy for predicting benefit are hampered by the fundamental problem of causal inference.134 Unlike individual patient outcomes, individual patient treatment effects (that is, who benefits and who does not) are inherently unobservable because patients do not simultaneously receive both counterfactual treatments to which they are randomized.135

Recently, the c-statistic, commonly used to measure discrimination in outcome risk models, has been adapted to evaluate the prediction of treatment effect.136 To do this, two patients who are discordant on treatment assignment are matched according to their predicted benefit (the absolute difference in their outcome risk with and without treatment). These matched pairs of patients with a similar “propensity for benefit” can then be classified into three categories according to their “observed benefit” by comparing outcomes in the control and experimental patient—benefit (1, 0); no effect (1, 1 or 0, 0); or harm (0, 1)—where 1 represents a bad outcome and 0 represents a good outcome in each of the two study arms; the c-statistic assesses how well the model discriminates pairs of patients on the basis of this trinary “outcome.”136 This approach assumes no correlation in the distribution of outcomes under the two treatments, conditional on the variables in the prediction model; this strong assumption leads to generally low values of the “c-for-benefit” statistic. Similarly, a model based ROC (receiver operating characteristic) measure has been proposed for treatment selection markers using a potential outcomes framework, but this approach relies on the assumption that model predictions are correct.137

Ultimately, the usefulness of a model depends not just on its ability to predict accurately and provide honest estimates of within strata treatment effects, but on its ability to improve decisions. This depends on model performance relative to a specific decision threshold—that is, a risk distribution that perfectly balances the burdens, harms, and costs of treatment. Decision curve analysis138 has been proposed to evaluate the clinical usefulness of prediction models and has been adapted to evaluate models that predict HTE in trials.139 These methods evaluate whether a particular prediction-decision strategy optimizes net benefit for a population at a particular decision threshold, compared with the best overall strategy (that is, treat all or treat none).140 The ultimate test of a predictive approach is to compare decisions (or outcomes) in settings that use such predictions with usual care in an experiment,141 such as a cluster randomized trial.

#### Use of observational data for predictive HTE analysis

Observational data have tremendous appeal for predictive HTE analyses. In particular, the growing availability of large databases that capture electronic health records and claims on millions of patients can provide statistical power far beyond that typically achieved by single or pooled RCTs.142143 In addition, because these databases capture a broader, more heterogeneous population, representing the full spectrum of patients seen in routine practice, they may be an excellent substrate for risk prediction. Nevertheless, because randomization remains the gold standard for unbiased estimation of causal treatment effects, RCTs are also the preferred substrate for HTE analysis. Although modern methods for de-confounding may produce unbiased average treatment effect estimates in observational data, it is not possible to know whether all model assumptions are met in any given analysis.144 In addition, for HTE analyses, the assumptions necessary for deconfounding need to be met within each stratum, a more stringent requirement than for the estimation of an overall average treatment effect. Apart from confounding by indication, large observational data sources collected from routine care are often plagued by missing data and misclassification. A growing body of research is focused on improving the understanding of the necessary conditions for trustworthy, unbiased observational results, including research on methods to achieve balance in covariates across subgroups.145146147 Nevertheless, the use of observational data potentially compounds and complicates the well known problems with credibility that already undermine subgroup analyses even in RCTs.

## Conclusion

Although a positive RCT result provides strong evidence that an intervention works for at least some patients included in the trial, clinicians still need to understand how a patient’s multiple characteristics combine to influence his or her potential treatment benefit—that is, the difference between outcome risk with and without the treatment. Disaggregation of the overall results according to absolute risk can yield more informative, narrower reference classes for more patient specific effect estimates of benefit and support more patient specific decision making. Routine use of absolute risk modeling is usually feasible for large phase III trials; journal editors, funders, and the research community should insist on these analyses. New statistical approaches, devised to model treatment effect directly, may offer additional advantages (increasing “benefit discrimination”), although with greater potential for statistical overfitting, false discovery, and biased predictions in new patient populations. These approaches merit more research**.**

Nonetheless, substantial barriers still need to be to overcome.148 We list a few of the outstanding research questions related to the problems covered in this article in the Questions for future research box. In addition, we need research aimed at:

Improving the integration of clinical prediction into practice149

Improving our understanding of how to individualize clinical practice guidelines

Establishing or extending reporting guidelines150

Establishing new models of data ownership to facilitate data pooling151

Re-engineering the clinical research infrastructure to support substantially larger clinically integrated trials sufficiently powered to determine thee HTE, or to develop our ability to predict when observational data will probably be sufficiently de-biased for reliable HTE determination, or both.146152

Many recent and ongoing organizational and technical advances should enable this evolution.

As Hill pointed out, at the level of the individual the right decision is fundamentally under-determined by the results of a trial. Even in retrospect, it is usually impossible to tell whether the right decision was made for any individual patient. Thus, although the goal of prediction is to improve decisions in each patient, paradoxically, like any other intervention, this can be assessed only by examining whether more precise prediction improves outcomes at the population level. As experience with these approaches grows, in addition to stronger methodological and evidentiary standards, we will need empirical studies to ensure that these more flexible (and manipulable) methods realize in practice their potential to improve population outcomes.

### Glossary

**Effect modification:**This occurs when the size of the effect of a treatment or exposure on an outcome depends on the level of a third variable (eg, patient characteristics). In the presence of effect modification, the use of an overall effect estimate is inappropriate.**Heterogeneity of treatment effect (HTE):**Non-random variability in the direction or size of a treatment effect, measured using clinical outcomes. HTE is fundamentally a scale dependent concept and therefore, for clarity, the scale should generally be specified. (It should be noted that some people reserve the term to describe variability on a relative scale only, such as changes in the odds ratio or relative risk.)**Clinically important HTE:**This occurs when variation in the risk difference across patient subgroups spans a decisionally important threshold, which depends on treatment burden (including treatment related harms and costs). It is generally assessed on the absolute scale.**Predictive HTE analysis:**The main goal of predictive HTE analysis is to develop models that can be used to predict which of two or more treatments will be better for an individual by taking into account multiple relevant variables.**Risk modeling approach:**An approach to predictive HTE analysis in which a multivariable model that predicts the risk of an outcome (usually the primary study outcome) is applied to disaggregate patients in trials and examine risk based variation in treatment effects.**External risk models versus endogenous/internal risk models:**External risk models have been developed from an external trial or cohort but can be used for HTE analysis of other trials. Internal risk models are developed directly from the trial population.**Treatment effect modeling approach:**An approach to predictive HTE analysis that develops a model directly on randomized trial data to predict treatment effects (the difference in outcome risks under two alternative treatment conditions). Unlike risk modeling, the model incorporates a term for treatment assignment and permits the inclusion of treatment by covariate interaction terms.**Net benefit:**A decision analytic measure that puts benefits and harms on the same scale. This is achieved by specifying an exchange rate on the basis of the relative value of benefits and harms associated with interventions. The exchange rate is related to the probability threshold determining whether a patient is classified as being positive or negative for a model outcome, or (when applied to trial analysis) as being treatment favorable versus treatment unfavorable.**Overfitting**: A situation where predictions do not generalize to new subjects outside the sample under study. Overfitting occurs when a model conforms too closely to the idiosyncrasies or “noise” of the limited data sample from which it is derived and is a threat to the validity of a model.**Penalized regression:**A set of regression methods, developed to prevent overfitting, in which the coefficients assigned to covariates are penalized for model complexity. Penalized regression is sometimes referred to as shrinkage or regularization. Examples of penalized regression include lasso, ridge, and elastic net regularization.**Predictive factors:**Patient characteristics that result in modification of the treatment effect and are often assessed using statistical interaction terms on the relative scale. Generally, predictive factors are substantially harder to identify than prognostic factors, given the more limited a priori information on their effects and the greater statistical power needed to test interactions.**Prognostic factors:**Patient characteristics that influence the risk of the outcome of interest. These factors may also help discriminate patient groups with different degrees of absolute benefit. A single characteristic may be both predictive and prognostic.**Reference class:**A group of similar cases that is used to make predictions for an individual patient of interest. The “reference class problem” refers to the fact that similarity can be defined in an indefinite number of different ways because individuals have many different potentially relevant attributes.**Testimation bias:**Refers to the fact that, on average, the effect sizes of newly discovered true (non-null) associations are inherently inflated. Testimation bias arises from the use of statistical thresholds in the process of discovering associations or of selecting variables for a model. Inflation is expected when an association has to pass a certain threshold of statistical significance to be deemed positive (or included in a model) and the study has suboptimal power. The problem is also referred to as the “winner’s curse.”

### How patients were involved in the creation of this article

To gain insight into the importance of heterogeneity of treatment effects from the patient’s perspective, we held three 90 minute webinar enabled group discussions with patient stakeholder representatives of three patient powered research networks (PPRNs): ARthritis Partnership with Comparative Effectiveness Researchers (AR-PoWER), the Health eHeart Alliance, and iConquerMS.

### Questions for future research

How can we jointly predict multiple important outcomes or risk dimensions (eg, risk of the primary outcome versus risk of treatment related harm)?

How can we determine when relative effect modifiers are sufficiently reliable for inclusion in treatment effect models?

Do machine learning techniques have distinct advantages over traditional statistical approaches for predicting treatment effect? If so, under what conditions?

How can models be updated and recalibrated in the absence of new randomized trials?

Under what conditions can observational big data sources provide a substrate for reliable predictive heterogeneity of treatment effect analysis?

## Acknowledgments

The authors acknowledge the supportive work of Jessica Paulus, the PARC Technical Expert Panel, and Evidence Review Committee, and of Jennifer Lutz in manuscript preparation.

## Footnotes

Series explanation: State of the Art Reviews are commissioned on the basis of their relevance to academics and specialists in the US and internationally. For this reason they are written predominantly by US authors

Contributors: The concepts of this manuscript were discussed among all authors. DMK prepared the initial draft of the manuscript. Substantial revisions were made by all authors.

Funding: This work was partially supported through two Patient-Centered Outcomes Research Institute (PCORI) grants (the Predictive Analytics Resource Center (PARC) (SA.Tufts.PARC.OSCO.2018.01.25) and Methods Award (ME-1606-35555)), as well as by the National Institutes of Health (U01NS086294).

Competing interests: All authors have read and understood BMJ policy on declaration of interests and declare no competing interests.

Provenance and peer review: Commissioned; externally peer reviewed.

Disclosures: All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors, or Methodology Committee.