# Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: a primer for practitioners

BMJ 2019; 367 doi: https://doi.org/10.1136/bmj.l5657 (Published 23 October 2019) Cite this as: BMJ 2019;367:l5657- Rishi J Desai, assistant professor of medicine1,
- Jessica M Franklin, assistant professor of medicine1

^{1}Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women’s Hospital and Harvard Medical School, 1620 Tremont Street, Boston, MA 02120, USA

- Correspondence to: R J Desai rdesai{at}bwh.harvard.edu

- Accepted 5 August 2019

This report aims to provide methodological guidance to help practitioners select the most appropriate weighting method based on propensity scores for their analysis out of many available options (eg, inverse probability treatment weights, standardised mortality ratio weights, fine stratification weights, overlap weights, and matching weights), and outlines recommendations for transparent reporting of studies using weighting based on the propensity scores.

Propensity scores1 have become a cornerstone of confounding adjustment in observational studies evaluating outcomes of treatment use in routine care. Propensity score based methods target causal inference in observational studies in a manner similar to randomised experiments by facilitating the measurement of differences in outcomes between the treated population and a reference population.2 Despite the conceptual equivalence between randomised experiments and observational studies using propensity scores, randomised experiments can successfully achieve exchangeability between treated and reference populations with respect to both measured and unmeasured characteristics, whereas observational studies can only achieve exchangeability with respect to the measured characteristics.

Propensity scores, formally defined as patients’ predicted probability of receiving a certain treatment given their characteristics, need to be estimated using observed data based on a statistical model such as a logistic regression model. After estimation, confounding adjustment through conditioning on the propensity scores can be done in many ways, including matching, stratification, adjustment as a regressor, and weighting.3 Previous research has suggested that the traditional outcome regression model provides generally equivalent confounding adjustment to various propensity score based approaches in cohort studies with a large sample size and sufficient number of outcome events to support multivariable model fit.4 However, some key advantages of propensity scores, including the ability to clearly define the target population of inference and the ability to identify and exclude patients in atypical circumstances with near zero probability of receiving a certain treatment,5 have made use of these scores a method of choice for analysing observational data for many researchers.

Matching each treated observation to a fixed number of reference observations if their propensity scores are within a prespecified range (the caliper) has often been the preferred approach of using propensity scores for confounding adjustment.6 However, this method has an important limitation of discarding unmatched observations falling within the caliper after a prespecified number of observations are found for each treated observation. More recently, a paradoxical phenomenon of increasing rather than decreasing covariate imbalance after propensity score matching has been described by King and Nielsen.7 Notably, other methods of using propensity scores in analysis (including stratification, adjustment as a regressor, and weighting) are not affected by this paradox.

Weighting on the propensity score has several advantages. Firstly, unlike matching, weighting keeps most observations in the analysis and hence, can offer increased precision when estimating treatment effects. Secondly, unlike regression adjustment by the propensity score,8 weighting lends itself easily to transparent reporting of the balance achieved between treatment and reference populations. Finally, weighting on the propensity score is arguably the most flexible approach of using propensity scores in the analysis with multiple available variations that allow targeting specific populations for inference. In addition to traditional approaches of propensity score weighting that use inverse probability treatment weights (IPTW) or standardised mortality ratio weights (SMRW), several newer approaches (including propensity score fine stratification weights,9 matching weights,1011 and overlap weights12) have been proposed to overcome important limitations of traditional weighting approaches.

In this report, we describe implementation of alternative propensity score weighting methods along with key features of each approach to help practitioners choose the most appropriate method for their analysis. We also provide recommendations for key diagnostic and reporting parameters to evaluate the validity of an analysis using propensity score weighting. The objective of this report is not to compare performance of different weighting methods, but rather to demonstrate implementation and provide insights into the process of selecting a specific approach for a particular study. For additional technical details and comparative performance of the various weighting approaches described here, we refer readers to previously published studies that have proposed and rigorously evaluated these approaches under various scenarios.45891011121314

#### Summary points

Propensity score based weighting approaches provide an alternative to propensity score matching and are especially useful when preserving a large majority of the study sample is needed to maximise precision

Propensity score based weighting approaches can target treatment effect estimation in specific populations including the average treatment effect in the whole population, average treatment effect among the treated population, or average treatment effect in a subpopulation with clinical equipoise

Principles outlined in this report are intended to help investigators in identifying the most suitable propensity score based weighting approach for their analysis and provide a framework for transparent reporting

## Basic principle of weighting methods based on propensity scores

The propensity score is a balancing score that allows for simultaneous balance on a large set of covariates between the treated and reference populations. Matching and traditional stratification of the propensity score (also referred to as subclassification)1 achieve balance by ensuring that treated and reference populations on average have comparable propensity scores (within each stratum if using subclassification). However, weighting methods use a function of the propensity score to reweight the populations and achieve balance by creating a pseudo-population where the treatment assignment is independent of the observed covariates.15 A weighted outcome regression model can be implemented with treatment status as the only independent variable to derive adjusted treatment effect estimates, because covariates are expected to be balanced in the weighted population. To account for the fact that the pseudo-population size is inflated or deflated relative to the original study population and that weights are estimated (rather than known with certainty), a robust, sandwich type estimator is recommended for variance estimation for the treatment effect estimates.16

## Target of inference (estimand)

The target of inference refers to the patient population to which the estimated treatment effect applies and will generally be study specific. Investigators should consider the following central question when conceptualising the target of inference for a specific study—would it be feasible to treat all eligible patients included in the study with the treatment of interest?

If the answer to this question is yes, then the target of inference might be defined as the average treatment effect (ATE). An example of where ATE could be the target of inference might be in a study comparing the effectiveness of a newly approved treatment with an existing treatment for a certain condition, for example, dabigatran versus warfarin for prevention of stroke in atrial fibrillation.17 Because both of these treatments are indicated as exchangeable options for atrial fibrillation in the absence of specific contraindications, all patients meeting the study inclusion criteria—namely, the diagnosis of atrial fibrillation—are eligible to receive dabigatran.

If the answer to the central question is no, the treatment would not be given to everyone in the eligible population, and only patients with certain characteristics who actually received the treatment would be ideal candidates for treatment; then the target of inference might be defined as average treatment effect among the treated population (ATT). An example of where ATT could be the target of inference might be in a study evaluating the safety of a particular drug treatment or class in a population of vulnerable patients, for example, antipsychotic drugs for pregnant women.18 Because of the concerns and uncertainty related to malformation risks associated with antipsychotics, not all patients meeting the study inclusion criteria—namely, the diagnosis of schizophrenia, bipolar disorder, or psychosis—might be considered for treatment. Therefore, only women with greater severity of these conditions would receive treatment with antipsychotics during pregnancy, making the ATT the relevant target of inference. There might also be circumstances when the interest is in targeting ATE only among a subset of patients with certain characteristics leading to clinical equipoise. Weighting approaches based on the propensity score can accommodate all three of these targets of inference. The key features and mathematical formulas of each weighting approach are summarised in table 1 and described in detail below. In the absence of treatment effect heterogeneity by patient characteristics, ATE and ATT will coincide.

## Considerations when selecting a propensity score weighting method for confounding adjustment

We describe a stepwise process (fig 1) that investigators can consider when selecting an appropriate weighting method based on the propensity score for their study. We use a cohort study of dabigatran versus warfarin initiation on the risk of ischaemic stroke or systemic embolism conducted using commercial insurance claims data from the United States19 as a recurring case study throughout this manuscript to demonstrate various concepts as they relate to alternate propensity score weighting approaches.

### Step 1: Correct specification of the propensity score model

The first critical step in an analysis using the propensity score for confounding adjustment is avoiding misspecification of the propensity score model. Because an investigator is unlikely to know the true structural association between treatment assignment and all covariates, model misspecification is possible when estimating the propensity score from a simple logistic regression model that only includes main effects and not interactions among variables. Other approaches to estimate the propensity score—for instance, the covariate balancing propensity scores or machine learning approaches such as neural networks—could provide alternatives that are less prone to misspecification.2021 Regardless of the approach used for constructing propensity score models, researchers should emphasise inclusion of outcome risk factors in the model22 and exclusion of strong predictors of treatment that are not associated with outcomes (that is, an instrumental variable) from the model to avoid increased variance and amplification of bias due to unmeasured confounding.23

When considering weighting based on the propensity score, the impact of model misspecification could vary across approaches. Approaches that use the score directly to create weights such as IPTW are theoretically more prone to increased bias and variance from misspecification of the propensity score model.2425 On the other hand, the weighting approach based on propensity score stratification might be more robust against misspecification of the propensity score model, because it can be conceptualised as a semiparametric implementation of propensity score weighting that uses the score only to create stratums and then uses the counts of observations within each stratum to derive weights. A simple diagnostic step of checking covariate balance between the treatment and reference populations after applying weights based on the propensity score can alert researchers to potential model misspecification that might need attention.20 For reporting the balance in each individual covariate between treated and reference populations, a measure such as the standardised difference in prevalence (or means for continuous variables) is recommended.26 Investigators might also consider reporting an overall measure of balance, such as the post weighting C statistic, where values closer to 0.5 would indicate achievement of balance in aggregate over all included covariates.27Box 1 summarises the recommended diagnostic and reporting steps for analyses conducted using propensity score based weighting.

### Recommended diagnostics and reporting practices for studies using a propensity score weighting method for confounding adjustment

Evaluate the weight distribution, and consider weight truncation or trimming when extreme weights are encountered

Describe the study population overall to clearly identify the population for which inference is being made

Describe the population by exposure groups to evaluate balance achieved across included covariates between treated and reference groups. Consider reporting an overall measure of balance in the weighted sample such as the post weighting C statistic

Report the crude and weighted effect estimates along with confidence intervals calculated using robust variance that accounts for weighting.

In the case example, a propensity score model was constructed with dabigatran initiation as the dependent variable and 66 prespecified patient characteristics as independent variables in a logistic regression model in a cohort of 79 265 patients with atrial fibrillation, 22 809 (29%) of whom were dabigatran initiators. Conditioning on the propensity scores derived from this model through various weighting approaches (described below) led to balance among included covariates (table 2, fig 2). Achievement of balance suggests that propensity score model specification was probably adequate in this example.

### Step 2: Evaluation of propensity score distributional overlap between exposed and reference groups

Evaluation of the propensity score distributional overlap between the treatment and reference groups is the next important step in an analysis using the propensity score. High overlap in the propensity score distribution generally indicates a reasonable degree of clinical equipoise in treatment selection between the comparator groups. The general recommendation of trimming the regions of non-overlap to ensure restriction to regions where patients had a non-zero probability of receiving either treatment3 is especially important when considering weighting based on the propensity score. Probabilities close to 0 or 1 could result in large weights that unduly influence the analysis by over-representing patients in atypical circumstances who were certain to receive one of the two treatments. If a large portion of the sample is lost after trimming regions of non-overlap, it could indicate insufficient overlap between distributions. Furthermore, exclusion of observations through trimming because of non-overlap can lead to important changes in the composition of the study population and therefore, could alter the target of inference. In the case example, we assessed propensity score distributional overlap between the dabigatran and warfarin groups and noted substantial overlap between the two groups (fig 3). Trimming non-overlapping regions of the propensity score distribution resulted in the exclusion of only 10 patients, which confirmed sufficient overlap.

Evaluating the propensity score distribution in the treatment and reference groups further revealed that the distribution was bimodal for the warfarin group. The first peak comprised of a subset of the warfarin initiators who have a low probability of receiving dabigatran, while the second peak comprised of remaining warfarin initiators who have a relatively higher probability of receiving dabigatran. Examining the distribution after applying weights under different approaches suggested that the patients receiving warfarin in the first peak were down-weighted substantially under all weighting approaches except for the weights targeting the ATE (IPTW and fine stratification weights (ATE)). If the investigators deem that it is important to generate inference that is applicable to all patients with atrial fibrillation initiating dabigatran or warfarin, then it may be appropriate to use weighting approaches that target the ATE in the whole population. However, if investigators consider patients receiving warfarin in the first peak to be a special group of patients with atrial fibrillation where there is little uncertainty over treatment choice (that is, warfarin is always preferred over dabigatran), then it may be appropriate to target the ATT or ATE in the overlap population.

### Step 3a: (If sufficient overlap in the propensity score distribution in step 2) Selection of target of inference

As different approaches for weighting based on the propensity score result in estimates targeting different populations, investigators should pay close attention to their target of inference and select a corresponding weighting approach.

#### Average treatment effect (ATE) in the whole population

Two weighting approaches are available for targeting the ATE, both of which aim to make the distribution of covariates in the treated and reference groups similar to each other and similar to the distribution of the overall study sample.

*Inverse probability treatment weighting (IPTW)*—This method involves weighting by the inverse probability of receiving the study treatment actually received (1/propensity score for the treated group and 1/(1−propensity score) for the reference group). As the propensity score is directly used to create weights, extreme weights are commonly observed whenever the propensity score is near 0 for a treated patient or near 1 for a reference patient. Weight truncation, which is commonly implemented by setting the maximum and minimum weights at prespecified values based on the observed distribution (eg, 1st and 99th percentile), is routinely necessary to address extreme weights and prevent variance inflation.16 Although selecting the cutoff value for truncation is often an arbitrary decision, researchers must appreciate that weight truncation involves a bias-variance trade off where truncating more observations by setting a lower threshold (eg, 95th *v* 99th percentile) will further reduce variance inflation, but at a cost of added bias.28

In the case example, IPTW as high as 155 417 was observed; truncation at the 99th percentile of the weight distribution led to a maximum weight of 9.91. Another solution to prevent extreme weights is stabilisation by incorporating the marginal probability of receiving the treatment actually received in the numerator.29 However, stabilising weights in this manner might not completely address all extreme weights, making truncation necessary. In our case example, incorporating marginal probabilities still led to weights of over 100 in 49 observations (>1000 in 20 observations).

A special setting where IPTW is routinely used is in marginal structural modelling.30 Marginal structural models are particularly useful when accounting for time-varying confounding, formally defined as confounding induced by outcome risk factors that are affected by previous treatment and affect future treatment. In this setting, IPTW calculated at multiple time points throughout the follow-up period are commonly combined with inverse probability of censoring weights to address time-varying confounding and selection bias introduced by informative censoring in a single model.30 Previously published articles provide additional details on this method.2831

*Fine stratification weights targeting the average treatment effect (ATE)*—This method does not use the propensity score directly to calculate weights; instead, propensity scores are used to create fine stratums.9 Stratums can be created in several ways, based on the following:

The propensity score distribution of the whole cohort

The propensity score distribution of the smaller of the two exposure groups

A fixed width of probabilities (eg, 0-0.02 stratum 1, >0.02-0.04 stratum 2, and so on).

For low exposure prevalence, the approach of creating stratums based on the propensity score distribution of the exposed patients ensures assignment of all exposed individuals to stratums and minimises loss of information. Following stratification, weights for both treated and reference patients in all stratums with at least one treated patient and one reference patient are subsequently calculated based on the total number of patients within each stratum. Stratums with no exposed or reference patients are dropped out before weight calculation. As long as an appropriate stratification procedure is selected to avoid sparse stratums, extreme weights due to propensity scores that are very close to 0 or 1 are unlikely, which is an important strength in circumstances where exposure prevalence is low and propensity score distribution is skewed. These weights are mathematically equivalent to marginal mean weights described in the education literature.32

#### Average treatment effect among the treated population (ATT)

Two weighting approaches are available for targeting the ATT, both of which aim to make the distribution of covariates in the reference group similar to the distribution observed in the treatment group.

*Standardised mortality ratio weighting (SMRW)*—This method involves setting weights to 1 for the treated patients and weighting reference patients by the odds of treatment probability: (propensity score/(1−propensity score)).29 Similar to IPTW, SMRW is potentially vulnerable to extreme weights because the propensity score is used directly for calculating the weights. Weight truncation could be considered if large weights are observed.

*Fine stratification weights targeting the average treatment effect among the treated population (ATT)*—Similar to the fine stratification weights targeting the ATE, propensity scores are used to create fine stratums, but weights for the treated group are set to 1 and reference patients are reweighted based on the number of treated patients residing within their stratum, so that reference patients contribute proportionally to the relative number of total patients within a stratum.9 Extreme weights are uncommon because propensity score is not directly used to weight but still possible if some stratums are highly imbalanced with respect to the number of treated and reference patients.33

#### Average treatment effect (ATE) in a subset with clinical equipoise

The next two weighting approaches, matching weights and overlap weights, have a variable target of inference that is heavily influenced by overlap in the propensity score distribution. Broadly, these approaches target the ATE in a subset of the overall population with some clinical equipoise. In other words, these approaches aim to make the distribution of covariates in the treated and reference group similar to each other and similar to the distribution in a subset of the overall study sample where patients are eligible to receive either the treatment of interest or the reference treatment.

*Matching weights*—This method involves weighting patients based on a ratio of the lower of the two predicted probabilities to the predicted probability of the actually received treatment.1011 A key feature is that extreme weights are impossible because weights are bound between 0 and 1 by design, eliminating the need for weight truncation. The target of inference is close to the ATE in the whole population when groups are equally sized, and propensity score distributions have good overlap and is close to the ATT in the group with fewer observations when groups are unequally sized, but propensity score distributions have good overlap. In circumstances of limited overlap in propensity score distribution, this approach targets treatment effect estimation in a subpopulation that is neither the set of patients receiving the treatment of interest in routine care nor the whole study population.

*Overlap weights*—This method involves weighting patients based on the predicted probability of receiving the opposite treatment.12 Similar to matching weights, extreme weights are impossible as weights are bound between 0 and 1 by design and, therefore, no truncation is necessary. Further, an attractive feature is that this weighting method yields exact covariate balance between treated and reference groups by construction. However, the target of inference is the ATE in the overlap population, which might be different from the ATT or the ATE in the whole study population.

For the case example, we calculated the treatment effect comparing dabigatran and warfarin for the risk of major bleeding before and after weighting for all approaches. The results are reported in figure 4, along with confidence intervals calculated using robust variance estimators. The crude estimate suggested a substantially lower bleeding risk with dabigatran versus warfarin, which attenuated after adjustment for confounding through all weighting approaches. Overall, hazard ratio estimates for approaches with a similar target of inference were nearly identical. Hazard ratios for approaches targeting the ATE and ATT were somewhat different (0.73 *v* 0.79). One potential explanation of this difference could be effect measure modification by patient characteristics. Because these estimates apply to populations with varying distribution of patient characteristics (as seen in table 1**)**, presence of effect measure modification could lead the estimates to diverge.

### Step 3b: (If insufficient overlap in the propensity score distribution in step 2) Consider alternative comparison groups or other design modifications

Insufficient distributional overlap could indicate two treatments that are used in completely different populations or for different indications. In this circumstance, investigators should reconsider their design choices with respect to the comparison group or study inclusion criteria. If sufficient overlap is achieved after such modifications, then use of weighting based on the propensity score could be considered, based on the considerations summarised in step 3a. If alternative comparison groups or design modifications fail to achieve sufficient overlap, investigators might need to reconsider the study question.

## Propensity score based weighting approaches for confounding adjustment in evaluations of comparative outcomes in more than two treatment groups

Certain weighting approaches readily extend to settings of more than two treatment groups. Specifically, weight calculations for IPTW, matching weights, and SMRW in settings of two groups have direct equivalents for settings of three or more treatment groups. All these approaches involve generating propensity scores for three or more treatments in a multinomial logistic regression model. IPTWs are calculated based on the inverse of the propensity of the treatment actually received, and target ATE in the whole population regardless of the number of treatment groups. For matching weights in settings of three or more groups, the numerator includes the minimum of all available propensity scores for each patient and the denominator includes propensity of the treatment actually received.11 Similar to settings of two treatment groups, when treatment groups are equally sized and covariate overlap is substantial across three or more treatment groups, matching weights target ATE in the whole population; when one of the treatment groups is small and covariate overlap is substantial, matching weights target ATT in the smallest group.11 For SMRW, investigators can target ATT for a specific treatment group by setting weights for patients receiving the target treatment to 1 and calculating weights for other treatment groups as a ratio of propensity of the target treatment to propensity of the treatment actually received. An extension of overlap weights, termed as generalised overlap weights, has been proposed for settings of three or more groups where weights are constructed as the product of the inverse probability weights and the harmonic mean of the generalised propensity scores and these weights target the population with the most overlap in covariates across the multiple treatments.13 Extension to settings of three or more groups for the weighting approaches based on fine stratification requires simultaneous stratification on a multinomial propensity score, which would increase the number of stratums exponentially and could result in variable estimates.34

## Conclusion

Weighting based on the propensity score represents a valuable tool for confounding adjustment in observational studies of treatment use and is increasingly being used in epidemiological investigations. In this article, we outline key considerations involved in selection and implementation of an appropriate weighting approach based on the propensity score that could provide a framework for practitioners in designing and reporting their analysis.

## Footnotes

Contributors: RJD and JMF have jointly developed this manuscript. RJD is the guarantor of the content of this article.

Funding: This study was supported through internal funding from the Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Harvard Medical School/Brigham and Women’s Hospital.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: support from Harvard Medical School/Brigham and Women’s Hospital for the submitted work; RJD is principal investigator of research grants from Bayer, Novartis, and Vertex, to the Brigham and Women’s Hospital for unrelated work; no other relationships or activities that could appear to have influenced the submitted work.

Data sharing: Patient level data are not made available publicly according the data use agreement. Any aggregate level data not presented in the manuscript can be requested from the corresponding author.

The lead author affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.