Original Article
Using the outcome for imputation of missing predictor values was preferred

https://doi.org/10.1016/j.jclinepi.2006.01.009Get rights and content

Abstract

Background and Objective

Epidemiologic studies commonly estimate associations between predictors (risk factors) and outcome. Most software automatically exclude subjects with missing values. This commonly causes bias because missing values seldom occur completely at random (MCAR) but rather selectively based on other (observed) variables, missing at random (MAR). Multiple imputation (MI) of missing predictor values using all observed information including outcome is advocated to deal with selective missing values. This seems a self-fulfilling prophecy.

Methods

We tested this hypothesis using data from a study on diagnosis of pulmonary embolism. We selected five predictors of pulmonary embolism without missing values. Their regression coefficients and standard errors (SEs) estimated from the original sample were considered as “true” values. We assigned missing values to these predictors—both MCAR and MAR—and repeated this 1,000 times using simulations. Per simulation we multiple imputed the missing values without and with the outcome, and compared the regression coefficients and SEs to the truth.

Results

Regression coefficients based on MI including outcome were close to the truth. MI without outcome yielded very biased—underestimated—coefficients. SEs and coverage of the 90% confidence intervals were not different between MI with and without outcome. Results were the same for MCAR and MAR.

Conclusion

For all types of missing values, imputation of missing predictor values using the outcome is preferred over imputation without outcome and is no self-fulfilling prophecy.

Introduction

No matter how hard investigators try to prevent it, missing values occur in all types of epidemiologic studies. Analysis of epidemiologic studies typically includes estimation of the association between risk factors or predictors and the outcome. Most statistical packages exclude subjects with any missing on any of the variables analyzed, making the complete or available case analysis the most common form of analysis. It is often not appreciated that simply excluding subjects with missing values not only affects precision but commonly causes biased results as well, as the occurrence of missing values is seldom completely at random [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11].

Generally, three types or mechanisms of missing values are distinguished [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. When subjects with missing data form a random subset of the study sample—for example, because a tube with blood material was accidentally broken—missing values are denoted as missing completely at random (MCAR). If the probability that an observation is missing depends on unobserved subject information or characteristics, missing values are called missing not at random (MNAR) or nonignorable. When missingness depends on characteristics that are observed, missing values are—confusingly—called missing at random (MAR); missing values are random conditional on other available information.

When missing values are MCAR—which can easily be checked from the data—complete and available case analysis give unbiased though less precise results [1], [2], [4], [5], [8], [10], [12], [13], [14], [15], [16], [17], [18], [19]. However, two other simple and frequently used techniques, i.e., imputation of missing values by the overall—unconditional—mean, or by regarding missing values as a separate category (the so-called missing indicator method), still lead to biased results when missing values are MCAR, which is also illustrated in the paper by Donders et al in this issue [1], [2], [4], [10], [12], [14], [15], [18]. When missing values are MNAR, there is no uniform method of handling the missing values properly [1], [2], [4], [5], [8], [10], [12], [13], [14], [15], [16], [17], [18], [19].

Mostly, in epidemiologic research missing values are neither MCAR nor MNAR [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Missingness is typically MAR, that is, related to other observed subject characteristics including, directly or indirectly, to the outcome. For example, in a diagnostic study among children with neck stiffness we quantified which combination of predictors from patient history and physical examination predict the absence of bacterial meningitis (outcome), and which blood tests (e.g., c-reactive protein level) have added predictive value [20]. Patients presenting with severe signs such as convulsions—that commonly occur among those with bacterial meningitis—often received additonal (blood) testing before full completion of patient history and physical examination, that in turn were largely missing. On the other hand, patients with very mild symptoms—who frequently had no bacterial meningitis—were more likely to have a completed history and physical but less likely underwent additional tests, because the physician already ruled out a serious disease [20], [21]. Missingness on particular tests was related to other observed test results and—although indirectly—to the outcome. This mechanism of missingness, that is, MAR, may similarly occur in prognostic, etiologic, and therapeutic studies. Moreover, in therapeutic studies lost to follow-up is often directly related to the outcome under study. In such instances, obviously a complete case analyses results in biased associations between predictors and outcome because the patient subset that is completely observed is highly selected. Moreover, when data are MAR, all above-mentioned simple techniques to handle missing values commonly result in biased estimates of the regression coefficients and overestimation of precision [1], [2], [4], [5], [8], [10], [12], [13], [14], [15], [16], [17], [18], [19]. Only if the missing covariate values is not at all related to the outcome, directly or indirectly, complete case analysis and other simple techniques may still yield valid results [1], [5], [6], [14]. But usually in epidemiologic research, missing covariate values is related to the outcome—if not directly than indirectly—via other variables that are related to the missing covariate and to the outcome.

More advanced methods, such as maximum likelihood estimation (e.g., using the EM-algorithm) and multiple imputation of missing values based on all other observed subject characteristics—that is, conditional imputation—have been advocated to properly deal with missing values [1], [2], [4], [5], [6], [9], [10], [11], [12], [14], [17], [18], [19], [22], [23], [24], [25], [26], [27]. Maximum likelihood estimations are particularly applied in multilevel or repeated-measurement analysis in which predictors or risk factors are documented more than once. In the present article, we focus on studies where predictors and outcome are measured once, for which multiple imputation is the advocated method [1], [2], [4], [5], [6], [9], [10], [11], [12], [14], [17], [18], [19], [24], [25], [26], [27].

As missingness on a predictor or covariate is commonly related to other predictors (covariates) and directly or indirecly to the outcome, the advice in conditional mutliple imputation of missing values is to use all observed patient data, that is, all other predictors plus outcome. Ignoring in the imputation the association between the predictors with missing values and the outcome, will dilute the estimated association—that is, regression coefficients—between the predictors and outcome [1], [2], [4], [5], [10], [12], [14], [18], [19], [25], [26]. This is simply because the (prediction) model that is used to estimate the conditional distribution of the covariate(s) with missing values, misses an important variable, that is, the outcome. Hence, the estimated conditional distribution of the covariate from which random values are drawn (multiple times) to replace the missing covariate values, is biased [2]. On the other hand, using the outcome to multiple impute missing predictors and subsequently estimating the association between these same predictors and the outcome may seem a self-fulfilling prophecy in which the regression coefficients of the predictors will be overestimated, that is, away from the null. However, it has empirically been shown that imputation of outcome values that are missing at random using all observed patient information including the predictors under study, (i.e., conditional imputation), causes less bias in the associations between these same predictors and the outcome, than unconditional imputation [8], [17], [26], [28], [29]. Accordingly, and based on firm statistical reasoning [1], [2], [4], [5], [10], [12], [14], [18], [19], [25], [26], imputation of predictor values that are missing at random without using the outcome, would also result in more biased associations than imputations conditional on the outcome. This has hardly been illustrated empirically [6].

Our primary aim was to test the hypothesis of self-fulfilling prophecy by comparing the results when missing predictor values—generated by different mechanisms—are multiple imputed conditional on all other predictors plus the outcome to the results obtained after multiple imputation (MI) conditional on other predictors only (i.e., without the outcome). A secondary aim was to compare the performance of MI with and without the outcome to that of complete case analysis. For both aims we use empirical data from an epidemiologic study on the detection of pulmonary embolism (PE) as the basis of realistic simulation studies. Our aim was not to compare differences between methods for imputation of missing outcome values, as this has already been illustrated by others [7], [8], [17], [28], [29], [30].

Section snippets

Description of the empirical data set

Data were taken from a prospective diagnostic study among adult (above 18 years) patients who were suspected of PE. The patients underwent a systematic patient history and physical examination, and additionally, blood gas analyses and chest X-ray. Finally, they underwent ventilation-perfusion lung scanning or pulmonary angiography (reference test) to determine the presence or absence of PE (outcome). For specific details and main results of this study we refer to the literature [31], [32], [33]

Bias

Figure 1 shows for each MI method including complete case analyses, the amount of bias in the estimation of the intercept and each predictor (including age). For both MI methods, use of the outcome yielded much lower bias in the models' intercept compared to MI without the outcome, which applied to all types of missingness. There were no major differences across the two MI methods. Similar results were found for the regression coefficients of the four imputed predictors: the bias was smaller

Discussion

Using a combination of empirical data and simulation studies, we quantified whether the results in estimated regression parameters of a multivariable logistic model are different when missing predictors (independent variables) are multiple imputed with use of the outcome as compared to no use of the outcome. The bias in estimated regression coefficients including intercept, was minimal when the outcome was included in the MI models, irrespective of whether the missing values were completely at

Acknowledgments

We gratefully acknowledge the support by The Netherlands Organization for Scientific Research (ZON-MW 904-10-006 and 917-46-360). The computer (R) scripts applied in this study—on how to use empirical data as a basis for simulations, how missing values are created according to different mechanisms (MCAR, MAR, and MNAR), and to illustrate the effect of different conditional multiple imputation models on the final study results compared to complete case analysis—are available on request by the

References (38)

  • C.R. Weinberg et al.

    Imputation for exposure histories with gaps, under an excess relative risk model

    Epidemiology

    (1996)
  • K. Meyer et al.

    A new suggestion for the classification of missing values in the outcome of clinical trials

    Clini Res Regul Affairs

    (1998)
  • K. Unnebrink et al.

    Intention-to-treat: methods for delaing with missing values in clinical trials of progressively deteriorating diseases

    Stat Med

    (2001)
  • J.L. Schafer et al.

    Missing data: our view of the state of the art

    Psychol Methods

    (2002)
  • D.B. Rubin

    Multiple imputation for non respons in surveys

    (1987)
  • A. Donner

    The relative effectiveness of procedures commonly used in multiple regression analyses for dealing with missing values

    Am Stat

    (1982)
  • W. Vach et al.

    Biased estimates of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables

    Am J Epidemiol

    (1991)
  • J.M. Robins et al.

    Estimation of regression coefficients when some regressors are not always observed

    J Am Stat Assoc

    (1994)
  • W. Vach

    Some issues in estimating the effect of prognostic factors from incomplete covariate data

    Stat Med

    (1997)
  • Cited by (744)

    • Missing data: Issues, concepts, methods

      2024, Seminars in Orthodontics
    View all citing articles on Scopus
    View full text