Intended for healthcare professionals

Education And Debate

Reader's guide to critical appraisal of cohort studies: 2. Assessing potential for confounding

BMJ 2005; 330 doi: (Published 21 April 2005) Cite this as: BMJ 2005;330:960
  1. Muhammad Mamdani, senior scientist1,
  2. Kathy Sykora, senior biostatistician1,
  3. Ping Li, analyst1,
  4. Sharon-Lise T Normand, professor of health care policy (biostatistics)2,
  5. David L Streiner, professor3,
  6. Peter C Austin, senior scientist1,
  7. Paula A Rochon, senior scientist4,
  8. Geoffrey M Anderson, chair in health management strategies (geoff.anderson{at}
  1. 1Institute for Clinical Evaluative Sciences, Toronto, ON Canada
  2. 2Department of Health Care Policy, Harvard Medical School, Boston, USA
  3. 3Department of Psychiatry, University of Toronto, ON, Canada
  4. 4Kunin-Lunenfeld Applied Research Unit, Baycrest Centre for Geriatric Care, Toronto, ON, Canada
  5. 5Department of Health Policy, Management and Evaluation, Faculty of Medicine, University of Toronto, Toronto, ON Canada
  1. Correspondence to: G M Anderson, Institute for Clinical Evaluative Sciences, 2075 Bayview Avenue, Toronto, ON M4N 3M5, Canada
  • Accepted 18 February 2005

Although confounding is an important problem of cohort studies, its effects can be minimised to enable valid comparison


In cohort studies, who does or does not receive an intervention is determined by practice patterns, personal choice, or policy decisions. This raises the possibility that the intervention and comparison groups may differ in characteristics that affect the study outcome, a problem called selection bias. If these characteristics have independent effects on the observed outcome in each group, they will create differences in outcomes between the groups apart from those related to the interventions being assessed. This effect is known as confounding.1 In the first paper in the series we dealt with the design and use of cohort studies and how to identify selection bias.2 This paper focuses on the definition and assessment of confounders.

What is a confounder?

For a characteristic to be a confounder in a particular study, it must meet two criteria.1 The first is that it must be related to the outcome in terms of prognosis or susceptibility. For example, in the study of the association between antipsychotic use and hip fracture that we considered in the first paper,2 age is known to be related to risk of hip fracture and therefore has the potential to be a confounder.

The second criterion that defines a confounder is that the distribution of the characteristic is different in the groups being compared. It can differ in terms of either the mean or the degree of variation or variability in that characteristic. For example, for age to be a confounder in a cohort study, either the average age or the variation in the age in the groups being compared would have to be different. Assessing variation as well as average values is important because groups can have the same average value but very different variation. For example, one group with an average age of 70 could include only people aged 70 and another with the same average age could consist of equal proportions of individuals aged 50 and 90. Nevertheless, even a characteristic that is a strong predictor of outcome will not be a confounder if its distribution is balanced between the comparison groups.

In assessing cohort studies, it is important to identify potential confounders and to examine their distribution in the intervention and comparison groups. Below we describe the three questions that need to be answered.

Has there been a systematic effort to identify and measure potential confounders?

Although currently available evidence helps identify potential confounders, the imperfect state of knowledge means that some characteristics related to the outcome may not have been discovered (unknown confounders). Even if a confounder is known, there may be insufficient data to evaluate it.

In randomised controlled trials, all potential confounders (known or unknown) are expected to be evenly distributed between the groups being compared.3 Cohort studies, however, have no similar protection against confounding and are especially vulnerable to unknown confounders. This does not mean that all cohort studies are inherently invalid. The unknown potential confounders may not have a large independent effect on the outcome of interest and, therefore, even if unevenly distributed, might not result in much bias. Unknown potential confounders may also be evenly distributed between the groups. Nevertheless, all cohort studies should recognise that unknown confounders could affect the results and, as outlined in the next article in this series,4 investigators should make an effort to determine how sensitive the results are to unknown confounders.

Although unknown confounders are difficult to deal with in cohort studies, a systematic approach can be used to identify known confounders. This should start with a well designed search of comprehensive databases such as Medline. In the context of the study of the relation between antipsychotic use and the outcome of a hip fracture, a review of the literature suggests that risk factors for hip fracture can be broken down into four categories510:

  • Features of medical history—for example, stroke, osteoporosis

  • Exposure to drugs—for example, benzodiazepines, oestrogens

  • Demographics—for example, age and sex

  • Social and behavioural factors—for example, exercise and diet.

Once the potential confounders have been identified, the next step is to develop ways to measure these in the groups being studied. In many cases, especially when using administrative databases, it may not be possible to measure all known confounders. Even if they are measured, the reliability and validity of the measurement technique may be unclear. In the hip fracture and atypical antipsychotic example (see for details of how the cohort was created) we used administrative databases to measure known confounders. These databases are poor sources of information on behavioural and social factors. The failure to include measures of these factors has been identified as a key issue in cohort studies of hip fracture,11 and lack of control for lifestyle factors has been suggested to have a key role in the differences in risk of cardiovascular disease seen in cohort and randomised controlled studies of hormone replacement therapy.12 Although the administrative databases can provide some information on patient history such as previous falls, they may underestimate their true prevalence. It is important to know which confounders have been measured in the study and how well they have been measured.

Is there information on distribution of potential confounders between groups?

Information on the distribution of potential confounders in the intervention and comparison groups is usually provided in the first table of the paper. Confounding is a problem only if these characteristics are unevenly distributed between the intervention and comparison groups. The table provides information on potential confounders for two comparisons examining the association between atypical antipsychotic use and hip fracture. Tables similar to this should be included in all cohort studies so that the reader can have an overview of the potential for selection bias and confounding.

Baseline characteristics of study groups in comparisons of atypical antipsychotic versus no drug in all older people, and atypical versus typical antipsychotic drug in older people with dementia. Values are numbers (percentages) of patients unless stated otherwise

View this table:

Embedded Image

Cohort characteristics can confound only if they vary between comparison groups


What methods are used to assess differences in distribution of potential confounders?

Perhaps the most common strategy to identify important imbalances in individual confounders between intervention and comparison groups is to use significance tests such as χ2 tests (for dichotomous variables) or t tests (for continuous variables). A problem with these tests is that the significance levels are sensitive to sample size, and the tests are usually not very meaningful when applied to studies with very large numbers of subjects (as is often the case for cohort studies). Under such circumstances, the differences may be significant but not clinically meaningful. For example, in the comparison restricted to people with dementia in the table, a difference of about three months in mean age between groups is significant (P < 0.001) but may not be clinically relevant. Alternatively, if the samples are small, differences that are clinically meaningful may not be significant. For these reasons this approach to the assessment of differences is of little value.

An alternative to traditional significance testing is to use standardised differences or effect size to examine between group differences in patient characteristics. Standardised differences reflect the mean difference as a percentage of the standard deviation. To estimate these, differences between groups are divided by the pooled standard deviation of the two groups. This measure of the distribution is not as sensitive to sample size as traditional tests and provides a sense of the relative magnitude of differences. Standardised differences of greater than 0.1 are typically felt to be meaningful.13

In the table, traditional significance testing found that all 19 potential confounders were significantly different (P < 0.001) in comparison 1, and that 13 of the 19 characteristics had standardised differences greater than 0.1. Of particular note is the large standardised difference for history of dementia. Restriction of the study to people with dementia eliminates the possibility of confounding from this characteristic. For comparison 2, traditional significance tests showed that 8 of the 18 potential confounders were significantly different (P < 0.001) but only two had a standardised difference greater than 0.1. The use of the standardised differences technique shows that comparison 1 has substantial selection bias, particularly for dementia, whereas comparison 2 has much less potential for bias.

Both traditional significance testing and standardised differences focus on one potential confounder at a time and do not provide an overall perspective on how the comparison groups differ. For example, two groups could have the same mean age and proportion of women, but one could contain old men and young women and the other old women and young men. An increasingly common approach to the analysis of cohort studies of health care interventions is to use propensity score methods14 15—a technique that involves multivariate assessment of confounders (see for a brief discussion and an example).

Selection bias in cohort studies can result in confounding. Here we have defined questions that can help identify potential confounders. In the next article we will examine statistical methods that can be used to reduce the effect of confounding and strategies that can be used to determine if the results of a study are plausible.

Key questions

Has there been a systematic effort to identify and measure potential confounders?

Is there information on how the potential confounders are distributed between the comparison groups?

What methods are used to assess differences in the distribution of potential confounders?

This is the second of three articles on appraising cohort studies


We thank Jennifer Gold and Monica Lee for help in preparing the manuscript.


  • Contributors and sources The series is based on discussions that took place at regular meetings of the Canadian Institute for Health Research chronic disease new emerging team. MM is a clinician with extensive research experience in cohort studies of prescription drugs who wrote the first draft of this article and is the guarantor. SLTN, DLS, and PCA are statisticians who commented on drafts of this paper. KS and PL programmed and conducted analyses. PAR and GMA conceived the idea for the series and GMA worked on drafts of this article and coordinated the development of the series.

  • Funding This work was supported by a CIHR operating grant (CIHR No MOP 53124) and a CIHR chronic disease new emerging team programme (NET-54010).

  • Competing interests None declared.

  • Embedded Image Further details on the study cohort and propensity scores are on


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
View Abstract