Epidemiological methods are widely applied in medical research, and even doctors who do not themselves carry out surveys will find that their clinical practice is influenced by epidemiological observations. Which oral contraceptive is the best option for a woman of 35? What prognosis should be given to parents whose daughter has developed spinal scoliosis? What advice should be given to the patient who is concerned about newspaper reports that living near electric power lines causes cancer? To answer questions such as these, the doctor must be able to understand and interpret epidemiological reports.
Interpretation is not always easy, and studies may produce apparently inconsistent results. One week a survey is published suggesting that low levels of alcohol intake reduce mortality. The next, a report concludes that any alcohol at all is harmful. How can such discrepancies be reconciled? This chapter sets out a framework for the assessment of epidemiological data, breaking the exercise down into three major components.
The first step in evaluating a study is to identify any major potential for bias. Almost all epidemiological studies are subject to bias of one sort or another. This does not mean that they are scientifically unacceptable and should be disregarded. However, it is important to assess the probable impact of biases and to allow for them when drawing conclusions. In what direction is each bias likely to have affected outcome, and by how much?
If the study has been reported well, the investigators themselves will have addressed this question. They may even have collected data to help quantify bias. In a survey of myopia and its relation to reading in childhood, information was gathered about the use of spectacles and the educational history of subjects who were unavailable for examination. This helped to establish the scope for bias from the incomplete response. Usually, however, evaluation of bias is a matter of judgement.
When looking for possible biases, three aspects of a study are particularly worth considering:
How were subjects selected for investigation, and how representative were they of the target population with regard to the study question?
What was the response rate, and might responders and nonresponders have differed in important ways? As with the choice of the study sample, it matters only if respondents are atypical in relation to the study question.
How accurately were exposure and outcome variables measured? Here the scope for bias will depend on the study question and on the pattern of measurement error. Random errors in assessing intelligence quotient (IQ) will produce no bias at all if the aim is simply to estimate the mean score for a population. On the other hand, in a study of the association between low IQ and environmental exposure to lead, random measurement errors would tend to obscure any relation-that is, to bias estimates of relative risk towards one. If the errors in measurement were nonrandom, the bias would be different again. For example, if IQs were selectively under-recorded in subjects with high lead exposure, the effect would be to exaggerate risk estimates.
There is no simple formula for assessing biases. Each must be considered on its own merits in the context of the study question.
Even after biases have been taken into account, study samples may be unrepresentative just by chance. An indication of the potential for such chance effects is provided by statistical analysis.
Traditionally, statistical inference has been based on hypothesis testing. This can most easily be understood if the study sample is viewed in the context of the larger target population about which conclusions are to be drawn. A null hypothesis about the target population is formulated. Then starting with this null hypothesis, and with the assumption that the study sample is an unbiased subset of the target population, a p value is calculated. This is the probability of obtaining an outcome in the study sample as extreme from the null hypothesis as that observed, simply by chance. For example, in a case-control study of the relation between renal stones and dietary oxalate, the null hypothesis might be that in the target population from which the study sample was derived there is no association between renal stones and oxalate intake. A p value of 0~05 would imply that under this assumption of no overall association between renal stones and oxalate, the probability of selecting a random sample in which the association was as strong as that observed in the study would be one in 20. The lower the calculated p value, the more one is inclined to reject the null hypothesis and adopt a contrary view – for example, that there is an association between dietary oxalate and renal stones. Often a p value below a stated threshold (for example, 0.05) is deemed to be ( statistically ) significant, but this threshold is arbitrary. There is no reason to attach much greater importance to a p value of 0.049 than to a value of 0.051.
A p value depends not only on the magnitude of any deviation from the null hypothesis, but also on the size of the sample in which that deviation was observed. Failure to achieve a specified level of statistical significance will have different implications according to the size of the study. A common error is to weigh “positive” studies, which find an association to be significant, against “negative” studies, in which it is not. Two case-control studies could indicate similar odds ratios, but because they differed in size one might be significant and the other not. Clearly such findings would not be incompatible.
Because of the limitations of the p value as a summary statistic, epidemiologists today prefer to base statistical inference on confidence intervals. A statistic of the study sample, such as an odds ratio or a mean haemoglobin concentration, provides an estimate of the corresponding population parameter (the odds ratio or mean haemoglobin concentration in the target population from which the sample was derived). Because the study sample may by chance be atypical, there is uncertainty about the estimate. A confidence interval is a range within which, assuming there are no biases in the study method, the true value for the population parameter might be expected to lie. Most often, 95% confidence intervals are calculated. The formula for the 95% confidence interval is set in such a way that on average 19 out of 20 such intervals will include the population parameter. Large samples are less prone to chance error than small samples, and therefore give tighter confidence intervals.
Whether statistical inference is based on hypothesis testing or confidence intervals, the results must be viewed in context. Assessment of the contribution of chance to an observation should also take into account the findings of other studies. An epidemiological association might be highly significant statistically, but if it is completely at variance with the balance of evidence from elsewhere, then it could still legitimately be attributed to chance. For example, if a cohort study with no obvious biases suggested that smoking protected against lung cancer, and no special explanation could be found, we would probably conclude that this was a fluke result. Unlike p values or confidence intervals, the weight that is attached to evidence from other studies cannot be precisely quantified.
Confounding versus causality
If an association is real and not explained by bias or chance, the question remains as to how far it is causal and how far the result of confounding. The influence of some confounders may have been eliminated by matching or by appropriate statistical analysis. However, especially in observational studies, the possibility of unrecognised residual confounding remains. Assessment of whether an observed association is causal depends in part on what is known about the biology of the relation. In addition, certain characteristics of the association may encourage a causal interpretation. A dose-response relation in which risk increases progressively with higher exposure is generally held to favour causality, although in theory it might arise through confounding. In the case of hazards suspected of acting early in a disease process, such as genotoxic carcinogens, a latent interval between first exposure and the manifestation of increased risk would also support a causal association. Also important is the magnitude of the association as measured by the relative risk or odds ratio. If an association is to be completely explained by confounding then the confounder must carry an even higher relative risk for the disease and also be strongly associated with the exposure under study. A powerful risk factor with, say, a 10-fold relative risk for the disease would probably be recognised and identified as a potential confounder.
The evaluation of possible pathogenic mechanisms and the importance attached to dose-response relations and evidence of latency are also a matter of judgement. It is because there are so many subjective elements to the interpretation of epidemiological findings that experts do not always agree. However, if sufficient data are available then a reasonable consensus can usually be achieved.