Chapter 4. Measurement error and bias

More chapters in Epidemiology for the uninitiated

Epidemiological studies measure characteristics of populations. The parameter of interest may be a disease rate, the prevalence of an exposure, or more often some measure of the association between an exposure and disease. Because studies are carried out on people and have all the attendant practical and ethical constraints, they are almost invariably subject to bias.

Selection bias

Selection bias occurs when the subjects studied are not representative of the target population about which conclusions are to be drawn. Suppose that an investigator wishes to estimate the prevalence of heavy alcohol consumption (more than 21 units a week) in adult residents of a city. He might try to do this by selecting a random sample from all the adults registered with local general practitioners, and sending them a postal questionnaire about their drinking habits. With this design, one source of error would be the exclusion from the study sample of those residents not registered with a doctor. These excluded subjects might have different patterns of drinking from those included in the study. Also, not all of the subjects selected for study will necessarily complete and return questionnaires, and non-responders may have different drinking habits from those who take the trouble to reply. Both of these deficiencies are potential sources of selection bias. The possibility of selection bias should always be considered when defining a study sample. Furthermore, when responses are incomplete, the scope for bias must be assessed. The problems of incomplete response to surveys are considered further in.

Information bias

The other major class of bias arises from errors in measuring exposure or disease. In a study to estimate the relative risk of congenital malformations associated with maternal exposure to organic solvents such as white spirit, mothers of malformed babies were questioned about their contact with such substances during pregnancy, and their answers were compared with those from control mothers with normal babies. With this design there was a danger that "case" mothers, who were highly motivated to find out why their babies had been born with an abnormality, might recall past exposure more completely than controls. If so, a bias would result with a tendency to exaggerate risk estimates.

Another study looked at risk of hip osteoarthritis according to physical activity at work, cases being identified from records of admission to hospital for hip replacement. Here there was a possibility of bias because subjects with physically demanding jobs might be more handicapped by a given level of arthritis and therefore seek treatment more readily.

Bias cannot usually be totally eliminated from epidemiological studies. The aim, therefore, must be to keep it to a minimum, to identify those biases that cannot be avoided, to assess their potential impact, and to take this into account when interpreting results. The motto of the epidemiologist could well be "dirty hands but a clean mind" (manus sordidae, mens pura).

Measurement error

As indicated above, errors in measuring exposure or disease can be an important source of bias in epidemiological studies In conducting studies, therefore, it is important to assess the quality of measurements. An ideal survey technique is valid (that is, it measures accurately what it purports to measure). Sometimes a reliable standard is available against which the validity of a survey method can be assessed. For example, a sphygmomanometer's validity can be measured by comparing its readings with intraarterial pressures, and the validity of a mammographic diagnosis of breast cancer can be tested (if the woman agrees) by biopsy. More often, however, there is no sure reference standard. The validity of a questionnaire for diagnosing angina cannot be fully known: clinical opinion varies among experts, and even coronary arteriograms may be normal in true cases or abnormal in symptomless people. The pathologist can describe changes at necropsy, but these may say little about the patient's symptoms or functional state. Measurements of disease in life are often incapable of full validation.

In practice, therefore, validity may have to be assessed indirectly. Two approaches are used commonly. A technique that has been simplified and standardised to make it suitable for use in surveys may be compared with the best conventional clinical assessment. A self administered psychiatric questionnaire, for instance, may be compared with the majority opinion of a psychiatric panel. Alternatively, a measurement may be validated by its ability to predict future illness. Validation by predictive ability may, however, require the study of many subjects.

Analysing validity

When a survey technique or test is used to dichotomise subjects (for example, as cases or non-cases, exposed or not exposed) its validity is analysed by classifying subjects as positive or negative, firstly by the survey method and secondly according to the standard reference test. The findings can then be expressed in a contingency table as shown below.

Table 4.1 Comparison of a survey test with a reference test

Survey test result

Reference test result





True positives correctly identified = (a)

False positives = (b)

Total test positives = (a + b)


False negatives = (c)

True negatives correctly identified = (d)

Total test negatives = (c + d)


Total true positives = (a + c)

Total true negatives = (b + d)

Grand total = (a + b + c + d)

From this table four important statistics can be derived:

Sensitivity - A sensitive test detects a high proportion of the true cases, and this quality is measured here by a/a + c.

Specificity- A specific test has few false positives, and this quality is measured by d/b + d.

Systematic error - For epidemiological rates it is particularly important for the test to give the right total count of cases. This is measured by the ratio of the total numbers positive to the survey and the reference tests, or (a + b)/(a + c).

Predictive value-This is the proportion of positive test results that are truly positive. It is important in screening, and will be discussed further in Chapter 10.

It should be noted that both systematic error and predictive value depend on the relative frequency of true positives and true negatives in the study sample (that is, on the prevalence of the disease or exposure that is being measured).

Sensitive or specific? A matter of choice

If the criteria for a positive test result are stringent then there will be few false positives but the test will be insensitive. Conversely, if criteria are relaxed then there will be fewer false negatives but the test will be less specific. In a survey of breast cancer alternative diagnostic criteria were compared with the results of a reference test (biopsy). Clinical palpation by a doctor yielded fewest false positives(93% specificity), but missed half the cases (50% sensitivity). Criteria for diagnosing "a case" were then relaxed to include all the positive results identified by doctor's palpation, nurse's palpation, or xray mammography: few cases were then missed (94% sensitivity), but specificity fell to 86%.

By choosing the right test and cut off points it may be possible to get the balance of sensitivity and specificity that is best for a particular study. In a survey to establish prevalence this might be when false positives balance false negatives. In a study to compare rates in different populations the absolute rates are less important, the primary concern being to avoid systematic bias in the comparisons: a specific test may well be preferred, even at the price of some loss of sensitivity.


When there is no satisfactory standard against which to assess the validity of a measurement technique, then examining its repeatability is often helpful. Consistent findings do not necessarily imply that the technique is valid: a laboratory test may yield persistently false positive results, or a very repeatable psychiatric questionnaire may be an insensitive measure of, for example, "stress". However, poor repeatability indicates either poor validity or that the characteristic that is being measured varies over time. In either of these circumstances results must be interpreted with caution.

Repeatability can be tested within observers (that is, the same observer performing the measurement on two separate occasions) and also between observers (comparing measurements made by different observers on the same subject or specimen). Assessment of repeatability may be built into a study - a sample of people undergoing a second examination or a sample of radiographs, blood samples, and so on being tested in duplicate. Even a small sample is valuable, provided that (1) it is representative and (2) the duplicate tests are genuinely independent. If testing is done "off line" (perhaps as part of a pilot study) then particular care is needed to ensure that subjects, observers, and operating conditions are all adequately representative of the main study. It is much easier to test repeatability when material can be transported and stored - for example, deep frozen plasma samples, histological sections, and all kinds of tracings and photographs. However, such tests may exclude an important source of observer variation - namely the techniques of obtaining samples and records.

Reasons for variation in replicate measurements

Independent replicate measurements in the same subjects are usually found to vary more than one's gloomiest expectations. To interpret the results, and to seek remedies, it is helpful to dissect the total variability into its four components:

Within observer variation - Discovering one's own inconsistency can be traumatic; it highlights a lack of clear criteria of measurement and interpretation, particularly in dealing with the grey area between "normal" and "abnormal". It is largely random-that is, unpredictable in direction.

Between observer variation - This includes the first component (the instability of individual observers), but adds to it an extra and systematiccomponent due to individual differences in techniques and criteria. Unfortunately, this may be large in relation to the real difference between groups that it is hoped to identify. It may be possible to avoid this problem, either by using a single observer or, if material is transportable, by forwarding it all for central examination. Alternatively, the bias within a survey may be neutralised by random allocation of subjects to observers. Each observer should be identified by a code number on the survey record; analysis of results by observer will then indicate any major problems, and perhaps permit some statistical correction for the bias.

Random subject variation -When measured repeatedly in the same person, physiological variables like blood pressure tend to show a roughly normal distribution around the subject's mean. Nevertheless, surveys usually have to make do with a single measurement, and the imprecision will not be noticed unless the extent of subject variation has been studied. Random subject variation has some important implications for screening and also in clinical practice, when people with extreme initial values are recalled. Thanks to a statistical quirk this group then seems to improve because its members include some whose mean value is normal but who by chance had higher values at first examination: on average, their follow up values necessarily tend to fall ( regression to the mean). The size of this effect depends on the amount of random subject variation. Misinterpretation can be avoided by repeat examinations to establish an adequate baseline, or (in an intervention study) by including a control group.

Biased (systematic) subject variation -Blood pressure is much influenced by the temperature of the examination room, as well as by less readily standardised emotional factors. Surveys to detect diabetes find a much higher prevalence in the afternoon than in the morning; and the standard bronchitis questionnaire possibly elicits more positive responses in winter than in summer. Thus conditions and timing of an investigation may have a major effect on an individual's true state and on his or her responses. As far as possible, studies should be designed to control for this - for example, by testing for diabetes at one time of day. Alternatively, a variable such as room temperature can be measured and allowed for in the analysis.

Analysing repeatability

The repeatability of measurements of continuous numerical variables such as blood pressure can be summarised by the standard deviation of replicate measurements or by their coefficient of variation(standard deviation mean). When pairs of measurements have been made, either by the same observer on two different occasions or by two different observers, a scatter plot will conveniently show the extent and pattern of observer variation.

For qualitative attributes, such as clinical symptoms and signs, the results are first set out as a contingency table:

Table 4.2 Comparison of results obtained by two observers


Observer 1



Observer 2







The overall level of agreement could be represented by the proportion of the total in cells a and d. This measure unfortunately turns out to depend more on the prevalence of the condition than on the repeatability of the method. This is because in practice it is easy to agree on a straightforward negative; disagreements depend on the prevalence of the difficult borderline cases. Instead, therefore, repeatability is usually summarised by the statistic, which measures the level of agreement over and above what would be expected from the prevalence of the attribute.