Validating scales and indexesBMJ 2002; 324 doi: http://dx.doi.org/10.1136/bmj.324.7337.606 (Published 09 March 2002) Cite this as: BMJ 2002;324:606
- J Martin Bland (), professor of medical statisticsa,
- Douglas G Altman, professor of statistics in medicineb
- a Papers p 569 Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
- b Cancer Research UK Medical Statistics Group, Centre for Statistics in Medicine, Institute for Health Sciences, Oxford OX3 7LF
- Correspondence to: Professor Bland
An index of quality is a measurement like any other, whether it is assessing a website, as in today's BMJ,1 a clinical trial used in a meta-analysis,2 or the quality of a life experienced by a patient.3 As with all measurements, we have to decide whether it measures what we want it to measure, and how well.
The simplest measurements, such as length and distance, can be validated by an objective criterion. The earliest criteria must have been biological: the length of a pace, a foot, a thumb. The obvious problem, that the criterion varies from person to person, was eventually solved by establishing a fundamental unit and defining all others in terms of it. Other measurements can then be defined in terms of a fundamental unit. To define a unit of weight we find a handy substance which appears the same everywhere, such as water. The unit of weight is then the weight of a volume of water specified in the basic unit of length, such as 100 cubic centimetres. Such measurements have criterion validity, meaning that we can take some known quantity and compare our measurement with it.
For some measurements no such standard is possible. Cardiac stroke volume, for example, can be measured only indirectly. Direct measurement, by collecting all the blood pumped out of the heart over a series of beats, would involve rather drastic interference with the system. Our criterion becomes agreement with another indirect measurement. Indeed, we sometimes have to use as a standard a method which we know produces inaccurate measurements.
Some quantities are even more difficult to measure and evaluate. Cardiac stroke volume does at least have an objective reality; a physical quantity of blood is pumped out of the heart when it beats. Anxiety and depression do not have a physical reality but are useful artificial constructs. They are measured by questionnaire scales, where answers to a series of questions related to the concept we want to measure are combined to give a numerical score. Website quality is similar. We are measuring a quantity which is not precisely defined, and there is no instrument with which we can compare any measure we might devise. How are we to assess the validity of such a scale?
The relevant theory was developed in the social sciences in the context of questionnaire scales.4 First we might ask whether the scale looks right, whether it asks about the sorts of thing which we think of as being related to anxiety or website quality. If it appears to be correct, we call this face validity. Next we might ask whether it covers all the aspects which we want to measure. A phobia scale which asked about fear of dogs, spiders, snakes, and cats but ignored height, confined spaces, and crowds would not do this. We call appropriate coverage of the subject matter content validity.
Our scale may look right and cover the right things, but what other evidence can we bring to the question of validity? One question we can ask is whether our score has the relationships with other variables that we would expect. For example, does an anxiety measure distinguish between psychiatric patients and medical patients? Do we get different anxiety scores from students before and after an examination? Does a measure of depression predict suicide attempts? We call the property of having appropriate relationships with other variables construct validity.
We can also ask whether the items which together compose the scale are related to one another: does the scale have internal consistency? If not, do the items really measure the same thing? On the other hand, if the items are too similar, some may be redundant. Highly correlated items in a scale may make the scale over- long and may lead to some aspects being overemphasised, impairing the content validity. A handy summary measure for this feature is Cronbach's alpha.5
A scale must also be repeatable and be sufficiently objective to give similar results for different observers. If a measurement is repeatable, in that someone who has a high score on one occasion tends to have a high score on another, it must be measuring something. With physical measurements, it is often possible for the same observer (or different observers) to make repeated measurements in quick succession. When there is a subjective element in the measurement the observer can be blinded from their first measurement, and different observers can make simultaneous measurements. In assessing the reliability of a website quality scale, it is easy to get several observers to apply the scale independently. With websites, repeat assessments need to be close in time because their content changes frequently (as does bmj.com). With questionnaires, either self administered or recorded by an observer, repeat measurements need to be far enough apart in time for the earlier responses to be forgotten, yet not so far apart that the underlying quantity being measured might have changed. Such data enable us to evaluate test-retest reliability. If two measures have comparable face, content, and construct validity the more repeatable one may be preferred for the study of a given population.