- D G Altman,
- J M Bland
- Medical Statistics Laboratory, Imperial Cancer Research Fund, London WC2A 3PX
- Department of Public Health Sciences, St George's Hospital Medical School, London SW17 1RE.
We have previously considered diagnosis based on tests that give a yes or no answer.1, 2 Many diagnostic tests, however, are quantitative, notably in clinical chemistry. The same statistical approach can be used only if we can select a cut off point to distinguish “normal” from “abnormal,” which is not a trivial problem. Firstly, we can investigate to what extent the test results differ among people who do or do not have the diagnosis of interest. The receiver operating characteristic (ROC) plot is one way to do this. These plots were developed in the 1950s for evaluating radar signal detection. Only recently have they become commonly used in medicine.
We assume that high values are more likely among those dubbed “abnormal.” Figure 1 shows the values of an index of mixed epidermal cell lymphocyte reactions in bone marrow transplant recipients who did or did not develop graft versus host disease.3 The usefulness of the test for predicting graft versus host disease will clearly relate to the degree of non- overlap between the two distributions.
A receiver operating characteristic plot is obtained by calculating the sensitivity and specificity of every observed data value and plotting sensitivity against 1 - specificity, as in Figure 2. A test that perfectly discriminates between the two groups would yield a “curve” that coincided with the left and top sides of the plot. A test that is completely useless would give a straight line from the bottom left corner to the top right corner. In practice there is virtually always some overlap of the values in the two groups, so the curve will lie somewhere between these extremes.
A global assessment of the performance of the test (sometimes called diagnostic accuracy4) is given by the area under the receiver operating characteristic curve. This area is equal to the probability that a random person with the disease has a higher value of the measurement than a random person without the disease. (This probability is a half for an uninformative test - equivalent to tossing a coin.)
No test will be clinically useful if it cannot discriminate,4 so a global assessment of discriminatory power is an important step. Having determined that a test does provide good discrimination the choice can be made of the best cut off point for clinical use. This requires the choice of a particular point, and is thus a local assessment. The simple approach of minimising “errors” (equivalent to maximising the sum of the sensitivity and specificity) is not necessarily best. Consideration needs to be given to the costs (not just financial) of false negative and false positive diagnoses and to the prevalence of the disease in the subjects being tested.4 For example, when screening the general population for cancer the cut off point would be chosen to ensure that most cases were detected (high sensitivity) at the cost of many false positives (low specificity), who could then be eliminated by a further test.
A receiver operating characteristic plot is particularly useful when comparing two or more measures. A test with a curve that lies wholly above the curve of another will be clearly better. Methods for comparing the areas under two curves for both paired and unpaired data are reviewed by Zweing and Campbell,4 who give a full assessment of this method.