Intended for healthcare professionals

Education And Debate Evidence base of clinical diagnosis

Designing studies to ensure that estimates of test accuracy are transferable

BMJ 2002; 324 doi: (Published 16 March 2002) Cite this as: BMJ 2002;324:669
  1. Les Irwig, professor (lesi{at},
  2. Patrick Bossuyt, professor of clinical epidemiologyb,
  3. Paul Glasziou, professor of evidence based practicec,
  4. Constantine Gatsonis, professord,
  5. Jeroen Lijmer, clinical researcherb
  1. a Screening and Test Evaluation Program, Department of Public Health and Community Medicine, University of Sydney, NSW 2006, Australia
  2. b Department of Clinical Epidemiology and Biostatistics, Academic Medical Centre, PO Box 22700, 1100 DE Amsterdam, Netherlands
  3. c School of Population Health, University of Queensland Medical School, Herston, QLD 4006, Australia
  4. d Center for Statistical Sciences, Brown University, Providence, RI 02192, USA

    This is the third in a series of five articles

    Measures of test accuracy are often thought of as fixed characteristics determinable by research and then applicable in practice. Yet even when tests are evaluated in a study of adequate quality—one including such features as consecutive patients, a good reference standard, and independent, blinded assessments of tests and the reference standard1—performance of a diagnostic test in one setting may vary significantly from the results reported elsewhere.28 In this paper, we explore the reasons for this variability and its implications for the design of studies of diagnostic tests.

    Summary points

    Test accuracy may vary considerably from one setting to another

    This may be due to the target condition, the clinical problem, what other tests have been done, or how the test is carried out

    Larger studies than those usually done for diagnostic tests will be needed to assess transferability of results

    These studies should explore the extent to which variation in test accuracy between populations can be explained by patient and test features

    True variability in test accuracy

    To interpret a test's results in different setting requires an understanding of whether and why the test's accuracy varies. Broadly speaking, measures of accuracy fall into two broad categories: measures of discrimination between people who are and who are not diseased, and measures of prediction used to estimate post-test probability of disease.

    Measures of discrimination

    Global measures of test accuracy assess only the ability of the test to discriminate between people with and without a disease. Common examples are the area under the receiver operating characteristic curve (ROC), and the odds ratio (OR), sometimes also referred to as the diagnostic odds ratio. Such results may suffice for some broad health policy decisions—for example, to decide whether a new test is in general better than an existing test for the target condition.

    Measures for prediction

    The measures used to estimate the probabilities of the target condition in people who have a particular test result require both discrimination and calibration. The predictive value—the proportion of people with a particular test result who have the disease of interest—is an example. It is clumsy and difficult to estimate disease rates for categories of patients who may have different pretest probabilities of having the disease. Therefore, the estimation is often done indirectly using Bayes's theorem, based on the pretest probability and measures of test characteristics such as sensitivity and specificity or likelihood ratios in specific patients. These measures of test performance require more than discrimination. They require tests to be calibrated.

    Transferability of test results

    The transferability of measures of test performance from one setting to another depends on which indicator of test performance is used. The figure shows the assumptions involved in transferability. The table indicates the relation between these assumptions and the transferability of the different measures of test performance.


    Distribution of test results in patients with and without the target disease. The numbers refer to assumptions for the transferability of test results (see text and table)

    Assumptions for transferring different test performance characteristics (X=important; x=less important)

    View this table:

    The main assumptions in transferring tests across settings fall into six categories.

    The definition of disease is constant—Many diseases have ambiguous definitions. For example, there are no single reference standards for heart failure, Alzheimer's disease, or diabetes. Reference standards may differ because individual investigators' conceptual frameworks differ, or because it is difficult to apply the same framework in a standardised way.

    The same test is used—Although based on the same principle, tests may differ—for example, over time or if made by different manufacturers.

    The thresholds between categories of test result (for example, positive and negative) are constant—This is possible with a well standardised test that can be calibrated for different settings. However, there may be no accepted means of calibration—for example, different observers of imaging tests may have different thresholds for calling an image “positive.” The effect of different cut-off points is classically studied by use of a receiver operating characteristic curve. In some cases calibration may be improved by using category specific likelihood ratios rather than a single cut-off point.

    The distribution of test results in the disease group is constant in average (location) and spread (shape)—This assumption is not fulfilled if the spectrum of disease changes—if, for example, a screening setting is likely to include earlier disease, for which test results will be closer to those for a group without the disease (hence reducing sensitivity).

    The distribution of test results in the group without disease is constant in average (location) and spread (shape)—This assumption is not fulfilled if the spectrum of non-disease changes—if, for example, the secondary care setting involves additional causes of false positives due to comorbidity not seen in primary care.

    The ratio of disease to non-disease (pretest probability) is constant—If this were the case, we could use the post-test probability (“predictive” values) directly. However, this assumption is often not fulfilled—for example, the pretest probability is likely to be lowest with screening tests and greatest with tests in referred patients. This likely inconstancy is the reason for using Bayes's theorem to “adjust” the post-test probability for the pretest probability of each different setting.

    All measures of test performance need the first two assumptions to be fulfilled. The importance of the last four assumptions is shown in the table, although they may not be necessary in every instance; occasionally the assumptions may not be fulfilled but, because of compensating differences, transferability is still reasonable.

    Assessing transferability of discrimination and prediction

    How should a study be designed to ensure that its transferability can be determined? We need first to distinguish artefactual variation from real variation in diagnostic performance. Artefactual variation arises when studies vary in the extent to which they share design features, such as whether consecutive patients were included or the reference standard and index test were read blind to each other. Once such sources of variation have been ruled out, we may explore the potential sources of true variation.9 The issues to consider are similar to those for assessing interventions. For interventions, we consider patient, intervention, comparator, and outcome (PICO). 10 11 To ensure that readers have the necessary information to decide on the transferability of a diagnostic study to their own setting, five components need to be taken into account in design and presentation of a study.

    Target condition and reference standard

    The target condition and reference standard need to be carefully chosen. For example, in a study of clinically relevant tests to assess stenosis of the carotid artery, it would be sensible to dichotomise angiographic stenosis at the level of angiographic abnormality above which, on currently available evidence, the benefits of treatment outweigh harm, and to use this as the reference standard. Error in the reference standard should be minimised—for example, by better methods or multiple assessments. Any information about the accuracy of the reference standard will help interpretation.

    Discriminative or predictive measures?

    Assessment of the discrimination of a test requires measures such as the area under the receiver operating characteristic curve or diagnostic odds ratio. However, for estimating the probability of disease in individuals, likelihood ratios (or sensitivity and specificity) are needed, with additional information on how the tests were calibrated. Studies should include information about calibration; inclusion of selected example material, such as x rays of lesions, will help to clarify what thresholds have been used.

    Clinical problem and population

    This question defines how the initial cohort should be selected for study—for example, a new test for carotid stenosis could be considered for all patients referred to a surgical unit. However, ultrasound quantifies the extent of a stenosis reasonably accurately, so investigators may choose to restrict the study of a more expensive or invasive test to patients in whom the ultrasound result is near the decision threshold for surgery. A useful planning tool is to draw a flow diagram of how patients are selected to make up the population with the clinical problem of interest. This flow diagram shows what clinical information has been gathered, what tests have been done, and how the results of those tests determine entry into the population in which the clinical problem of interest is being studied. A good example is given in a recent paper on the assessment of imaging tests in the diagnosis of appendicitis in children.12

    Replacement or incremental value of the test?

    A key question is whether a test is being assessed as a replacement (substitution) for an existing test (because it is better or just as good and cheaper) or whether the test adds value when used in addition to specified existing tests. This decision will also be a major determinant of how the data should be analysed.1315

    Reasons for variability

    Between test types or readers—Data should be presented on the variability between different readers or types of test and on tools to help calibration, such as standard radiographs 16 17 or laboratory quality control measures. The extent to which other factors, such as experience or training, affect reading adequacy is also helpful.

    Between subgroups of the study population—Data on individuals should be available for determining the influence on test performance of the following variables: the spectrum of disease and non-disease (for example by estimating “specificity” within each category of “non-disease”); the effect of other test results, taking account of logical sequencing of tests (simplest, least invasive, cheapest are generally first); any other characteristics (for example, age and sex) that could influence test performance.

    Between settings—Test performance needs to be compared in several populations or centres, as has been done for the general health questionnaire18 and predictors of coma.19 Variability between settings can also be explored across different studies by using meta-analytic techniques. 20 21 Studies should also explore the following sources of variability between settings that are not accounted for by the within-setting characteristics outlined in the previous section. These sources may be primary, secondary, or tertiary care; prevalence of the target condition; country or time period. Residual differences between settings should be explored to judge the extent to which there is inexplicable variability that may limit test applicability.


    There is merit in studies with heterogeneous study populations. They allow exploration of the extent to which the performance of a diagnostic test depends on prespecified predictors, and how much residual variation exists. The more variation there is in study populations, the greater the potential to know how the test will perform in various settings.


    We thank Petra Macaskill, Clement Loy, Andre Knottnerus, Margaret Pepe, Jonathan Craig, and Anthony Grabs for comments on the book chapter on which this paper is based.



    1. 1.
    2. 2.
    3. 3.
    4. 4.
    5. 5.
    6. 6.
    7. 7.
    8. 8.
    9. 9.
    10. 10.
    11. 11.
    12. 12.
    13. 13.
    14. 14.
    15. 15.
    16. 16.
    17. 17.
    18. 18.
    19. 19.
    20. 20.
    21. 21.
    View Abstract