CCBYNC Open access
Research Methods & Reporting

Anticipating missing reference standard data when planning diagnostic accuracy studies

BMJ 2016; 352 doi: (Published 09 February 2016) Cite this as: BMJ 2016;352:i402
  1. Christiana A Naaktgeboren, assistant professor1,
  2. Joris A H de Groot, assistant professor1,
  3. Anne W S Rutjes, senior clinical epidemiologist2 3,
  4. Patrick M M Bossuyt, professor4,
  5. Johannes B Reitsma, associate professor1,
  6. Karel G M Moons, professor1
  1. 1Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3584 CG Utrecht, Netherlands
  2. 2CTU Bern, Department of Clinical Research, University of Bern, Switzerland
  3. 3Institute of Social and Preventive Medicine, University of Bern, Switzerland
  4. 4Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Amsterdam, Netherlands
  1. Correspondence to: J B Reitsma j.b.reitsma-2{at}
  • Accepted 30 December 2015

Results obtained using a reference standard may be missing for some participants in diagnostic accuracy studies. This paper looks at methods for dealing with such missing data when designing or conducting a prospective diagnostic accuracy study

Summary points

  • Missing reference standard results—that is, missing data on the target disease status—are common in diagnostic accuracy studies

  • Analyses that include only the study participants for whom the target disease status is actually measured may produce biased estimates of accuracy

  • Several statistical methods to reduce this bias are available; however, they all rely on assumptions about the pattern of missing outcomes, which are sometimes unverifiable

  • This paper provides an overview of the different patterns of missing data on the reference standard, the recommended corresponding solutions, and the specific measures that can be taken before and during a prospective diagnostic study to enhance the validity and interpretation of these solutions

The problem: missing reference standard data

Diagnostic studies typically evaluate the accuracy of one or more tests, markers, or models by comparing their results with those of, ideally, a “gold” reference test or standard.1 2 In such studies, the outcome—that is, the presence or absence of the target disease as determined by the chosen reference standard—is often missing in some of the study participants. This is known as partial verification.3 4 When only the participants who received the reference standard are included in the analysis (complete case analysis), estimates of the accuracy of the diagnostic test(s), marker(s), or model(s) under study, such as the sensitivity, specificity, predictive values, likelihood ratios, or C index, can be biased.5 6 7 8

There are many reasons why missing reference standard results may occur in diagnostic studies, as well as various approaches to deal with these missing outcomes in the statistical analysis.3 4 8 9 10 11 12 13 14 15 16 Ideally, how missing outcome data will eventually be dealt with is determined during the design phase of the study as opposed to later during the data analysis phase.

Here, we build on previous research on methods for dealing with missing outcomes during the data analysis phase to look at specific measures that can be taken when designing or conducting a diagnostic accuracy study. This paper focuses on prospective studies in which all included patients suspected of having the disease of interest receive all tests under study as well as the reference standard. It does not cover alternative designs, such as separate sampling of diseased participants and healthy controls, or retrospective studies in which patients who have received both the index test and reference standard are identified in hospital databases.17 Firstly, we discuss the various reasons for missing reference standard results, then we consider the proposed solutions to handle patterns of missing data, and we end with an overview of specific measures that can be considered in the design phase of a prospective diagnostic study to improve the proposed solutions.

How the problem arises

In clinical practice, the diagnostic process begins when signs, symptoms, or test results signal a possible target disease. Patients go through a diagnostic pathway, typically starting with inexpensive, non-invasive tests to rule out the presence of the disease.18 For those in whom the presence of the disease is still suspected, additional tests may follow that are increasingly costly, burdensome, and even risky. For safety and efficiency, not all patients originally suspected of having a disease eventually go on to receive the complete battery of tests.

In prospective diagnostic studies—that is studies that do not use routine care data such as hospital or primary care records—all study participants ideally receive all tests, markers, or models under study (from now on referred to as index tests) and then the reference standard to assign their final diagnosis. Nevertheless, even in predesigned prospective studies, missing outcomes on the reference standard are likely to occur and in some situations may even be unavoidable. These missing outcomes may occur haphazardly, in a more or less predictable way, or even by design (see table 1 for examples). As in any clinical study, haphazardly missing data may result from, for example, lost blood samples, technical failures, or accidental deviations from the study protocol. For example, in a study on the accuracy of rapid diagnostic tests for malaria, a few blood samples were lost before they could be examined under the microscope.19 We refer to this as “incidental missing data.” Although this type of missing data leads to a loss of precision, it does not necessarily lead to biased estimates of test accuracy owing to the complete randomness of the missing outcomes.

Table 1

Examples of different mechanisms for missing outcomes in diagnostic accuracy studies

View this table:

Commonly, though, clear reasons exist why some participants in a study do not undergo the reference standard. It may be specified in the protocol of a prospective accuracy study, for instance, that to reduce study costs or burden to patients only a randomly selected subset of patients in a specific subgroup are to be verified by the preferred reference standard. We refer to this pattern as “data missing by study design”).20 For example, in a study on the diagnostic accuracy of visual inspection with acetic acid for detecting cervical cancer, in which the reference standard was colposcopy with biopsy, only a random subset of participants in whom no abnormalities were seen during visual inspection underwent colposcopy with a series of randomly located biopsies.21

In many diagnostic studies, the intention is to perform the reference standard in all patients, but for a variety of reasons missing outcomes occur. Typically, this is not a completely random process. Missingness may depend on several factors, such as severity of symptoms and other preceding test results, resulting in complicated patterns of missing outcomes that are also related to the results of index test. We refer to this pattern as “data missing due to clinical practice.” Selective missing data are likely to cause biased estimates of accuracy of the index test in a complete case analysis. An example is a study on the diagnostic accuracy of faecal calprotectin for irritable bowel disease; endoscopy combined with biopsy, the invasive reference standard, was limited to patients at high risk, defined as those with at least one predefined red flag symptom.22

In some clinical scenarios, it may be technically impossible to perform the reference standard in a well defined subgroup of participants. We refer to this as “data missing due to infeasibility.” This is common in cancer screening studies in which the reference standard is invasive. A specific example is a study on the diagnostic accuracy of ultrasonography for detecting breast cancer, in which one could not do a biopsy when no lesion was observed.23

How to deal with missing reference standard results

Understanding why missing outcomes occur is necessary for judging whether estimates of diagnostic accuracy are at risk of being biased, as well as whether and how this bias can be corrected for (table 2). In addition to keeping careful track of the reasons for missing reference standards, analytical methods are available to help to distinguish between “incidental missing data” and “data missing due to clinical practice.” A method commonly used to identify the risk of bias due to missing data is to compare the distribution of the patients’ characteristics and results of the index test(s) among the study participants with and without a missing outcome.24 If differences exist, the estimates based only on the participants with observed reference standard results (complete case analysis) are assumed to be at risk of bias, as those participants are not a completely random subset of the initial study population. Another method to judge the potential for bias is to do a sensitivity analysis to explore whether the range of values for the accuracy estimates of the index test are consistent with the data. Such a sensitivity analysis quantifies the possible range of sensitivities, specificities, predictive values, or C indices if all participants with a missing outcome were considered as either diseased or non-diseased. A web tool has been developed that plots a so-called test ignorance region (available at If the accuracy of the index test(s) from the complete case analysis falls outside this test ignorance region, the assumption that the data are missing haphazardly (completely at random) is not reasonable, so accuracy estimates are likely to be biased and should therefore be adjusted.

Table 2

Analytical approaches to reduce bias in estimated accuracy of diagnostic test(s), marker(s), or model(s) under study, introduced when preferred reference standard is not performed (that is, outcome is missing) in some study participants

View this table:

When outcomes are missing haphazardly (the pattern “incidental missing data”)—that is, unrelated to any observed or unobserved patients’ characteristics or test results—and the study is large enough, a complete case analysis that includes only participants who underwent the reference standard will produce estimates similar to those obtained if all original study participants had been included, except that these accuracy estimates will be less precise. In that case, participants with the outcome can be seen as a completely random sample of the original study group, still representing a random sample from the study population defined by the eligibility criteria.

When outcomes are missing selectively (as is the case for all patterns except “incidental missing data”), a complete case analysis will probably produce biased estimates of accuracy. Analytical approaches for reducing the bias introduced by missing outcomes essentially use the available data to reconstruct the missing outcome (see table 2 for an overview of these methods).11 14 15 16 26 These methods either require knowledge of or make assumptions about the pattern of the missing outcomes.

A straightforward correction method was developed by Begg and Greenes, who used inverse probability weighting, a technique also often used in causal research.11 Their approach can provide unbiased accuracy estimates of the index test(s) when the missingness is actually random given the result of the index test(s). For a dichotomous index test, this method is equivalent to inflating the two-by-two table by multiplying each cell by the inverse probability of having undergone the reference standard. The assumption then is that patients with a negative (or positive) index test result who have not been verified would have shown comparable results to those with a negative (or positive) index test result who were verified. This method can be extended to incorporate additional factors that may have led to the missing outcomes. However, when the mechanism of the missing outcome data is not so straightforward and is based on multiple variables rather than only the index test(s), a more advanced method of reconstructing the data, such as multiple imputation, may be recommended instead.15 27

Imputation is the substitution of missing data with plausible values to allow for analysis of the entire dataset. Multiple imputation is a statistical procedure that uses all available patients’ data to predict the missing data, in this case the missing outcome.28 These missing data are predicted multiple times, resulting in several complete datasets, often 10 or more, on which standard analyses are then performed.29 The accuracy estimates of the index tests from these datasets are then averaged to provide an overall estimate, with adjusted confidence intervals that reflect the uncertainty resulting from the missing data. The more accurately the available data predict the missing outcomes, the less biased and more precise the accuracy estimates after multiple imputation will be. Even if some of the variables that influenced missingness are not available in the data, multiple imputation will probably still result in less biased results of the accuracy of the index tests than will complete case analysis.30 The challenge to multiple imputation is that it depends on the ability of additional patients’ data to accurately predict the missing reference standard results. Other, less straightforward, analytical methods for complex missing patterns exist, for which we refer to an overview of the literature.12

Instead of approaching the bias introduced by missing outcomes by using purely analytical correction methods, an alternative approach is to rely on results from a second reference standard to determine the outcome in participants missing the preferred reference standard. The use of different reference standards in different participants is known as differential verification.3 9 31 32 If the alternative reference standard classifies disease status with less accuracy than does the preferred standard, this approach essentially results in misclassification of the outcome.33 As such, it may increase, rather than reduce, the bias in the estimated accuracy of the index test(s). When differential verification is present, one might consider using an empirical bayesian correction method that takes into account the verification pattern as well as bias due to imperfections in the reference standards.14 This model requires specification of the pattern by which participants receive one reference standard or the other. It allows the researchers to incorporate their beliefs about the accuracy of the reference standards with respect to the true disease of interest in the form of previous distributions. Challenges to the bayesian correction method are understanding and specifying a potentially complex verification pattern and the availability of evidence on which to base beliefs about the accuracy of the reference standard. In the particular situation in which the type of reference standard a participant receives is completely dependent on the result of the index test, marker, or model, the predictive values are clinically interpretable. This would happen, for example, if all participants whose (dichotomous) index test result is abnormal receive the preferred reference standard and all others receive an alternative. In that case, one may simply choose to report results stratified by the index test results—that is, predictive values.32

Considerations for study design, analysis, reporting, and interpretation of results

Obviously, missing outcomes in diagnostic accuracy studies should ideally be avoided, as in any clinical study. All solutions for correcting bias introduced by missing outcomes are suboptimal. However, we argue that when missing outcomes are anticipated before the start of a diagnostic study, timely actions can be planned to optimise the validity of the study results. The protocols of prospective diagnostic accuracy studies can be enhanced by including information on the expected pattern of missing outcomes, as well as the chosen design and analytical solutions for reducing the impact of these missing outcomes. In addition to presenting results that have been adjusted for missing outcomes, transparent reporting of the pattern of missing outcomes is important; this can be represented in a flowchart as recommended in the STARD guidelines.34 Such reporting facilitates readers’ judgment of the risk of bias introduced by the missing outcomes and the appropriateness of the analytical solutions used to correct for this bias.

Table 3 contains an overview of the patterns for missing outcomes and the relation of these patterns to possible design, analytical, and reporting considerations. The appendix contains a worked out example for each of these patterns, using the clinical examples in table 1 as inspiration.

Table 3

Anticipating missing results on best available or preferred reference standard (missing outcomes): considerations for design, conduct, analysis, reporting, and interpretation

View this table:

Incidental missing data

A small amount of completely random missing data is almost inevitable in any study for reasons unrelated to any patients’ characteristics or index test results, such as data entry errors or dropping a blood sample. In an adequately sized study, excluding from the analysis participants for whom the reference standard result is missing completely at random will not bias the results—it will only decrease precision. The percentage of missing outcomes should be reported, as well as the distribution of patients’ characteristics and index test results among those without and with missing outcomes, to allow the reader to judge whether they were missing completely at random and their exclusion thus would not lead to bias.34 Additionally, a sensitivity analysis as described above may provide further insight into the potential impact of the missing outcomes.

Data missing by study design

For efficiency, technical, or ethical reasons, it may be desirable not to perform the reference standard by design in all participants but only in a random sample of, for example, those with “normal” index test results and to adjust for this partial verification in the analysis (“data missing by study design”). This may be an efficient approach in situations in which the prevalence of disease is low—for instance, in screening. Unfortunately, no a priori sample size calculations are available to determine how large such random samples need to be. One must ensure that the random sample that will be verified by the reference standard will contain a sufficient number of participants with and without the target condition.35 Therefore, researchers choosing such a design should provide a rationale for the number and type of participants who will randomly be verified in specific subgroups.

Data missing due to clinical practice

When the outcome is missing more often in participants with specific characteristics or index test results, such as those with less severe symptoms or normal index test results, a complete case analysis will probably result in biased estimates of the accuracy of the index test. Whether this is the case can be inferred from a comparison of the distributions in participants with and without missing outcomes. If investigators plan to use an analytical method to correct for this bias, such as inverse probability weighting or multiple imputation, they should take appropriate actions for collecting additional information on study participants, such as signs, symptoms, and perhaps even additional test results, that will improve the performance of these methods. When the pattern by which patients receive one or another reference standard is more complex, as is often the case, multiple imputation is preferable to inverse probability weighting, as it makes accounting for more than one factor easier.

Sometimes a secondary reference standard—that is, a test that provides information about the outcome—is available but is less accurate than the preferred reference standard. Instead of using analytical correction methods to correct for partial verification bias, one can use this secondary reference standard to assess the outcome in participants who did not receive the preferred reference standard. A bayesian correction method can then be used to calculate the proper index test accuracy estimates.14 Here, it is important to report the assumptions made, such as the accuracy of the secondary reference standard with respect to the preferred reference standard. We stress that all of these methods to correct for bias due to “data missing due to clinical practice” assume that the pattern of missing reference standard results either is known or can be predicted by observed information.

Data missing due to infeasibility

When performing the reference standard in any of the participants in specific subgroups is explicitly decided against or even impossible—for example, no biopsy of the breast in women without any abnormality on mammography (table 1)—some alternative measure of the target disease should be obtained. In the design phase of a study, the decision can be made to use an alternative reference standard in these participants, a common choice being clinical follow-up. Rather than focusing on how well the index test results correspond to the preferred reference standard, it may be more relevant to focus on whether the index test provides information about clinically relevant outcomes. If so, the clinical relevance of this alternative reference standard should be discussed. One should then focus on the accuracy estimates of the index tests across strata of the index test results—that is, presenting predictive values.3 32

Considerations for study design

Although we have provided guidance for how to handle missing data on the reference standard, we stress that situations exist in which these approaches to deal with missing reference standard data may not be possible or cannot remove the bias, even when researchers anticipate the missing reference standards before the study starts. Additionally, although unbiased estimates of diagnostic test accuracy help to evaluate potential clinical value, cross sectional accuracy studies do not always provide the information needed when forming a conclusion about whether a test improves the care of patients. Hence, in some situations, it may be necessary to go beyond accuracy studies and opt for alternative designs that focus on estimating or comparing the clinical value of tests in terms of their ability to improve actual outcomes for patients.36 37 38 39 40 This may often be the case when missing outcomes are unavoidable or the new index test is hypothesised to outperform the reference standard.


Despite efforts to assess the outcome in all participants in a diagnostic accuracy study, missing reference standard results (that is, missing outcomes) are often inevitable and should be anticipated in any prospective diagnostic accuracy study. Analyses that include only the participants in whom the reference standard was performed are likely to produce biased estimates of the accuracy of the index tests. Several analytical solutions for dealing with missing outcomes are available; however, these solutions require knowledge about the pattern of missing data, and they are no substitute for complete data. Researchers should anticipate the mechanisms that generate missing reference standard results before the start of a study, so that measures and actions can explicitly be taken to reduce the potential for biased estimates of the accuracy of the tests, markers, or models under study, as well as to facilitate correction in the analysis phase. In all cases, researchers should include in their study report how missing data on the index test and reference standard were handled, as invited by the STARD reporting guideline.34


  • Contributors: All authors participated in the conception and design of the article, worked on the drafting of the article and revising it critically for important intellectual content, and have approved the final version to be published.

  • Funding: Netherlands Organization for Scientific Research (project 918.10.615).

  • Competing interests: All authors have read and understood the BMJ policy on declaration of interests and declare the following interests: none.

  • Provenance and peer review: Not commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:


View Abstract