Guidance for the design and reporting of studies evaluating the clinical performance of tests for present or past SARS-CoV-2 infectionBMJ 2021; 372 doi: https://doi.org/10.1136/bmj.n568 (Published 29 March 2021) Cite this as: BMJ 2021;372:n568
- Jenny A Doust, clinical professorial fellow1,
- Katy J L Bell, associate professor in clinical epidemiology2,
- Mariska M G Leeflang, assistant professor3,
- Jacqueline Dinnes, senior researcher4 5,
- Sally J Lord, associate professor6,
- Sue Mallett, professor in diagnostic and prognostic studies7,
- Janneke H H M van de Wijgert, professor of infectious and immune-mediated disease epidemiology8 9,
- Sverre Sandberg, professor and director10 11,
- Khosrow Adeli, division head and professor12 13,
- Jonathan J Deeks, professor of biostatistics4 5,
- Patrick M Bossuyt, professor of clinical epidemiology3,
- Andrea R Horvath, professor and director2 14 15
- 1Centre for Longitudinal and Life Course Research, School of Public Health, University of Queensland, Herston, QLD 4006, Australia
- 2School of Public Health, University of Sydney, NSW, Australia
- 3Department of Epidemiology and Data Science, Amsterdam University Medical Centres, University of Amsterdam, Amsterdam, Netherlands
- 4Test Evaluation Research Group, Institute of Applied Health Research, University of Birmingham, Birmingham, UK
- 5NIHR Birmingham Biomedical Research Centre, University Hospitals Birmingham NHS Foundation Trust and University of Birmingham, Birmingham, UK
- 6School of Medicine, Sydney, University of Notre Dame, Darlinghurst, NSW, Australia
- 7Centre for Medical Imaging, University College, London, UK
- 8Julius Centre for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht University, Utrecht, Netherlands
- 9Institute of Infection, Veterinary, and Ecological Sciences, University of Liverpool, Liverpool, UK
- 10Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
- 11Norwegian Quality Improvement of Laboratory Examinations, Haraldsplass Deaconess Hospital, Bergen, Norway
- 12CALIPER Program, Paediatric Laboratory Medicine, The Hospital for Sick Children, Toronto, ON, Canada
- 13Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
- 14New South Wales Health Pathology, Department of Chemical Pathology, Prince of Wales Hospital, Sydney, NSW, Australia
- 15School of Medical Sciences, University of New South Wales, Sydney, NSW, Australia
- Correspondence to: J Doust
- Accepted 22 February 2021
Testing for infection has a critical role in the response to the pandemic caused by SARS-CoV-2 identified in China in December 2019.1 Tests to identify SARS-CoV-2 infection and the disease caused by it (covid-19) have been developed at an extraordinary pace; moving rapidly from the identification of the viral ribonucleic acid (RNA) sequence on 10 January 20202 to the development of viral nucleic acid tests for the virus using reverse transcription polymerase chain reaction (RT-PCR) shortly thereafter. This development was followed by immunoassays for detecting the presence of viral antigens or antibodies in laboratories and at the point of care.
More than 1400 tests for SARS-CoV-2 are on the market or listed on websites such as the Foundation for Innovative New Diagnostics3 and the European Commission’s Joint Research Centre database,4 and more than 1700 preprints and peer reviewed journal articles evaluating tests for SARS-CoV-2 infection have been published as of January 2021.5 The volume of available evaluations of diagnostic test accuracy is unprecedented and is unlikely to diminish with the implementation of programmes to accelerate the development of new tests, such as the Rapid Acceleration of Diagnostics programme by the National Institutes of Health in the United States.6
A vital part of managing the pandemic is to ensure that evaluations of tests for SARS-CoV-2 infection are rigorous, unbiased, and conducted in the most efficient way possible so that the most accurate tests are rapidly identified and adopted in practice. The evidence standards framework of the United Kingdom’s National Institute for Health and Care Excellence (NICE) has outlined key evaluation concepts to assist with this process.7 However, systematic reviews of diagnostic accuracy studies of tests for SARS-CoV-2 have highlighted many methodological and reporting problems (table 1).9101112131415 These problems limit the ability of clinicians and policy makers to apply the results of such studies in diagnostic pathways and public health programmes and have led to poor clinical and public health decisions contributing to ongoing spread of the infection.16
This article aims to outline general principles for studies that evaluate the clinical performance of SARS-CoV-2 tests. Here, we use the term “SARS-CoV-2 tests” to refer to any of the following: viral nucleic acid, antigen, antibody, or other detection tests. The authors have expertise in the evaluation of diagnostic tests including the evaluation of SARS-CoV-2 tests, evidence based medicine, epidemiology, laboratory medicine, and virology. We have based the guidance in this paper on previously published work on diagnostic test evaluations, such as the STARD guideline for reporting of diagnostic accuracy studies,8 and the QUADAS-2 tool for appraising the risk of bias of diagnostic accuracy studies.17 We have also considered the guidance provided in templates issued by the US Food and Drug Administration (FDA) for Emergency Use Authorizations for in vitro diagnostic tests for SARS-CoV-2,18 the NICE evidence standards framework,7 the Medicine and Healthcare products Regulatory Agency (MHRA) and World Health Organization target product profiles,1920 and the European Commission’s document on recommendations for covid-19 testing strategies21 and related documents.22
The article focuses on clinical performance studies investigating the diagnostic accuracy of SARS-CoV-2 tests in clinical or public health practice. Many of the studies initially undertaken and quoted in reports of test performance can be classified as studies of scientific validity (box 1).25 They are essential in the development of a test, analogous to the finding of phase I clinical trials. Similarly, analytical performance studies, are also necessary prerequisites before clinical application of a test.22 These studies cannot, however, provide realistic estimates of the diagnostic accuracy of the tests when used in clinical practice, and it is misleading to assume the results from such studies apply in the clinical setting.
Terminology used in this guidance
Clinical performance studies: assess the ability of a test to discriminate those who have the target condition from those who do not have the target condition in clinical or public health practice.8
Scientific validity studies: establish an association between an analyte and a clinical condition or physiological state.20 SARS-CoV-2 tests are often performed on artificial or restricted sample sets, for example, comparing residual samples from individuals admitted to hospital with covid-19 with control samples before 2020.
Analytical performance studies: refers to technical test performance, and can include data to demonstrate accuracy (derived from trueness and precision), analytical sensitivity (eg, limit of detection, limit of quantitation), analytical specificity, linearity, cut-off thresholds, measuring interval, cross contamination, as well as determination of appropriate specimen collection and handling, and endogenous and exogenous interference on assay results.21
Target condition: a particular disease, disease stage, health status, or any other identifiable condition within a patient, such as staging a disease already known to be present, or a health condition that should prompt clinical action, such as the initiation, modification, or termination of treatment.8
Index test: the test being evaluated.8
Reference standard: the best available method for establishing the presence or absence of the target condition related to the intended use of the test.8
Reference method: used in analytical studies to refer to the best analytical method to detect a measurand.
Reverse transcription polymerase chain reaction (RT-PCR): a molecular test using cyclical amplification of DNA to detect if genetic material consistent with the SARS-CoV-2 virus is present in the sample (through a DNA mold, that is the reverse transcription of the viral RNA).
Cycle threshold (CT): each cycle of RT-PCR amplifies the number of DNA copies in the sample. The more virus that is present the less amplification is needed to detect the virus. Laboratories will run samples through machines with a set numbers of cycles (typically 40-50 cycles), and will establish a threshold for when a sample is determined to be positive, for example, 35 or 40. Samples that test positive after this threshold could be retested.
Antigen testing: immunoassays that detect the presence of a specific viral antigen, which implies current viral infection.23
Lateral flow test: a form of immunoassay performed outside of the laboratory using a sample placed onto a test device, with the presence or absence of the target analyte demonstrated by a colour change. A common example is a pregnancy test. In this context, they are used to detect SARS-CoV-2 antigens or antibodies.
Antibody testing: serological or antibody tests detect resolving or past SARS-CoV-2 virus infection indirectly by measuring the person’s humoral immune response to the virus.24RETURN TO TEXT
Our test evaluation guidance is outlined in a series of steps, in the order of the STARD checklist, although the steps might not be sequential in practice. Table 1 outlines the STARD checklist items, noting some key methodological issues in the studies done of SARS-CoV-2 tests to date. The steps described below are illustrated with examples of possible study designs in table 2.
Many evaluations of SARS-CoV-2 tests do not provide accurate estimates of the performance of the tests in the intended clinical setting
Studies need to clarify if they are scientific validity studies or clinical performance studies
The purpose of the test, the study population, the methods for determining the decision thresholds for the test being evaluated, and the reference standard need to be carefully mapped out in the study design
Studies that compare diagnostic tests and diagnostic pathways, preferably by investigators independent of those who developed the tests, are valuable
Step 1: Define the intended use of the test
Many published evaluations of SARS-CoV-2 tests are not able to provide an accurate estimate of the performance of the test in clinical practice because the relation between the purpose of the test, the selection of the study population, and the selection of the reference standard have not been carefully mapped out before the conduct of the study. Before beginning an evaluation of a SARS-CoV-2 test, researchers should define how the test will be used in the clinical or public health pathway. Some possible indications for use of SARS-CoV-2 tests are listed below.26
For viral nucleic acid (such as RT-PCR) and antigen testing:
To diagnose covid-19 in individuals with symptoms suggestive of the disease
To test asymptomatic, presymptomatic individuals, or individuals with mild symptoms who have known recent exposure to another person with confirmed covid-19 (eg, as part of localised outbreak investigations and test and trace programmes)
To screen individuals at risk of acquisition or transmission of infection (eg, staff or patients in hospital or staff or residents in aged care or education facilities, as part of outbreak prevention programmes)
To evaluate if a person with SARS-CoV-2 infection has cleared the virus
To establish the prevalence of current SARS-CoV-2 infection in a population (eg, for public health decisions, or to estimate pre-test probability for an individual in that population).
For serology (antibody) testing:
To investigate patients presenting late after symptom onset in whom viral nucleic acid testing is negative or where viral nucleic acid testing is not available to confirm whether they were infected with SARS-CoV-2
To determine antibody presence as part of a broader immunological assessment (eg, in intervention studies evaluating the efficacy of SARS-CoV-2 vaccine immunogenicity or convalescent plasma)
To estimate the seroprevalence of past and recent SARS-CoV-2 infection in a population (eg, for public health decisions).
Testing to assess if an individual has immunity to further infection is also of key interest. However, this requires studies that demonstrate that specific immune responses, such as the presence of antibodies (neutralising or non-neutralising), T cell, or other cellular responses, lead to protection from clinically important infection or re-infection. The detection of antibodies in itself is insufficient to demonstrate immunity. As yet, we do not have strong evidence of what immune responses are necessary for immunity to SARS-CoV-2 infection.272829
Defining the clinical (or public health) pathway involves not only describing the test, but also the test population, the role and position of the test (including what tests are conducted before and after the test being studied), how the test results will be used, and their impact on management decisions. Testing strategies also need to consider the availability of test materials and other resources, and the prevalence of infection in the community. Each type of test has different requirements in terms of equipment, expertise of the operator, sample types, sample storage, and turnaround time. Mathematical modelling studies have shown that reducing the time between symptom onset and a positive test result, assuming immediate isolation, is the most important factor for improving the effectiveness of test and trace programmes,30 so in some settings there may be a trade-off between turnaround time and diagnostic accuracy.
False negative test results could lead to infected individuals continuing to come into contact with and potentially infecting other individuals. False positive test results may lead to individuals being told incorrectly that they are infected with SARS-CoV-2 and decisions regarding isolation measures, restriction of movement and activities for both the individual and the community. The rate of infection in the group (that is, the prevalence in the group) will affect the predictive values of the test (that is, the probability of false positive and false negative test results; fig 1). For example, in settings where there is a very high rate of transmission, the pre-test probability of infection for an individual might be so high that even a negative test result does not safely rule out infection to a level that an individual can be assumed to be non-infectious unless the test has a very high sensitivity.31
Groups such as the FDA in the US,18 the MHRA in the UK,19 and WHO20 have set acceptable and desirable performance characteristics for SARS-CoV-2 testing (called target product profiles by the MHRA and WHO). The targets set by these agencies show a low tolerance for both false negative and false positive results in the setting of the SARS-CoV-2 pandemic. Acceptable clinical performance characteristics are determined by the values placed on the consequences of testing and are not definitive or intrinsic to the test.
Where clinical pathways are more established, it is generally desirable to establish minimum acceptable clinical performance characteristics before conducting a clinical performance study.32 In the setting of a pandemic, however, where the rate of infection in the community is changing and new tests, treatments, and responses to infection are rapidly becoming available, this is not likely to be feasible. In this context, groups conducting clinical performance studies should make the information from their protocols and reports available to public health and clinical decision makers in a rigorous, transparent, and timely manner.
Studies should also clearly outline existing or alternative clinical pathways, including whether the test being evaluated is intended to replace an existing test or is in addition to existing testing.33 For example, a reverse transcription loop-mediated isothermal amplification test might be used as a replacement diagnostic test for RT-PCR, to reduce the demand for reagents and allow for faster turnaround time. Studies that explicitly compare diagnostic tests in clinical pathways are valuable for clinical and public health decision makers.
Understanding the timing of the viral and immunological responses to SARS-CoV-2 infection is critical in considering the clinical pathway. After exposure to SARS-CoV-2, the virus typically becomes detectable by RT-PCR testing on the third or fourth day after infection (fig 2).3435 Symptoms typically appear around the fifth day of infection, and both symptoms and viral detection last for several days to weeks, depending on the severity of infection.36 Studies using repeat RT-PCR testing and tracking of transmission rates (including infector-infectee transmission pairs) have shown about 40% of transmissions occur before the development of symptoms,37 and peak infectiousness occurs about one day before until two to three days after symptom onset in typical individuals.34 Antibodies are generally low in the first week after symptom onset (in people with covid-19 confirmed by RT-PCR), with most individuals seroconverting by day 10 to 14, and diagnostic sensitivity for SARS-CoV-2 infection of serology tests only exceeds 90% in the third week after symptom onset,91011 and then begins to decline.38 It is not yet known how long high levels of antibodies to SARS-CoV-2 infection persist, but the observations to date show that the response among individuals varies, influenced by disease severity.282938
Researchers might not be able to predict all aspects of intended uses of the test as well as consequences of the test result. However, researchers should consider the potential clinical pathways a priori and how this will affect the application, timing, and interpretation of the results of the test, and therefore the design of their study.
Step 2: Define the target condition
Building on the first step, researchers must clearly define the target condition of interest—that is, what the test aims to detect. For SARS-CoV-2 tests, potential target conditions include infection with the virus, disease caused by the virus (that is, covid-19), infectiousness, the presence or extent of immune responses to the virus, clearance of the virus, past or recent infection with the virus, and immunity to infection. Explicit consideration of the target condition of interest helps identify further elements that guide study design, such as the population to be tested and acceptable reference standards for defining the presence of the target condition. For most clinical performance studies, the target condition will be SARS-CoV-2 infection (which includes symptomatic, presymptomatic, and asymptomatic infection).
Some settings could require researchers to establish whether someone is infectious rather than whether someone has the infection. For example, if an individual presents in a healthcare setting, knowing whether they are infectious or not influences the need for personal protective equipment and other infection control measures immediately; whereas determining whether they have the infection is less urgent if the individual’s symptoms are mild but SARS-CoV-2 infection cannot be excluded. Testing for infectiousness, rather than infection, has also been suggested as a possible method for screening in other settings, including opening businesses and allowing public gatherings.39 Although such strategies should be investigated, the entire clinical pathway for such strategies needs to be evaluated, including the potential consequences of false positive and negative test results.
Step 3: Define the population in which the test will be evaluated
Poor patient selection and description of study groups have severely limited the ability to establish the diagnostic accuracy of SARS-CoV-2 tests to date. Scientific validity studies, often of a case-control design, cannot provide realistic estimates of the diagnostic accuracy of the tests when used in clinical practice. To establish diagnostic accuracy, clinical performance studies should be conducted in individuals sampled from the population in which the test will be used, as determined by the intended use in step 1. Examples of possible populations for diagnosing current (or prior) infection include: individuals with current (or previous) symptoms suggestive of covid-19; individuals at high risk of exposure (such as close contacts of people with confirmed disease); individuals at high risk of both exposure and transmission (such as healthcare workers or residents of aged care facilities) and patients admitted to hospital with suspected covid-19. Based on the target population, studies should then define the method for enrolling participants into the study, including inclusion and exclusion criteria, aiming to recruit participants representative of the target population. Ideally, where the intended test use is in a healthcare setting, consecutive individuals from the target population would be recruited without previous knowledge of whether the individuals have the target condition or not. For population based studies, where the intended test use is for public health decisions, a representative random sample of the target population could be used. Studies using people with known disease and healthy controls introduce selection bias and effects related to the clinical spectrum of disease.
The diagnostic accuracy observed in studies of patients admitted to hospital with severe covid-19 or recruited from hospital settings might not apply to other settings. For example, although the intended use population for most serology tests is a community setting that includes individuals who have experienced no or mild covid-19 symptoms, most published studies of these tests have recruited patients admitted to hospital with severe infection. Antibody production in this population is likely to be higher than in the wider population of those infected.9
If the purpose of the test is to establish the presence of SARS-CoV-2 infection in a community setting or a clinical population, patients with respiratory symptoms due to respiratory illnesses other than SARS-CoV-2 should not be excluded from the study because these patients will be tested in clinical practice. Careful thought should be given to the presence or absence of symptoms that might be used as eligibility criteria for the study. The presence of, for example, respiratory symptoms, prompts correct selection of the anatomical site for the sample and correct timing (during symptoms). When testing for asymptomatic infection, neither of these helpful prompts are available, meaning that other epidemiological information (eg, risk of exposure, and time since exposure, if known) and more than one sample (anatomical or time point) might need to be tested. Viral nucleic acid typically can be detected on the third day after exposure in nasal, throat, or saliva secretions.3435 It is unclear whether virus is typically detected in faeces and sputum two days after infection, or if later time points are relevant for these sites of sampling.
In addition to defining the population, researchers should record and report characteristics of study participants during the course of the study, such as the presence of key symptoms (temperature, cough and so forth), time since a high risk contact (defined as contact within a certain distance of a person with confirmed or probable SARS-CoV-2 infection and for a certain amount of time), viral load if known, markers of disease severity, and time since the development and cessation of symptoms. The number and reasons for any exclusion of individuals from the study following recruitment should also be recorded.
The accuracy of all tests depends on their timing, so it is essential to record the time point in the disease course at which the test is done, in relation to time since known exposure and time since onset of symptoms. Owing to differences in healthcare provision and pathways, only recording time since healthcare events (such as admission to hospital, intensive care units, or results from RT-PCR) restricts the ability of study findings to be generalised to other settings.
Step 4: Describe the index test
Given the natural history of infection over time, variations in viral load, and the current limitations in test accuracy, combinations of tests, or tests at different time points might be needed to identify all true cases and non-cases. The index test strategy could therefore be one test, the same test repeated at different time points, or a combination of different tests, such as a test with lower specificity followed by a test with higher specificity in those initially positive. Ideally, the entire testing pathway would be evaluated.
SARS-CoV-2 tests can be developed commercially or in-house by a laboratory, and need to meet key regulatory or emergency use authorisation requirements for in vitro medical devices.1819202122 All pre-analytical, analytical, and postanalytical characteristics of the test should be described, including the items in the list below.
Full name of the test and manufacturer, and associated batch numbers allowing clear identification
type of samples suitable for testing (eg, nasopharyngeal swab, sputum, saliva, blood)
method of collection of specimens and how the sample was taken (eg, whether a long swab was used for RT-PCR tests)
who has taken the sample (eg, clinical training)
conditions for specimen handling, transport, and storage
actual target of the assay (what is being measured; eg, viral nucleic acid, antigen, or antibody against specific viral proteins)
principles of analytical methods (eg, fluorescence, multiplex fluorescence, or digital RT-PCR; enzyme linked immunoassay or lateral flow assay)
platform used for measurement (how and with what device the target analyte is measured)
where was the analysis done, if relevant (eg, at the point of care or in a reference laboratory)
analytical performance measures of the test (eg, analytical sensitivity or limit of detection, cross reactivity, accuracy, trueness, precision)
decision limits at which the test is considered positive or negative, where applicable.
The study should determine a priori which specimen types will be tested. The results of evaluations on one type of specimen cannot be generalised to other specimen types without further validation. The type of specimen and the methods used to collect and analyse the specimen need to reflect the methods intended to be used in standard clinical practice. For PCR and antigen tests, the anatomical site used for collection of the specimen should be stated; for example, whether the specimen is taken from the upper respiratory tract (nasal or pharyngeal swab – including insertion depth, or saliva), the lower respiratory tract (bronchoalveolar lavage, sputum), or elsewhere (urine, faeces, blood). Samples using viral transport medium spiked with inactivated virus are not appropriate for assessing the test’s clinical performance. For antibody tests, the sample type could be venous whole blood, plasma, serum, or finger prick capillary whole blood. Elution protocols for dried blood spots should be available if used. Tests should be evaluated preferably with samples that are prospectively collected.
The actual targets that the test is measuring must be clearly stated or reference must be given to the actual measurement procedure or vendor’s instructions. For viral nucleic acid tests by RT-PCR, the primer binding site (and for antigen tests, the specific antigen targeted) should be stated and whether the specimens were run with or without extraction, heat inactivation, or pooling. For serology tests, it is important to describe the viral proteins targeted by the antibody (typically the spike protein S1 or S2, which are specific for SARS-CoV-2, or the nucleocapsid protein, which is conserved among all coronaviruses), the type of immunoglobulin(s) detected (that is, IgA, IgG, or IgM), and the immunological method used (eg, enzyme linked immunosorbent assay, chemiluminescence immunoassays, lateral flow immunoassays, and fluorescent immunoassays). Depending on the question being asked as determined in step 1, the authors will also need to determine whether the index test is identifying neutralising or non-neutralising antibodies.
The key analytical performance indicators of the tests used in the evaluation should be known before starting a clinical performance study. These characteristics should be described, if possible, using appropriate reference measurement methods to ensure that they adequately measure the presence or quantities of the virus or antibodies, and will usually be described in the instructions for use documentation. These typically cover the limit of detection, reportable range, imprecision, trueness as compared to a reference method and the analytical specificity of the tests. Recommended methods for performing these analyses are given in the FDA templates18 and elsewhere.4041 Quality controls, such as negative and positive controls, and linearity checking by measuring of levels using spiked samples with increasing concentrations of the virus, antigen, or antibody are also necessary. For RT-PCR, the limit of detection is typically measured by spiking RNA or inactivated virus into an artificial or real clinical matrix, such as bronchoalveolar lavage fluid or sputum. The limit of detection should be reported, for example, as viral copies per millilitre.
Cross reactivity with other viral RNA or antigens or antibodies to previous infections (analytical specificity) also needs to be evaluated to show that the test does not cross react with normal microbiota or other pathogens that might be present in the clinical specimen. High priority organisms for the evaluation of cross reactivity are listed in the FDA templates.18 Potential cross contamination within the laboratory also needs to be minimised, and controlled by good laboratory practice. Contaminated reagents in laboratories have led to false positive test results.42 A proportion of samples within the study should therefore be tested for cross contamination, and this proportion should be stated.
Measures of precision (repeatability and reproducibility) might be important, for example, if different operators will be analysing results in the laboratory or at the point of care. Repeatability reflects closeness of agreement between results of successive measurements carried out under the same laboratory conditions, while reproducibility reflects closeness of agreement between results of measurements performed under changed laboratory conditions of measurements (eg, time, operators, calibrators, and reagent lots).43 The lot-to-lot variability of tests should be stated.
Postanalytical characteristics—decision limits
Decision limits need to be defined for positive, negative, and indeterminate results. Preferably, these cut-off points are selected a priori, for example, based on the manufacturer’s guidance, or from previous scientific validity studies. If invalid or indeterminate results are repeated, the methods for deciding this process should be described and the number of such repeat tests should be reported. Cut-off points derived from the data collected within the study can bias estimates of test performance.4445 If no prior data exist to determine cut-off points, or when the cut-off point was established in people with symptoms but the test is intended to be used in non-symptomatic individuals or individuals with mild symptoms, then it must be made clear that further external validation of the optimal cut-point is needed in an appropriately selected and representative population.
For RT-PCR tests, considerable attention has been given to the number of amplification cycles used and the cycle threshold (CT) to determine if a test is positive, negative, or indeterminate. Although a strong relation exists between CT and viral load, choosing the CT is not easily generalisable between tests, kits, testing platforms, and laboratories. CT values can be transformed into concentrations using a calibration curve for each testing pathway (test, kit, platform, and laboratory), allowing for direct comparisons between different testing pathways. The CT or concentration cut-off points used in the evaluation should be clearly explained, and the methods for managing an indeterminate test clearly outlined.
Step 5: If applicable, describe which tests are compared and why
With the rapid development of so many SARS-CoV-2 tests, decisions need to be made regarding the comparative performance of different tests. The comparison can be between different forms of testing, different tests of the same form, or different testing strategies. Each test included in the study should be described as in step 4.
Comparisons of index tests can involve a comparison of two or more index tests against a common reference standard or compare the agreement of two tests against each other. In the case of the first scenario, both index tests are best performed in the same individuals, using a direct comparison, rather than as an indirect comparison of the index test against the reference standard in two different study groups.
Studies that make head-to-head comparisons of many tests in the same samples efficiently provide important and useful information about comparative test accuracy. However, the practicalities of obtaining adequate samples to perform all included tests without compromising the generalisability of the study findings must also be considered.
The aim of the comparison should be specified. For example, the aim of the study could be to perform a descriptive analysis of all included index tests or to determine if a new test has higher sensitivity and equivalent specificity, or faster turnaround time and equivalent diagnostic accuracy. Although one characteristic might be specified as the primary outcome (eg, improved sensitivity), other measures of clinical performance will also need to be evaluated, such as the test’s specificity. Note that the comparator test is not the same as the reference standard described in step 6.
Step 6: Define the reference standard
The reference standard needs to clearly separate individuals who have the target condition from those who do not; for example, those who have or have had the infection from those who do not or have not had the infection, or those who are infectious from those who are not infectious. Irrespective of the intended use, in clinical performance studies, the interpretation of the index test (or tests), the comparator test (or tests), and the reference standard test need to be conducted masked to the results of the other test (or tests).
In the systematic reviews of SARS-CoV-2 tests to date, a high proportion of studies have used a reference standard with a high risk of bias, which does not apply to the clinical population of interest.9101112131415 Selection of the appropriate reference standard for evaluation of SARS-CoV-2 tests is not simple, and several issues described below need to be considered.46
For studies where the target condition is SARS-CoV-2 infection
SARS-CoV-2 infection includes individuals who do not have symptoms, those who are presymptomatic, and those who have symptoms. WHO has published definitions of suspected, probable, and confirmed covid-19 based on clinical, epidemiological, and laboratory criteria, with recommended associated testing.4748 According to this advice, a person with confirmed covid-19 is defined as having laboratory confirmation of SARS-CoV-2 infection, irrespective of clinical signs and symptoms. This definition can be confusing, because in most publications covid-19 is the disease caused by the SARS-CoV-2 virus and thus is equivalent to symptomatic infection, not to infection in itself.
WHO defines an individual with probable covid-19 as having symptoms indicative of the disease (fever, cough, general weakness or fatigue, headache, myalgia, sore throat, coryza, dyspnoea, anorexia, nausea or vomiting, diarrhoea, altered mental status); has an epidemiological risk of exposure; and is a contact of a person with probable or confirmed covid-19, has chest imaging findings suggestive of covid-19, has a loss of taste or smell, or death has occurred that is not otherwise explained in an adult with respiratory distress preceding death and was a contact of an individual with probable or confirmed covid-19 or epidemiologically linked to a cluster with at least one person with confirmed covid-19. These WHO definitions above are necessary to standardise clinical protocols and reporting but will also misclassify a proportion of cases. Some individuals will be classified as having probable covid-19, but not be infected with SARS-CoV-2. On the other hand, some individuals will have had exposure, have had symptoms and investigations such as imaging that indicate covid-19, but have tested (either by RT-PCR or antibody) negative. These individuals are not classified as having definite covid-19. If the WHO classification is used as a reference standard, a sensitivity analysis of the test’s clinical performance using a reference standard including probable disease should be presented.
Putting aside the confusion caused by terminology, viral nucleic acid testing (specifically RT-PCR) is frequently used as a reference standard for SARS-CoV-2 infection, where the individual has had possible exposure up to two weeks before testing. After this period, viral load decreases in many individuals reducing the sensitivity of the RT-PCR. Although the specificity of viral nucleic acid testing is thought to be very high, it is not 100%. The probability of false positive test results is difficult to determine, but it is possible that at least some individuals who have tested positive and who remain asymptomatic have never had the virus. Some false positive test results might be due to cross contamination with other samples or clerical error in reporting results. Repeat testing could identify some false positive results, but interpretation of discordant results is complex. For example, a second test, especially if done beyond the typical 14 days test window after exposure, might be negative because the individual no longer has the virus. Repeat testing in individuals with confirmed covid-19 shows that false negative results occur, particularly in the first few days after exposure or late in the course of infection.3435364950 Poor sampling technique, samples from the wrong anatomical site, and incorrect transport of specimens can also contribute to false negative results. One negative viral nucleic acid test is inadequate to rule out SARS-CoV-2 infection.
Performance of viral nucleic acid testing as a reference standard could be improved by ensuring appropriate collection, repeat testing for those who initially test negative within an appropriate time window (eg, within five days after symptom onset or on the fourth day after exposure if exposure date is known), or by samples from multiple sites or with multiple genetic targets.5152 Serology could be used if exposure is thought to have occurred more than 14 days previously. However, serology also has a high false negative rate, and might also have false positive results due to the presence in the specimen of substances such as rheumatoid factor, heterophile antibodies, haemolysis, fibrin, and other types of coronaviruses.5354 Repeat testing and combinations of tests, however, adds a greater layer of complexity in deciding what is considered a true positive and true negative result and will add to the resources needed to conduct an evaluation. If repeat or multiple testing is used as part of the reference standard, the testing strategy needs to be clearly outlined with the same strategy used for all individuals included in the study, not just those samples with a discordant result between the index test and the reference standard.55
For asymptomatic infection, clinical reference standards are not possible because there are no clinical symptoms and because the number of asymptomatic patients detected with other forms of testing, such as lung imaging to detect inflammation, will be low.
For studies where the target condition is covid-19
Covid-19 is the disease caused by SARS-CoV-2 and therefore includes all patients with symptoms. For diagnosing covid-19 disease, the clinical reference standard is likely to be a combination of clinical information, including repeat or multiple RT-PCR tests, other tests (including chest imaging), serological antibody testing, and clinical follow-up. Studies should specify which clinical information is used as part of the clinical reference standard and attempts made to obtain this information for all study participants, for example, using the information included in the WHO definitions for individuals with probable disease. Clinical follow-up and repeat testing of those who develop symptomatic disease or more severe disease will detect at least a proportion of individuals with covid-19 who are initially negative on RT-PCR testing.13 The use of multiple sources of clinical information as a reference standard ensures more complete identification of cases, but it can also lead to both an underestimation of the diagnostic sensitivity of an index test (if individuals are defined by the reference standard as having disease are actually true negatives) or an overestimation of the sensitivity of an index test (if the results of the index test are incorporated into the definition of the target condition). A reference standard using all clinical information, while not perfect, is probably the best that can be achieved at present.
For studies where the target condition is previous SARS-CoV-2 infection
If the purpose of the test is to identify previous SARS-CoV-2 infection, for example, to validate use of a serology test for a seroprevalence survey, the reference standard needs to demonstrate clear evidence of the presence or absence of previous infection. Such evidence can be shown through results of a previous RT-PCR test plus clinical information about potential exposure risk and clinical follow-up. Timing of such testing with RT-PCR is difficult, especially in asymptomatic and presymptomatic individuals. Therefore, if the test is intended for seroprevalence surveys, the best study design would involve a large number of randomly selected individuals who are regularly tested with repeat PCR weekly or biweekly as a reference standard and followed up by serology testing 2-3 weeks after the last RT-PCR test until there is risk of exposure to the virus. However, such studies, especially in a low prevalence setting, would be costly and uncomfortable to study participants.
Exclusion of prior infection needs to be established as robustly as the presence of current infection. Many studies evaluating serology tests have used samples from pre-pandemic serum and blood banks, either from health resources or from study sample archives. Such studies can measure scientific validity and analytical sensitivity and specificity, but do not measure clinical performance.
Comparisons of different forms of serology testing can be valuable, but must be made against an appropriate reference standard, and require understanding that the development of an immune response varies between individuals in the timing, intensity, and which parts of the virus antibody responses are targeted. Inclusion of a category for individuals with probable disease category might be useful.
For studies where the target condition is infectiousness
Although a positive RT-PCR test result indicates presence of viral RNA, it does not necessarily indicate that the individual is infectious. Infectiousness requires the virus to be present in a bodily secretion that could result in transfer of virus to another individual, and also that the virus particles in secretions remain infectious—that is, are still viable virus particles as opposed to inactive or remnants of virus particles. The ability to use a rapid test that determines whether an individual is infectious could have advantages in some settings, as described above. However, a reference standard for determining viable and non-viable viruses in the patient’s specimen does not currently exist. Assays of virus infectivity in cell culture and viral replication could be a measure of virus viability and infectivity, but are currently not suitable outside a research setting, as the assays are time consuming and methods are still being refined including sampling methods, transportation and culture media. Cell culture assays are problematic as a reference standard because they appear to have suboptimal sensitivity for detecting infectiousness. Early in the course of infection, which we expect to be the most infectious stage, samples from RT-PCR positive individuals with the virus might not grow virus on cell culture.56 While samples that return a positive RT-PCR result at a higher CT could indicate viral remnants at a point when the patient is no longer infectious, they might also indicate an early point in the course of the infection. Using a lower CT for determining infectiousness will reduce the sensitivity of the test to detect all infectious individuals. Similarly, the assumption that only those people with high viral load are infectious will miss individuals who have lower viral loads but who are still capable of passing on the infection.16
For studies where the target condition is SARS-CoV-2 infection clearance
Diagnosis of SARS-CoV-2 clearance (that is, absence of detectable viral particles whether viable virus or not) generally requires at least two negative RT-PCR tests to demonstrate clearance. However, testing at multiple anatomical sites has shown that the virus is cleared from the upper respiratory tract before clearance from the lower respiratory tract.12 Time for clearance from gastrointestinal tract varies greatly by individual. It is not known whether presence of the virus in faeces has a role in the spread of infection, although this was a significant route for spreading infection in SARS.
Step 7: Analysis and presentation of results
Poor reporting of studies evaluating SARS-CoV-2 tests has been a common methodological concern in the studies to date. Reports should follow the STARD reporting guidelines for diagnostic accuracy studies.8 Researchers should include the STARD flow diagram to report the number of individuals included in the study, the number of individuals excluded from the study before testing, the number of individuals whose samples were not tested, and the number of individuals who had samples tested but who were not included in the study (eg, who did not receive the reference standard, or had indeterminate or outlier results; fig 3). The diagram might need to be adapted for studies that use repeated testing over time. The prevalence of SARS-CoV-2 in the study group needs to be clearly identified, and where possible, study reports should indicate transmission intensity and co-circulating pathogens at the time of the study.
Sample size and unit of analysis
The sample size should be the number of individuals included in the study, not the number of samples tested. If more than one test from some individuals are included in the study, the repeat test should not be included in the same estimates of sensitivity and specificity. Repeat samples from the same individual can be included, however, for the estimation of sensitivity and specificity at different time points (one repeat at each time point). Such analyses can be helpful in establishing the sensitivity and specificity of a test over time. Where repeat testing occurs, the reason for repeat testing should be reported and the reporting of repeated samples should be clear. If more than one test from all individuals are included in an evaluation of a testing strategy (rather than evaluation of one test), then the sample size is again the number of individuals included in the study.
Although researchers should evaluate sensitivity and specificity in the same population to estimate clinical test performance, preliminary studies might estimate sensitivity and specificity in separate study groups. Where this occurs, the sample size for each group should be stated separately.
Analysis of data
In presenting the results of the study, a cross tabulation of the index test and the reference standard results is helpful. Use of the same reference standard for all index tests minimises the risk of verification bias. Any missing data or indeterminate results for either the index test or reference standard should be reported according to the final disease status (if known) and not excluded from the results.
Reports can include the results of analytical performance (eg, analytical sensitivity, analytical specificity, imprecision), but these need to be clearly differentiated from clinical performance (diagnostic or clinical sensitivity and specificity) which are the more relevant measures and should be the focus of the report. All estimates require confidence intervals, based on the appropriate sample size using appropriate methods for computation, such as exact binomial or Wilson approximation.5758
For each individual included in the study, the timing of the samplings and the analysis of the test should be recorded. Time from presumed exposure to infection and since the onset of symptoms (if applicable) should also be recorded. In general, the index test and the reference standard should be conducted as close in time as possible. If both the index test and the reference standard include RT-PCR, then the same sample should be used or paired samples should be obtained.
For studies evaluating antibody tests to identify previous infection, the reference standard might include a RT-PCR test or other tests conducted during the symptomatic phase of the illness or post-exposure, with antibody testing conducted at a later date, when the individual is likely to have seroconverted. In these studies, the timing of the serology sampling might be defined as time since RT-PCR evaluation, or better, the time since exposure to a known person with confirmed disease or since onset of symptoms. For studies using a reference standard that includes clinical follow-up or repeat testing, the same follow-up period should be used in all individuals included in the study.
Subgroup analyses of diagnostic performance by factors known to affect the sensitivity and specificity of testing can assist the understanding of the clinical applicability of the results. Most of the identified heterogeneity for SARS-CoV-2 tests seen so far is in the sensitivity of the test. Subgroup analyses by time since exposure, time since symptom onset, disease severity, viral load, or antibody titre in the reference standard and in groups of individuals who are asymptomatic or presymptomatic or those who have symptoms are particularly helpful.
As described above, two index tests should ideally be compared within the same study group. Where two index tests are measuring a common property and no reference standard is used, the agreement between tests might be reported in the form of tables showing concordant and discordant results. Further information on the people with discordant results could help to evaluate which test is more accurate using agreement with observations that might be considered as so-called fair umpires but are not a reference standard.59 Such fair umpires could include information on prior exposure risk, concurrent tests (apart from index or comparator test under evaluation—eg, inflammatory markers, chest imaging), response to treatment, and clinical outcomes on follow-up.
Clinicians and public health experts require not only the sensitivity and specificity of the test but also an understanding of the positive and negative predictive values of the test. In presenting the results of the study, estimates of these predictive values using several clinically relevant values of prevalence is helpful. We also recommend a graphical display of how the test characteristics will perform in slightly different prevalence settings and use of natural frequencies (eg, the number of people affected in a population of 10 000 people), as shown in figure 1. The FDA website provides a calculator to convert sensitivity, specificity, and prevalence to the positive and negative predictive values of the test that are relevant to the target population.60
In addition to summarising the results, authors can provide guidance to assist those using the study results (such as clinicians, public health staff, and policy makers) on how the results of the study can be applied in practice and the consequences of false positive and false negative test results. Where possible, advice can be given on how testing strategies and use of the test might need to be refined on the basis of understanding gained from the evaluation of the test.
If a study is done in a reference laboratory with highly experienced staff, the results will represent the best case scenario for the estimates of diagnostic accuracy, and the test is likely to have performance characteristics that are less than this in clinical practice.
If future research is needed, advice on how to store samples and how to assure the stability of samples and what data to record for biobanking purposes can be helpful. Appropriately designed and harmonised sample banks, with detailed information about the population characteristics, should be made available to developers of new tests so that the tests can be rapidly validated, and passed to clinical laboratories for local verification.
Step 8: Prospectively register the study protocol
On completion of the study design, study protocols can be registered before their initiation in a clinical trial registry, such as ClinicalTrials.gov or one of the WHO primary registries, ensuring that existence of the studies can be identified.61 Prospective registration is a sign of quality, providing evidence that the study objectives, test procedures, outcome measures, eligibility criteria, and data to be collected were defined prospectively, and allows transparent reporting of any modifications to study protocols. Trial registration also allows reviewers to identify studies that have been completed but were not yet reported, supporting the reduction in publication bias in subsequent systematic reviews. Including a registration number in the study report facilitates identification of the trial in the corresponding registry.
Testing and early identification of individuals with SARS-CoV-2 infection is a vital part of controlling the spread of the pandemic, including decisions regarding the need to introduce public health measures such as restrictions on movements and limits on social gatherings. To do this, we need to establish the clinical accuracy of tests in rigorously designed evaluations and in the full range of intended use settings so that the consequences of acting on test results are well understood by clinicians and policy makers. Substandard methods and poor reporting of these studies have limited our ability to understand the clinical performance of tests to date, including having to withdraw tests from the market that have been shown to have poor test accuracy.6263 Poor communication about the intended roles and diagnostic performance of tests has led to tests being used inappropriately, for example, antibody tests being used to screen or diagnose patients with acute infections64 or using inaccurate rapid testing to screen asymptomatic individuals and falsely reassuring individuals who are infectious.16 The issues regarding determining the clinical performance of antibody tests have been particularly challenging.
Inflated and inappropriate claims for test accuracy have been made for tests during the pandemic.6566 Most tests have been evaluated by the teams that have developed the tests using convenience samples. More accurate estimates would be derived using prospectively collected samples representing the target population, ideally evaluated by independent teams. The use of convenience samples and retrospectively collected samples has been a particular problem for the evaluation of antibody tests.9 Submissions for emergency use authorisation should be made publicly available to allow critical review, and data should be made available for use in individual patient data meta-analyses. Leading international and national public health organisations, regulatory authorities, and scientific journal editorial boards could assist by harmonising their requirements for test evaluations and developing study templates that can be used across studies and that encourage standardised data collection and reporting and rigorous study design.
We thank Tze Ping Loh, Mary Kathryn Bohn, and Shannon Steele for their helpful assistance and comments on the manuscript, and Ian Doust for providing figure 2.
Contributors: All authors provided a substantial contribution to the design and interpretation of the guidance, as well as writing sections of drafts, revising based on comments received, and approving the final version. ARH initiated the project. JAD wrote the initial draft and is the guarantor for the study. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding: No specific funding was provided for this work. KB is supported by Australian NHMRC Investigator grant #1174523. SM is supported by grants from the National Institute for Health Research (NIHR), University College London (UCL) and UCL Hospital (UCLH) Biomedical Research Centre. KA is supported by grants from the Canadian Institutes of Health Research, Abbot Diagnostics, Siemens Healthineers, March of Dimes, and the Heart and Stroke Foundation of Canada. JJD and JD are supported by the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. ARH has received a grant from the NSW Health COVID-19 Research Grants Round 1—2020: Improved Confirmatory Diagnosis of SARS-CoV-2 infection using Protein Mass Spectrometry. Funders had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The views expressed are those of the authors and not necessarily those of the NHS, NIHR, or Department of Health and Social Care.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Provenance and peer review: Not commissioned; externally peer reviewed.
Patient and public involvement: No patients were involved in setting the research question or outcomes of developing this study. No patients were asked to advise on interpretation or writing up of results.
This article is made freely available for use in accordance with BMJ's website terms and conditions for the duration of the covid-19 pandemic or until otherwise determined by BMJ. You may use, download and print the article for any lawful, non-commercial purpose (including text and data mining) provided that all copyright notices and trade marks are retained.https://bmj.com/coronavirus/usage