key: cord-0691734-fxs6bq9l
authors: Korevaar, Daniël A.; Toubiana, Julie; Chalumeau, Martin; McInnes, Matthew D.F.; Cohen, Jérémie F.
title: Evaluating tests for diagnosing COVID-19 in the absence of a reliable reference standard: pitfalls and potential solutions
date: 2021-08-03
journal: J Clin Epidemiol
DOI: 10.1016/j.jclinepi.2021.07.021
sha: b1a1834bb9a3703b57375f51cb96e41895359ebd
doc_id: 691734
cord_uid: fxs6bq9l

nan

Diagnostic tests play a crucial role in the management of the COVID-19 pandemic, helping to contain the spread of SARS-CoV-2 by detecting and isolating cases, enabling contact tracing, and guiding public health decisions about the initiation of lockdowns, thereby protecting people at increased risk of severe disease and our healthcare systems. Likewise, treatment decisions and enrollment in therapeutic trials require diagnostic confirmation. Because testing for SARS-CoV-2 is currently being done on a massive scale worldwide, false-positive and false-negative results may have considerable adverse downstream consequences. [1] Despite progress made in the last decades in our understanding of the complexity of medical test evaluation, assessing the value of diagnostic tests for COVID-19 still poses considerable challenges. [1] [2] [3] [4] [5] [6] One major challenge is that multiple target conditions can be defined. For example, the target condition may be (past or present) SARS-CoV-2 infection, infectiousness, or COVID-19 (i.e., the acute disease caused by SARS-CoV-2). Additionally, COVID-19 has a broad clinical spectrum, which may vary from asymptomatic and mildly symptomatic cases of SARS-CoV-2 infection, to cases of severe pneumonia with or without acute respiratory distress syndrome (ARDS) and multiorgan failure. [7] Furthermore, SARS-CoV-2 may also trigger severe post-infectious processes such as myocarditis, multiorgan failure, and Kawasaki-like illness, notably in children (referred to as 'Paediatric inflammatory multisystem syndrome temporally associated with COVID-19' (PIMS-TS)). [8] It is important that researchers carefully define the target condition before initiating a test accuracy study of a test for COVID-19. [2, 9] However, errors in such studies may still occur if the final diagnosis entirely relies on detecting SARS-CoV-2. For example, patients with respiratory symptoms due to pulmonary embolism or community-acquired pneumonia due to Streptococcus pneumoniae may be misclassified as COVID-19 if they are concomitant carriers of SARS-CoV-2. Yet, carriage is still highly relevant to detect as part of contact tracing strategies because it may warrant isolation to contain the spread of SARS-CoV-2.

What makes test accuracy studies of COVID-19 tests particularly challenging is the absence of a reliable clinical reference standard. Because of this, the results of most test accuracy studies of COVID-19 tests are of limited value to clinicians and policymakers and may represent a considerable waste of research resources. [10] In this commentary article, we will discuss our views on the extent of this problem and provide potential solutions, based on our experience as clinicians and researchers, and with examples from published literature.

Tests for diagnosing COVID-19 include clinical signs and symptoms, laboratory tests such as molecular and antigen detection assays, sometimes done at the point of care, imaging tests such as chest CT, electronic noses, and multivariable clinical prediction models (Box 1). The most usual approach to inform clinicians and policymakers about test performance is by conducting test accuracy studies. In such studies, the results of a test under evaluation (the 'index test') are compared to those of another test that is supposed to distinguish between patients with and without the clinical target condition with a high level of certainty (the 'reference standard'). Typical outcomes in test accuracy studies are estimates of clinical sensitivity and specificity. Hundreds of test accuracy studies on COVID-19 tests have already been conducted, as well as systematic reviews to summarize them. [11] [12] [13] [14] [15] [16] In these test accuracy studies, laboratory-based real-time reverse-transcription polymerase chain reaction (RT-PCR) is widely used as the reference standard for diagnosing COVID-19. However, this test has important shortcomings (Box 2), which threaten the validity of these studies. An imperfect reference standard will, by definition, lead to index test results that are wrongly classified as 'false positives' or 'false negatives', or wrongly classified as 'true positives' or 'true negatives', a problem referred to as 'reference standard error bias'. [17] This will result in either over-or underestimation of the sensitivity or specificity of the index test under investigation. First, as for other respiratory viruses, the detection of SARS-CoV-2 by RT-PCR is strongly influenced by the quality, site, and timing of sampling. [4] For example, analytical sensitivity of RT-PCR is different for bronchoalveolar lavage fluid, nasopharyngeal swabs, and throat swabs. [18] In addition, in most cases, the probability that SARS-CoV-2 becomes detectable by RT-PCR in nasopharyngeal samples peaks around the onset of symptoms of COVID-19 and then gradually declines within 2 to 3 weeks. [1, 4, 19] Hence, most experts recommend repeating RT-PCR testing in patients with a sustained intermediate or high clinical suspicion of COVID-19 when the initial result is negative, especially in hospital settings. [20] Second, there is technical variability among the (more than) 150 RT-PCR kits for SARS-CoV-2 that have been approved and provided with FDA's emergency use authorization (EUA) label. [5, 21] Notably, these kits rely on various molecular targets encoding structural (e.g., envelope (E), nucleocapsid (N1, N2, N3), and spike (S) genes) and nonstructural proteins (e.g., RNA-dependent RNA polymerase (RdRp) and open reading frame 1 segments 1a and 1b (Orf1) genes). RT-PCR analytical sensitivity is higher for certain combinations of nucleic acid targets than others. For example, in a recent study comparing 6 molecular kits approved for detecting SARS-CoV-2, the limit of detection after 40 thermal cycles ranged from a viral RNA concentration of 484 to 7744 copies/mL, a 16-fold difference. [22] Third, RT-PCR is not a binary test: most tests provide a quantitative result reflecting the number of amplification cycles after which a signal becomes detectable ('cycle threshold', Ct), but the Ct value that has the highest clinical relevance is still a matter of debate and may vary across commercial kits and between laboratories using the same kits. [23] [24] [25] Low viral loads may reflect noninfectious states (e.g., nonviable viral RNA fragments) and low-risk viral shedding of unclear clinical significance, which may lead to overdiagnosis. Yet, decreasing the positivity threshold for the Ct value may result in missing infectious cases with low viral loads. Fourth, SARS-CoV-2 is mutating over time, resulting in genetic variations across circulating viral strains. If such mutations occur in the genetic sequences targeted by a given RT-PCR kit, this may potentially have a negative effect on test accuracy. [26] There are alternatives to RT-PCR testing. Researchers evaluating COVID-19 tests have used other reference standards, ranging from viral culture to various combinations of epidemiological data, clinical information, and testing results (Box 1). Unfortunately, each of these alternatives has its shortcomings as well, and may further amplify the problem by making evidence syntheses more challenging to conduct and interpret. For example, in a recent Cochrane systematic review of the accuracy of imaging tests for COVID-19, which included 51 studies, 47 used only RT-PCR as the reference standard (single RT-PCR, 2 studies; repeat RT-PCR in all patients with an initial negative test, 11 studies; repeat RT-PCR in at least some patients with an initial negative test, 17 studies; number of RT-PCR tests not reported, 17 studies) and 4 studies used a combination of RT-PCR and other criteria (i.e., clinical symptoms, information about infected household contacts, imaging tests, and laboratory tests). [15] The series of systematic reviews from the Cochrane COVID-19 Diagnostic Test Accuracy Group currently encompasses a total of 248 test accuracy studies and 157,433 participants ( Table   1 ). [11] [12] [13] [14] [15] Details about the reference standard were poorly reported, and risk of bias about the reference standard was deemed high in 48% and unclear in 25%; the reference standard was judged at low risk of bias in only 26% of the included studies.

The challenge of evaluating the diagnostic accuracy of a test in the absence of a reliable reference standard is not new, and there is an array of possible solutions that can be applied to COVID-19, although each has its advantages and limitations ( Table 2) . Below, we discuss several possible solutions, but others exist as well. [27] [28] [29] Panel-based reference standard A first option is to define disease status based on the opinion of a panel of experts that use evidence from a combination of signs, symptoms, tests, and sometimes follow-up in an unstructured way to classify each patient as having COVID-19 or not. An example of such an adjudication committee is reported in a study where five board-certified specialists in respiratory and internal medicine were retrospectively invited to make a final diagnosis for each patient suspected of COVID-19. [30] In a similar study, all patients with a positive chest CT, but a negative RT-PCR, were discussed in a multidisciplinary meeting to arrive at a final diagnosis. [31] Panel-based diagnosis has the advantage of reflecting clinical settings, where it is common practice that difficult cases are examined during multidisciplinary meetings. The disadvantage is that panel-based studies do not provide clear criteria for ruling-in or ruling-out the target condition, which may hamper reproducibility. Also, the level of expertise of the panel members is crucial in such approaches and may vary across settings and over time, as the evidence base increases. The panel may even be worse than the reference standard it is seeking to replace if members of the panel have insufficient expertise in diagnosing the target condition.

A second possibility is to classify patients using a clear and structured classification rule in the form of a list of criteria or score that may include the results of several tests. This procedure is sometimes referred to as a 'composite' reference standard. An example is the case definition of the US Centers for Disease Control and Prevention, which combines clinical, laboratory, and epidemiological information into a classification rule with three levels of likelihood of COVID-19 (e.g., suspect, probable, or confirmed); Box 3. These case definitions of COVID-19 are easy to apply and reproducible. However, they were designed for epidemiological surveillance purposes but may not be well suited for clinical management and test accuracy studies of COVID-19 tests. Another challenge for test evaluations is that many case definitions for COVID-19 have multiple diagnostic categories according to the likelihood of disease, while calculating sensitivity and specificity usually requires a binary reference standard. Also, rule-based studies may be impacted by 'incorporation bias': if the test under evaluation is incorporated as part of the composite reference standard, then sensitivity and specificity may both be over-estimated. [17] Model-based reference standard A third option are latent class models. [32] In such models, none of the tests is considered a reference standard: the sensitivity and specificity of each test are estimated from the analysis of the cross- to COVID-19 diagnostic data. [33] Latent class models were also implemented in a diagnostic metaanalysis of salivary tests for SARS-CoV-2, which allowed to account for the imperfectness of the reference standard and the potential non-independence between results obtained with salivary and nasopharyngeal nucleic acid amplification tests. [34] A benefit of latent class models is that they take advantage of all the available information and provide sensitivity and specificity estimates for all tests included in the model, allowing comparisons between tests. Limitations are that they are complex methods that may require expert statistical knowledge, that erroneous assumptions regarding dependence between tests in patients with and without the target condition may lead to biased estimates of test accuracy, and that they do not rely on a clinical definition of the target condition, but only a statistical one. Due to these limitations, results of studies applying latent class models may be more difficult to interpret by readers without a statistical background.

A fourth option is to compare a new test to an existing one by evaluating the level of agreement between the two tests, rather than reporting accuracy estimates that are unreliable due to the absence of a satisfactory reference standard. This has been done in several studies assessing RT-PCR tests based on salivary samples compared to the same tests using nasopharyngeal samples. One study, for example, found an overall agreement of 98% between RT-PCR done on salivary and nasopharyngeal samples. The authors concluded that « saliva is an acceptable alternative source for detecting SARS-CoV-2 nucleic acids ». [35] An advantage of this approach is that classification outcomes do not rely on a (potentially imperfect) reference standard. A downside is that raw agreement does not tell if discrepancies are due to errors from one test or the other. Therefore, this option is often only useful if a new test is meant to replace an existing test, for example because it is cheaper or less invasive, and researchers want to illustrate that the tests produce similar results in the majority of patients. Alternatively, such studies could compare detection rates between two tests, which may be a useful statistic if both tests are considered to have a specificity close to 100% (i.e., almost no false-positive results).

Finally, we may consider moving from diagnostic accuracy to clinical effectiveness. [36] In this framework, we would not be interested in classification outcomes such as sensitivity and specificity but would focus on patient-centered or population-centered outcomes such as infection, hospitalization, quality of life, and mortality rates. Here the goal is to develop and implement testing strategies that would prove beneficial for health and society. For example, several authors have argued that rapid point-of-care tests for SARS-CoV-2 might be effective in reducing viral community transmission despite having higher analytical limits of detection than conventional RT-PCR tests based on nasopharyngeal swabs, because of better uptake and shorter turnaround time that allow for repeat testing and timely isolation. [37] In a screening setting, individuals or groups could even be randomized to receive conventional or rapid tests for SARS-CoV-2. Here, we could assess outcomes such as participation, usability, positivity rate, and diagnostic yield, as done, for example, in colorectal cancer screening trials. [38] [39] [40] A drawback is that clinical effectiveness studies (including randomized trials) of medical tests are generally much more time-and resource-consuming than cross-sectional test accuracy studies, although this may be less of a problem in a pandemic setting due to the large number of potential study subjects and available funds.

COVID-19 tests play a central position in managing the disease worldwide and are being done at an unprecedented scale. However, evaluating the diagnostic accuracy of these tests is challenging.

Although there are several reasons for this, [2] one major issue is the lack of a high-quality reference standard. In this commentary article, we have argued that, because of this, the evidence provided by many test accuracy studies is difficult to interpret and may not suffice to decide with confidence which test is optimal in which setting. To avoid this waste of resources, research is urgently needed to help clarify which reference standard(s) for COVID-19 we should use in future test accuracy studies. It has become clear that relying on a single RT-PCR test is problematic to detect SARS-CoV-2 infection in symptomatic individuals as this reference standard may miss too many cases. A minimal requirement to minimize bias could be to ask for at least two negative RT-PCR results to define COVID-19-negatives. Depending on the target condition (e.g., COVID-19, infectiousness, carriage, immunological responses), and guided by the intended use population (e.g., diagnosis in symptomatic patients, targeted screening in contact tracing programs, mass screening in the general population), different reference standards may be needed. In the absence of an appropriate clinical reference standard, researchers could consider alternative techniques such as panel-, rule-and model-based methods, or measures of agreement. We need test accuracy studies that provide a more informative description of essential study features and test methods by following, for example, the STARD reporting guideline. [41, 42] We also need studies that, beyond accuracy, evaluate the effectiveness of COVID-19 tests through outcomes that directly matter to patients, policymakers, and society.

JFC initiated the project and wrote the initial draft of the manuscript. All authors provided a substantial contribution to the manuscript and approved the final version.

No specific funding was obtained for this work.

None of the authors have interest to disclose.

JFC initiated the project and wrote the initial draft of the manuscript. All authors provided a substantial contribution to the manuscript and approved the final version. In the absence of a more likely diagnosis:

-At least two of the following symptoms: fever (measured or subjective), chills, rigors, myalgia, headache, sore throat, nausea or vomiting, diarrhea, fatigue, congestion or runny nose OR -Any one of the following symptoms: cough, shortness of breath, difficulty breathing, new olfactory disorder, new taste disorder OR -Severe respiratory illness with at least one of the following: clinical or radiographic evidence of pneumonia, acute respiratory distress syndrome.

-Detection of severe acute respiratory syndrome coronavirus 2 ribonucleic acid (SARS-CoV-2 RNA) in a clinical or autopsy specimen using a molecular amplification test

-Detection of SARS-CoV-2 by antigen test in a respiratory specimen

-Detection of specific antibody in serum, plasma, or whole blood -Detection of specific antigen by immunocytochemistry in an autopsy specimen

One or more of the following exposures in the prior 14 days:

-Close contact with a confirmed or probable case of COVID-19 disease; -Member of a risk cohort as defined by public health authorities during an outbreak.

Suspect -Meets supportive laboratory evidence with no prior history of being a confirmed or probable case.

-Meets clinical criteria AND epidemiologic linkage with no confirmatory laboratory testing performed for SARS-CoV-2.

-Meets presumptive laboratory evidence.

-Meets vital records criteria with no confirmatory laboratory evidence for SARS-CoV-2.

-Meets confirmatory laboratory evidence.

*2020 Interim case definition, approved August 5, 2020  Poor reproducibility.

Final diagnosis is made through the formal combination of various pieces of information.

Easy to apply and reproducible.

May not reflect clinical practice.

Model-based diagnosis Accuracy estimates are computed by a model, assuming that no single test is able to define disease status.

Provides estimates of sensitivity and specificity for multiple tests, which allows comparisons.

 No clinical definition of the target condition.

 Complex modeling.

None of the two tests is accepted as a reference standard and agreement is measured in a cross-tabulation of test results.

Reference standard not needed. Not possible to say if discrepancies are due to errors from one test or the other.

The focus is not on sensitivity and specificity but on other outcomes such as infection, hospitalization, and mortality rates, or test uptake and diagnostic yield.

 Reference standard not needed.

 Outcomes matter to users.

 Time and resource-consuming.

 Large sample sizes required.

Interpreting a covid-19 test result

Guidance for the design and reporting of studies evaluating the clinical performance of tests for present or past SARS-CoV-2 infection

Testing COVID-19 tests faces methodological challenges

Interpreting Diagnostic Tests for SARS-CoV-2

Considerations for diagnostic COVID-19 tests

Laboratory Diagnosis of COVID-19: Current Issues and Challenges

A Proposed Framework and Timeline of the Spectrum of Disease Due to SARS-CoV-2 Infection: Illness Beyond Acute Infection and Public Health Implications

Kawasaki-like multisystem inflammatory syndrome in children during the covid-19 pandemic in Paris, France: prospective observational study

Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses

Waste in covid-19 research

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19. The Cochrane database of systematic reviews

Routine laboratory testing to determine if a patient has COVID-19. The Cochrane database of systematic reviews

Rapid, point-of-care antigen and molecular-based tests for diagnosis of SARS-CoV-2 infection. The Cochrane database of systematic reviews

Antibody tests for identification of current and past infection with SARS-CoV-2. The Cochrane database of systematic reviews

Thoracic imaging tests for the diagnosis of COVID-19. The Cochrane database of systematic reviews

Electronic and animal noses for detecting SARS-CoV-2 infection. Cochrane database of systematic reviews

Sources of variation and bias in studies of diagnostic accuracy: a systematic review

Detection of SARS-CoV-2 in Different Types of Clinical Specimens

Temporal dynamics in viral shedding and transmissibility of COVID-19

Infectious Diseases Society of America Guidelines on the Diagnosis of COVID-19

FDA

Limits of Detection of 6 Approved RT-PCR Kits for the Novel SARS-Coronavirus-2 (SARS-CoV-2)

Duration of infectiousness and correlation with RT-PCR cycle threshold values in cases of COVID-19

Predicting Infectious Severe Acute Respiratory Syndrome Coronavirus 2 From Diagnostic Samples

SARS-CoV-2 shedding and infectivity

Could mutations of SARS-CoV-2 suppress diagnostic detection?

A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard

Diagnostic test evaluation methodology: A systematic review of methods employed to evaluate diagnostic tests in the absence of gold standard -An update

The estimation of diagnostic accuracy of tests for COVID-19: A scoping review

Unenhanced computed tomography (CT) utility for triage at the emergency department during COVID-19 pandemic

Added value of chest computed tomography in suspected COVID-19: an analysis of 239 patients

Latent class models in diagnostic studies when there is no reference standard--a systematic review

Bayesian latent class models to estimate diagnostic test accuracies of COVID-19 tests

Comparison of Saliva and Nasopharyngeal Swab Nucleic Acid Amplification Testing for Detection of SARS-CoV-2: A Systematic Review and Meta-analysis

Saliva as an Alternate Specimen Source for Detection of SARS-CoV-2 in Symptomatic Patients Using Cepheid Xpert Xpress SARS-CoV-2

Beyond Diagnostic Accuracy: The Clinical Utility of Diagnostic Tests

Rethinking Covid-19 Test Sensitivity -A Strategy for Containment. The New England journal of medicine

A randomised comparison of two faecal immunochemical tests in population-based colorectal cancer screening

Participation in Competing Strategies for Colorectal Cancer Screening: A Randomized Health Services Study (PICCOLINO Study)

Assessment of test performance and adherence in a single round of a population-based screening programme for colorectal cancer

STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies

STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration