key: cord-0837295-ejdunj41 authors: Yang, He S; Hou, Yu; Vasovic, Ljiljana V; Steel, Peter; Chadburn, Amy; Racine-Brzostek, Sabrina E; Velu, Priya; Cushing, Melissa M; Loda, Massimo; Kaushal, Rainu; Zhao, Zhen; Wang, Fei title: Routine laboratory blood tests predict SARS-CoV-2 infection using machine learning date: 2020-08-21 journal: Clin Chem DOI: 10.1093/clinchem/hvaa200 sha: b40cdd374ba79b7c2e170750cf9f34b0f535fa89 doc_id: 837295 cord_uid: ejdunj41 BACKGROUND: Accurate diagnostic strategies to rapidly identify SARS-CoV-2 positive individuals for management of patient care and protection of health care personnel are urgently needed. The predominant diagnostic test is viral RNA detection by RT-PCR from nasopharyngeal swabs specimens, however the results are not promptly obtainable in all patient care locations. Routine laboratory testing, in contrast, is readily available with a turn-around time (TAT) usually within 1-2 hours. METHOD: We developed a machine learning model incorporating patient demographic features (age, sex, race) with 27 routine laboratory tests to predict an individual’s SARS-CoV-2 infection status. Laboratory test results obtained within two days before the release of SARS-CoV-2-RT-PCR result were used to train a gradient boosted decision tree (GBDT) model from 3,356 SARS-CoV-2 RT-PCR tested patients (1,402 positive and 1,954 negative) evaluated at a metropolitan hospital. RESULTS: The model achieved an area under the receiver operating characteristic curve (AUC) of 0.854 (95% CI: 0.829-0.878). Application of this model to an independent patient dataset from a separate hospital resulted in a comparable AUC (0.838), validating the generalization of its use. Moreover, our model predicted initial SARS-CoV-2 RT-PCR positivity in 66% individuals whose RT-PCR result changed from negative to positive within two days. CONCLUSION: This model employing routine laboratory test results offers opportunities for early and rapid identification of high-risk SARS-CoV-2 infected patients before their RT-PCR results are available. It may play an important role in assisting the identification of SARS-COV-2 infected patients in areas where RT-PCR testing is not accessible due to financial or supply constraints. SARS-COV-2 infected patients in areas where RT-PCR testing is not accessible due to financial or supply constraints. 6 The Coronavirus Disease-2019 (COVID- 19) pandemic has rapidly spread worldwide resulting in over 14 million confirmed cases and more than 603,000 total deaths as of July 20, 2020 (1) . The highly contagious nature of SARS-CoV-2 (2), rapid progression of disease in some infected patients (3) and the subsequent stress on the healthcare system has created an urgent need for rapid and effective diagnostic strategies for the prompt identification and isolation of infected patients. Currently, the diagnosis of COVID-19 relies on SARS-CoV-2 virus-specific real-time reversetranscriptase polymerase chain reaction (RT-PCR) testing of nasopharyngeal swabs or other upper respiratory track specimens (4, 5) . However, while the TAT of RT-PCR testing is usually within 48 hours (6) , it can be substantially longer due to many variables including the need for repeat testing or the lack of needed supplies. Many smaller hospitals do not yet have access to on-site SARS-CoV-2 RT-PCR testing. These issues can result in delayed hospital admission and bed assignment, inappropriate medical management including quarantining of infected patients and increased exposure of healthcare personnel and other patient contacts to the virus. Rapid diagnosis and identification of high-risk patients for early intervention is vital for individual patient care, and, from a public health perspective, for controlling disease transmission and maintaining the healthcare workforce. Currently in hospital EDs, the nationally recommended practice when evaluating patients with moderate to high risk for COVID-19 is SARS-CoV-2 RT-PCR testing, a panel of routine laboratory tests, a chest X-ray and symptomatology, whereas chest computed tomography (CT) is not recommended due to cost and TAT considerations (2, 7) . Routine 7 laboratory tests are generally available within 1-2 hours and are accessible prior to patient discharge from the ED. Several studies (3, (8) (9) (10) have reported laboratory abnormalities in COVID-19 patients on admission and during the disease course, including increases in C-reactive protein (CRP), D-dimer, lactic acid dehydrogenase (LDH), cardiac troponin, procalcitonin (PCT), and creatinine as well as lymphopenia and thrombocytopenia. While no single laboratory test can accurately discriminate SARS-CoV-2 infected from noninfected patients, the combination of the results of these routine laboratory tests may predict the COVID-19 infection status. Recent promising advances in the application of artificial intelligence (AI) in several healthcare areas (11) (12) (13) (14) (15) have inspired the development of AI-based algorithms as diagnostic (6) or prognosis tools (16) for complex diseases, such as COVID-19. In this study, we hypothesized that the results of routine laboratory tests performed within a short time frame as the RT-PCR testing, in conjunction with a limited number of previously identified predictive demographic factors (age, gender, race) (17), can predict SARS-CoV-2 infection status. Thus, we aimed to develop a machine learning model integrating age, gender, race and routine laboratory blood tests, which are readily available with a short TAT. 8 We conducted a retrospective study with 5,893 patients evaluated at the New York Presbyterian Hospital/Weill Cornell Medicine (NYPH/WCM) during March 11 to April 29, 2020. SARS-CoV-2 RT-PCR results, routine laboratory testing results and patient demographic information were obtained from the laboratory information system (Cerner Millennium, Cerner Corporation). Exclusion criteria included patients < 18 years old, patients who had indeterminate RT-PCR results, and patients who did not have laboratory results within two days prior to the completion of RT-PCR testing (Figure 1) . Among a total of 4,207 RT-PCR results from 1,402 RT-PCR positive and 1,954 negative patients in our dataset, 54.1% of RT-PCR tests were ordered from the Emergency Department (ED), 32.4% were ordered on inpatients, including 2.7% from ICU patients, and the rest were ordered from the outpatient surgery department, the outpatient clinics, and the private ambulatory setting. Among the RT-PCR results excluded from the dataset due to no corresponding laboratory results, 50.0% were ordered for "non-patient institutional" including patient samples used for validation, plasma donor specimens, and healthcare workers. An additional 20.0% of excluded RT-PCR tests were ordered on ED patients who were likely discharged, 12% were from "private ambulatory" setting, 2.8% were ordered for outpatient clinic with no associated hospital admission, and the rest were from the surgery or dental departments to rule out COVID-19 infection. Same criteria were applied to the dataset collected from New York Presbyterian Hospital/Lower Manhattan Hospital (NYPH/LMH) during the same time period. A total of 1,822 RT-PCR tests ordered for 496 RT-PCR positive and 968 negative NYPH/LMH patients were obtained and used for validation. Among them, 60.9% were ordered from ED and 36.3% were from inpatients. This study was approved by the Institutional Review Board (#20-03021671) of Weill Cornell Medicine. (19, 20) . There was no difference in turnaround time for positive or negative RT-PCR results. RT-PCR was performed using the Roche Cobas SARS-CoV-2 RT-PCR assay and Cepheid Xpert Xpress SARS-CoV-2 RT-PCR at NYPH/LMH. At NYPH/WCM, routine chemistry testing was performed on Siemens ADVIA XPT analyzers and Centaur XP analyzers. Procalcitonin was performed on the Roche e411 analyzer. Blood gas analysis was performed on the Instrumentation Laboratory GEM Premier 4000 analyzer. Routine hematological testing was performed on the UniCel DXH 800 analyzer. Coagulation tests were performed on the Instrumentation Laboratory ACLTM TOP CTS Coagulation System. At NYPH/LMH, Routine chemistry testing including procalcitonin was performed on Abbott ARCHITECT® c SYSTEM ci 4100 and ci 8200 analyzers. Blood gas analysis was performed on the Radiometer analyzer ABL 820 FLEX. Routine complete blood count (CBC) testing was performed on the UniCel DXH 800 analyzer. Coagulation tests were performed on the STAGO STA-R® Evolution multiparametric analyzer. A total of 685 distinct laboratory tests were ordered for patients in the NYPH/WCM dataset. A 685-dimensional vector was generated for each RT-PCR test. If one specific test was ordered multiple times, an average of the values was calculated and used for analysis. Univariate analysis was performed on all laboratory test results to obtain the significance of the association between each laboratory test and the RT-PCR result with SciPy1.4.1 (21) . Laboratory tests were selected to construct the input feature vectors of the prediction model based on the following criteria: 1) a result available for at least 30% of the patients two days before a specific SARS-CoV-2 RT-PCR test, and 2) showing a significant difference (P-value, P-value after Bonferroni correction, P-value after demographics adjustment all less than 0.05) between patients with positive and negative RT-PCR results. After the feature selection process (details are provided in the online Supplemental Material), a 33-dimensional vector (27 routine lab tests, one age, one gender, four race variables (African American, Asian, Caucasian and others) was constructed to represent every RT-PCR test. The value on each dimension was the average result value of the corresponding laboratory test taken two days before the RT-PCR test in addition to the patients' age, gender and race. The patient's race and gender variables were encoded with binary values. The missing value of a specific laboratory test in a feature vector was imputed by the median value of the available non-missing value of that dimension over all patients. The result of each RT-PCR test was referred to as the label of the test. Mathematically, let be the 33-dimensional feature vector of the -th RT-PCR test. Let ∈ {0,1} be its corresponding label. = 0 means the result of the -th RT-PCR test is "Not Detected" and we refer to this RT-PCR test as a negative sample, while = 1 means the result is "Detected" and we refer to this RT-PCR test as a positive sample. Our goal was to "learn" a classification function that can accurately map each to its corresponding . We considered 4 popular classifiers in this study: The models were evaluated in two different settings. The first setting was a 5-fold cross validation with the NYPH/WCM data, where all RT-PCR tests were randomly partitioned into 5 equal buckets with the same positive/negative ratio in each bucket as the ratio over all tests. The implementation was based on scikit-learn package 0.23.1 (22) with the sklearn.model_selection.StratifiedKFold function. Then the training and testing procedure was performed 5 times for these 4 different classifiers. Each time a specific bucket was used for testing and the remaining 4 buckets for training. In the second setting all data from NYPH/WCM were used for training, and the data from NYPH/LMH was used for testing. In both settings, highly suspicious negatives (HSN) were excluded in the training process. Here an HSN was defined as a negative RT-PCR test in a patient who had a positive RT-PCR result upon re-testing within 2 days. 13 The pipeline of our modeling framework is illustrated in Figure 2 . Table 1 ). Among 590 patients who had repeat testing during this 7-week study period, 53 were initially negative but became positive upon repeat testing. Among this subgroup, 32 patients' RT-PCR results changed from negative to positive within a 2-day period. The performance of four machine learning models from 5-fold cross validation are summarized in Furthermore, since 54% of RT-PCR tests were ordered from the ED, performance of the GBDT model was tested on ED patients and achieved an AUC of 0.879, a sensitivity of 0.800, a specificity of 0.825, and an agreement with RT-PCR of 0.815. 14 As an independent validation, we tested the performance of the trained model on an independent patient dataset (496 positive and 968 negative by RT-PCR, laboratory testing environments based on the predictability seen using an independent patient dataset. There have been attempts to build SARS-CoV-2 predictive models with routine laboratory tests (10, 29) . However, these studies were based on a small set of routine lab tests and the patient cohort sizes were smaller than our study. The The proposed machine learning model, An interactive web-based dashboard to track COVID-19 in real time Coronavirus disease (COVID-19): A primer for emergency physicians Clinical characteristics of coronavirus disease 2019 in china Interpreting diagnostic tests for SARS-CoV-2 Interim guidelines for collecting, handling, and testing clinical specimens for COVID-19 Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 American College of Emergency Physicians. ACEP COVID-19 Field Guide Laboratory abnormalities in patients with COVID-2019 infection Risk factors associated with acute respiratory distress syndrome and death in patients with coronavirus disease 2019 pneumonia in wuhan, china Routine blood tests as a potential diagnostic tool for COVID-19 Machine learning to predict the likelihood of acute myocardial infarction A clinically applicable approach to continuous prediction of future acute kidney injury Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence Hidden in plain sight: Machine learning in acute kidney injury Machine learning in clinical pathology: Seeing the forest for the trees An interpretable mortality prediction model for COVID-19 patients Positive rate of rt-pcr detection of SARS-COV-2 infection in 4880 cases from one hospital in Clinical performance of SARS-COV-2 molecular testing Comparison of two highthroughput reverse transcription-polymerase chain reaction systems for the detection of severe acute respiratory syndrome coronavirus 2 Rapid immplementation of SARS-COV-2 emergency use authorization RT-PCR testing and experience at an academic medical institute Scipy 1.0: Fundamental algorithms for scientific computing in python Scikit-learn: Machine learning in python Classification and regression trees. 1 edition Ed Random forests Greedy function approximation: A gradient boosting machine Estimation of the Youden index and its associated cutoff point Construction of confidence regions in the roc space after the estimation of the optimal Youden index-based cut-off point A unified approach to interpreting model predictions A predictive tool for identification of SARS-COV-2 PCR-negative emergency department patients using routine test results Hematological findings and complications of COVID-19 Hematologic parameters in patients with COVID-19 infection COVID-19: Immunopathology and its implications for therapy Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD)