key: cord-353499-os328w9o authors: Yang, H. S.; Vasovic, L. V.; Steel, P.; Chadburn, A.; Hou, Y.; Racine-Brzostek, S. E.; Cushing, M.; Loda, M.; Kaushal, R.; Zhao, Z.; Wang, F. title: Routine laboratory blood tests predict SARS-CoV-2 infection using machine learning date: 2020-06-19 journal: nan DOI: 10.1101/2020.06.17.20133892 sha: doc_id: 353499 cord_uid: os328w9o Accurate diagnostic strategies to rapidly identify SARS-CoV-2 positive individuals for management of patient care and protection of health care personnel are urgently needed. The predominant diagnostic test is viral RNA detection by RT-PCR from nasopharyngeal swabs specimens, however the results of this test are not promptly obtainable in all patient care locations. Routine laboratory testing, in contrast, is readily available with a turn-around time (TAT) usually within 1-2 hours. Here we present a machine learning model incorporating patient demographic features (age, sex, race) with 27 routine laboratory tests to predict an individual's SARS-CoV-2 infection status. Laboratory test results obtained within two days before the release of SARS-CoV-2-RT-PCR result were used to train a gradient boosted decision tree (GBDT) model from 3,346 SARS-CoV-2 RT-PCR tested patients (1,394 positive and 1,952 negative) evaluated at a large metropolitan hospital. The model achieved an area under the receiver operating characteristic curve (AUC) of 0.853 (95% CI: 0.829-0.878). Application of this model to an independent patient dataset from a separate hospital resulted in a comparable AUC (0.838), validating the generalization of its use. Moreover, our model predicted initial SARS-CoV-2 RT-PCR positivity in 66% individuals whose RT-PCR result changed from negative to positive within two days. Overall, this model employing routine laboratory test results offers opportunities for early and rapid identification of high-risk SARS-COV-2 infected patients before their RT-PCR results are available. This may facilitate patient care and quarantine, indicate who requires retesting, and direct personal protective equipment use while awaiting definitive RT-PCR results. 5 who requires retesting, and direct personal protective equipment use while awaiting definitive RT-PCR results. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint 6 Introduction: The Coronavirus Disease-2019 (COVID-19) pandemic has rapidly spread worldwide resulting in over 7 million confirmed cases and more than 368,000 total deaths as of June 9, 2020 (1). The highly contagious nature of SARS-CoV-2 (2), rapid progression of disease in some infected patients (3) and the subsequent stress on the healthcare system has created an urgent need for rapid and effective diagnostic strategies for the prompt identification and isolation of infected patients. Currently, the diagnosis of COVID-19 relies on SARS-CoV-2 virus-specific real-time reversetranscriptase polymerase chain reaction (RT-PCR) testing of nasopharyngeal swabs or other upper respiratory track specimens (4, 5). However, while the TAT of RT-PCR testing is usually within 48 hours (6), it can be significantly longer due to many variables including the need for repeat testing or the lack of needed supplies. Many emergency departments (EDs) in smaller hospitals do not yet have access to on-site SARS-CoV-2 RT-PCR testing. These issues can result in delayed hospital admission and bed assignment, inappropriate medical management including quarantining of infected patients and increased exposure of healthcare personnel and other patient contacts to the virus. Rapid diagnosis and identification of high risk patients for early intervention is vital for individual patient care, and, from a public health perspective, for controlling disease transmission and maintaining the healthcare workforce. Currently in hospital EDs, the nationally recommended practice when evaluating patients with moderate to high risk for COVID-19 is SARS-CoV-2 RT-PCR testing, a panel of routine laboratory tests, a chest X-ray and symptomatology, whereas chest computed tomography (CT) is not recommended due to cost and TAT considerations . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint 7 (2, 7). Routine laboratory tests are generally available within 1-2 hours and are accessible prior to patient discharge from the ED. Several studies (3, 8-10) have reported laboratory abnormalities in COVID-19 patients on admission and during the disease course, including elevations of C-reactive protein (CRP), D-dimer, lactic acid dehydrogenase (LDH), cardiac troponin, procalcitonin (PCT), and creatinine as well as lymphopenia, thrombocytopenia, leukopenia. While no single laboratory test can accurately discriminate SARS-CoV-2 infected from non-infected patients, the combination of the results of these routine laboratory tests may predict the COVID-19 status. Recent promising advances in the application of artificial intelligence (AI) in several healthcare areas (11) (12) (13) have inspired the development of AI-based algorithms as diagnostic (14) or prognosis tools (15) for complex diseases, such as COVID-19. In this study, we hypothesized that the results of routine laboratory tests performed within a short time frame as the RT-PCR testing, in conjunction with a limited number of previously identified predictive demographic factors (age, gender, race) (16), can predict SARS-CoV-2 infection status. Thus, we aimed to develop a machine learning model integrating age, gender, race and routine laboratory blood tests, which are readily available with a short TAT. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 19, 2020. SARS-CoV-2 RT-PCR testing was performed using the RealStar SARS-CoV-2 RT-PCR kit 1.0 (Altona Diagnostics) reagent system, the Cobas SARS-CoV-2 RT-PCR assay (Roche Diagnostics), the Panther Fusion SARS-CoV-2 RT-PCR assay (Hologic), and Xpert Xpress SARS-CoV-2 RT-PCR (Cepheid) at NYPH/WCM. RT-PCR was performed using the Xpert Xpress SARS-CoV-2 RT-PCR (Cepheid) at NYPH/LMH. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 19, 2020. At LMH, Routine chemistry testing including procalcitonin were performed on ABBOTT® ARCHITECT c SYSTEM ci 4100 and ci 8200 analyzers (Chicago, IL, USA). Blood gas analysis was performed on Radiometer analyzer ABL 820 FLEX (Copenhagen, Denmark). Routine complete blood count (CBC) testing was performed on the UniCel DXH 800 analyzer. Coagulation tests were performed on STAGO STA-R® Evolution multiparametric analyzer (Parsippany, New Jersey, USA). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 19, 2020. selected to construct the input feature vectors of the prediction model based on the following criteria: 1) a result available for at least 30% of the patients two days before a specific SARS-CoV-2 RT-PCR test, and 2) showing a significant difference (p-value, pvalue after Bonferroni correction, p-value after demographics adjustment all less than 0.05) between patients with positive and negative RT-PCR results. Laboratory tests having redundant clinical information were eliminated. For example, for neutrophil count and percentage, only the absolute neutrophil count was retained for analysis. After this procedure, twenty-seven routine laboratory tests were selected for further analysis. A 33-dimensional vector (27 routine lab tests, 1 age, 1 gender, 4 race variables corresponding to White, Black, Asian and others) was constructed to represent every RT-PCR test. The value on each dimension was the average result value of the corresponding laboratory test taken two days before the RT-PCR test in addition to the patients' age, gender and race. The patient's race and gender variables were encoded . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 19, 2020. • Gradient boosted decision tree (GBDT), where ݂ is a gradient boosting machine with decision tree as base learners (19). Our implementation is based on scikit-. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. The models were evaluated in two different settings. The first setting was a 5-fold cross validation with the NYPH/WCM data, where all RT-PCR tests were randomly partitioned into 5 equal buckets with the same positive/negative ratio in each bucket as the ratio over all tests. The implementation was based on scikit-learn package 0.23.1 with the sklearn.model_selection.StratifiedKFold function. Then the training and testing procedure was performed 5 times for these 4 different classifiers. Each time a specific bucket (fold) was used for testing and the remaining 4 buckets (folds) for training. In the second setting all data from NYPH/WCM were used for training, and the data from LMH was used for testing. In both settings, highly suspicious negatives (HSN) were excluded in the training process. Here an HSN was defined as a negative RT-PCR test in a patient who had a positive RT-PCR result upon re-testing within 2 days. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint 1 3 The pipeline of our modeling framework is illustrated in Figure 1 . A summary of statistics of the 27 routine laboratory tests used to construct the input feature vectors of the prediction model is shown in Table 1 . The models were trained and tested on a retrospective dataset collected from 3,346 SARS-CoV-2 RT-PCR tested adult patients who had routine laboratory testing performed within two days prior to the release of RT-PCR result, between March 11 to April 29, 2020, at NYPH/WCM. This dataset included 1,394 SARS-CoV-2 RT-PCR positive and 1,952 negative patients who ranged in age from 18 to 101 years (mean 56.4 years, demographic information in Table 2 ). Among 590 patients who had repeat testing during this 7-week study period, 53 were initially negative but became positive upon repeat testing. Among this subgroup, 60% (32 out of 53) patients' RT-PCR results changed from negative to positive within a 2-day period. Among the 32 patients whose SARS-CoV-2 RT-PCR results were initially negative but upon repeat testing within two days were positive, our approach predicted positive a result of the initial RT-PCR for 21 patients (66%). For example, a Hispanic male patient in his 70s underwent a RT-PCR testing which showed a negative result. Since the second RT-PCR test taken on the next day was positive, the initial RT-PCR is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint 1 5 could potentially need isolation at home or in the hospital while awaiting confirmatory RT-PCR results. We also varied the length of the window for collecting routine lab tests with 4 more settings: one day before RT-PCR, one day after RT-PCR, one day before and one day after RT-PCR, two days before and one day after RT-PCR. The ROC curves are plotted as Supplemental Figure 2 . With this analysis we observed that the longer the time window around the RT-PCR test, with more information captured characterizing patient infection status, resulted in slightly higher prediction performance in terms of AUC. However, these performances did not significantly differ from each other or from the chosen setting. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. There are three potential limitations to the use of this model. First, the model was trained on a dataset generated from a patient cohort who were in the hospital for moderate to life-threatening presentations of COVID-19. Thus, this model may not be applicable to mild COVID-19 cases. Second, the model was developed with a "control . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. We trained and tested a machine learning model incorporating 27 routine laboratory tests to provide a rapid and objective assessment of patient SARS-COV-2 infection status. The robust performance of our model was confirmed in an independent testing set. Our results have illustrated the potential role for this model as a complementary diagnostic tool to identify high-risk SARS-COV-2 infected patients before their RT-PCR results are available, and risk stratify patients in the ED. As such use of our model could result in earlier appropriate treatment and isolation, thereby promoting the health of the patients while protecting the health of the public. Furthermore, our model may play an important role in assisting in the identification of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint 1 9 SARS-COV-2 infected patients in areas where RT-PCR testing is not possible due to financial or supply constraints. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. We want to thank Richard Fedeli for organizing the datasets of laboratory testing results. HSY for conceptualization, investigation, data analysis, writing, reviewing and editing of the manuscript, and visualization. LVV for organizing LMH data; PS for editing the manuscript and providing ED information; AC for editing the manuscript; YH for performing data analysis; SER, MMC, ML, RK for editing the manuscript; ZZ for conceptualization and editing the manuscript; FW for conceptualization, investigation, data analysis, visualization, editing the manuscript, and supervision of project. None of the authors have conflict of interest in this project. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint 2 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint Machine learning to predict the likelihood of acute myocardial infarction A clinically applicable approach to continuous prediction of future acute kidney injury Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence Artificial intelligenceenabled rapid diagnosis of patients with covid-19 An interpretable mortality prediction model for covid-19 patients Positive rate of rt-pcr detection of sarscov-2 infection in 4880 cases from one hospital in wuhan, china Classification and regression trees . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 19, 2020. . https://doi.org/10.1101/2020.06.17.20133892 doi: medRxiv preprint