key: cord-0750635-710vp3t9 authors: Zoabi, Y.; Shomron, N. title: COVID-19 diagnosis prediction by symptoms of tested individuals: a machine learning approach date: 2020-05-11 journal: nan DOI: 10.1101/2020.05.07.20093948 sha: c49b814c63258f8bab2be7ff4462cb6fe2601139 doc_id: 750635 cord_uid: 710vp3t9 Effective screening of SARS-CoV-2 enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine several features to estimate the risk of infection have been developed in hopes of assisting medical staff worldwide in triaging patients when allocating limited healthcare resources. We established a machine learning approach that trained on records from 51,831 tested individuals (of whom 4,769 were confirmed COVID-19 cases) while the test set contained data from the following week (47,401 tested individuals of whom 3,624 were confirmed COVID-19 cases). Our model predicts COVID-19 test results with high accuracy using only 8 features: gender, whether the age is above 60, information about close contact with an infected individual, and 5 initial clinical symptoms. Overall, using nationwide data representing the general population, we developed a model that enables screening suspected COVID-19 patients according to simple features accessed by asking them basic questions. Our model can be used, among other considerations, to prioritize testing for COVID-19 when allocating limited testing resources. The novel coronavirus disease 2019 pandemic caused by the newly emerged SARS-CoV-2 is a critical and urgent threat to global health. The outbreak in early December 2019 in the Hubei province of the People's Republic of China has spread worldwide. As of May 2020, the overall number of patients confirmed to have the disease has exceeded 3,580,000 in more than 180 countries, the number of people infected is probably much higher, and more than 250,000 people have died from This pandemic continues to challenge medical systems worldwide in many aspects, including sharp increases in demands for hospital beds and critical shortages in medical equipment, while many healthcare workers have themselves been infected. Thus, the capacity for immediate clinical decisions and effective usage of healthcare resources is crucial. The most validated diagnosis test for COVID-19, using reverse transcriptase polymerase chain reaction (RT-PCR), is currently in shortage in developing countries. This contributes to increased infection rates and delays critical preventive measures. In Israel, all diagnostic laboratory tests for COVID-19 are performed according to criteria determined by the Israeli Ministry of Health. While subject to change, these currently include the presence and severity of clinical symptoms, possible exposure to confirmed patients, geographical area, the risk of complications if infected, and other factors. 2 The Israeli Ministry of Health recently publicly released data of individuals who were tested for SARS-CoV-2 via RT-PCR assay of a nasopharyngeal swab 3 . The dataset contains initial records, on a daily basis, for all citizens tested for COVID-19 nationwide. In addition to the test date and result, various information is available, including clinical symptoms, gender and a binary indication as to whether the tested individual is above age 60 years. Effective screening enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine several features to estimate the risk of infection have been developed in hopes of assisting medical staff worldwide in triaging patients when allocating limited healthcare resources. These models use features such as computer tomography (CT) scans 4-7 , information available at hospital admission including clinical symptoms 8 , and laboratory tests. 9 We developed a model that predicts COVID-19 test results with high accuracy using only 8 features: gender, whether the age is above 60, information about close contact with an infected individual, and 5 initial clinical symptoms (Supplementary Table 1 ). The results for a prospective test set were 0.90 auROC (area under the receiver operating curve) with 95% CI: 0.892-0.905 (Figure 1.a) . Possible working points are: 87.3% sensitivity and 72% specificity, or 85.7% sensitivity and 79% specificity. The training set consisted of records from 51,831 tested individuals (of whom 4,769 were confirmed COVID-19 cases, Supplementary Table 1) , from the period March 22 th , 2020 through March 31 st , 2020. The test set contained data from the following week, April 1 nd through April 7 th (47,401 tested individuals of whom 3,624 are confirmed COVID-19 cases). Our framework provides a ranking of the most important features that were used to define the decisions (Figure 1 .b). Presenting with fever and cough were key features in predicting contraction of the disease. As expected, close contact with a confirmed COVID-19 individual was also an important feature, thus corroborating the disease's high transmissibility 10 . In addition, 'male' gender was revealed as a predictor of a positive result by the model, concurring with the observed gender bias [11] [12] [13] . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 11, 2020. . https://doi.org/10.1101/2020.05.07.20093948 doi: medRxiv preprint Predictions were generated using a gradient-boosting machine model built with decision-tree baselearners 14 . Gradient boosting is widely considered state of the art in predicting tabular data 15 and is used by many successful algorithms in the field of machine learning 16 . As suggested by previous studies 17 , missing values were inherently handled by the gradient-boosting predictor 18 . We used the gradientboosting predictor trained with the LightGBM 19 Python package. To identify the principal features driving model prediction, SHAP (SHapley Additive exPlanations) values 20 were calculated. These values are suited for complex models such as artificial neural networks and gradient-boosting machines 21 . Originating in game theory, SHAP values partition the prediction result of every sample into the contribution of each constituent feature value. This is done by estimating the difference between models with subsets of the feature space. By averaging across samples, SHAP values estimate the contribution of each feature to overall model predictions. Overall, using nationwide data representing the general population, we developed a model that enables screening suspected COVID-19 patients according to simple features accessed by asking them eight basic questions. Our model can be used, among other considerations, to prioritize testing for COVID-19 when allocating limited testing resources. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 11, 2020. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 11, 2020. . https://doi.org/10.1101/2020.05.07.20093948 doi: medRxiv preprint An interactive web-based dashboard to track COVID-19 in real time The Novel Coronavirus -Israel Ministry of Health COVID-19 -Government Data Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection & Patient Monitoring using Deep Learning CT Image Analysis Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images A deep learning algorithm using CT images to screen for Corona Virus Disease Development and Evaluation of an AI System for Strong associations and moderate predictive value of early symptoms for SARS-CoV-2 test positivity among healthcare workers, the Netherlands A Novel Triage Tool of Artificial Intelligence Assisted Diagnosis Aid System for Suspected COVID-19 pneumonia In Fever Clinics The reproductive number of COVID-19 is higher compared to SARS coronavirus Gender Differences in Patients With COVID-19: Focus on Severity and Mortality. Front Sex, gender and COVID-19: Disaggregated data and health disparities The Epidemiological Characteristics of an Outbreak of 2019 Novel Coronavirus Diseases (COVID-19) -China Boosting and Additive Trees Classifiers to Solve Real World Classification Problems? XGBoost and LGBM for Porto Seguro ' s Kaggle challenge : A comparison Semester Project On the consistency of supervised learning with missing values XGBoost: A Scalable Tree Boosting System ACM SIGKDD International Conference on Knowledge Discovery and Data Mining LightGBM: A Highly Efficient Gradient Boosting Decision Tree A Unified Approach to Interpreting Model Predictions Explainable machine-learning predictions for the prevention of