key: cord-0782939-vznb3puk authors: Zeng, Wenhuan; Gautam, Anupam; Huson, Daniel H title: Enhanced COVID-19 data for improved prediction of survival date: 2020-07-08 journal: bioRxiv DOI: 10.1101/2020.07.08.193144 sha: a7914094ecc6e1748a2f3f90c9da61a2ac5d798c doc_id: 782939 cord_uid: vznb3puk The current COVID-19 pandemic, caused by the rapid world-wide spread of the SARS-CoV-2 virus, is having severe consequences for human health and the world economy. The virus effects individuals quite differently, with many infected patients showing only mild symptoms, and others showing critical illness. To lessen the impact of the pandemic, one important question is which factors predict the death of a patient? Here, we construct an enhanced COVID-19 dataset by processing two existing databases (from Kaggle and WHO) and using natural language processing methods to enhance the data by adding local weather conditions and research sentiment. Author summary In this study, we contribute an enhanced COVID-19 dataset, which contains 183 samples and 43 features. Application of Extreme Gradient Boosting (XGBoost) on the enhanced dataset achieves 95% accuracy in predicting patients survival, with country-wise research sentiment, and then age and local weather, showing the most importance. All data and source code are available at http://ab.inf.uni-tuebingen.de/publications/papers/COVID-19. The current COVID-19 pandemic, caused by the rapid world-wide spread of the 2 SARS-CoV-2 virus, is affecting many aspects of society, in particular human health, but 3 also social issues [1, 2] , mental health and the economy [3] . Medical researchers, and 4 researchers from different scientific fields, including immunology, genetics and 5 bioinformatics, are studying the pandemic to find ways to slow its progression. Machine 6 learning approaches are being utilized to understand aspects of the problem. 7 To date, most machine learning research on COVID-19 has used supervised learning 8 methods or deep learning [4, 5] to investigate which might be the important features to 9 predict a predefined outcome. Running such approaches on the publicly available 10 datasets is associated with difficulties that are due to the fact that features are collected 11 depending on the needs of the data provider, which can be a source of bias. In 12 particular, features that have high predictive value for the outcome for an infected 13 July 3, 2020 1/7 patient, might be missing. Generally speaking, the presence or absence of features will 14 impact the accuracy of a model. 15 The currently available COVID-19 data is missing features and we explore the effect 16 of this by adding a number of features that might be important, so as to determine how 17 this affects the accuracy of the model. 18 We used data on patients that tested positive for the virus and added new features 19 based on (1) how different countries responded to the pandemic in terms of research 20 sentiment (so as to calculate a weighted average polarity score for research abstracts per 21 country) and (2) the local weather conditions when the patient was probably infected. 22 We found that age is one of the most important factors when we have not incorporated 23 these additional features based on the initial data. However, after the addition of two new features, country-specific research sentiment, 25 followed by local weather and age, came out to be the most important features. Recent 26 publications suggest that the weather, as represented by the variables temperature and 27 humidity, plays a role in COVID-19 [6] and SARS [7] . To summarize, our main contributions are as follows: • We demonstrate how to construct an enhanced set of COVID-19 features using 30 additional available information. • Using this enhanced dataset, we show that the Extreme Gradient 32 Boosting (XGBoost) method achieves 95% accuracy in predicting a patient's 33 survival. • We show that country-specific research sentiment, followed by age and local 35 weather and are the most important features. We first compiled an initial dataset by combining data from two sources. processing was carried out on this dataset (S1 File). WHO COVID-19 database 48 We downloaded a database of literature on COVID-19 from the World Health 49 Organization (WHO) web site (https://www.who.int/emergencies/diseases/novel-50 coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov), on April 13, 2020. Of 51 the 5,354 downloaded entries, we kept only those whose "Journal Name" and "DOI" 52 fields were not blank, which resulted in 4,683 publications in 590 journals. We then 53 analyzed these publications to determine the authors' institute and country (S2 File). In this paper, we present an enhanced COVID-19 dataset, which is based on the above 56 described initial database. The data is enhanced by adding features that reflect the 57 local weather and research sentiment in the country of the infected person, as described 58 in the following (S3 File). Addition feature construction 60 Database construction was performed as outlined in Fig 1. It has been demonstrated 61 that there is a link between environmental factors and the development of COVID-19 [8] . 62 Indeed, it seems reasonable to suspect that the weather conditions play a role. For a given country, we assume that the researchers' attitude toward COVID-19 will 70 reflect the response capacity of the country, to some extent. For journal publications 71 obtained from the WHO database, we extracted the author's institution with the help 72 of the paper's DOI. We then applied sentiment analysis on each abstract so as to obtain 73 a polarity score for every abstract, and we then calculated an weighted average polarity 74 score for each country. This feature was added to the enhanced dataset. is filtered for patients for which the outcome has been recorded, and then, for these items, the weather is determined using the https://www.wunderground.com website. The WHO COVID-19 literature database is filtered for items for which both a journal name and DOI are provided, and these are post-processed so as to obtain a country-wise research sentiment polarity score. XGBoost is then trained and run on both the initial and the enhanced data and the accuracy of survival prediction is shown to be 85% and 95%, respectively. variables sex, age, the time interval between the patient's onset date, confirmed infected 80 date and admission date, symptoms description, infection reason and outcome. We will 81 refer to this as the initial dataset. 82 We added local weather variables (temperature, humidity, climate description) and 83 the weighted polarity score of country's research attitude. We will refer to the result of 84 this as the enhanced dataset. To prepare for analysis with XGBoost (as discussed below), we tokenized all 86 multi-value text features, such as symptom description or climate description, into 87 July 3, 2020 3/7 three-dimensional embedding vectors, used label encoding on categorical variables such 88 as infected reason, as shown in Table 1 . 89 We assigned the constant -999 to all missing values. After filtering for samples that 90 have a valid outcome value, we obtain 183 samples. processing on data obtained from various social media [10] [11] [12] . Along these lines, we 102 performed sentiment analysis on the abstracts of research papers (associated with 103 COVID-19) using the Python package Textblob (https://github.com/sloria/TextBlob), 104 which operates by analyzing text content and assigns emotional values to word based on 105 matches to a built-in dictionary. Our aim is to predict whether the patient will survive the infection, based on either the 108 initial dataset or the enhanced dataset. 109 We use the Extreme Gradient Boosting (XGBoost) [13] method to address this. XGBoost is a powerful member of the gradient boosting family, which is designed to 111 perform well on sparse features, and is known to perform well on Kaggle tasks, This 112 approach avoids overfitting using its built-in L1 and L2 regularization on the target 113 function: As an additive model, XGBoost consists of k base models, and in most cases we 115 choose the tree model as its base model. Suppose, for the k-th of t iterations, that we 116 July 3, 2020 4/7 train the tree model f k (x), then is the estimate result of the sample i after t times' iteration. During construct of each 118 tree, XGBoost minimizes the objective function with regularization term introduced in 119 Eq (1) in the split phase of each node. In each tree, we calculate the Gain of the feature 120 and choose the tree who has the biggest value as the leaf node to be split: Implementation 122 In this study, we ran the XGBoost algorithm on two different datasets, namely the 123 initial dataset and the enhanced dataset, the latter containing additional features 124 representing local weather and research sentiment, as illustrated in Fig 1. 125 To obtain the model with the best capacity for prediction, we used a grid search for 126 model tuning. Each subtree in our model is a simple tree whose maximum depth is 3. The learning rate was 0.01. During the training step, we randomly sampled the columns 128 of each tree according to a ratio of 0.5. We evaluated the algorithm's performance by calculating each model's classification 131 accuracy. The accuracy of the model created by using the initial dataset (no added 132 features) is 85%, whereas using the enhanced dataset (with added features), the model's 133 accuracy is 95%. The method we chose to evaluate the importance score of feature is based on 135 counting the number of times that a feature occurred in a tree. The feature importance 136 for both datasets is shown in Fig 3. For the initial dataset, age plays a more important 137 role than other features. For the model based on the enhanced dataset, weighted 138 average research sentiment polarity score is more important than age, whereas the level 139 of importance of weather is similar to that of age. The performance of machine learning methods depends on the amount and quality of 142 available features. For our analysis, we can say that the current publicly available data 143 is poor. First, the data is quite sparse and there are too few features. Here we see that 144 by enhancing the dataset, the accuracy of survival prediction can be increased by 10%. 145 Our study shows how one might enhance a dataset by adding informative features if 146 they are not available in the original dataset. Here we demonstrated this for 147 country-wise research sentiment and local weather. Local weather conditions has been 148 implicated as an important feature in the existing research. Our analysis confirms the observation that age is an important factor for survival of 150 COVID-19. However, in the data considered here, the total number of deaths above age 151 July 3, 2020 5/7 60 were 8, and 16 survived or were still alive, while in the age group between 40-60 there 152 were 2 deaths and 36 alive or survived. Hence, linking mortality to a particular age 153 group is not be appropriate based on the current result. While this analysis suggests 154 that elderly have a higher risk of death, which has already been observed [14, 15] , saying 155 mortality is associated with old age is probably generally true for any infectious disease. 156 Age is one of the confounding factors that could be responsible for enhanced COVID-19 157 mortality rate, so more emphasis should be be taken for the elderly care [16, 17] . For the model based on the enhanced dataset, the weighted average of research 159 sentiment, followed by weather and age, appear as the most important features, and 160 account for the increase in the accuracy of the model. This confirms that environmental 161 conditions play a role. Also, it suggests the research sentiment might reflect a countries 162 ability to tackle the disease. Finally, this analysis suggests that enhancing a dataset, rather than just analyzing 164 the originally given features, might lead to a better prediction of the particular outcome. 165 Supporting information 166 S1 File. Initial COVID-19 dataset 183 cases. S2 File. Processed WHO publication data. S3 File. Enhanced COVID-19 dataset. Funding This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B). Also, we acknowledge support by the Open Access Publishing Fund of University of Tübingen. The outbreak of COVID-19 coronavirus and its impact on global mental health COVID-19 and its impact on society Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science. The Lancet Psychiatry Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks Potential neutralizing antibodies discovered for novel corona virus using machine learning Temperature and latitude analysis to predict potential spread and seasonality for COVID-19. Available at SSRN 3550308 Environmental factors on the SARS epidemic: air temperature, passage of time and multiplicative effect of hospital infection Evidence that higher temperatures are associated with lower incidence of COVID-19 in pandemic state, cumulative cases reported up to SMOTE: synthetic minority over-sampling technique A review of influenza detection and prediction through social networking sites Forecasting influenza levels using real-time social media streams Regional influenza prediction with sampling twitter data and PDE model. International journal of environmental research and public health Xgboost: A scalable tree boosting system Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet infectious diseases Protecting workers aged 60-69 years from COVID-19. The Lancet Infectious Diseases Dementia care during COVID-19. The Lancet COVID-19 and the consequences of isolating the elderly. The Lancet Public Health