key: cord-0682881-ltog7ptd authors: Kang, Jianhong; Chen, Ting; Luo, Honghe; Li, Lijian; Yang, Mia Jiming title: Machine learning predictive model for severe COVID-19 date: 2021-01-28 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2021.104737 sha: cf2cf654679f59e8577ee3f8b9e51dea6a05f1f6 doc_id: 682881 cord_uid: ltog7ptd To develop a modified predictive model for severe COVID-19 in people infected with Sars-Cov-2. We developed the predictive model for severe patients of COVID-19 based on the clinical date from the Tumor Center of Union Hospital affiliated with Tongji Medical College, China. A total of 151 cases from Jan. 26 to Mar. 20, 2020, were included. Then we followed 5 steps to predict and evaluate the model: data preprocessing, data splitting, feature selection, model building, prevention of overfitting, and Evaluation, and combined with artificial neural network algorithms. We processed the results in the 5 steps. In feature selection, ALB showed a strong negative correlation (r = 0.771, P < 0.001) whereas GLB (r = 0.661, P < 0.001) and BUN (r = 0.714, P < 0.001) showed a strong positive correlation with severity of COVID-19. TensorFlow was subsequently applied to develop a neural network model. The model achieved good prediction performance, with an area under the curve value of 0.953(0.889–0.982). Our results showed its outstanding performance in prediction. GLB and BUN may be two risk factors for severe COVID-19. Our findings could be of great benefit in the future treatment of patients with COVID-19 and will help to improve the quality of care in the long term. This model has great significance to rationalize early clinical interventions and improve the cure rate. In 2019, an outbreak of very contagious pneumonia began in Wuhan, China. The disease and the virus causing the disease were named coronavirus disease 2019 (COVID-19) and severe acute respiratory syndrome coronavirus two (SARS-COV-2), respectively. Mild COVID-19 has a self-limiting course with a low mortality rate, and patients with mild symptoms are reported to recover after one week. On the other hand, severe cases are reported to experience progressive respiratory failure due to alveolar damage from the virus, which may lead to death. Proinflammatory responses play a role in the pathogenesis of severe In vitro experiments have shown that delayed release of cytokines and chemokines occurs in respiratory epithelial cells, dendritic cells, and macrophages during the early stages of SARS-CoV2 infection. Later, the cells secrete low levels of antiviral factors, such as interferons, and high levels of proinflammatory cytokines, such as interleukin (IL)-1β, IL-6, and tumor necrosis factor, and chemokines, such as C-C motif chemokine ligand (CCL)-2, CCL-3, and CCL-5 [1] [2] [3] . The rapid increase in cytokines and chemokines attracts inflammatory cells, such as neutrophils and monocytes, resulting in excessive infiltration of inflammatory cells into the lung tissue, leading to lung injury. Serum cytokine and chemokine levels are significantly higher in patients with severe COVID-19 compared with those with mild and moderate COVID-19 4 . Elevated serum cytokine and chemokine levels in patients are associated with a high number of neutrophils and monocytes in the lung tissues and peripheral blood, suggesting that these cells may play a role in lung pathology 4 . Onset COVID-19 is usually concealed and it is difficult to predict severe COVID-19, despite knowing its causes. While studies have reported ways of predicting COVID-19, these have mostly explored the linear relationship between each feature and COVID-19 severity to identify independent risk factors. However, some of the factors related to the severity of J o u r n a l P r e -p r o o f Journal Pre-proof are nonlinear. Classical linear prediction methods do not take nonlinear phenomena into account. Therefore, a heuristic methodology for the epidemic forecast could help to resolve this problem, and we set sights on the artificial neural network (ANN). An artificial neural network can integrate the linear and nonlinear relationships of each feature to obtain prediction results, adding to the credibility of the forecast results. Since years, Artificial neural network has been widely applied in medical studies. Because the hidden neurons are unnecessary for linearity, the input and output are not required to be linearly related either. This approach could make the prediction more flexible 5 . If the selected input variables are sufficient and representative, and there is neither a closed correlation between them, the network could reveal their complex relationship and show the advantages in extrapolation. This view has been supported by Schonenberger, et al. 6 . Furthermore, this methodology has been applied to the prediction of the SARS epidemic by Bai and Jin (2005) 7 . A highlighted characteristic of the neural network is training, this study has proved its strong associative and rational ability under large and non-linear conditions in the theoretical and practical aspects. Due to 4 ways of controlling the capacity, the network can prevent overfitting, which is one of the typical problems that the linear regression model has faced 8 . In conclusion, this method is opted for in this study. Our study aimed to introduce a neural network predictive model to predict the severity of COVID-19 using the results from routine examinations. A neural network is a simplified model of how the human brain processes information. There are typically three parts in a neural network: an input layer, with units representing the input fields; one or more hidden layers; and an output layer, with a unit or units representing the target field(s). The units are connected with varying connection strengths (or weights). Input data are presented to the first layer, and values are propagated from each neuron to each neuron in the next layer. Eventually, a result is delivered from the output layer. Data were collected from the Tumor Center of Union Hospital affiliated with Tongji Medical College of Huazhong University of Science and Technology, Hubei, China. All participants gave verbal consent to take part in the study. Data from consecutive patients with COVID-19 were collected between January 26, 2020, and March 20, 2020. Data were obtained at admission ( ①Scikit-learn: Scikit-learn is a software machine learning library for the Python programming language. It has a powerful data preprocessing function (https://scikit-learn.org/stable/). ②TensorFlow: TensorFlow is a framework for data stream-oriented programming, which is widely used in machine learning (https://github.com/tensorflow/tensorflow). ③ Scipy.stats: Scipy.stats contains a large number of probability distributions as well as a growing library of statistical functions. (https://docs.scipy.org/doc/scipy/reference/stats.html) The study consisted of the following phases: Cases with missing and invalid values were deleted. Next, data were normalized using the median normalization, and the qualitative variable was coded as dummy variables to eliminate their effect on the model (using scikit-learn package in Python software). The data set was randomly split into three parts: training set, verification set, and test set. For the training set and verification set, the split ratio was 9:1 (ten-fold cross-validation). Pearson correlation coefficient was used to analyze correlations of quantitative data, and Kendall correlation coefficients were used to analyze the correlations of qualitative data. Statistically significant (P < 0.05) features were extracted as the input for the neural network model (using the Scipy.stats package in Python software). The training set was used for training and tuning the parameters, the validation set was for preventing the overfitting problem, and the test set was used to evaluate the performance (using the TensorFlow package in Python software). Four parameters need to be set in modeling: learning rate, epochs, the number of nodes, and the number of layers in the hidden layer. We describe the approaches of adjustment parameters in detail below. ① Learning rate: The learning rate is a hyperparameter that deter-mines how much the model should change concerning the error each time the model parameters are updated. It is important to tune the learning rate properly because a too small learning rate, as shown in Fig. 1(a) , may result in a very long and very slow training process that may get stuck, whereas a too-large learning rate value, as shown in Fig. 1(c) , may result in diverging away from the optimal point rather than converging towards it. However, there is currently no algorithm to obtain the optimal value of the learning rate. The learning rate can be determined through experiments 10 . Experiments have shown that starting the learning rate from 0.1 gives a relatively good performance, we used the same method in this study. If setting the learning rate to 0. ②Epochs For the number of epochs, the residual error decreases with an increase in the number of epochs and finally tended to be stable, but it needs a much longer training time. To find the optimal quantity of epochs, we recorded the residual error (cross-entropy) of each epoch. When the residual error tends to stabilize, the optimal quantity of epochs was determined. ③The number of nodes and layers in the hidden layer Kolmogorov's theorem stating that any continuous function defined on an n-dimensional cube can be represented by sums and superpositions of continuous functions of one variable. Hecht-Nielsen imported this theorem later in neurocomputing by proving that any continuous function can be represented by a neural network that has only one hidden layer with exactly 2n+1 nodes, where n is the number of input nodes 11 . But Hecht-Nielsen stated that the 2n+1 rule is not for all classes of activation functions. Therefore, Kurkova suggested that two hidden layers should be used to compensate for lost efficiency when using regular activation functions 12 . So we used two hidden layers (each layer consists of 2n+1 nodes) in this study. The 10-fold internal cross-validation was used to prevent overfitting of the data. In the 10-fold cross-validation, the original sample was randomly partitioned into 10 equal-sized subsamples. Of the 10 subsamples, a single subsample was retained as the validation data for testing the model, and the remaining nine subsamples were used as training data. The cross-validation process was then repeated 10 times, using each of the 10 subsamples once as the validation data. An average of the 10 results was then taken to produce a single estimation 13 . The 10-fold cross-validation tested the model's J o u r n a l P r e -p r o o f Journal Pre-proof ability to predict new data that was not used in the estimation of flag problems, such as overfitting. This method was effective in preventing overfitting. The performance of the model was evaluated using receiver operating characteristics (ROC) curve analysis 14 and AUC. A total of 166 cases were included in the study, although 15 cases were excluded due to missing data. The remaining 151 COVID-19 patients comprised 59 males and 92 females with a mean age of 62.4±16.12 (range 18-96) years. There were 58 mild and moderate cases, 88 severe cases, five critical cases (age 84, 84, 69, 65, and 34 years), one case of chronic kidney disease, 21 cases of diabetes (10 with complications), none were human immunodeficiency virus-infected, 11 patients had a history of cancer, and 20 had a history of lung disease (Fig. 2) . The feature set of the present study consisted of 33 features. After feature selection, six eligible features were used for modeling: a history of lung disease, age, hemoglobin (Hb), albumin (ALB), globulin (GLB), and blood urea nitrogen (BUN) ( Table 3 ). The data distribution of these features is illustrated in Fig. 3 . For the correlation analysis, a correlation coefficient P ≥ 0.8 was considered a very strong correlation, P = 0.60-0.79 was strong, The total data were divided into a training set (99 cases) and a verification set (11 cases) and a test set (41 cases). ① The number of nodes in the hidden layer: According to the results of data pre-processing, the number of nodes in the input layer is six. So the number of nodes in the hidden layer is thirteen. ② Learning rate: In this study, the optimal value of the learning rate is 0.001 through experiments. ③ Epochs: The residual error tends to stabilize in 200, as shown in (Fig.4) . So the number of epochs was set to 200. After adjusting for parameters, the final predictive model made up an input layer (six units), two hidden layers (13 units) , and an output layer (one unit: severe COVID-19 or non-severe COVID-19). Hidden layer nodes use the ReLU (rectified linear unit) activation function (Eq. a), the output node uses the Sigmoid activation function (Eq. b), and the cost function was minimized using the adaptive moment estimation method (Fig. 5) . (Fig. 6) , the Specificity and sensitivity values of this model were selected at 85.7% and 100%, respectively. Results showed a good prediction of the model. The present study included a total of 151 cases. At the data preprocessing stage, 33 features among all cases were subjected to relatedness analyses, and six features were needs to observe dynamic changes of imaging manifestations (Fig. 7) rather than a single imaging manifestation 27, 28 . Artificial neural network technology, which is widely implemented in various fields of science, was used in the establishment of our model [29] [30] [31] [32] . This model has good accuracy as long as there is a suitable parameter adjustment. Thus, the neural network model is extremely Hongyi Zhang et al. reported that patients with severe COVID-19 had a significant reduction in granulocytes compared with patients with mild COVID-19 34 . We did not collect information on these factors. Third, the verification of our model using prospective testing with a larger sample size is warranted. Our sample size was relatively small and we are currently collecting recent data from a larger sample size to validate further and improve the current models. Fourth, the operational process of the artificial neural network model is complicated, as the neural activity of the human brain. Therefore, there is currently no quantitative indicator that can express the relevance between predictors and forecast results in the artificial neural network model. This is a limitation of this study, as well as difficulties with machine learning. 33, 34 Author contributions Writing-Reviewing and Editing, The authors declared that they have no conflicts of interest in this work. 2. We find that a low albumin, a high globulin and a high blood urea nitrogen maybe potential risk factors severe COVID-19. 3. The results of our study will prove helpful in the prevention of severe COVID-19. Chemokine up-regulation in SARS-coronavirus-infected, monocyte-derived human dendritic cells Cytokine responses in severe acute respiratory J o u r n a l P r e -p r o o f syndrome coronavirus-infected macrophages in vitro: possible relevance to pathogenesis Delayed induction of proinflammatory cytokines and suppression of innate antiviral response by the novel Middle East respiratory syndrome coronavirus: implications for pathogenesis and treatment Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study Diagnosing thyroid disorders: Comparison of logistic regression and neural network models Classification of Mammographic Breast Microcalcifications Using a Deep Convolutional Neural Network: A BI-RADS-Based Approach Prediction of SARS epidemic by BP neural networks with online prediction strategy Artificial Neural Network : a tool for approximating complex functions Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia Comparison of Various Learning Rate Scheduling J o u r n a l P r e -p r o o f Techniques on Convolutional Neural Network Kolmogorov's mapping neural network existence theorem Kolmogorov's theorem and multilayer neural networks Over-Fitting and Error Detection for Online Role Mining Quantitative CT Extent of Lung Damage in COVID-19 Pneumonia Is an Independent Risk Factor for Inpatient Mortality in a Population of Cancer Patients: A Prospective Study Granulocyte-colony stimulating factor in COVID-19: Is it stimulating more than just the bone marrow? Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study The impact of the glycocalyx on microcirculatory oxygen distribution in critical illness Comparison of computed tomography hepatic steatosis criteria for identification of abnormal liver function and clinical risk factors, in incidentally noted fatty liver Albumin is a major serum survival factor for renal tubular cells and macrophages through scavenging of ROS Covid-19: An Urgent Need For A Psychoneuroendocrine Perspective COVID-19 and the Correctional Environment: The American Prison as a Focal Point for Public Health Medical Advisory Board of the International FA. Managing FPIES during the COVID-19 pandemic-expert recommendations The Clinical Observation of a CVID Patient Infected with COVID-19 The Impact of Coronavirus Disease 2019 (COVID-19) on the Practice of Hand Surgery in Singapore Digital Ischemia in COVID-19 Patients: Case Report CT Findings of Coronavirus Disease (COVID-19) Severe Pneumonia COVID-19 pneumonia: what has CT taught us? Predicting Splicing J o u r n a l P r e -p r o o f from Primary Sequence with Deep Learning Deep Learning Reveals Cancer Metastasis and Therapeutic Antibody Targeting in the Entire Body A Deep Learning Approach to Antibiotic Discovery Segmentation of Masses on Mammograms Using Data Augmentation and Deep Learning Diagnostic Utility of Clinical Laboratory Data Determinations for Patients with the Severe COVID-19 Elevated exhaustion levels and reduced functional diversity of T cells in peripheral blood may predict severe progression in COVID-19 patients The authors declared that they have no conflicts of interest in this work.J o u r n a l P r e -p r o o f