key: cord-0878882-f16puyls
authors: Yaşar, Şeyma; Çolak, Cemil; Yoloğlu, Saim
title: Artificial Intelligence-Based Prediction of Covid-19 Severity on the Results of Protein Profiling
date: 2021-02-15
journal: Comput Methods Programs Biomed
DOI: 10.1016/j.cmpb.2021.105996
sha: 1045c36d51973e9c2bf286aa45844d7f32f99817
doc_id: 878882
cord_uid: f16puyls

BACKGROUND: COVID-19 progresses slowly and negatively affects many people. However, mild to moderate symptoms develop in most infected people, who recover without hospitalization. Therefore, the development of early diagnosis and treatment strategies is essential. One of these methods is proteomic technology based on the blood protein profiling technique. This study aims to classify three COVID-19 positive patient groups (mild, severe, and critical) and a control group based on the blood protein profiling using deep learning (DL), random forest (RF), and gradient boosted trees (GBTs). METHODS: The dataset consists of 93 samples (60 COVID-19 patients, 33 control), and 370 variables obtained from an open-source website. The current dataset contains age, gender, and 368 protein, used to predict the relationship between disease severity and proteins using DL and machine learning approaches (RF, GBTs). An evolutionary algorithm tunes hyperparameters of the models and the predictions are assessed through accuracy, sensitivity, specificity, precision, F1 score, classification error, and kappa performance metrics. RESULTS: The accuracy of RF (96.21%) was higher as compared to DL (94.73%). However, the ensemble classifier GBTs produced the highest accuracy (96.98%). TGB1BP2 in the cardiovascular II panel and MILR1 in the inflammation panel were the two most important proteins associated with disease severity. CONCLUSIONS: The proposed model (GBTs) achieved the best prediction of disease severity based on the proteins compared to the other algorithms. The results point out that changes in blood proteins associated with the severity of COVID-19 may be used in monitoring and early diagnosis/treatment of the disease.

The novel coronavirus disease has spread rapidly across the globe affecting billions of people's everyday lives. The disease can lead to serious pneumonia, which can lead to death [1] . The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) outbreak that occurred in December 2019 manifests itself in different situations in infected patients. While it appears as a mild respiratory infection on some infected patients, it may progress to severe pneumonia and acute respiratory distress syndrome (ARDS), resulting in multiple organ failure or even death in some. On the other hand, in some patients, the disease progresses without symptoms. Hence, it is very difficult to determine the percentage of people with COVID-19 severity. However, according to the World Health Organization (WHO), it is estimated that 80% of infections are asymptomatic or mild, 15% are serious infections requiring oxygen support, 5% are critical infections requiring ventilation, and 3% are fatal [2] . Patients classified as COVID-19 clinically severe are diagnosed based on some clinical features such as respiratory rate and mean oxygen saturation. However, when these clinical signs appeared, the patients reached a clinically serious stage. Therefore, patients are either taken to intensive care or can die quickly. Therefore, considering all these negativities, it is very important to detect early which cases may become clinically serious and develop new approaches to prevent deaths from COVID-19. Therefore, many studies have been conducted for early diagnosis and diagnosis. While most of these studies are on the clinical and epidemiological features of COVID-19, one of the most frequently used approaches recently is proteomics technology. Proteomics technology is the study of all proteins in a biological system and is increasingly being used by clinical researchers to identify disease biological markers [3] . Therefore, the detection of molecularly changed proteins in the blood of an infected individual with proteomics technology and the discovery of biomarkers are thought to play an active role in the development of the diagnosis and treatment of COVID- 19 .

In recent years, machine learning and deep learning-based studies have gained importance in terms of being a decision support mechanism for clinicians for early diagnosis and diagnosis of diseases in the field of health. Therefore, interest in studies involving machine learning algorithms combined with data obtained from methods used in the early diagnosis and diagnosis of COVID-19 has increased considerably. Machine learning is a subbranch of artificial intelligence, consisting of modeling and algorithms that make inferences from existing data using mathematical and statistical methods and make predictions about the unknown with these inferences [4] . On the other hand, machine learning is an area at the intersection of statistics, artificial intelligence, computers, predictive analysis, and statistical learning about obtaining information from available data [5] . Deep learning, a sub-branch of machine learning, is the name of the system that allows multi-layered neural networks, unlike machine learning [6] . The main factor that distinguishes deep learning from ANN is that deep neural networks consist of many more layers than ANN. It is done by increasing the number of hidden layers to obtain more features from the data to be processed and to make learning better [7] .

The relationships between COVID-19 severity and protein profiling technology can be modeled to discover important proteins associated with the pandemic by deep learning and artificial modeling approaches. Therefore, this study aims to classify three COVID-19 positive patient groups (mild, severe, and critical) and a control group based on the blood protein profiling using deep learning and machine learning models (i.e., Random Forest and Gradient Boosted Tree).

The data set used in this study includes age, gender, 368 proteins, obtained from blood protein profiling belong to 93 subjects which are 59 positive COVID-19 cases [mild (n = 26; group 1), severe (n = 9; group 2), critical (n = 24; group 3)] and 28 control groups (control group). In the OLINK proteomics of cardiovascular, immune, inflammation, and neurology panels, each containing 92 proteins, the protein profiles of 87 samples were examined, resulting in 368 protein measurements per subject blood protein profiling. One of the reasons for the data set's missing values was that a sample of the mild symptom group failed the immune and neurology panels' analysis. However, it was not excluded from the analysis for analyzes involving cardiovascular and inflammation panels. Another reason for the missing values in the data set was that in more than 50% of the samples in all four disease groups, thirteen proteins were missing Normalized Protein eXpression (NPX) values or NPX values below the protein-specific detection limit (LOD). Therefore, these 13 proteins were excluded, and 355 proteins remained; 344 were unique in that protein replication in four panels [8] .

Missing values in the data set often complicate the statistical analysis of multivariate data. Therefore, the data set's missing values are complemented by the multiple imputation method using Fully Conditional Specification (FCS), as it is thought that it will negatively affect the model training process [9] . In this method, each variable containing a missing value is determined by a separate model and assigned to that variable with that model [10] .

Similarly, since the number of data in each group is unbalanced in the data set, all the classes have been balanced with the "Sample (balance)" operator in Rapidminer Studio. This operator functions the same way as the (absolute) combination of multiplying and sampling works. A [11] . Random Forest-Recursive Feature Elimination (RF-RFE) algorithm is used as the variable selection method.

In this algorithm, the data set is first trained with a machine learning algorithm, in which variables such as Random Forest (RF), logistic regression, Support Vector Machines (SVM) have certain weights. Then, the variable with the smallest coefficient is removed, and the system is retrained with the remaining variables. This process continues until all features are eliminated, and the variable subset giving the best result is selected. RFE was originally proposed to enable support vector machines to perform feature selection by iteratively training a model, grading features, and then removing the lowest-rated features [12] .

However, recently this method has been similarly applied to Random forest (RF), and it is useful in the presence of related features [13] [14] [15] .

Deep Learning is a machine learning technique developed for machine feature extraction, perception, and learning. It performs its operations using multiple consecutive layers. Each consecutive layer receives the output formed in the previous layer as input [16] .

Besides the deep learning algorithm also performs data-based learning, the learning process works with calculations based on network diagrams expressed as a neural network, not a single mathematical model as in standard machine learning algorithms. The deep learning architecture in the study was constituted by using a multi-layer feed-forward neural network with stochastic gradient descent using the backpropagation approach. Epsilon, rho, L1 regularization, L2 regularization, max w2, and dropout hyperparameters were tuned using the Evolutionary optimization algorithm to increase model performance.

The random forest (RF) algorithm establishes multiple decision trees on data samples, and one estimate is obtained from each. Afterward, if the problem is a regression problem, the results obtained are averaged, and if the problem is the classification problem, the prediction with the highest number of votes is selected. RF can reduce overfitting, one of the biggest problems of machine learning algorithms. Since this algorithm gives positive results in large data sets, it can work harmoniously in large data sets. The RF provides an advantage by not ignoring outlier observations [17] . The hyperparameters of maximal depth, minimal leaf size, minimal size for the split, and the number of pre pruning alternatives were tuned using the Evolutionary optimization algorithm to increase model performance.

Gradient Boost is a type of ensemble method used in machine learning. The foundations of this algorithm are based on the studies of Friedman et al. [18] [19] [20] . The main disadvantage of decision trees is that it generates a large "bias" in simple trees and a large variance in complex trees. "Bootstrap" is a method of selecting a random number of data from a data set. It is mostly used to reduce the variance of the tree. In this algorithm, a community of stronger learners is produced, usually by compensating for each other's weaknesses, such as decision trees.

According to the Gradient Boosted Tree (GBTs) algorithm, a prediction function is constructed in the first iteration. A loss function is obtained from these differences by calculating the difference between estimates and observations. In the second iteration, the difference between the repetitions and observations is calculated by combining the estimation and loss functions. Thus, the success of the prediction function is tried to be increased by constantly adding on it. The difference between the predictions and the observations obtained is as close to zero as possible [21] . The maximum depth and learning rate hyperparameters were tuned using the Evolutionary optimization algorithm to increase the model performance.

A 10-fold cross-validation method was used for the validity of the model. In the 10fold cross-validation method, all data are divided into ten equal parts. The first part is used as the test set, and the remaining nine parts are used as the training data set, and this process is repeated for each part. In this technique, the general accuracy rate of the model is determined by averaging the accuracy values. Performance metrics for all models are given with accuracy, sensitivity, specificity, precision, F1 score, classification error, and kappa statistics.

The compliance of quantitative variables to normal distribution was checked with the Shapiro Wilk test. Quantitative variables fulfilling the normal distribution assumption were summarized with mean and standard deviation and quantitative variables that did not show normal distribution with median and min-max. In statistical analysis, the Kruskal Wallis test was used for variables that did not show normal distribution, and the Conover test was used for pairwise comparisons of variables with differences (p<0.05). One-way analysis of variance (One-way ANOVA) test was used for variables with normal distribution and Tukey test in case of variance homogeneity in multiple comparisons, and Tamhane T2 test when variances were not homogeneous. In this study, in addition to artificial intelligence modeling and basic comparisons, the effect size was calculated to evaluate the effects of each protein on COVID-19 severity and control groups. The effect size is defined as the magnitude of the difference between groups [22] . Generally, the interpretation values in reported literature are small effect between 0.01-0.06, moderate effect between 0.06-0.14, and large effect more than 0.14 [23] . P <0.05 was considered statistically significant. The programming languages of "Statistical Analysis Software" [24] , RStudio Version 3.6.2 [25] , and RapidMiner Studio Version 9.8 [26] were used in data analysis.

In the study, 93 subjects are included, which are 34 (36.6%) are female, 59 (63.4%) are male, and the average age is 58.6 (±15.3). Descriptive statistics regarding the COVID-19 positive and control group based on age and gender variables of the data set are given in Table 1 . Descriptive statistics of 92 proteins in the cardiovascular II panel according to the COVID-19 positive and control group are given in Table 2 . Additionally, the effect sizes for each protein are estimated in Table 2 . According to Table 2 Table 3 . Considering Table 3, except for CD28, FCRL3, JUN, CLEC7A, IL5, CD83, TPSAB1,   ITM2A, Table 4 . When Table 4 Table 5 . According to Table 5 , the difference between groups in terms of protein in the neurology panel except for proteins UNC5C, VWC2, CRTAM, NEP, GM-CSF-R-alpha, Dkk-4, LAIR-2, NCAN, CDH3, TNFRSF21, and IL12 is statistically significant (p<0.05). As the effect sizes are appreciated, the two proteins that most markedly affect the COVID-19

severity and control groups in the neurology panel are PLXNB3 (0.73) and GAL-8 (0.70).

In the data set, a total of 7 missing values were determined in other variables except for gender and age variables. As a result of the assignment made using the Fully Conditional Specification method instead of the determined missing values, the missing values in the data set were completed. The class imbalance problem is solved using the "Sample (balance)"

operator on the data set that does not contain any missing value. As a result, the data set became balanced, with 33 people in each group. As a result of the feature selection with the Recursive Feature Elimination method to increase the model performance, the number of variables in the data set decreased to 138 proteins.

In this study, artificial intelligence models (deep learning (DL), Random Forest (RF), Gradient Boosted Trees (GBTs) are constructed to classify three COVID-19 positive patient groups (mild, severe, and critical) and a control group based on the blood protein profiling. Figure 1 displays the pseudo-codes of the GBTs algorithm, which produces the best prediction in classifying the severity of COVID19 disease based on the proteomics data. Figure 1 depict the importance levels of the top ten proteins in COVID-19

positive and control individuals on the severity of the disease in the GBTs modeling. [27] .

During the severe acute respiratory syndrome-new coronavirus-2 pandemic, the insufficiency of laboratory diagnostic tools and the take a long time led clinicians to more rapid diagnosis methods. Although COVID-19 can be effectively diagnosed at an early stage by the approaches based on proteomic analysis, the detection of serious COVID-19 patients before the manifestation of severe symptoms to reduce mortality is equally important. In this study, three positive (mild, extreme, and critical) COVID-19 patient groups and a control group may be separated based on deep learning and multiple machine learning models (i.e., Random Forest, Gradient Boosted Tree) related to blood protein profiling. According to the experimental results from the current study, it can be concluded that the models based on blood proteins generate promising prediction results in classifying COVID-19

(mild/severe/critical) severity levels and the control group. When the prediction results of the algorithms are compared according to the performance metrics (i.e., accuracy, sensitivity, specificity, precision, F1, kappa, and classification error), the GBTs algorithm slightly outperforms deep learning and random forest techniques on the classification problem under question . The ten top proteins, ITGB1BP2, MILR1, MATN3, ROBO2, REN, CLEC4C, IL6, ZBTB16, PLXNB3, and LILRB4, calculated from the best performing GBTs algorithm, can be used as biomarkers in the COVID-19 severity classification. A similar paper has been reported that six proteins (IL6, CKAP4, Gal-9, IL-1ra, LILRB4, and PD-L1) are associated with the severity of the COVID-19 disease, and complex variations in blood proteins associated with the severity of the disease may be used as early biomarkers to screen the severity of the disease in COVID-19 and act as future therapeutic targets [8] . respectively. Besides, the findings of the proposed GBTs model indicate that the two proteins (i.e., IL6 and LILRB4) are significantly related to COVID-19 severity, as reported by the previous work [8] .

Different studies on the proteomics profiling of the COVID-19 pandemic have been reported to identify the differences in proteins. A novel paper [28] has performed RNA-seq and high-resolution mass spectrometry on 128 blood samples from COVID-19-positive and COVID-19-negative patients with various disease severities and outcomes and mapped 219 molecular features with high significance for the status and severity of COVID-19. Finally, the related study presents a web-based platform to be interactively explored and demonstrated through a machine learning approach (ExtraTrees classifier) to COVID-19 severity prediction [28] . Another research has profiled host responses to COVID-19 by studying plasma proteomics in a population of patients with COVID-19, including non-survivors and survivors emerging from moderate or extreme symptoms, and revealed numerous plasma protein alterations associated with COVID-19. To classify 11 proteins as biomarkers and a range of biomarker combinations validated by an independent cohort and precisely differentiated and projected COVID-19 outcomes, we developed a pipeline based on machine-learning (penalized logistic regression) [29] . A recent study has been described the development of a proteomic risk score (PRS) based on 20 blood proteomic biomarkers linked to progression to severe COVID-19 and established that using a machine learning model (Light Gradient

Boosting Machine), a core group of gut microbiota could reliably predict the blood proteomic biomarkers of COVID-19 [30] . The study conducted by Gomila et al. [31] used matrixassisted laser desorption/ionization time-of-flight mass spectrometry (MALDI TOF MS) to analyze the mass spectra profiles of the sera from 80 COVID-19 patients, clinically classified as mild (33), severe (26) , critical (21) , and 20 healthy controls and they found a clear variability of the serum peptidome profile depending on COVID-19 severity. The two support vector machines discrete severe (severe and critical) and non-severe (mild) patients with 90%

precision in the study of the resulting matrix of peak intensity and estimated correctly the non-negative outcome of the severe patients in 85% of the cases and the negative in 38% of the cases. Yet, the current work additionally encapsulates the use of a deep learning approach in the proteomics analysis of the COVID-19 severity, which is an important difference from the studies given earlier.

To sum up, the proposed model (Gradient Boosted Tree) achieved the best prediction of disease severity based on the proteins compared to the other algorithms. The results point out that changes in blood proteins associated with the severity of the disease may be used in monitoring the severity of COVID-19 disease and in early diagnosis and treatment.

Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: Classification and segmentation

COVID-19): World Health Organization

Application of Proteomics to Medical Diagnostics

The elements of statistical learning: data mining, inference, and prediction

Introduction to machine learning with Python: a guide for data scientists

Deep learning

Scaling learning algorithms towards AI, Large-scale kernel machines

Proteomic blood profiling in mild, severe and critical COVID-19 patients, medRxiv

Data mining and the impact of missing data

Fully conditional specification in multivariate imputation

RapidMiner 5, Operator Reference. www. rapid-i. com

Gene selection for cancer classification using support vector machines

Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules

Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes

Correlation and variable importance in random forests

Deep learning, nature

A random forest guided tour

Reitz Lecture

Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors

A decision-theoretic generalization of on-line learning and an application to boosting

Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500

Using effect size-or why the P value is not enough

Statistical power analysis for the behavioral sciences

A Developed Interactive Web Application for Statistical Analysis: Statistical Analysis Software

Getting started with RStudio

RapidMiner: Data mining use cases and business analytics applications

COVID-19 detection in radiological text reports integrating entity recognition

Large-scale Multi-omic Analysis of COVID-19 Severity

Plasma Proteomics Identify Biomarkers and Pathogenesis of COVID-19

Gut Microbiota May Underlie the Predisposition of Healthy Individuals to COVID-19-Sensitive Proteomic Biomarkers

Rapid classification and prediction of COVID-19 severity by MALDI-TOF mass spectrometry analysis of serum peptidome, medRxiv

This study does not require ethical approval because the open-source data set is used.Also, there is no conflict of interest among the authors. Any institution or organization did not financially support this research.