key: cord-0917604-v7y7fwlm
authors: Liu, Yushuang; Jin, Shuping; Gao, Hongli; Wang, Xue; Wang, Congjing; Zhou, Weifeng; Yu, Bin
title: Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier
date: 2021-12-02
journal: Bioinformatics
DOI: 10.1093/bioinformatics/btab811
sha: b932638082bab72f39ac0e40c9cba3b66b9f17d2
doc_id: 917604
cord_uid: v7y7fwlm

MOTIVATION: Multi-label (ML) protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as coronavirus disease 2019 (COVID-19). RESULTS: The article proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition, encoding based on grouped weight, gene ontology, multi-scale continuous and discontinuous, residue probing transformation and evolutionary distance transformation. In the next part, we utilize the ML information latent semantic index method to avoid the interference of redundant information. In the end, ML learning with feature-induced labeling information enrichment is adopted to predict the ML protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy of the first four datasets are 99.23%, 93.82%, 93.24% and 96.72% by the leave-one-out cross validation. It is worth mentioning that the overall actual accuracy prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of ML protein, which provides new ideas for further research on the SCL of ML protein. AVAILABILITY AND IMPLEMENTATION: The source codes and datasets are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

The structure and function of protein are various, but they can only play a role in the right subcellular localization (SCL) (Chu et al., 2020; Costa et al., 2018) . When protein structure changes, it will cause diseases, such as kidney disease (Ivanova et al., 2008) , myocarditis (Jang et al., 1988) , diabetes (Brownlee, 1995) , dermatomyositis (Brownlee, 1995) and muscle atrophy (Sneddon et al., 2000) . With the continuous increase of data and the expansion of research directions (Wan et al., 2015 (Wan et al., , 2017 , the traditional machine learning methods cannot achieve good prediction results (Marilyn et al., 2020; Zhang et al., 2020c) . First, the traditional machine learning methods are time-consuming and labor-intensive. Second, the protein not only exists in one SCL, but also may exist in two or multiple SCL. The prediction method for a single protein site ignores the situation of two or more subcellular locations (Du et al., 2020) . Finally, the high-dimensional space formed by multi-information fusion increases the interference of redundant information on the prediction results (Yu et al., 2018) . Therefore, this article mainly optimizes V C The Author(s) 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com the feature extraction, feature selection and classifier to improve the prediction accuracy.

Since protein sequences cannot be directly used for calculation, it must be transformed into digital information for further study (Yu et al., 2020a) . Zhang et al. (2020c) utilized dipeptide composition (DC), pseudo position-specific scoring matrix (PsePSSM), pseudo amino acid composition (PseAAC), gene ontology (GO) and encoding based on grouped weight (EBGW) to extract protein information from relevant datasets. Wu et al. (2012) adopted GO and evolutionary information to develop a new predictor iLoc-Gpos in Grampositive bacteria dataset. Wan et al. (2014) used the relationship between GO terms to predict the SCL of plant dataset. Zhang et al. (2021b) used position-specific scoring matrix-transition probability composition (PSSM-TPC), DC, GO, PseAAC, PsePSSM and differential evolution algorithm to assign five single feature weight vector.

The feature fusion method can combine multiple information of protein sequences (Yu et al., 2021) . But the interference of redundant information on the prediction results will gradually increase (Fan et al., 2021; Shi et al., 2019) with the increase of dimension. To eliminate the useless features in the original space, researchers have proposed a variety of dimensionality reduction methods. Zhang et al. (2019) put forward a manifold regularized discriminant feature selection (MDFS) algorithm to improve performance by optimizing feature selection framework and considering label correlation. Zhang and Zhou (2010) suggested a multi-label (ML) dimension reduction method based on dependency maximization (MDDM) to maximize the dependency of original features and related category labels to make the dimension reduction process more efficient. Xu et al. (2016) came up with the ML feature extraction algorithm via feature variance and feature-label dependence (MVMD) method, which integrated two least squares formulas and used the maximum feature variance and the correlation of feature label to select the best feature vector. Zhang et al. (2020a) presented global relevance and redundancy optimization (GRRO) method composed of feature relevance, label relevance and feature redundancy, which greatly improved computing efficiency.

Choosing a suitable classifier is crucial for predicting the SCL of proteins. Wan et al. (Wan and Mak, 2018; Wang et al., 2021) proposed an adaptive decision-making scheme for support vector machines (AD-SVM) to obtain the overall actual accuracy (OAA) on virus dataset was 93.24%, and the overall location accuracy (OLA)was 96.03%. Wang et al. (2015) used the ensemble multiple classifier chain (ECC) to predict the protein SCL of Gram-negative bacteria dataset, and the OAA was 94.03%, the OLA was 94.46%. Shen et al. (2019) used the multi-kernel SVM classifier to predict the two human datasets ML protein SCL, and the average precision reached 70.65% and 68.89%, respectively, compared with the results of other methods, the result was the best.

To improve the accuracy of prediction, we propose a new model called ML-locMLFE to predict the SCL of ML protein. Six feature extraction methods are used to transform protein sequences into digital information. Therefore, this article needs to fuse six types of feature information. Then we use the ML information latent semantic index (MLSI) to classify and recognize the most effective information from many features. Finally, the ML learning with feature induced labeling information enrichment (MLFE) is utilized to predict the SCL of ML protein. Compared with other methods, ML-locMLFE is more superior in predicting the SCL of ML protein.

Five datasets are used to verify the effectiveness of the model. The Gram-positive bacteria dataset (Dehzangi et al., 2015) is the training set, while the Gram-negative bacteria dataset (Dehzangi et al., 2015) , the virus dataset , the SARS-CoV-2 dataset (Zhang et al., 2020b) and the newPlant dataset (Wan et al., 2012) are the test sets together. The Gram-positive bacteria dataset, Gramnegative bacteria dataset, virus dataset come from the Swiss-Prot database and the breakdown of each dataset is shown in Supplementary Tables S1-S3. We have obtained data from the UniProt database of the past 3 years to construct a new plant dataset (named as the newPlant dataset). The detailed breakdown is given in Supplementary Table S4 . As a newly mutated coronavirus, the SARS-CoV-2 can cause great harm to human health. Therefore, the accurate identification of the SCL of the SARS-CoV-2 protein is helpful to analyze the pathogenic mechanism of the virus. The SARS-CoV-2 dataset is constructed from the UniProt database, and the detailed breakdown is shown in Supplementary Table S5 . The homology of the five datasets is <25%.

The quality of features has a crucial impact on the predictive ability of the model. Therefore, a suitable feature extraction method is an extremely critical step in predicting the SCL of ML protein. Six methods, namely PseAAC, EBGW, GO, residue probing transformation (RPT), evolutionary distance transformation (EDT) and multiscale continuous and discontinuous (MCD), are adopted here.

PseAAC: PseAAC is a commonly used feature extraction method to predict SCL. According to Chou (2001) , PseAAC mainly reflects the protein sequence information (Bahar et al., 1997; Sahu et al., 2020; Zhang et al., 2021a) . The algorithm can be expressed by

where g k represents the level sequence correlation factor, f v represents the frequency of the vth amino acid in the protein, x is the weighting factor and the value selected in this article is 0.05 (Chou, 2001) . Because b is the characteristic parameter, a 20 þ b-dimensional feature vector will be formed finally. EBGW: The physical and chemical properties is one of the important properties of protein. Zhang et al. (2006) proposed EBGW, which divided amino acids into four categories, as shown in Table 1 .

Three disjoint combinations can be obtained from Table 1 . According to formulas (3), (4) and (5), the protein sequences are converted into three binary sequences.

The length of three binary sequences is L. These sequences are divided into multiple subsequences and the subsequence length is progressive increase. Each will form L-dimensional feature vector, so three binary sequences form 3 Ã L-dimensional vector.

GO: When using GO model to extract GO information of each protein sequence, it is usually divided into two steps (Huang et al., 2008; Shen et al., 2020) . One is GO terms, and another is GO vector. The BLASTP is used to search from the Swiss-Prot database and retain homologous proteins (denoted as Y i ) with a similarity !60% with protein P i . Parameter E is set to 0.001 (Zhang et al., 2018a) in the above steps. In the GOA database, we searched for accession number (ACs) of each protein in Y i , which are obtained from the Swiss-Prot database. Then, the corresponding GO terms were obtained (Xiao et al., 2011) . Then, GO feature vector is constructed as:

RPT: RPT is a feature extraction method that reflects the evolutionary information of protein sequences (Jeong et al., 2010) . In the PSSM, domains with similar conservations are grouped according to conservation scores (Wang et al., 2019; Zhang et al., 2021c) . Here, each particular columns corresponding are standard amino acids in the PSSM. The 20 amino acids are separated 20 groups as rows in the PSSM. Then, calculate the sum of PSSM values for each element in each column, and form a 20Â20-dimensional matrix, which is the RPT matrix. The matrix is expressed as follow:

Therefore, the matrix can be transformed into a 400-dimensional

EDT: EDT is an effective method to calculate the nonoccurrence information possibility of two amino acids (Jeong et al., 2010) . The two amino acids with interval of dð1; 2; Á Á Á; L min À 1Þ, where L min is the shortest sequence length in the dataset. The feature vector of EDT is denoted as:

The f ðA x ; A y Þ is non-occurrence possibility of two amino acids with interval d. It is calculated by formula (10):

where A i;x ; A iþd;y are the element in the PSSM, A x ; A y are any 2 of the 20 amino acids, D is the maximum value in d. MCD: Due to the influence factors of continuous and discontinuous fragments in protein sequences, You et al. (2014) proposed the MCD feature extraction method. The method converts protein sequence into digital information by binary method. For example, a protein sequence 'AVDCALSK' is randomly selected and transformed into a digital model '11321476' via MCD calculation. Then, the sequence is divided into 10 regions, thus composition (C), transition (T) and distribution (D) are used to represent protein characteristics and each descriptor can be calculated. Finally, a 630dimensional feature vector is formed by all descriptors from 10 regions.

Assuming the feature space contains N samples, and each sample size is M-dimensional feature vector, but we will reduce to L dimension. MLSI (Yu et al., 2005) defines the input matrix X ¼ ½x 1 ; x 2 ; Á Á Á; x i ; Á Á Á; x N 2 R NÂM , where x i is the M-dimensional feature vector. The output matrix Y ¼ ½y 1 ; y 2 ; Á Á Á; y i ; Á Á Á; y N 2 R NÂL and y i is the L-dimensional feature vector. Kernel function k x ðÁ ; ÁÞ represents inner product as:

Similar kernel function k y ðÁ ; ÁÞ is expressed as equation (12), and the kernel matrix K y ¼ YY T is obtained.

The kernel calculation matrix C is as (13):

Then, for generalized eigenvalue problems,

where coefficient a requires a T K 2

By formula (14), the generalized eigenvalues a 1 ; Á Á Á; a N are calculated and the first K eigenvalues are used as mappings. The ith mapping function can be obtained by scaling the eigenvalues:

Then, k ¼ 1=k $ and formula (15) is rewritten as:

Finally, k-dimensional vector with the largest eigenvalue is selected.

If training sample is denoted as ðx i ; Y i Þ, p is the number of training sample, and given the enriched labeling information U, the original training sample can be transformed into D ¼ fðx i ; u i Þ j 1 i pg. The response variables u i can measure the model through the multioutput regression technology. MLFE algorithm (Zhang et al., 2018b) uses minimization to obtain the objective function of the regression model:

where, H ¼ ½h 1 ; h 2 ; Á Á Á; h q and b ¼ ½b 1 ; b 2 ; Á Á Á; b q T represent weight matrix and deviation vector of regression model, respectively, and q is the number of class label.

To obtain the optimal objective function, Newton-weighted least squares iterative method (IRWLS) (Sanchez-Ferná ndez et al., 2004; Tsoumakas et al., 2011) is used. In the iterative process, the descent direction of model optimization is determined by solving the linear solution of the equation. Let fH ðkÞ ; b ðkÞ g represents the current model after the kth iteration, and equation (18) is obtained based on the first-order Taylor expansion.

where e can be calculated under the current model fH ðkÞ ; b ðkÞ g. To identify the analytical solution of the descent direction, it is necessary to construct the quadratic approximation value of dX 1 ðuÞ=u:

where ! is a constant term.

The cross-validation method can avoid over-fitting to some extent. The commonly methods include K-fold cross-validation (Jia et al., 2018) , leave-one-out cross validation (LOOCV) (Yu et al., 2020b) , self-compatibility method (Bringi et al., 2001) and independent sample test (Heeren and D'Agostino, 1987) . Compared with other cross-validation methods, LOOCV is deterministic and has high sample utilization (Cheng et al., 2017) . Therefore, the LOOCV test is introduced in this article to evaluate the effectiveness of the model with the OAA, OLA, hamming loss (HL), coverage (CV), ranking loss (RL), and average precision (AP) as indicators. Six evaluation indicators are defined as the following.

OLA

where W is the number of training sample. Y i ðU i Þ and Y 0 ðU i Þ represent prediction label and real label,

where G is the number of labels.

where rankðf ðX i ; yÞÞ À 1 makes all labels rank down and get the corresponding ranking.

where RALðX i Þ ¼ fðy j ; y k Þjf ðX i ; y j Þ f ðX i ; y k Þ; ðy j ; y k Þ 2 Y i Â Y i g and y i 2 Y. f ðX i ; y j Þ is part of the label of X i , Y is a supplement to Y i .

where AVPðT i Þ ¼ This study proposes a new method ML-locMLFE for predicting the SCL of ML proteins and the detailed process is displayed in Figure 1 .

PseAAC and EBGW have different characteristic information by setting different parameters. Since the minimum length of all protein sequences of the Gram-positive bacteria dataset is 55, the parameter of PseAAC is set from 5 to 54 and the parameter of EBGW is set from 5 to 55. Through the LOOCV test, the characteristic information obtained from each parameter is put into the classifier MLFE, and the specific evaluation index values of the different parameter results are listed in the Supplementary Tables S6 and S7. The optimal OAA obtained from PseAAC and EBGW are 62.44% and 

This article uses a total of six feature extraction methods. Among them, the PseAAC method not only considers the sequence information of the protein but also includes the position information of the amino acids in the sequence. The EBGW method is based on the physical and chemical properties of amino acids to effectively extract the physical and chemical information of proteins. The MCD method uses multiple regions as features to extract the physical and chemical information of protein sequences. The GO method extracts the annotation information of the protein, which can essentially analyze the properties of genes and gene products. Because the EDT method considers the evolutionary information of the protein, it can reflect the probability of two different amino acids. The RPT method obtains the evolutionary information of the protein by grouping the evolution scores in PSSM. Therefore, the six feature extraction methods obtain effective information from the different characteristics of the protein, which greatly improves the prediction performance of the model. Through the LOOCV test, six single feature vectors are put into MLFE, and GO has the largest contribution rate among all single features, and its OAA and OLA reach 91.91% and 93.31%, respectively. However, single feature information cannot represent all important information. Therefore, the six feature extraction results need to be fused. We extract 912-dimensional feature vectors from GO, 45-dimensional feature vectors from PseAAC, 120-dimensional feature vectors from EBGW, 400-dimensional feature vectors from RPT and EDT, respectively, and 630-dimensional feature vectors from MCD. After the final fusion, 2507-dimensional feature vectors are obtained. Through the LOOCV test, the comparison results of single and fusion features are given in Figure 3 .

Feature selection method can reduce spatial dimensions and decrease model training time. Therefore, this article uses principal component analysis (PCA) (Abdi and Williams, 2010) , GRRO (Zhang et al., 2020a,b,c) , MDFS (Zhang et al., 2019) , MDDM (Zhang and Zhou, 2010) , MVMD (Xu et al., 2016) , MLSI (Yu et al., 2005) to eliminate irrelevant features. Through LOOCV test, the feature subset obtained by each method is put into MLFE. Then, the OAA of MLSI reaches 99.23%, and the OLA reaches 99.81%, which are both optimal. The algorithm not only retains the original input features, but also captures the correlation of output dimensions, which greatly improves the performance of model prediction. On the Gram-positive bacteria dataset, the MLSI method selects different dimensions to obtain the prediction results which are shown in Supplementary Table S8 , and the comparison results of different methods can be found in Figure 4 .

To verify the effectiveness of MLFE, we take five classifiers as comparison. That are ML k-nearest neighbor (ML-KNN) (Gonzalez-Lopez et al., 2018), ML radial basis function (ML-RBF) (Zhang, 2009) , ML learning with label-specific features (LIFT) (Zhang and Wu, 2015) , ranking SVM (Rank-SVM) (Tayal et al., 2018) , ML learning by instance differentiation (INSDIF) (Zhang et al., 2007) . The optimal feature subset obtained by the MLSI is put into six classifiers. Through the LOOCV test, the results of OAA and OLA obtained from the MLFE classifier are 99.23% and 99.81%, respectively. The algorithm uses the sparse reconstruction information between the training samples as features, and the reconstruction information is passed into the label space to enrich the original labels as numerical labels, thereby enhancing the effectiveness of the label information. The comparison results of different methods are shown in Figure 5 , and the corresponding receiver operating characteristic (ROC) and precision-recall (PR) curves are shown in Figure 6 . The specific parameter values obtained through different algorithms are shown in Supplementary Table S9 . Fig. 3 . Comparison of results based on seven different methods for Gram-positive bacteria. ALL: PseAACþEBGWþEDTþRPTþGOþMCD. The six single feature extraction methods, the GO method has greatest contribution rate to the model. For the fusion features, the OAA and OLA are lower than GO due to the increase of redundant information in the fusion feature space. But compared with the other five single characteristics, the OAA is 26.40-39.89% higher than other methods, and the OLA is 25.81-40.15% higher than other methods. Therefore, the fusion features can represent the overall characteristics of the protein and improve the accuracy of the model prediction Fig. 4 . Comparison results based on different dimension reduction methods. When the 80-dimensional feature subset is obtained by MLSI, the results of OAA and OLA both reach the highest. This method uses the linear correlation of input information and output information to select the feature subset, which greatly improves the ability of prediction. At the same time, MLSI has increased the OAA by 7.52-37.77%, and the OLA by 6.69-35.95% compared with other methods. Therefore, MLSI is chosen as the feature selection method The MLFE classifier is used to predict the multi-label protein SCL, the OAA and OLA are both highest and above 99%. The algorithm uses sparse reconstruction of the training samples to represent the bottom layer of the feature space. At the same time, the OAA of MLFE is 5.01-11.56% higher than the other five classifiers, and the OLA is 4.78-10.14% higher than the other five classifiers. In summary, MLFE can effectively link feature information with label information, which improves the prediction performance of the model

With the continuous development of ML protein SCL research, many researchers use machine learning methods to predict. To prove the superiority of the ML-locMLFE, we compare the results of four datasets with other methods. On the Gram-positive dataset, the results of this article are compared with the results of iLoc-pos (Wu et al., 2012) , Gpos-ECC-mPLoc (Wang et al., 2015) and Gram-LocEN (Wan et al., 2017) . The results of different methods are listed in Figure 7 . On the Gram-negative dataset, the results of this article are compared with the results of iLoc-Gneg (Chou and Shen, 2006) , Gneg-ECC-mPLoc (Wang et al., 2015) and Gram-LocEN (Wan et al., 2017) . On the virus dataset, the results of this article are compared with the results of mGOASVM (Wan et al., 2012) , AD-SVM (Wan and Mak, 2018) and mPLR-Loc (Wan et al., 2015) . On the newPlant dataset, the results of this article are compared with the results of Plant-mPLoc , mPLR-Loc (Wan et al., 2015) and HybridGO-Loc (Wan et al., 2014) . The result of different comparison on the newPlant dataset is shown in Supplementary Table S10, and the results of different methods on other datasets are listed in Supplementary Figures S1 and S2 .

Since the SARS-CoV-2 has brought us great influence, it is important to locate the subcellular location of SARS-CoV-2 protein accurately and quickly. Many researchers have found a way to treat COVID-19 by analyzing the pathogenesis of SARS-CoV-2. German scientist (Hoffmann et al., 2020) found that SARS-CoV-2 transmission depended on transmembrane protease serine 2 (TMPRSS2), the protease inhibitors of TMPRSS2 can block SARS-CoV-2 into cells. Xu et al. (2020) predicted that the TMPRSS2 can bind to some monomeric compounds independently by studying the protein properties and 3D structure of TMPRSS2. Therefore, it is shown that the TMPRSS2 is a serine protease anchored on the cell membrane at the amino-terminal transmembrane region, and its inhibitor can be used as the treatment of COVID-19. The SARS-CoV-2 protein information is obtained by PseAAC, EBGW, GO, RPT, EDT and MCD to form the original feature space. With the continuous increase of the dimensionality, the interference of redundant information on the result is gradually significant. Thus, we use MLSI to obtain the optimal feature subset. Using the MLFE algorithm, the OAA is 72.73% and the OLA is 69.23%. The specific result comparison is shown in Table 2 . Table 2 shows that the SARS-CoV-2 protein mainly exists in Plasma membrane, nucleus, Golgi apparatus and other subcellular. We obtained 26 proteins from the UniProt database and found that the ninth protein is TMPRSS2, which is located on the plasma membrane and accurately predicted by ML-locMLFE. The SARS-CoV-2 dataset is too small to optimize the model, the stability of the model is relatively low. Therefore, the prediction results of Cytoskeleton, Endoplasmic reticulum, Endosome, Golgi apparatus and Lysosome are not ideal, but the OAA and OLA are 72.73% and 69.23% by the ML-locMLFE method. The model presented in this article can not only predict the SCL of important protein in the SARS-CoV-2 quickly and accurately, but also provide a theoretical basis for the treatment of SARS-CoV-2 pneumonia and drug research.

It is significant to understand the structure and function of protein by using machine learning methods to predict the SCL of ML protein.

First, PseAAC, EBGW, GO, RPT, EDT, MCD are used to extract important information about various properties of proteins. Among them, the GO method has the highest prediction accuracy and the largest contribution rate compared with the other five methods. The annotation information of genes and gene products extracted using the GO method can provide important evidence for the study of protein functions. Second, it is the first time to use MLSI as feature selection in the prediction of ML protein SCL. This method can map input features to a new feature space, which not only ensures the existence of input information, but also captures the correlation between multiple output information, so that it can more effectively select the best feature subset. Finally, we integrate the optimal feature subset into MLFE. For the first time, MLFE algorithm is used to enrich the original labels of training samples into numerical labels to enhance the effectiveness of ML information, which further improves the performance of the model. Through the LOOCV test, the OAA of the Gram-positive bacteria dataset, the Gram-negative bacteria dataset, the virus dataset, the SARS-CoV-2 dataset and the newPlant dataset are 99.23%, 93.82%, 93.24%, 72.73%, 96.72%, and the OLA are 99.81%, 96.50%, 99.21%, 69.23%, 96.25%, respectively. Therefore, the ML-locMLFE proposed in this article can predict the ML protein SCL more accurately. In addition, the ML-locMLFE model can spread to other research The PR curves of the Gram-positive bacteria dataset corresponding to the six classifiers. The ROC and PR curves are usually used to evaluate the quality of the model. The closer the ROC curve is to the upper left corner, the higher the accuracy of the model. Conversely, the closer the PR curve is to the upper right corner, the higher the accuracy of the model. The area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) of MLFE are both optimal. The AUC of MLFE is 99.75%, which is 3.09-7.15% higher than the AUC of the other five classifiers, and the AUPR is 98.13%, which is 1.29-10.30% higher than the other five classifiers Fig. 7 . On the Gram-positive bacteria dataset, the ML-locMLFE is compared with other methods by LOOCV test. The OAA and the OLA are 99.23% and 99.81% by MLFE, which is 2.93-6.33%, 3.01-6.71% higher than other methods. In addition, the prediction results of the four types of subcellular locations by this method are 99.42%, 100.00%, 100.00% and 100.00%, which are 1.72-3.42%, 5.60-33.33%, 2.84-4.80%, 4.88-10.57% higher than other methods, respectively. Therefore, the ML-locMLFE is superior to other methods using the same dataset fields such as ML protein post-translational modification, ML mRNA SCL and identification of drug-target interactions. More importantly, the model can accurately predict the SCL of the SARS-CoV-2 protein, and then clarify the pathogenic mechanism of the virus. We hope that our method can provide some insights and help in the clinical treatment of various diseases, including COVID-19. In the next step, we will construct larger scale and diverse datasets to study the SCL of ML protein.

Principal component analysis

Understanding the recognition of protein structural classes by amino acid composition

Correcting C-band radar reflectivity and differential reflec-tivity data for rain attenuation: a self-consistent method with constraints

Advanced protein glycosylation in diabetes and aging

iATC-mISF:a multi-label classifier for predicting the classes of anatomical therapeutic chemicals

Prediction of protein cellular attributes using pseudo amino acid composition

Large-scale predictions of gram-negative bacterial protein subcellular locations

Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization

DTI-MLCD: predicting drug-target interactions using multi-label learning with community detection method

Defining the physiological role of SRP in protein-targeting efficiency and specificity

Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary based descriptors into Chou's general PseAAC

Using Evolutionary information and multi-label linear discriminant analysis to predict the subcellular location of multi-site bacterial proteins via Chou's 5-steps rule

Distributed nearest neighbor classification for large-scale multi-label data on spark

Robustness of the two independent samples t-test when applied to ordinal scaled data

SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor

ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

Mesenchymal transition in kidney collecting duct epithelial cells

A segment of the 5' nontranslated region of encephalomyo-carditis virus RNA directs internal entry of ribosomes during in vitro translation

On position-specific scoring matrix for protein function prediction

O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique

Deep neural network to extract high-level features and labels in multi-label classification problems

Plant-mSubP: a computational framework for the prediction of single-and multi-target protein subcellular localization using integrated machine-learning approaches

SVM multiregression for nonlinear channel estimation in multiple-input multiple-output systems

Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites

Critical evaluation of web-based prediction tools for human protein subcellular localization

Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC

Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure

Amelioration of denervation-induced atrophy by clenbuterol is associated with increased PKC-a activity

Bounding the difference between RankRC and RankSVM andapplication to multi-level rare class kernel ranking

Random k-labelsets for multi-label classification

Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme

mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines

HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins

mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction

Gram-LocEN: interpretable prediction of subcellular multi-localization of Gram-positive and Gram-negative bacterial proteins

Active k-labelsets ensemble for multilabel classification

Multi-location gram-positive and gram-negative bacterial protein subcellular localization using gene ontology and multi-label classifier ensemble

Protein-proteininteraction sites prediction by ensemble random forests with synthetic minority oversamplingtechnique

iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex gram-positive bacterial proteins

A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites

A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously

Potential monomer compounds for treatment of corona virus disease 2019 (COVID-19) by transmembrane serine proteinase 2 (TMPRSS2)

Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set

Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction

Prediction of protein-protein interactions based on L1-regularized logistic regression and gradient tree boosting

SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting

Prediction of protein-protein interactions based on elastic net and deep forest

Multi-label informed latent semantic indexing

MetaGO: predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

Manifold regularized discriminative feature selection for multi-label learning

Multi-label feature selection via global relevance and redundancy optimization

A systemic and molecular study of subcellular localization of SARS-CoV-2 proteins

ML-RBF: RBF neural networks for multi-label learning

LIFT: multi-label learning with label-specific features

Multi-label learning by instance differentiation

DMLDA-LocLIFT: identification of multi-label protein subcellular localization using DMLDA dimensionality reduction and LIFT classifier

MpsLDA-ProSVM: predicting multi-label protein subcellular localization by wMLDAe dimensionality reduction and ProSVM classifier

Accurate prediction of multi-label protein subcellular localization through multi-view feature learning with RBRL classifier

StackPDB: predicting DNA-binding proteins based on XGB-RFE feature optimization and stacking ensemble classifier

Feature-induced labeling information enrichment for multi-label learning

Multilabel dimensionality reduction via dependency maximization

A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine

The authors thank anonymous reviewers for valuable suggestions and comments.

This work was supported by the National Natural Science Foundation of China [62172248].

Conflict of Interest: none declared.