key: cord-0959991-zalfsi2o
authors: nan
title: Application of Tensor Decomposition to Gene Expression of Infection of Mouse Hepatitis Virus Can Identify Critical Human Genes and Efffective Drugs for SARS-CoV-2 Infection
date: 2021-02-23
journal: IEEE J Sel Top Signal Process
DOI: 10.1109/jstsp.2021.3061251
sha: 78372ad71c4e29cb63c3144c3f546243728ba9a3
doc_id: 959991
cord_uid: zalfsi2o

To better understand the genes with altered expression caused by infection with the novel coronavirus strain SARS-CoV-2 causing COVID-19 infectious disease, a tensor decomposition (TD)-based unsupervised feature extraction (FE) approach was applied to a gene expression profile dataset of the mouse liver and spleen with experimental infection of mouse hepatitis virus, which is regarded as a suitable model of human coronavirus infection. TD-based unsupervised FE selected 134 altered genes, which were enriched in protein-protein interactions with orf1ab, polyprotein, and 3C-like protease that are well known to play critical roles in coronavirus infection, suggesting that these 134 genes can represent the coronavirus infectious process. We then selected compounds targeting the expression of the 134 selected genes based on a public domain database. The identified drug compounds were mainly related to known antiviral drugs, several of which were also included in those previously screened with an in silico method to identify candidate drugs for treating COVID-19.

for SARS-CoV-2 Infection [5] . This strategy can be also applicable to COVID-19. Recently, Pfaender et al. [6] demonstrated that host lymphocyte antigen 6 (LY6E) complex impairs coronavirus fusion and confers immune control of viral disease. The authors also used a transcriptome approach to evaluate the effect of infection of mouse hepatitis virus (MHV), a natural mouse pathogen that causes hepatitis and encephalomyelitis, which is a well-studied model of coronavirus infection. Although they found many pathways that were disturbed after MHV infection, they did not perform a detailed analysis of the genes with altered expression in response to MHV infection. In this study, we applied tensor decomposition (TD)-based unsupervised feature extraction (FE) [7] to identify genes with altered expression by MHV infection as a model of coronavirus. We further performed functional enrichment on the selected genes to determine their potential associations with coronavirus infection processes, and screened candidate drug compounds targeting these genes. Overall, this work expands TD formalism by exploring the interpretation of six-dimensional tensors in an infectious disease context. Moreover, we demonstrate a novel application of TD to facilitate the drug discovery process, which can offer a valuable resource for researchers to obtain mechanistic insight for identifying effective drugs for infectious diseases such as COVID- 19 .

The reason why TD was primary employed to be applied to the present data set that we investigated is because the data set was formatted as a six-mode tensor. Since TD is the most famous method to be used to attack the data set formatted as tensor, it is natural to apply TD to it. Although any other methods might be applicable, they will be tried only when TD fails (and as can be seen in the below, it is not the case in this study; TD can work quite well). In addition to this, our proposed approach, TD based unsupervised FE [7] , was known to be applicable to wide range of genomics study. Thus, the purpose of this study is to propose a different tensor formulation dealing with such different tensor data representation and to estimate how well the existing method can work to fulfill a new requirement; Using the experiments of mice infected by virus that is related to SARS-CoV-2, but is not SARS-CoV-S itself, identifying critical human genes that play important roles when SARS-CoV-2 infects human lung.

The key point is if we can outperform the previous study where SARS-CoV-2 infected human lung cell line [8] . If we can derive the better results than those using MHV infected mouse data This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0 /  TABLE I  NUMBER OF BIOLOGICAL REPLICATES. TWO TECHNICAL REPLICATES ARE  AVAILABLE FOR EACH BIOLOGICAL REPLICATE   TABLE II  NUMBER OF BIOLOGICAL REPLICATES OF SARS-COV INFECTION TOWARD  MOUSE LUNG GENE EXPRESSION PROFILES set, it is considered a remarkable achievement (and we could do this as can be seen in the below). It is worth noting that our TD employs "samples + genotypes + tissues + treatments + biological replicates + technical replicates" structure that requires a six-axes tensor representation to identify critical human genes and effective drugs for SARS-CoV-2 infection. Reported results via enrichment analysis show the superiority of our TD, attributed to taking into account the relationships among samples, genotypes, tissues, treatments, biological replicates, and technical replicates at once"

The gene expression profile was downloaded from the Gene Expression Omnibus (GEO) dataset GSE146074. This dataset comprises the gene expression profiles of the liver and spleen from female mice experimentally infected with MHV or injected with phosphate-buffered saline (PBS) as a control group for comparison. This experiment was performed with mice of two genetic backgrounds, including wild-type (WT) mice and an textit Ly6e-knockout (KO) mutant strain. The number of replicates for each group are listed in Table I . Seventy-two files whose names start with "GSM" (processed file) were used for the analyses.

In order to validate the suitability of MHV as model SARS-CoV-2 infectious process, two additional gene expression profiles of mouse lung SARS-CoV infectious processes were used (Table II) . For GSE33266 and GSE50000, we downloaded two files: GSE33266_series_matrix.txt.gz and GSE50000_series_matrix.txt.gz. Although GSE50000 also includes files for SARS-CoV-MA15, they were omitted for reasons detailed in the Discussion section. For cases with less than five biological replicates, we used some replicates more than once, in order to have five biological replicates for individual cases.

Although the TD-based unsupervised FE was fully described in the recent book [7] , we briefly outline the analysis flow. At first, TD is applied to tensor and singular value vectors are obtained. Individual singular value vectors are attributed to either various experiments or genes. By investigating singular value vectors attributed to experiments, we identify which ones are associated with properties of interest. Then among singular value vectors attributed to genes, those coincident with identified singular value vectors attributed to experiments are selected. Finally, genes having larger contributions toward selected singular value vectors attributed to genes are selected as those associated with the properties of interest. Fig. 1 shows the flowchart of TD-based unsupervised FE. The gene expression profile dataset was formatted as a tensor, x ijkmnp ∈ R N ×2×2×3×3×2 , which represents the expression level of the ith gene of the jth genotype (j = 1:KO, j = 2:WT) of the kth tissue (k = 1:liver, k = 2:spleen) of the mth treatment group (m = 1:PBS day 5, m = 2:MHV day 3, m = 3:MHV day 5) for the nth biological replicate (1 ≤ n ≤ 3) and pth technical replicate (1 ≤ p ≤ 2).

The TD is therefore expressed as 

where G( 1 2 3 4 5 6 ) ∈ R 2×2×3×3×2×N is a core tensor, and u 1 j ∈ R 2×2 , u 2 k ∈ R 2×2 , u 3 m ∈ R 3×3 , u 4 n ∈ R 3×3 , u 5 p ∈ R 2×2 , and u 6 i ∈ R N ×N are singular values vectors, which can be obtained via the higher-order singular value decomposition (HOSVD) algorithm [7] .

To select u 6 i , attributed to selected genes, we need to select u 1 j attributed to the genotype, u 2 k attributed to the tissue, u 3 m attributed to the treatment, u 4 n attributed to the biological replicate, and u 5 p attributed to the technical replicate, associated with desired properties.

For this study, we sought to identify genes whose expression is independent of the mouse genotype, tissue type, and replicate. Thus, u 1 j , u 2 k , u 4 n , and u 5 p should be independent of j, k, n, and p.

By contrast, we require u 3 m to be dependent on u 3 1 < u 3 2 < u 3 3 or vice versa. This is because m = 2 (3 days after MHV infection) must be between m = 1 (5 days after PBS injection as the control) and m = 3 (5 days after MHV infection).

After selecting 1 , 2 , 3 , 4 , and 5 based on the above considerations, we selected 6 associated with G( 1 2 3 4 5 6 ) as the largest absolute value, with fixed 1 , 2 , 3 , 4 , and 5 values. Using the selected u 6 i , The P -values, P i s, were attributed to gene expression levels as

where P χ 2 [> x] is the cumulative probability distribution of the χ 2 distribution when the argument is larger than x and σ 6 is the standard deviation of u 6 i . P i s were adjusted by the Benjamini and Hochberg (BH) criterion [7] , and only genes associated with an adjusted P i less than 0.01 were selected for further analysis.

We employed almost the same procedures, apart from different tensor fromats for the two additional gene expression profiles of mouse lung SARS-CoV infectious processes. The tensors formatted and TDs are for GSE33266

which represents the ith gene expression profile of jth experiments (j = 1:Mock, j = 2:10 2 pfu, j = 3:10 3 pfu, j = 4:10 4 pfu, j = 5:10 5 pfu) at the kth day after infection (k = 1:D1, day 1, k = 2:D2, day 2, k = 3:D4, day 4, k = 4:D7, day 7) of the nth biological replicate (1 ≤ n ≤ 5). G( 1 2 3 4 ) ∈ R 5×4×5×N is a core tensor that represents the weight of products of the singular value matrices u 1 j ∈ R 5×5 , u 2 k ∈ R 4×4 , u 3 n ∈ R 5×5 , and u 4 i ∈ R N ×N , which are all orthogonal matrices. Those for GSE50000 are

which represents the ith gene expression profile of the jth experiments (j = 1:BatSRBD, j = 2:icSARS, j = 3:Mock) at the kth day after infection (k = 1:d1, day 1, k = 2:d2, day 2, k = 3:d4, day 4, k = 4:d7, day 7) of the nth biological replicate (1 ≤ n ≤ 5). G( 1 2 3 4 ) ∈ R 3×4×5×N is a core tensor that represents the weight of products of singular value matrices, u 1 j ∈ R 3×3 , u 2 k ∈ R 4×4 , u 3 n ∈ R 5×5 , and u 4 i ∈ R N ×N which are all orthogonal matrices. The criteria for the selection of singular value vectors were as follows. For GSE33266, u 1 j should be a monotonic function of j, since it represents the strength of infection, u 2 k should also be a monotonic function of k, since it represents time development, and u 3 n should be constant, since biological replicates should not differ from one another. For GSE50000, u 1 j should be distinct between k = 3 and k = 1, 2, since it represents the distinction between mock and real infection, u 2 k also should be a monotonic function of k, since it represents time development, and u 3 n should be constant, since biological replicates should not differ from one another.

Downstream procedures for gene selection after identifying singular value vectors are the same as those for MHV. The core tensors, G( 1 2 3 4 )s, were investigated in order to see which 4 associated with G( 1 2 3 4 ) having larger absolute values given 1 , 2 , 3 . u 4 i s selected are used for attributing P -values, P i , to genes. Genes associated with adjusted P -values less than 0.01 were selected.

Gene symbols of genes selected by TD-based unsupervised FE with significantly altered expression due to MHV infection were uploaded to Enricher [9] and Metascape [10] , which are popular enrichment analysis servers that evaluate the biological properties of genes based on enrichment analysis. There are some explanations about individual categories of Enrichr.

1) Virus-Host PPI P-HIPSTer2020: This is a list of human proteins known to interact with various virus proteins. By comparing uploaded genes with genes in the list, we can estimate the ratio of genes interacting with virus proteins.

2) Virus Perturbations From GEO up/down: From GEO, gene expression profiles of virus infections are retrieved. Then genes associated with altered expression are identified. By comparing uploaded genes with genes in the list, we can estimate the ratio of genes whose expression is altered by virus infection.

3) Drugmatrix: In DrugMatrix data base, gene expression profiles of rat tissues treated with various drugs are recorded. By comparing uploaded genes with genes in the list, we can estimate the ratio of genes whose expression is altered by drug treatment.

From GEO, gene expression profiles of drug treatments are retrieved. Then genes associated with altered expression are identified. By comparing uploaded genes with genes in the list, we can estimate the ratio of genes whose expression is altered by drug treatment.

In order to demonstrate how effective the tensor is, we employed synthetic data sets composed of multiway and multiclass labels, among which partial collections having distinct values between js as well as ks must be selected, where S replicates are available for individual combinations of j and k. A typical example is that x ijks represents the expression of the ith gene in the kth tissue of patients who belong to the jth group, that consists of S patients. We then need to identify which genes are expressed distinctly in tissue (k) -specific as well as patient-group (j) specific ways.

The most popular approach to this problem is linear regression,

where a j and b k are pre-defined variables that represent some properties of the jth patient group and kth tissue, respectively. α i , β i , and γ i are regression coefficients that are selected such that the discrepancy between both sides of equation is minimized. Then the is associated with significant P -values are selected as those whose expression levels are different between distinct js and ks. It is not guaranteed that liner regression will correctly represent the dependency of x ijks upon j and k. We generated synthetic data that follows

where ξ ijk are drawn from N (1, 0) for every combination of i, j, k, so that x ijks s associated with distinct ss share the same ξ ijk , and ε ijks are drawn from N (1, 0) for every combination of i, j, k, s. N (μ, σ) represents a normal distribution that has mean of μ and standard deviation of σ. For i > N 0 , ξ ijk is taken to be zero; this means that x ijks for i > N 0 is simply random variables. The task is to correctly select N 0 variables associated with the dependence upon j and k.

As an alternative regression analysis to linear regression, we employed

which reflects the multiplicative nature when x ijk is generated with eq. (8) although it cannot represent x ijk completely since ξ ijk is replaced with α i and therefore dependence upon j or k is not assumed. In addition to the above two regression analyses, we applied categorical regression, which is equivalent to analysis of variance (ANOVA):

where α ijk and γ i are regression coefficients. Eq. (10) can fully reproduce the x ijk generated by eq. (8), when α ijk is taken to be ξ ijk a j b k excluding randomness introduced by ε ijks , which can be regarded as residuals. Finally, we also applied TD based unsupervised FE to x ijks . TD is computed as

Then 1 and 2 , that have the largest absolute values of the correlation coefficient between u 1 j and a j or u 2 k and b k were selected. The top two 1 that have the largest absolute G( 1 2 1 4 )s were selected. The reason why 3 is fixed to be 1 is because u 1 s always represent the u 3 s that lacks s dependency, i.e., is a constant independent of s. Since x ijks should take the same values between replicates, u 3 s that do not have any s dependence are selected. Then P -values are attributed to i as

where summation is taken over two selected 4 only.

No matter which form of linear regression, eq.(7), eq. (9), categorical regression, eq.(10), or TD, eq.(11), was used to compute P i , P i was corrected using the BH criterion [7] and is having corrected P i less than threshold P -values were selected as those associated with dependency upon j and k, respectively.

For simplicity, we employed a j = j and b k = k when generating x ijks using eq. (8) as well as eq. (9) for regression analysis. Specifically, we chose N = 1000, N 0 = 100, M = K = 3, S = 5. Generation of x ijks and selection of is were repeated 100 times with two distinct threshold P -values, 0.01 or 0.1, for each trial. Table III shows the performance averaged over 100 trials. Although categorical regression, eq. (10), could outperform the other three approaches, since it is able to completely reproduce eq. (8) as shown above, it has one weak point: it cannot explicitly consider the dependence of a j and b k upon j and k, since ξ ijk a j b k s are estimated as one parameter, α ijk . This limitation might prevent us from selecting is that are specifically associated with the dependence upon j and k that a j and b k represent. Even if some is are selected, it might be because of dependence upon j and k that a j and b k do not represent. There is no way for us to check this point. Taking this problem into account, TD based unsupervised FE, which is the second best approach, is more useful than categorical regression, since TD based unsupervised FE can consider a j and b k when selecting u 1 j and u 2 k correlated with a j and b k .

Another advantage of TD based unsupervised FE is the short CPU time required for execution (Table III) . The CPU time required for TD based unsupervised FE is approximately one tenth that of other three methods, which must repeat the regression analysis N times. This difference might be important when dealing with massive data sets.

TD based unsupervised FE can consider the dependence upon j and k of a j and b k , although other methods cannot because of the random nature of ξ ijk in eq. (8) . When an individual i is considered, there are no ways to take into account the dependence upon j and k that a j and b k have. However, when x ijks s are averaged over multiple is, there is a possibility that the dependence upon j and k of a j and b k can appear, since the randomness of ξ ijk can be smeared out because of averaging. In practice, u 1 j and u 2 k can be such variables that can appear only after averaging, and can represent the dependence upon j and k of a j and b k . This is why TD based unsupervised FE can outperform eq. (9), which explicitly considers the multiplicative nature when x ijks s are generated, and can consider the dependence upon j and k of a j and b k , which categorical regression, eq.(10), cannot.

We applied TD based unsupervised FE to the gene expression profiles introduced in the Materials and Methods section. Then other methods were applied to the synthetic data set to form a basis for the comparisons discussed in the Discussions and Conclusions section.

We selected 1 = 1, 2 = 1, 3 = 2, 4 = 1, and 5 = 1 based on the criteria described above (Fig. 3) , and the associated G (1, 1, 2, 1, 1, 6 ) values are listed in Table IV , demonstrating the largest value for G (1, 1, 2, 1, 1, 3) . The associated P i values were computed using u 3i as shown in eq. (2), resulting in (Table V) . Although some mouse-specific genes were included in this list (e.g., genes with symbols starting with "mt"), since there were still several gene symbols that are common between human and mice, we decided to evaluate the potential association of all 134 genes with the infection process of coronavirus.

In order to see if we could correctly select differentially expressed genes between infected mice and control, we applied t test to expression of 134 genes between infected and control mouse. Then we have found that the null hypothesis that gene expression averaged over selected 134 genes are equal between infected and control mouse was rejected by the P -values of 0.004. Thus we could successfully identify expressed genes between infected mice and control. 

We first evaluated whether the 134 selected genes could reflect the process of coronavirus infection using the Enrichr server for functional enrichment analysis. Several of the genes were enriched in the category "Virus-Host PPI P-HIPSTer 2020," which is related to SARS-CoV (Table VI, see the supplementary  materials for the full list) .

These genes were also related to ORF1ab, polyprotein, and 3C-like protease. Interestingly, Woo et al. [11] suggested that ORF1ab, which encodes a replicase polyprotein of CoV-HKU1, is cleaved by papain-like proteases and 3C-like proteinase. Thus, it is reasonable that ORF1ab, polyprotein, and 3C-like protease would be affected during MHV infection VI. Other PPIs detected that are not listed in Table VI (see supplementary materials) were also mainly associated with ORF1ab and polyproteins, suggesting that our strategy has clear capability to elucidate the basic infectious process at the molecular level that is common among various coronaviruses.

We next evaluated whether genes with known altered expression by virus perturbation overlapped with the 134 genes selected by our TD-based unsupervised FE approach (Table VII; see the supplementary material for the full list).

Among these, we detected the overlap of many genes that are pertubated in response to either SARS-CoV or SARS-like bat CoV, which are the genetically closest coronaviruses to the new SARS-CoV-2 strain. This further suggests that our results could have high similarity to the genes perturbated in SARS-CoV-2 infection.

For further functional enrichment analysis, we uploaded the 134 selected genes to Metascape to identify non-redundant biological terms (Fig. 4) . Among the terms identified, "R-HSA-2 173 782: Binding and Uptake of Ligands by Scavenger Receptors" was the third most significantly enriched term. Although it was initially surprising that a scavenger receptor might be related to the response to coronavirus infection, a search of the related literature revealed that the scavenger receptor TMPRSS2 plays a critical role in SARS-CoV-2 infection as well as SARS-CoV infection [12] . Isolation of SARS-CoV-2 was also reported to be enhanced by TMPRSS2-expressing cells [13] . Moreover, [14] , suggesting that TMPRSS2 activity would be related to detection of scavenger receptor activity. This finding further demonstrates the outstanding capability of our strategy to detect factors related to the SARS-CoV-2 infectious process. Moreover, this analysis suggests that research on the SARS-CoV infection process could be informative for understanding the SARS-CoV-2 infection process when it is not possible to directly investigate SARS-CoV-2 infection.

We previously demonstrated that genes selected by TD-based unsupervised FE are useful to screen for drugs that are effective in treating disease or those that may cause adverse effects [15] . Therefore, we used this approach to screen for candidate drugs to treat coronavirus infections based on the individual terms that emerged from the Enrichr analysis.

1) Drug Matrix: In the Enrichr category "DrugMatrix," the top-ranked drug was related to virus infection (Table VIII; see the supplementary materials for the full list). Most of these viruses are enveloped, single-stranded RNA viruses. Coronaviruses, including SARS-CoV-2, are positive-sense, enveloped, singlestranded RNA viruses, whereas influenza virus is a negativesense, enveloped, single-stranded RNA virus.

Primaquine is known to inhibit the replication of Newcastle disease virus [16] , which is in the family of paramyxoviruses that are enveloped, non-segmented, negative-sense single-stranded RNA viruses. Meloxicam is known to have cytotoxic and antiproliferative activity on virus-transformed tumor cells [17] , including myelocytomatosis virus and Rous sarcoma virus. Myelocytomatosis virus is a retrovirus, which is an enveloped, negative-sense, single-stranded RNA virus, whereas Rous sarcoma virus is an enveloped, positive-sense, single-stranded RNA virus. Although there are no studies showing that cytarabine is effective against infection of an RNA virus, one report demonstrated that cytarabine can affect DNA virus infection [18] . Pyrogallol was reported to have anti-virus effects on human influenza virus strain A/Udorn/72, avian influenza virus A/swan/Shimane/499/83, herpes simplex virus-1, vesicular stomatitis virus, and retrovirus [19] . As mentioned above, influenza virus is a negative-sense, enveloped, singlestranded RNA virus; herpesvirus is a DNA virus; vesicular stomatitis virus is an enveloped, single-stranded, negative-sense RNA virus; and retroviruses are enveloped, negative-sense, single-stranded RNA viruses. This suggests that a single drug can effectively inhibit a wide range of viruses from DNA viruses to both negative-and positive-sense RNA viruses. The structuredependent antiviral activity of catechol derivatives in pyroligneous acid against encephalomyocarditis virus was reported, which is a non-enveloped single-stranded RNA virus [20] . To our knowledge, there are no reports that neomycin is effective against RNA viruses; however, one study showed that it could inhibit infection of fibroblasts with human cytomegalovirus [21] , which is a DNA virus.

Although not all viruses identified to be related to the 134 genes selected by TD-based unsupervised FE are enveloped, positive-sense, single-stranded RNA viruses similar to SARS-CoV-2, since drugs shown to be effective against other viruses (e.g., DNA viruses) are also often effective against RNA viruses (including pyrogallol that was screened by our strategy), drugs in Table VIII warrant being tested as potential treatments for SARS-CoV-2 infection.

2) Drug Perturbations From GEO: Several promising drug compound candidates were also screened from the GEO "Drug Perturbations from GEO up" and "Drug Perturbations from GEO down" categories, along with available evidence for possible adverse effects (Table IX; see the supplementary material for  the full list) .

Drugs associated with upregulated genes that overlapped with the 134 genes selected by TD-based unsupervised FE are considered to be more likely to cause adverse effects, since they will enhance the expression of genes altered by SARS-CoV infection. Captoprilis is an angiotensin-converting enzyme (ACE) inhibitor, which is known to activate ACE2 that is the receptor that SARS-CoV-2 uses to infect human cells [22] , suggesting that this drug might have negative effects for COVID-19 therapy. Coenzyme Q10, which frequently emerged in Table IX , has been reported to accelerate virus infection [23] , which could therefore also have negative effects for COVID-19 therapy. Fenretinide is known to effectively inhibit HIV infection [24] , and therefore might be a promising drug candidate for SARS-CoV-2 even though it was listed in the "Drug Perturbations from GEO up" category.

In contrast to the drugs in the above list, those associated with downregulated genes that overlapped with the 134 genes selected by TD-based unsupervised FE are considered to be able to effectively suppress SARS-CoV-2 infection, since they will inhibit the expression of genes altered by SARS-CoV infection. Pioglitazone was also included in the list of candidate compounds for SARS-CoV-2 screened by an in silico method [25] . Quercetin was reported to inhibit the cell entry of SARS-CoV-2 [26] , and was also included in the list of candidate compounds for SARS-CoV-2 screened by an in silico method [27] . Fenretinide was also included in the drugs identified as effective compounds in the "Drug perturbations from GEO up" category as described above. Decitabine is one of the drugs used in HIV combination therapy [28] . Troglitazone impedes the oligomerization of sodium taurocholate co-transporting polypeptide and entry of hepatitis B virus into hepatocytes [29] , which is a partially double-stranded DNA virus. Finally, motexafin gadolinium was reported to selectively induce apoptosis in HIV-1-infected CD4+ T helper cells [30] .

Based on these observations, our strategy appears to be useful to identify potential drug compounds for SARS-CoV-2.

Finally, we compared the drugs screened out using our approach from the "Drug perturbations from GEO up/down" lists with those screened from two in silico drug discovery studies [25] , [27] 1) Comparison With Wu et al. [25] : We found multiple hits, which are summarized in Table X .The main drugs identified included doxycycline, ascorbic acid, isotretinoin, pioglitazone, cortisone, and tibolone.

Wu et al. [25] identified 29 potential PLpro inhibitors, 27 potential 3CLpro inhibitors, and 20 potential RdRp inhibitors from the ZINC drug database, and identified 13 potential PLpro inhibitors, 26 potential 3Clpro inhibitors, and 20 Potential RdRp inhibitors from their in-house natural product database. Doxycycline was among both the potential PLpro and 3CLpro inhibitors; ascorbic acid and isotretinoin were among the potential PLpro inhibitors; pioglitazone was among the potential 3CLpro inhibitors; and cortisone and tibolone were included in the potential RdRp inhibitors from the ZINC drug database. These multiple hits also further support the suitability of our strategy.

2) Comparison With Ubani et al. [27] : Ubani et al. [27] screened a library of 22 phytochemicals with antiviral activity obtained from the PubChem database for activity against the spike envelope glycoprotein and main protease of SARS-CoV-2. Among these, we found only one hit that overlapped with our screened out drugs, which was quercetin (Table XI) .

In this paper, we present a novel evaluation method to identify drugs that could be used to effectively treat COVID-19. We applied a TD-based unsupervised FE method to select genes with altered expression caused by MHV infection in mice. Although the dataset analyzed for this study was not based on SARS-CoV-2 infection, the 134 genes selected by TD-based unsupervised FE can still be considered useful for gaining a better understanding of the infectious mechanism of SARS-CoV-2 for several reasons. First, the 134 genes selected were enriched in general RNA virus proteins that play important roles during infectious processes. This suggests that the infectious mechanism represented by the 134 genes in the mouse model is also applicable to SARS-CoV-2 infection. In fact, these genes were also enriched in processes related to scavenger receptor activity, which might reflect the critical role of TMPRSS2 activity in SARS-CoV-2 replication, suggesting a potential therapeutic target.

Following these achievements, we tried to identify potential drug candidate compounds that could influence the 134 selected genes. Among these, we screened out several candidate compounds that are known antiviral drugs, including those that were screened out as drug candidate compounds for SARS-CoV-2 using in silico methods. (Table II) . U1:u 2j , U2:u 2 k , and U3:u 1n . See Materials and Methods for the meanings of j, k and n.

The question arises whether MHV is a suitable model system for SARS-CoV-2 infection, since there are more datasets available for SARS-CoV infection in mouse lung. In order to evaluate the suitability of MHV as a model of the SARS-CoV-2 infectious process, we also performed additional analyses using two SARS-CoV infectious processes in mouse lung (see Materials and Methods). As can be seen in Figs. 5 and 6, 1 = 2 was selected as singular value vector u 1 j with monotonic dependence upon j, 2 = 2 was selected as a singular value vector u 2 k with monotonic dependence upon k, and 3 = 1 was selected as a singular value vector u 3 n with constant values Fig. 6 . Singular value vectors obtained by the HOSVD algorithm applied to GSE50000 (Table II) . U1:u 2j , U2:u 3 k , and U3:u 1n . See Materials and Methods for the meanings of j, k and n.

regardless of n for GSE33266, while 1 = 2 was selected as a singular value vector u 1 j with distinct values between mock (j = 3) and infectious samples (j = 1, 2), 2 = 3 was selected as a singular value vector u 2 k with monotonic dependence upon k, and 3 = 1 was selected as a singular value vector u 3 n with constant values regardless of n for GSE50000. Then 4 = 2, 3 and 4 = 1 were selected for GSE33266 and GSE50000, respectively, as those associated with the larger absolute values of G( 1 2 3 4 ) given 1 , 2 , 3 . P -values, P i , were attributed to gene i using a cumulative χ 2 distribution, P χ 2 [> x], as described in the Materials and Methods; 569 gene symbols associated with selected genes were selected for GSE33266 and 475 were selected for GSE50000 (see Supplementary Material). Fig. 7 shows the Venn diagram of these two sets of genes and 134 genes selected by TD based unsupervised FE using GSE146074. Although there are some overlaps, the majority of genes are not shared among these three gene sets.

These two sets of genes were uploaded to Enrichr and checked for significant overlap with coronavirus PPI genes (Table XII) .Two sets of selected genes significantly overlap with genes that are believed to interact with SARS-CoV proteins, in spite of the fact that these two gene sets are not highly coincident with the 134 genes selected by the TD based unsupervised FE using GSE146074 (Fig. 7) . TD based unsupervised FE therefore has the ability to predict PPI using gene expression profiles, no matter which data sets among GSE146074, GSE33266, and GSE50000 are used. These findings also suggest that the results shown in Table VI are unlikely to be accidental, but provide evidence that the genes selected by TD based unsupervised FE are those interacting with SARS-CoV proteins during infectious processes.

Another possible concern is that the studies are not based upon direct investigation of SARS-CoV-2, but are based upon a closely related virus. This limitation suggests that these results might not be applicable to SARS-CoV-2. In order to address this point, we compared the 134 genes (Table V) with genes reported to interact with SARS-CoV-2 proteins [31] (Table XIII) .One hundred and thirty-four genes significantly overlap with human genes reported to interact with SARS-CoV-2 proteins. TD based unsupervised FE therefore appears to have the ability to predict PPI, even when only gene expression profiles from a related virus are available. Especially, it is remarkable that the present study could outperform the inference in the previous study [8] (asterisked ones in Table XIII) where SARS-CoV-2 infected human lung cell lines are investigated. This is possibly because of the superiority of in vivo study toward in vitro study in spite of the usage of different species. Our proposed method, TD based unsupervised FE, can make use of the data set taken from different species infected by related but not exactly same virus whereas other methods could not (see below). Since it is unrealistic to intentionally infect human subject SARS-CoV-2, it is important to have the methodology that can make use of data set taken from other species than human. Finally, we compared identified drugs ("Drug Pert GEO up/down" and "DrugMarix" in the Supplementary Material) with those reported as possible drugs against SARS-CoV-2 [32] . Among the 142 drugs identified by Zhou et al. [32] , as many as 25 drugs were found to significantly affect 134 genes in at least one experiment within either DrugMatrix, or GEO, using Enrichr with adjusted P -values of less than 0.05 (Table XIV) . Thus, our suggestions for drug repositioning are also supported.

Since it is unlikely that this level of agreement is purely accidental, the drugs identified in the present study can be useful candidates for further evaluation for COVID-19 therapy. This work therefore provides a foundation for further research pertaining to utilizing advanced learning concepts to analyze COVID-19 infectious disease.

Final concerns to be addressed might be the comparison with methods other than TD based unsupervised FE applied to synthetic data sets. When considering the synthetic data set, categorical regression, eq. (10), outperformed TD based unsupervised FE. Although eq. (9) cannot be better than TD based unsupervised FE (Table III) , its performance is still comparable. If these two more easily understood methods are better than or comparable to TD based unsupervised FE, TD based unsupervised FE, which is more difficult to interpret, is useless. In order to check this point, we applied categorical regression [32] and eq. (9), which were modified as

and

where a i = i, b k = k, c m = m, to the present set (GSE146074). Then is associated with adjusted P -values less than 0.01 were selected. There are too many genes that passed this screening to evaluate: 25 609 and 20 217, respectively. This observation suggests that these two methods lack the ability to screen for a limited number of genes that are likely to interact with SARS-CoV-2 proteins, since only a limited number of human genes will interact with SARS-CoV-2 proteins (2 × 10 4 are as many as all human protein coding genes). Although this finding is enough reason to reject the use of these two methods in favor of TD based unsupervised FE if only the top ranked genes are selected, it might be possible to identify a limited number of genes that significantly overlap with genes reported to interact with SARS-CoV-2 proteins. We selected the 134 top ranked genes using the P -values produced by these two methods and uploaded them to Enrichr. There were no SARS-CoV-2 proteins that significantly interact with these 134 genes. Since a significant interaction with SARS-CoV-2 proteins is the primary requirement for genes to be used to screen candidate drug compounds, these two methods appear to be inadequate for the present purpose. If we employ more complicated and sophisticated methods to select genes, it might be possible to identify a limited number of genes that significantly interact with SARS-CoV-2 proteins. However, TD based unsupervised FE is simple and rapid (see the CPU time in Table III ) enough to achieve the present purpose, and does so with acceptable accuracy.

Finally, we would like to discuss why we did not use SARS-CoV-MA15 data. This is simply because we could not get any significant overlaps with human proteins supposed to be interact with SARS-CoV proteins. This is possibly because SARS-CoV-MA15 is believed to be adapted to mouse infection processes while we checked human proteins. Thus, even if organism used is not human but mouse, MHV is suitable model for human SARS-CoV-2 infection and our method has ability to distinguish between MHV and SARS-CoV-15 where the former remains an effective model of human SARS-CoV-2 infectious process as Pfaender et al. [6] correctly assumed while the latter is not.

One might wonder why we have employed one specific algorithm, HOSVD, among those developed to apply TD to tenors. Although it was fully described in the recently published book [7] , we outline it here very briefly.

r Although CP decomposition [7] is more popular implementation, it cannot give us suitable solution when the number of features are much larger than that of samples as the problems dealt in this study because of its heavily dependence upon initial values [7] .

r There are other implementations to derive Tucker decomposition than HOSVD, other methods cannot give us suitable solution because other methods require the optimization within the limited number of given singular value vectors. Since HOSVD does not perform optimization and singular value vectors can be obtained independent of the number of singular value vectors to be computed, HOSVD does not destroy important singular value vectors with small contribution; since the number of genes considers is as small as a few hundreds, which is less than 1 % of total number of genes 10 4 , we cannot ignore singular value vectors with small contributions that might reflect the property of a few hundred genes.

r Not all TD cannot give us weight to evaluate the relations between singular value vectors attributed to distinct instances. Although HOSVD can give us G that can evaluate coincidence between u t , 1 ≤ t ≤ 6, for example tensor train decomposition [7] cannot give us something that corresponds to G. Thus we cannot know which singular value vectors attributed to gene should be used for selecting genes based upon the evaluation other singular value vectors attributed to, e.g., samples. Because of these reasons, we have specifically employed HOSVD.

In the book [7] , we carried out decompositions of data modeled as three-mode tensor and five-mode tensor in addition to others, which were different from this study as we consider a decomposition of different application and data modeled as six-mode tensor (x ijkmnp ∈ R N ×2×2×3×3×2 ).

In silico drug repurposing for COVID-19: Targeting SARS-CoV-2 proteins through docking and consensus ranking

Drug repurposing for coronavirus (COVID-19): In silico screening of known drugs against coronavirus 3CL hydrolase and protease enzymes

Drug repurposing for coronavirus (COVID-19): In silico screening of known drugs against the SARS-CoV-2 spike protein bound to angiotensin converting enzyme 2 (ACE2) (6M0J)

Identification of candidate drugs using tensordecomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and drug matrix datasets

Drug candidate identification based on gene expression of treated cells using tensor decomposition-based unsupervised feature extraction for large-scale data

LY6E impairs coronavirus fusion and confers immune control of viral disease

Unsupervised Feature Extraction Applied to Bioinformatics: PCA and TD Based Approach

A new advanced in silico drug discovery method for novel coronavirus with tensor decomposition-based unsupervised feature extraction

Enrichr: A comprehensive gene set enrichment analysis web server 2016 update

Metascape provides a biologist-oriented resource for the analysis of systems-level datasets

In silico analysis of ORF1ab in coronavirus HKU1 genome reveals a unique putative cleavage site of coronavirus HKU1 3C-like protease

SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor

Enhanced isolation of SARS-CoV-2 by TMPRSS2-expressing cells

Type II transmembrane serine proteases

Neurological disorder drug discovery from gene expression with tensor decomposition

Primaquine diphosphate: Inhibition of newcastle disease virus replication

Cu(II) and Zn(II) complexes with Meloxicam on virus -transformed tumor cells daniela

Antiviral activity of cytarabine in herpesvirusinfected rats

Inactivation of pathogenic viruses by plant-derived tannins: Strong effects of extracts from persimmon (diospyros kaki) on a broad range of viruses

Structure-dependent antiviral activity of catechol derivatives in pyroligneous acid against the encephalomycarditis virus

Evidence that neomycin inhibits human cytomegalovirus infection of fibroblasts

ACE-2/Ang1-7/Mas cascade mediates ace inhibitor, captopril, protective effects in estrogen-deficient osteoporotic rats

Coenzyme Q plays opposing roles on bacteria/fungi and viruses in drosophila innate immunity

Fenretinide inhibits HIV infection by promoting viral endocytosis

Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods

Small molecules blocking the entry of severe acute respiratory syndrome coronavirus into host cells

Molecular docking analysis of some phytochemicals on two SARS-CoV-2 targets

Exploiting drug repositioning for discovery of a novel HIV combination therapy

Troglitazone impedes the oligomerization of sodium taurocholate cotransporting polypeptide and entry of hepatitis B virus into hepatocytes

Motexafin gadolinium (Gd-Tex) selectively induces apoptosis in HIV-1 infected CD4 + T helper cells

A SARS-CoV-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing

Networkbased drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2