key: cord-0281410-9kkj5j5e
authors: Taguchi, Y-h.; Turki, Turki
title: Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
date: 2020-10-04
journal: bioRxiv
DOI: 10.1101/2020.10.02.324616
sha: 2007b1669eddfa3f55e92c6b323621ab40be6cfc
doc_id: 281410
cord_uid: 9kkj5j5e

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

In genomic sciences, selecting a limited number of differentially expressed genes (DEGs) among as many as several tens of thousands of genes is a critical problem. Unfortunately, this is a very difficult task as the number of genes, N, is usually much larger than the number of available samples, M. However, as this is not a mathematically solved problem, it has most frequently been tackled empirically using statistical test-based feature selection strategies 1, 2 . Despite huge efforts along this direction, these statistical test-based feature selection strategies cannot be said to work well.

Selection of biologically informative genes including DEGs is essentially performed as 2012, Chen follows (For simplicity, ∑ i x i j = 0, ∑ i x 2 i j = N and M samples are composed of multiple classes having an equal number of samples). Suppose that we have properties y y y ∈ R M attributed to M samples. We would like to relate a matrix form of some omics data, e.g., gene expression profiles, X ∈ R N×M to y y y. The overall purpose is to derive b b b ∈ R N whose absolute values represent the importance of the ith gene. The first and the most popular strategy outside genomic sciences is a regression strategy that requires minimization of (y y y − b b bX) 2 (1)

The regression approach, eq. (2), is less popular in genomic sciences than in other scientific fields, possibly because of N M, which always results in exactly (y y y − b b bX) 2 = 0 with an infinitely large number of b b b. Thus, it is useless to select a limited number of important features among the total N features. Although adding the regulation term of L 2 norm to eq. (1) as (y y y − b b bX) 2 

because it does not satisfy (y y y − b b bX) 2 = 0 anymore, it is not an ideal solution. Although the solution using the Moore-Penrose (y y y − b b bX) 2 + λ |b b b| (6) can yield at most M variables, which is not effective when N M, because variables larger than M might be biologically informative and should not be neglected. Moreover, addition of L 1 norm is known to be a poor strategy when X are not composed of independent vectors, which are very common in genomic science.

The second strategy is a projection strategy b b b = y y yX T

that is equivalent to the maximization of

and is employed in PCA-and TD-based unsupervised FE (see below). Through the concept of projection pursuit 5 (PP), it is understood that seeking the projection vector b b b maximizes interestingness

which is eq. (8) in this study. As H(b b bX) is a function of b b b, it is also denoted as I(b b b), which is called projection index. I(b b b) can be any other function, but its selection should be decided such that the biologically most meaningful results are obtained. Upon obtaining b b b that maximizes I(b b b), we can select i having a larger absolute b i as mentioned above. In the framework of PP, in a high dimensional system, almost all b b b have finite projections 6 . Thus, the only the point is if it is accidental or biologically meaningful.

In genomic science, projection strategy, eq. (7), is also unpopular. Although the reason for the unpopularity of the projection strategy, eq. (7) is unclear, this may be explained by the ignorance of the contribution perpendicular to y y y, |(x x x i ·ŷ y y)ŷ y y − x x x i |, whereŷ y y is a unit vector parallel to y y y and is defined as y y y/|y y y|. Nevertheless, in contrast to the regression strategy requiring the computation of (XX T ) −1 , eq. (7) can be always computable even if N M, which is a great advantage of the projection strategy when compared with the regression strategy.

Instead of these two strategies, feature selection based on statistical tests 1, 2 are more popular in genomic sciences as mentioned above. They try to identify genes whose expression is significantly distinct between classes. Despite its popularity, feature selection based on statistical tests have critical problems; in particular, significance is heavily dependent on sample size, M. Even in the case of a small distinction, more significant results are obtained when more samples are considered; this is not applicable biologically because determination of whether gene expression between classes differs significantly should not be a function of sample size. To compensate this heavy sample dependence of significance, other criteria such as fold change between classes are often employed. Thus, feature selection based on statistical tests is at best, the best among the worst approaches. If better strategies can be employed, there will be no reason to employ strategies based on statistical tests.

Despite the unpopularity of projection strategy, it was sometimes evaluated as more effective 7, 8 than the standard feature selection strategy based on statistical tests. Thus, it can be a candidate strategy that can be replaced with feature selection based on statistical tests. In this paper, we try to understand why PCA-based unsupervised FE and TD-based unsupervised FE 3 are effective in feature selection based on projection strategy. We consider the cases SARS-CoV-2 infection problem 9 as well as biomarker identification of kidney cancer 10 ; in these studies, despite unsuccessful results obtained by conventional feature selection based on statistical tests, TD-based unsupervised FE identified biologically reasonable genes (for more details about how PCA-and TD-based unsupervised FE are superior to statistical test-based feature selection tools in these specific examples, see these previous studies 9, 10 ). Figure 1 shows the work flow of this study.

Before starting to rationalize PCA-and TD-based unsupervised FE, we briefly summarize how they work. The purpose of PCAand TD-based unsupervised FE is to select biologically sound features (typically genes) based on the given omics data such as gene expression profiles, in an unsupervised manner. In this subsection, we introduce PCA-based unsupervised FE; TD-based unsupervised FE is an advanced version of PCA-based unsupervised FE and will be introduced in the next subsection.

Suppose that we have gene expression data in a matrix form, X ∈ R N×M for N genes measured across M samples. First, we need to standardize X as ∑ i x i j = 0 and ∑ i x 2 i j = N as we will attribute principal component (PC) scores to genes whereas PC loading will be attributed to samples. The th PC score attributed to the ith gene, u i , can be obtained as the ith component of the th eigenvector, u u u ∈ R N , of a gram matrix XX T ∈ R N×N , where X T is a transpose matrix of X, as XX T u u u = λ u u u (10) where λ is the th eigenvalue. Further, the th PC score attributed to the jth sample, v j , can be obtained as the jth component

Notably, v v v is also an eigenvector of the covariance matrix, X T X ∈ R M×M because

PCA-based unsupervised FE works as follows. First, we need to identify the v v v of interest. The v v v of interest depends on the problem. It might be the one coincident with the samples cluster, or the one with monotonic dependence on some external parameter such as time. After identifying the v v v of interest, we try to attribute P-values to genes assuming that the components of the corresponding u u u follow a normal distribution

where P χ 2 [> x] is the cumulative χ 2 distribution that the argument is larger than x and σ is the standard deviation. Computed P-values are adjusted based on the BH criterion 3 and features associated with adjusted P-values less than a specified threshold value can be selected. The reason for the proper working of such a simple procedure is explained later. Finally, we would like to emphasize the equivalence between singular value decomposition (SVD) and PCA. Suppose we have the SVD of X as

It is straight forward to show

Thus, SVD and PCA are mathematically equivalent problems.

TD-based unsupervised FE works quite similar to PCA-based unsupervised FE. Instead of PCA, we apply TD to x i jk ∈ R N×M×K that is, for example, the expression of the ith gene measured in the kth tissue of the mth person (Even though we consider a three-mode tensor here, extension to the higher mode tensor is straightforward). In order to obtain TD, we specify the higher-order singular decomposition 3 (HOSVD) as

where G( 1 2 3 ) ∈ R M×K×N is a core tensor, u 1 j ∈ R M×M , u 2 k ∈ R K×K , u 3 i ∈ R N×N are singular value matrices. After identifying the u 1 j and u 2 k of interest, for instance, the distinction between healthy controls and patients as well as tissue 3/16 specific expression, we seek 3 associated with G( 1 2 3 ) having the largest absolute value given as 1 , 2 . Then using the identified 3 , we attribute P-values to the ith feature as in the case of PCA-based unsupervised FE,

where σ 3 is the standard deviation. Computed P-values are adjusted based on the BH criterion and features associated with adjusted P-values less than a specified threshold value can be selected. The reason for the proper working of such a simple procedure is explained later.

To explain why PCA-and TD-based unsupervised FE work rather well, we consider two recent works 9, 10 , in which the superiority of PCA-and/or TD-based unsupervised FE over conventional statistical methods was shown; in these studies, conventional statistical test-based methods failed to select a reasonable number of genes whereas TD-based unsupervised FE successfully selected a biologically reasonable restricted number of genes.

In the first study 10 , two independent sets of data including the mRNA and miRNA expression of kidney cancer and normal kidney were analyzed in an integrated manner using PCA as well as TD-based unsupervised FE were employed.

The first data set comprised M = 324 samples including 253 kidney tumors and 71 normal kidney tissues. The expression of N mRNAs and K miRNAs was formatted as matrices as x i j ∈ R N×M and x k j ∈ R K×M , respectively. The three mode-tensor x i jk ∈ R N×M×K was generated as

As the data were too large to be loaded into the memory available in a standard stand-alone server, it was impossible to obtain TD

Instead, we generated

and SVD was applied to x ik as

to obtain u 1 i and u 3 k approximately. Missing singular value vectors attributed to mRNA and miRNA samples were approximately recovered using the equations

respectively. Although we do not intend to insist that these approximations are precise enough, we decided to employ them as since they turned out to work well empirically. After investigating the obtained u mRNA 1 j and u miRNA 3 j , we realized that 1 = 3 = 2 are coincident with the distinction between tumors and normal tissues; therefore, we attributed P-values to mRNA and miRNA using u 2i and u 2k , respectively with the equations

These P-values were corrected by the BH criterion and we selected 72 mRNAs and 11 miRNAs associated with adjusted P-values less than 0.01, respectively.

The second data set

The second data set comprised M = 34 samples including 17 kidney tumors and 17 normal kidney tissues. The same procedures applied to the first data set were also applied to the second data set and we selected 209 mRNAs and 3 miRNAs associated with adjusted P-values less than 0.01, respectively. Although various biological evaluations were performed for mRNAs and miRNAs selected using the first data set, the most remarkable achievement was that all three miRNAs selected using the second data set were included in the 11 miRNAs selected using the first data set, and there were as many as 11 common mRNAs selected between the first and second data sets. If we consider that there are as many as several hundred miRNAs and a few tens of thousand mRNAs available, these overlaps are a great achievement as these two data sets are completely independent of each other.

In order to understand why such simple procedures can work well in the framework of PP, we replaced the singular value vectors attributed to samples with projections. For this, we applied PP as mentioned above.

where M N , M T are the numbers of normal tissues and cancer samples, respectively, and M N + M T = M. Then we applied PP as

We used the absolute values of b i and b k to select mRNAs and miRNAs that are presumably coincident with the distinction between tumors and normal tissues. P-values are attributed to mRNA and miRNA as

These P-values are corrected by the BH criterion and we selected 78 mRNAs and 13 miRNAs for the first data set and 194 mRNAs and 3 miRNAs for the second data set, associated with adjusted P-values less than 0.01, respectively. Tables 1, 2, 3, and 4 show the comparisons of genes between TD-based unsupervised FE and PP, eqs. (30) or (31) and demonstrate a high coincidence with each other. Figs. 2 show the comparisons of P i and P k between TD-based unsupervised FE and PP, eqs. (30) or (31). It is obvious that smaller P-values used for gene selection as well as the overall distributions of P-values are coincident between TD-based unsupervised FE and PP, eqs. (30) or (31).

To understand these excellent and unexpected coincidences between TD-based unsupervised FE and PP, we first considered the relationship between PCA and PP and later related it with TD. PCA was known to be equivalent to K-means 3 ; the space spanned by centroids of optimal sample clusters can be reproduced by the PC score attributed to the features. Suppose that we have x i j ∈ R N×M which is the value of the ith feature of the jth sample. M samples are supposed to be clustered into S clusters. The centroid of sth cluster, m m m s ∈ R N is defined as

where x x x j = (x 1 j , x 2 j , · · · , x N j ) T ∈ R N , C s is a set of js that belong to the sth cluster, n s is the size of the sth cluster. Here we define the projection of any vector x x x ∈ R N onto the centroid subspace as

where ⊗ is the Kronecker product. S b is also known to be represented as

which take non-zero values only when the jth sample belongs to the sth cluster. K-means is an algorithm to find clusters that minimize

Minimization of J k is known to be equivalent to the maximization of TrS b . It is known that

where u u u ∈ R N is the vector whose components are th PC scores attributed to the features and eigenvector of the gram matrix as

If we compare eq. (35) with eq. (38), we can notice that ∑ S s=1 Xh h h s ⊗ h h h T s X T corresponds to ∑ S−1 =1 λ u u u ⊗ u u u T , and PCA can give us an optimal centroid subspace, S b , even without realizing the clusters by K-means, i.e., in a fully unsupervised manner.

At first, when the clusters are the solution of K-means, the centroid subspace can be represented by the PC score which can also be expressed by Xh h h s . h h h s is clearly coincident with y j defined in eq. (27). This means that PP employing u u u as b b b should result in projection onto the centroid subspace when y j is coincident with the clusters. Here we define y j , eq. (27), such that it can represent distinction between tumors and normal tissues, which should be detected by K-means. This explains why TD-based unsupervised FE works well and why PP can be replaced with TD-based unsupervised FE. To our knowledge, this is the first rationalization on why TD-and PCA-based unsupervised FE work well.

One might wonder whether the above explanation is applicable to PCA while TD was applied to the first and second data sets. This gap can be explained as follows. Tensor x i jk , was generated as the product of x i j and x k j . Suppose these two are decomposed as

If v j = v j then

This means that if v j = v j , the SVD of x ik gives u i and u k that are obtained when SVD is applied to x i j and x k j . Here u mRNA 2 j is highly correlated with u miRNA 2 j 10 . This is coincident with the requirement v j = v j . As SVD is equivalent to PCA, this might explain why TD-based unsupervised FE works well even though the above rationalization is applied only to PCA.

The third data set Next, we would like to extend the above discussion to TD. Therefore, we consider a third data set analyzed in another study 9 where we performed in silico drug discovery for SARS-CoV-2 by applying TD-based unsupervised FE to the gene expression profiles of human cell lines infected with SARS-CoV-2. The third data set comprises five cell lines infected with either mock (control) or SARS-Cov-2, including three biological replicates. It is formatted as tensor, x i jkm ∈ R N×5×2×3 , that represents the expression of the ith gene of the jth cell line from the infected (k = 1) or control (k = 2) group in the mth biological replicate. HOSVD was applied to x i jkm and we got

In this study, we selected 1 = 1, 2 = 2, 3 = 1 based on biological discussions. We then realized that G(5, 1, 2, 1) has the largest absolute value given 1 = 1, 2 = 2, 3 = 1. Thus, u 5i was used to attribute P-values to gene i using

and the obtained P-values were corrected using the by BH criterion; further, 163 genes associated with adjusted P-values less than 0.01 were selected. We now relate TD to the above discussion about PCA. Because of the HOSVD algorithm, u 4 i can also be obtained by applying SVD to the unfolded matrix, X ∈ R N×30 . Here 30 columns correspond to one of 30 combinations of j, k, m. Here we select 1 = 1, 2 = 2, 3 = 1 so that the gene expression is independent of the cell lines and biological replicates and has opposite signs between the control and infected cells. Thus, two clusters are expected, each of which corresponds to either the control or infected cell lines. The reason why 4 = 5 is selected is simply because u 4i is composed of the centroid subspace coincident with two clusters. Thus, in this sense, the above discussion about PCA can be directly applied to this result.

To confirm this, y j was taken to be

such that it represented the distinction between k = 1 and k = 2 (i.e. that between infected and control cell lines), where y jkm ∈ N 5×2×3 , α j ∈ N 5 , β k ∈ N 2 , γ m ∈ N 3 . Then PP was performed as

P-values were attributed to genes as

and 155 genes associated with corrected P-values less than 0.01 were selected. Table 5 shows high coincidence of selected genes between TD-based unsupervised FE and PP. Fig. 3 shows the overall coincidence of distributions of P-values between TD-based unsupervised FE and PP. Thus, why TD based unsupervised FE can work well is explained by the ability of singular value vectors to generate a centroid subspace of clusters coincident with control and infected cell lines. One might wonder why TD is needed if u 4 i can be computed by applying SVD to the unfolded matrix. To understand this, we compared v 5(i jk) obtained by applying SVD to an unfolded matrix, and corresponding to u 5i as well as u 1 j u 2k u 1m with y ikm . While u 1 j u 2k u 1m is well coincident with y jkm , v 5( jkm) is not (Fig. 8) . Thus, we need to apply TD to x i jkm to obtain singular value vectors attributed to samples, which are coincident with two clusters but cannot be obtained when SVD is applied to an unfolded matrix.

As we have successfully shown that TD as well as PCA are equivalent to PP that aims to maximize projection onto the subspace centroid of clusters coincident with the desired distinction (cancer vs. normal tissue or control vs. infected cell lines), we would next like to rationalize the P-values computed by the χ 2 distribution and threshold values of P = 0.01, which have long 7/16 been employed to select DEGs with PCA-and TD-based unsupervised FE. Because distribution of projection in the infinite sample number limits is proven to be always Gaussian 6 , this null hypothesis might seem reasonable. Nonetheless, the individual distribution of gene expression is far from Gaussian and is rather close to negative signed binomial distribution and when the number of samples is not large enough, the distribution of projection does not converge with a Gaussian distribution at all. Thus, a more straightforward rationalization is needed. Therefore, we generated a null distribution by shuffling i in each sample and recomputed the singular value vectors, u 1 i (for mRNA in the first and the second data sets), u 5 i (for miRNA in the first and the second data sets), and u 3 k (for genes in the third data set). Then P-values were recomputed using the generated null distribution and were corrected using the BH criterion in order to obtain genes associated with significant adjusted P-values. Figure 4 (A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when the miRNAs in the first data set were considered. As it is obvious that there are too many P-values near 1, we excluded some miRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Figure  4 (B) shows the histogram of raw P-values computed to be restricted to the top 500 more expressive miRNAs; this seems more coincident with the null distribution. We then found that twelve miRNAs are associated with adjusted P-values less than 0.1. Table 6 shows the comparison of selected miRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although the threshold P-values differ between the two, the selected miRNAs are quite coincident. A threshold P-value 0.01 was empirically employed for PCA-and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar. Figure 5 (A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when mRNAs in the first data set were considered. As it is obvious that there are too many P-values near 1, we excluded some mRNAs with low values in order to obtain a P-value distribution more coincident with the null distribution. Figure 5 (B) shows the histogram of raw P-values computed to be restricted to the top 3000 more expressive mRNAs; this seems more coincident with the null distribution. We then found that 69 mRNAs are associated with adjusted P-values less than 0.1. Table 7 shows the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between the two, the selected mRNAs are quite coincident. A threshold P-value 0.01 was empirically employed for PCA-and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar. Figure 6 (A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when miRNAs in the second data set were considered. As it is unlikely to get significant P-values, we did not select miRNAs associated with significant P-values. Figure 6 (B) shows the histogram of raw P-values computed for mRNAs in the second data set; there are no peaks around P = 1. We then found that 262 mRNAs are associated with adjusted P-values less than 0.1. Table 8 shows the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between the two, the selected mRNAs are well coincident. A threshold P-value 0.01 was empirically employed for PCA-and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar. Figure 7 (A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when considering the genes in the third data set. As there were too many P-values less than 0.2, we excluded some mRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Figure 7 (B) shows the histogram of raw P-values computed to be restricted to the top 2780 more expressive mRNAs; this seems more coincident with the null distribution. We then found that 48 mRNAs are associated with adjusted P-values less than 0.1. Table 9 shows the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between two, selected mRNAs are well coincident. A threshold P-value 0.01 was empirically employed for PCA-and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

In the previous section, we explained why PCA-and TD-based unsupervised FE work well (because singular value vectors correspond to projection onto the centroid subspace obtained by K-means) and how the criterion to select genes associated with adjusted P-values less than 0.01, which was computed assuming the null hypothesis that singular value vectors obey Gaussian distribution, is empirically coincident with another criterion to select the genes associated with adjusted P-values less than 0.1,

which are computed assuming the null distribution generated by shuffling.

There are many points to be discussed. In the above example, we only dealt with the case wherein only two clusters could be distinguished in a one-dimensional space (i.e., only one singular value vector). Considering cases with more clusters might be challenging, projections onto subspace centroids do not have a one-to-one correspondence with singular value vectors as the coincidence between the projection to the subspace centroid and singular value vectors stands only between the spaces spanned by them, and not between themselves. Despite this, TD-and PCA-based unsupervised FE applied to more than two classes is known to work rather as well as in the case with only two clusters.

On the contrary, although we could only discuss cases with a finite number of clusters, PCA-and TD-based unsupervised FE are also known to work in detecting parameter dependence, e.g., time development 11, 12 . Extending the discussion here to regression analysis without any clusters will be the next step.

One might also wonder whether we need TD if singular value vectors attributed to genes are common between TD and PCA. At first, in the integrated analysis of mRNA and miRNA, TD-based unsupervised FE could outperform PCA-based unsupervised FE 10 . Similarly, TD-based unsupervised FE outperformed PCA-based unsupervised FE in the integrated analysis of gene expression and DNA methylation 13 . Thus, TD-based unsupervised FE is required when integrated analysis is targeted. Even when no integrated analysis was targeted, TD based unsupervised FE can give singular value vectors that are more coincident with biological clusters (Fig. 8) . Thus, despite the apparent equality of singular value vectors attributed to genes between TD and PCA, TD-based unsupervised FE is a more useful strategy than PCA-based unsupervised FE.

Although we did not clearly denote this, conventional gene selection strategies based on statistical tests are known to fail when applied to the first, second, and third data sets 9, 10 ; they always selected too many or too few genes, mRNAs, and miRNA, which is in contrast to TD-based unsupervised FE that could always select a restricted number of genes, from tens to hundreds.

One might also wonder why we did not employ the null distribution generated by shuffling instead of the un-justified Gaussian distribution, with PCA-and TD-based unsupervised FE. As can be seen above, employment of null distribution generated by shuffling is not straightforward; in some cases, e.g, the first and the third data sets mentioned above, we needed to exclude low expressed genes manually whereas this was not required for the second data set. No miRNAs that were significantly expressed distinctly between controls and cancers in the second data sets were detected with the null distribution generated by shuffling. In addition, the number of low expressed genes to be removed cannot be decided uniquely. On the contrary, the criterion that genes associated with adjusted P-values less than 0.01 assuming the null hypothesis that singular value vectors obey a Gaussian distribution is more robust. This often can give a restricted number of genes without excluding low expressed genes. Although why this works so well must be explored in the future, it is an empirically more useful strategy than the null distributions generated by shuffling.

One may also wonder why we did not employ the centroid subspace, S b , instead of singular value vectors if these two are equivalent for optimal clusters and the meaning of centroid subspace is easier to understand compared to singular value vectors. At first, we needed to apply K-means which often fail in unbalanced data sets composed of clusters with a very distinct number of samples. Next, K-means always identify the primary cluster. Nevertheless, in the case of SARS-CoV-2 (the third data set), distinction between infected cell lines and control cell lines was detected using the fifth singular value vectors whose contribution will probably be neglected by K-means because of its too small contribution. In addition, singular value vectors can be computed in a fully unsupervised manner that does not require any labeling. Considering these advantages, it is reasonable to use singular value vectors instead of a centroid subspace despite its apparent usefulness. Further, as the y j used to compute projection b b b is decided manually, even if some biological features that y j assumes, such as clusters, do not exist, b b b can be computed. This might result in wrong conclusions. However, if there are no clusters at all, because no corresponding singular value vectors attributed to samples and coincident with y j are obtained, we can have an opportunity to realize any misunderstanding. Thus, usage of singular value vectors but not projection b b b might be advantageous.

One might also wonder why other more frequently used TD such as CP decomposition 3 have not been employed instead of HOSVD. This might be understood as follows. In the above description, we could relate the singular value vectors obtained by HOSVD to the centroid subspace, because singular value vectors attributed to genes are common between HOSVD and PCA. This equivalence will be broken if HOSVD is replaced with other TDs. When we invented TD-based unsupervised FE, though we also tested other TDs 3 , HOSVD always outperformed other TDs when used for feature selections. The equivalence of HOSVD and PCA might explain why HOSVD could outperform other popular TDs as a feature selection tool. Table 4 . Confusion matrix of selected miRNAs between TD based unsupervised FE and PP in the second data set PP adjusted P k > 0.01 adjusted P k < 0.01 TD based unsupervised FE adjusted P k > 0.01 316 0 adjusted P k < 0.01 0 3 Table 5 . Confusion matrix of selected genes between TD-based unsupervised FE and PP in the third data set PP adjusted P i > 0.01 adjusted P i < 0.01 TD based unsupervised FE adjusted P i > 0.01 21582 52 adjusted P i < 0.01 60 103 Table 6 . Confusion matrix of selected miRNAs between TD-based unsupervised FE and shuffling in the first data set shuffling adjusted P k > 0.1 adjusted P k < 0.1 TD based unsupervised FE adjusted P k > 0.01 488 1 adjusted P k < 0.01 0 11 Table 7 . Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the first data set shuffling adjusted P i > 0.1 adjusted P i < 0.1 TD based unsupervised FE adjusted P i > 0.01 2928 0 adjusted P i < 0.01 3 69 Table 8 . Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the second data set shuffling adjusted P i > 0.1 adjusted P i < 0.1 TD based unsupervised FE adjusted P i > 0.01 33736 53 adjusted P i < 0.01 0 209 

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q qq qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q qqq qq q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qqq q q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q qq q qq q q q q q qq qqq q q q q q q q q qq q q q q q q q qq qqq q q qq q qq q q q q qqqq q q q q q qqq q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q qqq q q q q q qq q q qq q q q qqq q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q qq q q q q qq q q q q q q q q q qq qq q q q qq q q q q q q q q q q qq q q q qq q qq q q q qq q q qq q q q q q q q q q q qq q q q qqq q q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q qq q qq q q q q q q qqq q q q q q q q qq q q q q q qq q q q q q q qq q q q q q qq qq q q q q q q q qq 

Statistical methods for identifying differentially expressed genes in RNA-seq experiments

Selection of differentially expressed genes in microarray data analysis

Unsupervised Feature Extraction Applied to

Regression shrinkage and selection via the lasso

Projection pursuit

Projection pursuit in high dimensions

Identification of differentially expressed genes in microarray data in a principal component space

The characteristic direction: a geometrical approach to identify differentially expressed genes

A new advanced in silico drug discovery method for novel coronavirus (sars-cov-2) with tensor decomposition-based unsupervised feature extraction

Identification of miRNA signatures for kidney renal clear cell carcinoma using the tensordecomposition method

Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression

Tensor decomposition-based unsupervised feature extraction applied to single-cell gene expression analysis

Tensor decomposition-based and principal-component-analysis-based unsupervised feature extraction applied to the gene expression and methylation profiles in the brains of social insects with multiple castes

The Concise Encyclopedia of Statistics

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing