key: cord-0795830-kwmennqk
authors: Rodrigues, Pedro; Costa, Rafael S.; Henriques, Rui
title: Enrichment analysis on regulatory subspaces: A novel direction for the superior description of cellular responses to SARS-CoV-2
date: 2022-04-25
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2022.105443
sha: ea2dd8ca5a18e4b165584cdc79520db41cb24791
doc_id: 795830
cord_uid: kwmennqk

STATEMENT: Enrichment analysis of cell transcriptional responses to SARS-CoV-2 infection from biclustering solutions yields broader coverage and superior enrichment of GO terms and KEGG pathways against alternative state-of-the-art machine learning solutions, thus aiding knowledge extraction. MOTIVATION AND METHODS: The comprehensive understanding of the impacts of SARS-CoV-2 virus on infected cells is still incomplete. This work aims at comparing the role of state-of-the-art machine learning approaches in the study of cell regulatory processes affected and induced by the SARS-CoV-2 virus using transcriptomic data from both infectable cell lines available in public databases and in vivo samples. In particular, we assess the relevance of clustering, biclustering and predictive modeling methods for functional enrichment. Statistical principles to handle scarcity of observations, high data dimensionality, and complex gene interactions are further discussed. In particular, and without loos of generalization ability, the proposed methods are applied to study the differential regulatory response of lung cell lines to SARS-CoV-2 (α-variant) against RSV, IAV (H1N1), and HPIV3 viruses. RESULTS: Gathered results show that, although clustering and predictive algorithms aid classic stances to functional enrichment analysis, more recent pattern-based biclustering algorithms significantly improve the number and quality of enriched GO terms and KEGG pathways with controlled false positive risks. Additionally, a comparative analysis of these results is performed to identify potential pathophysiological characteristics of COVID-19. These are further compared to those identified by other authors for the same virus as well as related ones such as SARS-CoV-1. The findings are particularly relevant given the lack of other works utilizing more complex machine learning algorithms within this context.

The infection of humans by Severe Acute Respiratory Syndrome Corona Virus 2 (SARS-CoV-2) represents a major global health concern, with deaths having surpassed 5.8 million according to the World Health Organization 1 . Worldwide initiatives to publicly share data related to the virus provides an opportunity to draw novel insights on the infection by the family of coronaviruses, enabling continuous breakthroughs on the understanding of how the virus can enter and use the cellular machinery to replicate. The knowledge of these mechanisms has been propelled by a generic understanding of the process of viral replication, the transcriptomic properties of the virus, and the study of differentially expressed genes after infection [1, 2] . This later line of research has been primarily assisted by the sequencing of RNA transcripts in infectable cell lines, chosen according to the level of permissivity to infection, as well as cells collected from organisms susceptible to infection, including humans and ferrets [1] . Along this line of research, comparisons of infection in different cells and tissues have been pursued, as well as between different viral strains and families of virus [1] .

Despite the ongoing breakthroughs, the regulatory responses to SARS-CoV-2 infection are still not comprehensively known [3] . For instance, the role played by genes with moderate differential expression in response to infection, or how interactions between multiple genes support or prevent viral replication, are still being actively updated [4] . In addition, most works in this field do not explore the role of machine learning approaches, such as clustering, predictive modeling and biclustering, to aid in the identification of differentially expressed regulatory modules [5] .

This work aims to address these challenges by assessing the extent to which clustering, predictive modeling and biclustering methods aid the identification of elicited biological functions and pathways from transcriptomic data in response to infection. The aforementioned machine learning approaches are placed to model differential regulatory responses after infection from both infectable lung cell lines and in vivo tissue samples. In addition, a comprehensive analysis of the enriched processes and pathways using these methods is undertaken to better understand the viral life-cycle and interactions with the cell, as well as the defence mechanisms employed by the cell against the virus. In particular, SARS-CoV-2 infection is assessed against non-infected cells and infection by other respiratory viruses such as the RSV, IAV (H1N1), and HPIV3. SARS-CoV-2 (α-variant) is targeted in this study, yet the underlying methodological principles extensible to other variants and viruses.

Four major contributions are provided. First, the role of different machine learning approaches to produce relevant gene sets for enrichment analysis is experimentally compared. Second, state-of-the-art biclustering algorithms are assessed and compared against alternative descriptive and predictive stances. The gathered results reveal that biclustering significantly assists the knowledge acquisition process. In particular, the recent class of pattern-based biclustering approaches [6, 7] show distinctive ability to produce a comprehensive set of superiorly enriched biological annotations in well-established knowledge bases without an increase on false positive discoveries. Third, grounded on the previous findings, a novel methodology is provided for a robust and comprehensive analysis of putative regulatory modules associated with virus infection. Fourth, under this methodology, we further identify and highlight the putative role of less-studied biological processes associated with SARS-CoV-2 infection, some consistent with literature on other coronaviruses. In particular, cell-and virus-specific regulatory differences are further identified.

The manuscript is organized as follows: section 2 covers related contributions; section 3 explores the datasets; section 4 presents the proposed methodology; section 5 experimentally compares the role of state-of-the-art machine learning approaches to aid enrichment analysis, together with the description of the identified biological processes. Finally, major concluding remarks are drawn.

with increased or decreased resistance to SARS-CoV-2-induced cell death. The gene with the strongest pro-viral effect was ACE2, the protein facilitating viral entry into the cell. TMPRSS2, another gene with an established role in the SARS-CoV-2 entry, was not identified significantly as pro or anti-viral, whereas the CTSL gene, which encodes the Cathepsin L protease with an identified role in viral entry, was identified as pro-viral.

Due to thrombotic complications being common among COVID-19 patients, the functional and transcriptional changes elicited by SARS-CoV-2 infection in platelets have been further explored [13] . The conducted analysis shows that SARS-CoV-2 infection does indeed alter the platelet transcriptional activity [13] . To assess the significance of differential changes, paired t-Student and Mann-Whitney tests are considered. The authors observed that COVID-19 further induces functional and pathological changes to platelets, including thrombocytopenia (abnormally low numbers of platelets), despite the platelets not presenting detectable levels of ACE2. This may be a contributing factor to the pathophysiology of COVID- 19 .

In [14] , the authors tested the pathogenesis of the SARS-CoV-2 virus on transgenic mice presenting the human ACE2 gene. The infection of these mice by SARS-CoV-2 resulted in high mortality rates, especially in male mice. The transcriptional analysis of the lungs of infected animals revealed increases in transcripts involved in lung injury and inflammatory cytokines, in agreement with findings in humans.

Though there are multiple authors applying machine learning and complex statistical models to COVID-19 patient biometric data, these approaches have been more scarcely applied to transcriptomic data. The objective of this work is to fill this gap, addressing the question of whether the application of these approaches to this data can assist functional enrichment analysis, yielding novel insights into the disease.

In order to assess the proposed methods, we use the transcriptomic data (RNA-Seq) collected by Blanco et al. (Gene Expression Omnibus, GEO accession GSE147507 2 ) [1] . A schematic of its structure is presented in Figure 1 . The samples are divided into different series (a subset of samples), each comprising the behavior of a single cell line among different sets of experimental conditions. These also correspond to particular experiments being run, with each experiment containing multiple replicas of each experimental condition being tested.

Three major cell lines are considered: NHBE (normal human bronchial epithelial), A549 (adenocarcinomic human alveolar basal epithelial) and Calu3 (generated from a bronchial adenocarcinoma) cells. Considering NHBE cells, there is a total of 7 samples of healthy cells (spanned along three series), [1] notes that A549 cells show low viral counts, a fact posited, in agreement with other studies, to be due to the low expression of ACE2 in these cells. Thus, data of A549 cells with added ACE2 (A549-ACE2) is also available. In particular, 6 samples of healthy cells (series 6 and 16), 6 samples of cells infected by SARS-CoV-2 α-variant (series 6 and 16) and, finally, 3 samples of cells after treatment with Ruxolitinib (series 16). For Calu3 cells, there are 3 samples of healthy cells and 3 samples of cells infected by SARS-CoV-2 α-variant (all belonging to series 7). Complementarily to infectable lung cell lines, there is an additional set of 2 samples from a lung biopsy of two healthy human donors (one male, one female), and 2 samples from a single deceased male patient of COVID-19.

Since the distribution of transcript counts is understandably significantly skewed, gene expression was adjusted by a log2-transform for all subsequent analysis. Figure 2 depicts the distribution of expression levels among in-vitro cell lines and lung biopsies. The standard deviation of gene expression within healthy and infected cells was subsquently computed to preliminarily verify if there are significant differences between healthy and infected cells ( Figure 3 ).

In order to select an appropriate statistical test to identify differentially expressed genes, a number of assumptions needs to be assessed. Firstly, we performed a median-based Levene's test [15] to assess the equality of variances for each pair of conditions (Table 1) . For these pairs, out of 19967 genes with non-null expression levels, 18990 had unequal variance for at least one pair of conditions, with p < 0.01. Additionally, we applied the Shapiro-Wilk test [16] to assess whether these genes follow a normal distribution, applied 

J o u r n a l P r e -p r o o f to healthy and infected lines for each cell type with p < 0.05. Overall, 32.8%, 46.1% and 27.4% of all genes in NHBE, A549 and Calu3 cells, respectively, are non-normally distributed.

The results of Levene's test suggest that an assumption of equal variance cannot be made. As such, either an unequal variance (Welch) t-test or its non-parametric alternative, the Mann-Whitney U test [17] , are more suitable for variable selection. As results reveal a significant percentage of non-normally distributed genes, Mann-Whitney U tests are selected as a baseline to identify differentially expressed genes.

To comprehensively unravel the biological processes involved in the cell response to SARS-CoV-2 infection, we explore the role of state-of-the-art machine learning approaches to find discriminative transcriptional modules. In this context, we propose a methodology for the selection and discovery of correlated groups of differentially expressed genes (DEG) composed of three major steps. First, preprocessing and preliminary gene selection are undertaken. Second, we proceed with pattern recognition techniques, namely clustering, predictive modeling and biclustering. For each of these techniques, we apply functional enrichment to the obtained groups of genes in order to identify putative biological functions. Finally, we analyse and interpret the identified functions, relating them to known characteristics of the disease as well as work by other authors. These steps are summarized in 

Given the skewed distribution of gene expression (with a vast majority of genes having low expression levels), we first apply a log2 transform. Then, from the high-dimensional set of over 20 000 genes, we select a subset of DEGs. Due to the non-normal nature of data and unequal variance between control and test groups (as seen in section 3), Mann-Whitney U test is applied with p < 0.05 and p < 0.01. By default a p < 0.01 is used, however for certain cell types p < 0.05 is suggested to guarantee a better coverage of gene candidates to the subsequent learning stage. The null hypothesis of Mann Whitney U test is that the two assessed populations are equal. Therefore, this test can only be applied for pairs of conditions. In particular, we consider the following settings:

• Paired setting: single pairs of conditions, e.g. healthy and SARS-CoV-2 infected NHBE cells or healthy and IAV infected A549 cells;

• Multi-condition setting: genes are selected if they show differential expression in one of the various pairs of conditions presented in Table 1 .

For each set of pairs, a Mann-Whitney U test is applied for each gene. Genes satisfying p < 0.01 or p < 0.05 are selected.

Additionally, ANOVA test [18] can be optionally applied to further ensure the discriminative power of the resulting set of differentially expressed genes. This is suggested if the subsequent learning step benefits from a reduced dimensionality by requiring candidate genes to satisfy two distinct statistical criteria.

The usage of complete data with a simple statistical pre-selection of genes yields results which, depending on the chosen level of statistical significance, can surpass 1000 genes. Applying functional enrichment to these results delivers none or very few enriched processes, which, when present, tend to be very generic cell functions. In this context, the pursue of putative transcriptional modules given by smaller sets of DEG is attempted to obtain more specific biological processes, as well as better statistical significance for each one found. To achieve this goal, three major approaches are applied: clustering, predictive modeling and biclustering.

Given a set of genes G and sample X, the clustering task aims to find groups (clusters) of genes, {J 1 , .., J k } where J i ⊆ G, or conditions, {I 1 , .., I k } where I i ⊆ X, maximizing intra-cluster similarity and inter-cluster dissimilarity. As introduced, the notion of a cluster can assume two distinct forms -a subset of correlated genes along a given set of samples or a subset of correlated samples along a given set of genes. The former is suggested to identify sets of co-expressed genes, which further satisfy delineate discriminative criteria satisfied in the precedent feature selection step (section 4.1). Agglomerative clustering is considered in this work with Euclidean affinity and Ward linkage for two main reasons: the easy visualization of gene proximity using dendrogram, which can also help with the selection of the number of clusters; and inherent algorithmic flexibility, allowing for parameters to be adjusted according to the provided data.

Despite the relevance of clustering for enrichment analysis [19, 20] , it has considerable limitations. Namely, similarity between genes is assessed across all samples. If multiple conditions are used simultaneously, such information will not be taken into account, imposing similarity across all conditions and biasing the detected patterns. To ameliorate this effect, clusters can be found on subsets of conditions or individual conditions. As a result, multiple clustering solutions can be acquired, providing a wide diversity of sets of correlated genes with potential biological relevance. However, such disaggregation prevents a direct comparison between different conditions.

Consider the target multivariate data, described by a set of samples (observations), X, with expression measured along a high-dimensional set of genes, G, and each sample annotated with the corresponding condition (e.g. IAV infection), c ∈ C. Given a set of annotated data samples, the classification task aims at learning a mapping between samples and conditions, X → C, on the basis of the underlying transcriptional activity.

As we seek to better understand potential signaling pathways and gene ontologies involved in the infection by SARS-CoV-2, we mainly focus on which genes are chosen to classify each of the samples by inspecting the learned predictive model. For this reason, we focus on the family associative classifiers given their easier explainability, namely decision trees [21] , random forests [22] , and XGBoost [23] in Python. While not directly interpretable, both random forests and XGBoost provide a metric of the relevance of each gene, which can be used to obtain the set of genes with the highest difference in expression level. In both cases, this metric corresponds to the impuritybased feature importance [24] , which is calculated using the Gini criterion and then averaged across all trees within the model.

Given a set of observations (samples), X={x 1 , .., x n }, genes G={g 1 , .., g m }, a bicluster, B=(I, J), is a subspace defined by a subset genes, J ⊆ G, coexpressed on a subset of conditions, I ⊆ X. The biclustering task aims at identifying a set of biclusters, B, such that each bicluster, B k =(I k , J k ), satisfies specific criteria of homogeneity, dissimilarity and statistical significance.

Homogeneity criteria are commonly guaranteed through the use of a merit function, such as the variance of the values in a bicluster [25] . Merit functions are typically applied to guide the formation of biclusters in greedy and exhaustive searches. In stochastic approaches, a set of parameters that describe the biclustering solution are learned by optimizing a merit (likelihood) function. The pursued homogeneity determines the coherence (co-expression patterning), quality (noise tolerance) and structure (number, size and positioning) of the subspaces in the biclustering solution [7] . A putative regulatory module is in this context given by a subspace of co-expressed genes, i.e. expression pattern on observations (Fig.5) . A co-expressed subspace, B=(I, J), can be described by an order-preserving coherence when the ordering of a subset of genes according to their expression values, π J , is preserved for each sample in I. In alternative, co-expression can be defined by constant coherence where expression values in a bicluster, a ij ∈ B, are described by a ij =c j +η ij , where c j is the expected value of gene g j and η ij is the noise factor, generally a bounded deviation from expectations, η ij ∈ [−δ/2, δ/2]. 

In addition to homogeneity criteria, dissimilarity criteria can be further placed to guarantee the discovery of non-redundant biclusters [6] . Finally, statistical significance criteria guarantee that the probability of a bicluster's occurrence (against a null data model) deviates from expectations [26] .

With biclustering algorithms, we can detect gene sets co-expressed on particular subsets of conditions, allowing for a more comprehensive modular view of regulatory responses to infection by SARS-CoV-2 and other viruses. In particular, when compared to the other proposed methods, biclustering allows for the detection of more specific patterns, such as gene groups with higher or lower expression levels for a particular set of conditions, which are in turn easier to interpret and provide better results with functional enrichment.

To this end, we tested several biclustering algorithms to assess differences between solutions, namely the Cheng and Church [27] , plaid [28] and xMotifs [29] algorithms. In recent years, a clearer understanding of the synergies between biclustering and pattern mining paved the rise of a new class of biclustering algorithms, generally referred to as pattern-based biclustering [7] . Pattern-based biclustering algorithms are inherently prepared to efficiently find exhaustive solutions of biclusters and offer the unprecedented possibility to affect their structure, coherency and quality [30] . This behavior explains why this class of biclustering algorithms are receiving an increasing attention in recent years [7] . In this context, we additionally assess the role of Bic-PAMS (Biclustering based on PAttern Mining Software), which consistently combines state-of-the-art contributions on pattern-based biclustering [6] .

To study the biological processes associated with the gene groups found with the aforementioned methods, we used the EnrichR tool [31, 32] . To assess enrichment of terms in the target knowledge bases, we focus on three major criteria: p-value from Fisher's exact test; the q-value, which adjusts the p-value to control the False Discovery Rate; and the z -score, which takes into account that Fisher's exact method to calculate the p-value produces lower values for longer lists even if they are random. Furthermore, the zscore and p-value are combined as follows: c = ln(p) × z. We prioritize both the adjusted p-value and the combined score to compare the results of the enrichment analysis.

Additionally, EnrichR web tool provides access to multiple knowledge bases 3 . For our analysis, we prioritize Gene Ontology (GO) Biological Process knowledge base [33, 34] (ver. 2021) as it comprehensively characterizes a large amount of genes (14937) against 6036 terms, including recently augmented biological processes on viral infection and immune responses. Additionally, we use the Kyoto Encyclopedia of Genes and Genomes (KEGG) [35] to analyse enriched pathways and diseases. The identified terms are then analysed and compared to known characteristics of the infectious disease and other previous studies, in order to identify potential new insights into the effects of the virus and verify existing ones.

The code used to obtain the results can be obtained in the following GitHub repository: https://github.com/PRodrigues98/Analysis-of-regulatoryresponse-to-SARS-CoV-2-infection. Dependencies: Python version 3.8, NumPy , pandas, scikit-learn and matplotlib libraries.

To address the challenges of classic functional enrichment analyses, we introduced three approaches for identifying DEGs associated with modular regulatory views (section 4): clustering, predictive modeling and biclustering. In the present section, we present the key findings resulting from the application of these methods to study regulatory responses to viral infection, Table 2 : Top 15 GO biological processes ordered by combined score after selecting differentially expressed genes using the Multi-Condition Setting (p < 0.01). GO IDs are linked to descriptions of the biological processes in the QuickGO web browser. as well as an analysis of the identified biological processes within the context of SARS-CoV-2 infection.

To assess the effectiveness of the methods, we begin by presenting, in Table 2 , the functional enrichment on the baseline set of genes obtained directly through gene selection using the Multi-Condition Setting (section 4), p < 0.01. As we can see in Table 2 , there is a considerable number of processes with low p-value and c-scores are significantly lower when compared to the gene sets formed using machine learning methods (Tables 3-8 ). This is likely due to the higher number of genes being analysed together, an observation that supports the need for complementary methods, such as clustering, predictive modeling and biclustering, better suited to identify smaller subgroups of co-expressed genes. Additionally, terms such as negative regulation of bone remodeling (GO:0046851) and negative regulation of bone resorption (GO:0045779), less related to viral infection appear in this analysis, yet do not seem to reoccur within the terms found when using machine learning methods. Subsequent sections 5.1 to 5.3 assess the role of clustering, classification and biclustering searches to functional enrichment analysis, establishing a comparative appraisal.

Considering the application of agglomerative clustering (Ward linkage, Pearson correlation affinity) over differentially expressed genes obtained using the Multi-Condition Setting (p<0.01), seven clusters of co-expressed genes were produced under the Elbow method, and all clusters subsequently subjected to functional enrichment analysis using the EnrichR API [32, 31] and ordered by combined score. The gathered results from GO term enrichment, in Table 3 , show that a high percentage of the top enriched processes are related to response to viral infection, as well as complementary immunerelated responses. The annotation cytoplasmic pattern recognition receptor (PRR) signaling pathway in response to virus, GO:0039528 (directly related to the annotations GO:0140546 and GO:0051607, also within the top enriched processes) corresponds to a set of molecular signals associated with the detection of a virus (binding of viral RNA molecules to certain cytoplasmic receptors). In particular, the detection seems to be performed by the RIG-I PRR, responsible for the detection of RNA synthesized during the process of viral replication, since there are three child processes (GO:0039529 with p = 2.91×10 −3 and c = 905.29; GO:0039535 with p = 7.67×10 −4 and c = 526.08; GO:0039526 with p = 5.26×10 −3 and c = 513.51) associated with this receptor which are statistically relevant. This receptor, along with others, has been identified as part of the inflammatory response to SARS-CoV-2 as well as other coronaviruses [36] . Additionally, the signaling cascade resulting from the detection of viral proteins is associated with the production of Type I interferons and pro-inflammatory cytokines [37] , which can also be observed within the top enriched processes (for instance, terms GO:0060337, GO:0071357 and GO:0060333).

In order to assess cell-specific transcriptional responses, clustering was also performed separately per cell line after gene selection using the Paired Setting (section 5). Appendix tables 12, 13, 14 and 15 list the top enriched GO terms from healthy versus infected expression for NHBE cells, A549 cells, A549 cells with added ACE2, and Calu3 cells, respectively. In particular, type I interferon signaling pathway (GO:0060337), which has several related terms also present within the top enriched processes (for instance, type I interferon signaling pathway, GO:0060337 and cytokine-mediated signaling pathway, GO:0019221, both direct ancestors), appear to be strongly associ- ated to the process of viral infection, further bolstered by the presence of terms response to interferon-beta (GO:0035456) and response to interferonalpha (GO:0035455). It is also interesting to note the presence of the term negative regulation of type I interferon-mediated signaling pathway (GO:0060339) as well as negative regulation of chemokine production (GO:0032682). Chemokines are involved in inflammation and the control of viral infections, and they and their receptors are sometimes mimicked by viruses in order to evade host antiviral immune responses [38] . The presence of these is noteworthy mainly due to directly opposing the other processes related to the activation of an immune response.

Considering normal bronchial epithelial (NHBE) cells (Table 12) , there are multiple additional processes directly related to cellular response to viruses, namely defense response to symbiont (GO:0140546), defense response to virus (GO:0051607), negative regulation of viral genome replication (GO:0045071, also associated with GO:0045069), antiviral innate immune response (GO:0140374), negative regulation of viral process (GO:0048525) and cellular response to virus (GO:0098586). These indicate that NHBE cells were able to identify that they had been infected by a virus and induce an immune response.

For adenocarcinomic alveolar basal (A549) cells (Table 13) , the genes composing all detected processes show higher expression levels for infected cells than for control. Similarly to NHBE cells, there seems to be a prevalence of type I interferon and cytokine related terms. Multiple processes, such as cellular response to type I interferon (GO:0071357), type I interferon signaling pathway (GO:0060337), response to interferon-beta (GO:0035456) reoccur, with most of the common processes linked to interferon and general cytokine responses as well as general responses to viral infection. Interestingly, and also similarly to the NHBE cells, the process negative regulation of type I interferon production (GO:0032480) seems to suggest a potential attempt to reduce immune response. However, the opposite term, positive regulation of type I interferon production (GO:0032481) is also within the top terms (though with higher p-value and lower c-score). This may be due to both pathways being active simultaneously, although it may also reveal overlap in the genes that produce each process (2 out of 5 genes in common between the two processes).

The terms STAT cascade (GO:0097696), positive regulation of JAK-STAT cascade (GO:0046427) and JAK-STAT cascade (GO:0007259) are also considerably enriched in A549 cells, although not significantly enriched in NHBE cells. These terms are related to the JAK-STAT signaling pathway, mediated by a wide variety of cytokines. Not triggering signaling or not regulating it properly, can lead to inflammatory disease [39] , among other issues.

The addition of ACE2 to A549 cell cultures (Table 14) seems to increase the number of processes not directly related with viral infection. Nevertheless, top terms, including positive regulation of heat generation (GO:0031652), regulation of fever generation (GO:0031620) and positive regulation of fever generation (GO:0031622), all associated with acute inflammatory response (the term GO:0002526, which is an ancestor), underlie common physiological symptoms of COVID-19 (α-variant) and mediate immune responses.

Predominantly interferon and cytokine related processes are observed for Calu3 cells (Table 15 ), similarly to NHBE and A549 cells. The term regulation of ribonuclease activity (GO:0060700), also identified for NHBE cells (Table 12) , is noteworthy as ribonuclease (RNase) is an enzyme that catalyzes the decomposition of RNA into smaller components. Particularly, RNase L is associated with innate immune response, and certain viruses have been shown to block this pathway for preventing viral RNA degradation [40] . Finally, in Table 4 , we present the top enriched pathways from the KEGG knowledge base (ver. 2021). The results, similarly to the GO database, include multiple virus related pathways. These are all composed by genes with higher expression values for infected cells than control. Within the top identified terms, there is a prevalence of virus-related pathways. Coronavirus disease (map05171), the sixth enriched term, confirms the association to the target viral infection. The KEGG pathways antigen processing and presentation (map04612), JAK-STAT signaling pathway (map04630), the principal signaling mechanism for a variety of cytokines, IL-17 signaling pathway (map04657), a subset of cytokines with various roles related to inflammatory responses and defence against external pathogens, and NF-kappa B signaling pathway (map04064), a signaling pathway which is activated by the afore-mentioned cytokines and is related to immune responses, all support the processes identified previously in the role played by inflammatory cytokines and related signaling pathways in the infection by SARS-CoV-2. Additionally, the term RIG-I-like receptor signaling pathway (map04622), which is related to the previously mentioned RIG-I receptor, solidifies the relevance of its putative involvement with the anti-viral immune response. Figure 6 presents a decision tree produced for the Multi-Condition Setting (Mann-Whitney U test with p<0.01) using the Gini criterion. We notice the presence of selected genes related to immune and inflammatory re- sponses, namely IL1A, Interleukin 1 Alpha, a protein-encoding gene associated with cytokine activity and inflammatory response; MX1, which is a protein-encoding gene associated with antiviral activity against a variety of RNA viruses; IL31RA, a type I cytokine receptor. We observe that a few genes appear to be sufficient to discriminate conditions, including infection by different viruses. While in a classification problem this can be desirable, it does not support the comprehensive discovery of putative regulatory modules from transcriptomic data. To address this limitation, we now proceed with ensemble algorithms. Random Forest and xGBoost algorithms are selected as ensemble predictive models. As with clustering, we first begin by presenting the processes identified when using the complete data, with multiple cell types and viruses. Tables 5 and 6 gather the top enriched terms for Random Forests and XG-Boost, respectively. Using impurity-based feature importance, 94 genes are identified with xGBoost and 356 genes with Random Forest. These algorithms have 69 genes in common. There are multiple terms present in both models, mostly related to immune system activity. However, there are several processes uniquely identified by each algorithm. ISG15-protein conjugation (GO:0032020), a term identified only within XGBoost selected genes, is related to the cellular protein modification process of ISG15. This protein has an important role in host antiviral response, with several different actions depending on the infecting virus. Most significantly among these actions is the inhibition of viral replication in addition to the modulation of the damage and repair as well as the immune responses [41] .

Amongst the terms identified by XGBoost, multiple ones are related to chemotaxis, the movement of a cell or organism towards a higher or lower concentration of a given substance, and migration of various types of immune cells. In particular, macrophages [42, 43] (GO:0048246 and GO:1905517), natural killer cells [44] (GO:2000501)), eosinophils [45] (GO:0072677 and GO:0048245), neutrophils [46] (GO:0030593 and GO:1990266), which are all types of white blood cells involved with the innate immune response to viral infection.

Additionally, there are multiple terms in both cases associated with cytokine production and related signaling pathways, as well as response to different types of interferons. In addition to these, terms such as regulation of fever generation (GO:0031620), negative regulation of viral process (GO:0048525), inflammatory response (GO:0006954) and negative regulation of viral genome replication (GO:0045071) are also associated with immune response. Together with the previously mentioned signaling of white blood cells, these results show the significance, both innate and adaptive, immune responses by cells infected by this virus.

NHBE cell-specific terms are highlighted in Tables 16 and 17 in appendix for Random Forest and XGBoost, respectively. The p-values for these processes are significantly higher than those obtained using clustering on the same data. Among the top processes in both tables is chronic inflammatory response (GO:0002544). Similarly to what was mentioned for the combined data, there are multiple terms related to the recruitment of certain types of white blood cells. In particular, positive regulation of monocyte chemotactic protein-1 production (GO:0071639), the top term for the Random Forest, is associated to a protein with a pivotal role in the migration of monocytes [47] .

It is also important to note that multiple terms associated with the apoptotic process are present, namely positive regulation of intrinsic apoptotic signaling pathway (GO:2001244), regulation of intrinsic apoptotic signaling pathway (GO:2001242) and positive regulation of apoptotic signaling pathway (GO:2001235). This process, responsible for cell death programming, may indicate that the cell was able to detect the infection by SARS-CoV-2. This hypothesis is further supported by the presence of the term pattern recognition receptor signaling pathway (GO:0002221). These receptors, as previously explained for the related term present in Table 3 , have been associated with the inflammatory response to SARS-CoV-2 [36] .

Considering A549 cells (Tables 18 and 19) , we observe the presence of several terms related with the host response to the virus. In particular, positive regulation of defense response to virus by host (GO:0002230), regulation of defense response to virus by host (GO:0050691), defense response to symbiont (GO:0140546) and defense response to virus (GO:0051607) within the selected DEG by the Random Forest. It is also worth noting once again the abundance of interferon related processes, as well as some cytokine related terms. Among these, negative regulation of cytokine production (GO:0001818) and positive regulation of cytokine production (GO:0001819), which are contradicting, may indicate an attempt to modulate the immune response by the cell or potentially a mechanism of the virus to defend itself from the immune response.

The terms RIG-I signaling pathway (GO:0039529) and cytoplasmic pattern recognition receptor signaling pathway in response to virus (GO:0039528), also enriched for NHBE cells, are once more identified. These receptors play crucial roles in the detection of viruses by cells and the resulting signaling cascade, which in turn leads to the production of Type I interferons and pro-inflammatory cytokines [37] .

With the aim of modeling more complex regulatory patterns to acquire novel knowledge, biclustering is now applied. In particular, biclustering, unlike clustering, can identify regulatory co-expression profiles spanning a subset of overall conditions. In addition, we can go beyond classic correlation assumptions, and accommodate less-trivial (yet relevant) forms of subspace coherence, such as additive and order-preserving expression [6] .

We begin in Table 7 by presenting multiple statistics per algorithm when considering different preprocessing options. BicPAMS and Cheng and Church algorithms present the highest average number of biclusters, while Plaid and xMotifs algorithms provide significantly less for most preprocessing conditions. These differences are driven by the varying coherence, positioning constraints, and underlying searches (greedy in Cheng and Church and xMotifs, stochastic in Plaid, and exhaustive in BicPAMS). It is also important to note that BicPAMS presents delineatedly higher enrichments, and selects a larger amount of genes and lower number of conditions per putative regulatory module. Comparatively, this behavior is particularly relevant since having too many conditions can lead to the identification of more generic genes, while having too few genes can lead to the identification of less significant processes.

The observed differences are further hypothesized to be driven by four unique properties of the pattern-based biclustering searches implemented in BicPAMS. First, the exhaustive nature of the searches combined with the possibility to mask regions of the data space with greater likelihood in an attempt to find a more diversified set of non-redundant biclusters [6] . Second, the ability to consider varying levels of coherence strength and quality, allowing the discovery of regulatory modules with varying degrees of homogeneity [30] . Third, the ability to statistically test biclusters with varying coherence assumptions, ensuring deviations from expectations and therefore minimizing false positive discoveries [26] . Finally, the absence of structural constraints, enabling the discovery of an arbitrarily-high number of putative regulatory modules with flexible positioning [7] , including overlapping genes and conditions. Using these methods, we obtain a set of biclusters per algorithm, where each bicluster consists of a subset of genes and a subset of conditions. By performing functional enrichment on these genes, a set of biological processes is then produced (Tables 8-10 ). In order to analyze these results and obtain a more generic view of how often certain processes occur for each condition, a Table 8 : GO biological processes with highest joint ranks for SARS-CoV-2 conditions. Counts correspond to the normalized number of occurrences of each process within each condition. count is performed for each process identified. This allows the identification of the most commonly occurring processes, and thus provides a better view of which processes are most closely related with a certain condition, while also potentially reducing the amount of more generic biological processes. In addition to this, it provides a direct element of comparison between different cell types for the same condition, or between the same cell type and different viruses. In addition to the number of occurrences of each process, the best c-score and p-value are also provided, in order to compare the statistical relevance of different processes.

We now proceed to a comparative analysis of the biological processes associated with SARS-CoV-2 for all cell types using biclustering (Table 8 ). In order to provide an ordering for the processes taking into account all cell types, each enriched term is first ranked by the number of occurrences it has related to a given condition. Then a fused rank is computed by multiplying the resulting ranks. The multiplication allows for a higher penalization of terms which contain a single very low rank but high ranks for other cell types.

Some of the identified processes have been analogously retrieved with clustering and predictive models. In particular, terms related to cytokine activity, for instance cytokine-mediated signaling pathway (GO:0019221), showing a high number of occurrences for A549 (1.00), NHBE (0.75) and Calu3 (1.00) cells and a lower count for Biopsy cells (0.60). It is interesting to note a seeming tendency for the normalized number of occurrences for Biopsy cells to be lower for most processes, with more generic DNA related processes, such as DNA metabolic process (GO:0006259), DNA repair (GO:0006281) and cellular response to DNA damage stimulus (GO:0006974), possessing higher values. This may be due to biopsy results possibly containing multiple cell types as well as given the very low number of samples of this type of cell (2 healthy and 2 infected).

Other cytokine associated processes include cellular response to cytokine stimulus (GO:0071345), chemokine-mediated signaling pathway (GO:0070098) followed also by cellular response to chemokine (GO:1990869). Chemokines in particular play an important role in multiple processes related with host immune response against viral infection [48, 49] , namely the attraction of leukocytes to the infected tissue. The presence of the terms neutrophil mediated immunity (GO:0002446), neutrophil activation involved in immune response (GO:0002283) and neutrophil degranulation (GO:0043312), further supports this hypothesis. Neutrophils are leukocytes which are the first responders to sites of infection, and have also been identified as the main in- Table 9 : GO biological processes with highest joint ranks for all viruses for the A549 cell type. Counts correspond to the normalized number of occurrences of each process within each condition. filtrating cell population in IAV infection [46] . Despite containing somewhat lower counts than other processes, this set of enriched terms still possess p-values and c-scores well within the range of statistical significance. Another set of previously identified processes is interferon related terms. Interferons are a potent type of cytokines which are associated with antiviral response, with most viruses having developed adaptations to at least partially avoiding this mechanism [50] . In particular, cellular response to interferon-gamma (GO:0071346) and interferon-gamma-mediated signaling pathway (GO:0060333).

We now proceed to a comparative analysis of the processes associated with SARS-CoV-2, RSV, HPIV3 and IAV viruses. In Table 9 we present the results from A549 cells, and in Table 10 from NHBE cells. We observe a considerable number of processes in common with the analysis provided in Table 8 , which is to be expected, since most identified processes are related to immune response.

Cellular response to interferon-gamma (GO:0071346) has somewhat fewer occurrences when compared to the other viruses (0.66 vs 0.85 for RSV, 0.92 for HPIV3 and 0.86 for IAV). Cytokine-mediated signaling pathway (GO:0019221) has a somewhat higher number of occurrences for SARS-CoV-2 and HPIV than others (1.00 and 1.00 vs 0.74 for RSV and 0.92 for IAV). Inflammatory response (GO:0006954) is somewhat muted for SARS-CoV-2 when compared to the other viruses, for both A549 (0.45 vs 0.97 for RSV, 0.92 for HPIV3, 1.00 for IAV) and NHBE cells (0.75 vs 0.91 for IAV, 1.00 for IAVdNS1). These differences are consistent with those found by Blanco-Melo et al. [1] , who found SARS-CoV-2 to induce a limited interferon response when compared with the other viruses but a strong production of cytokines and resulting processes. Overall, there seems to be a tendency for the other viruses to have comparatively higher counts, especially IAV. Table 11 offers a compilation of the number of GO biological processes detected for each of the applied machine learning approaches. As we can see, biclustering provides, by a considerable margin, a highest amount of biological processes, followed by clustering. The predictive models generally provided worse results, with Random Forests providing somewhat better results amongst predictors for the Multi-Condition Setting as well as for NHBE cells. Overall, these results provide initial empirical evidence in favor of pattern-based algorithms to promote the coverage and statistical significance of functional enrichment analysis, offering a way of unraveling less-trivial yet relevant regulatory behavior in knowledge bases. 

This work assesses the impact of different modular views on regulation for gene set enrichment analysis using transcriptional responses to SARS-CoV-2 infection, and further presents non-trivial biological processes associated with virus infection. Amongst state-of-the-art machine learning approaches, particular focus is placed on the pattern-centric views given the observed role of subspace clustering methods to improve the coverage and quality of enriched terms from knowledge bases.

A novel methodology is proposed, combining different computational approaches, which when consolidated provide a more robust view of the putative processes associated with virus infection. To guarantee the discriminative power of the pursued regulatory modules, the complete gene set is initially filtered using a Mann-Whitney U Test, which allows for the selection of genes with statistically relevant differences in expression between healthy and infected cells, as well as between cells infected by different viruses. Other authors perform feature enrichment directly on the set of genes obtained using simplistic statistical tests. However, this stance results in a smaller amount of biological processes detected, as well as a decrease in their quality (measured using Fisher's Exact Test and the combined c-score). So a three-fold, pattern-centric approach -composed by clustering, associative predictive modeling and biclustering algorithms -is suggested to identify DEGs with correlated expression.

Under this methodology, we were able to validate and identify potentially novel biological processes associated with SARS-CoV-2 infection. Among the various enriched terms, the high cytokine induction, Type I interferon related terms, as well as signaling pathways related to these were reoccurring in all analysis performed. In particular, SARS-CoV-2 was found to induce a limited interferon response when compared with other viruses but a strong production of cytokines and associated processes (namely interferon induction and response to these stimuli). These findings are consistent with previous studies [1] . Additionally, we found in multiple analysis the involvement of Pattern Recognition Receptors (with particular emphasis on RIG-I) in the process of infection. This was not identified in previous studies, however it is consistent with other literature on coronaviruses, and further supports the hypothesis that a pattern-centric view of the gene enrichment process can result in novel information.

As directions for future work, we aim at: i) extending the conducted experimental analysis towards other transcriptomic data sources, in particular SARS-CoV-2 related sources, to cross-validate, expand and improve the robustness of the provided findings; ii) assessing the validity of the methodological contributions over proteomic and metabolomic data; iii) addressing the issue of sample interdependence by testing the underlying relationships and designing pattern-centric approaches tailored to the presence of replicates; and iv) explicitly combining available background knowledge [51] to guide the enrichment analysis towards novel findings. In particular, and beyond virus infections, the proposed methodology can be straightforwardly extended towards the study of other pathologies as long as distinct phenotypes or morphological features of interest are present. Paradigmatic examples are cancer and cardiovascular disease analysis. As such, we believe that the proposed computational pipeline will prove to be a useful workflow for generating hypotheses across biomedical domains. 

Top statistically relevant GO biological processes ordered by combined score, for A549 cells (XGBoost)

GO:1900025) 1.49E-02 3304.67 negative regulation of cell morphogenesis involved in differentiation (GO:0010771) 1.49E-02 3304.67 protein localization to vacuole (GO:0072665) 1.49E-02 3012.37 regulation of lymphocyte activation (GO:0051249) 1.49E-02 2764.28 negative regulation of T cell receptor signaling pathway (GO:0050860) 1.49E-02 2062.22 regulation of protein localization to cell periphery (GO:1904375) 1.49E-02 1935.63 negative regulation of protein localization to plasma membrane (GO:1903077) 1.49E-02 1822.54 negative regulation of protein localization to cell periphery

GO:0050848) 1.49E-02 1278.76 regulation of B cell activation (GO:0050864) 1.49E-02 1278.76 regulation of protein localization to membrane (GO:1905475) 1.50E-02 1174.65 regulation of T cell receptor signaling pathway (GO:0050856) 1.60E-02 971.56 regulation of sodium ion transport (GO:0002028) 1.60E-02 938.42 negative regulation of cell-substrate adhesion (GO:0010812) 1.68E-02 824.08 cellular response to tumor necrosis factor (GO:0071356) 1.49E-02 773.33 regulation of interleukin-2 production (GO:0032663) 1.85E-02 657.83 negative regulation of ERK1 and ERK2 cascade (GO:0070373) 1.85E-02 625.39 regulation of substrate adhesion-dependent cell spreading (GO:1900024) 1.85E-02 610.23 interferon-gamma

The architecture of sars-cov-2 transcriptome

Type-i interferon signatures in sars-cov-2 infected huh7 cells

Identification of biomarkers and pathways for the sars-cov-2 infections that make complexities in pulmonary arterial hypertension patients

Identifying transcriptomic signatures and rules for sars-cov-2 infection

Bicpams: software for biological data analysis with pattern-based biclustering

A structured view on pattern mining-based biclustering

Mechanisms of severe acute respiratory syndrome pathogenesis and innate immunomodulation

Type i interferon susceptibility distinguishes sars-cov-2 from sars-cov

Consensus transcriptional regulatory networks of coronavirus-infected human cells

Bulk and single-cell gene expression profiling of sars-cov-2 infected human cell lines identifies molecular targets for therapeutic intervention

Genome-wide crispr screen reveals host genes that regulate sars-cov-2 infection

Platelet gene expression and function in patients with covid-19

Human angiotensin-converting enzyme 2 transgenic mice infected with sars-cov-2 develop severe and fatal respiratory disease

Robust tests for the equality of variances

An analysis of variance test for normality (complete samples)

On a test of whether one of two random variables is stochastically larger than the other, The annals of mathematical statistics

Comparing individual means in the analysis of variance

Clean: Clustering enrichment analysis

Improved functional enrichment analysis of biological networks using scalable modularity based clustering

Induction of decision trees

Random forests: from early developments to recent advancements

Xgboost: A scalable tree boosting system

Trees, forests, and impurity-based variable importance

Biclustering algorithms for biological data analysis: A survey

Bsig: evaluating the statistical significance of biclustering solutions

Biclustering of expression data

Plaid models for gene expression data

Extracting conserved gene expression motifs from gene expression data

Bicpam: Pattern-based biclustering for biomedical data analysis

Enrichr: interactive and collaborative html5 gene list enrichment analysis tool

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

Gene ontology: tool for the unification of biology

The gene ontology resource: enriching a gold mine

Kegg: kyoto encyclopedia of genes and genomes

Highlight of immune pathogenic response and hematopathologic effect in sars-cov, mers-cov, and sars-cov-2 infection

Sars and mers: recent insights into emerging coronaviruses

Expression and function of chemokines during viral infections: from molecular mechanisms to in vivo function

The jak/stat signaling pathway

Antagonism of the interferon-induced oas-rnase l pathway by murine coronavirus ns2 protein is required for virus replication and liver pathology

Isg15 in antiviral immunity and beyond

Alveolar macrophages are a major determinant of early responses to viral lung infection but do not influence subsequent disease development

Alveolar macrophages are essential for protection from respiratory failure and associated morbidity following influenza virus infection

Natural killer cells and viral infections

Eosinophils and their interactions with respiratory virus pathogens

Neutrophils in viral infections: current concepts and caveats

Monocyte chemoattractant protein-1 (mcp-1): an overview

Airway epithelial derived cytokines and chemokines and their role in the immune response to respiratory syncytial virus infection

Chemokines and chemokine receptors during covid-19 infection

Viruses and interferons

Bic2pam: constraint-guided biclustering for biological data analysis with domain knowledge