key: cord-0307264-sz6k4p3h authors: Muralidharan, Sachin; Ali, Sarah; Yang, Lilin; Badshah, Joshua; Zahir, Farah; Ali, Rubbiya; Chandra, Janin; Frazer, Ian; Thomas, Ranjeny; Mehdi, Ahmed M. title: E.PAGE: A curated database and enrichment tool to predict gene modules associated with gene-environment interactions date: 2022-01-04 journal: bioRxiv DOI: 10.1101/2022.01.03.474848 sha: ec9c4be963da837c68769656dda04a693c1561ae doc_id: 307264 cord_uid: sz6k4p3h Background The purpose of this study was to manually and semi-automatically curate a database and develop an R package that will provide a comprehensive resource to uncover associations between biological processes and environmental factors in health and disease. We followed a two-step process to achieve the objectives of this study. First, we conducted a systematic review of existing gene expression datasets to identify those with integrated genomic and environmental factors. This enabled us to curate a comprehensive genomic-environmental database for four key environmental factors (smoking, diet, infections and toxic chemicals) associated with various autoimmune and chronic conditions. Second, we developed a statistical analysis package that allows users to interrogate the relationships between differentially expressed genes and environmental factors under different disease conditions. Results The initial database search run on the Gene Expression Omnibus (GEO) and the Molecular Signature Database (MSigDB) retrieved a total of 90,018 articles. After title and abstract screening against pre-set criteria, a total of 186 studies were selected. From those, 243 individual sets of genes, or “gene modules”, were obtained. We then curated a database containing four environmental factors, namely cigarette smoking, diet, infections and toxic chemicals, along with a total of 25789 genes that had an association with one or more of these gene modules. In six case studies, the database and statistical analysis package were then tested with lists of differentially expressed genes obtained from the published literature related to type 1 diabetes, rheumatoid arthritis, small cell lung cancer, COVID-19, cobalt exposure and smoking. On testing, we uncovered statistically enriched biological processes, which revealed pathways associated with environmental factors and the genes. Conclusions A novel curated database and software tool is provided as an R Package. Users can enter a list of genes to discover associated environmental factors under various disease conditions. Organisms are constantly being exposed to a wide range of environmental triggers that influence gene expression, resulting in several diseases. Environmental factors, such as drugs, toxic chemicals, smoke, temperature, dietary components and infections are considered modifiable causes of disease through their effects on biological processes, and in response, the expression of many genes is altered (1) . It is estimated that environmental factors account for approximately 70% percent of all autoimmune diseases and 80% of all chronic diseases (2) . These large proportions indicate that environmental exposures are an important contributor to disease, and there is ample evidence to support complex interrelationships between various environmental and genomic factors for disease causation (3) . Manipulation of environmental triggers and the host immune system during the clinical and preclinical stages of a disease will offer significant insight and guide early intervention for many disorders (4) . In the era of Big Data technologies, several genomic databases exist to explore differential expression of genes under various clinical conditions (5, 6) . However, to our knowledge there is currently no computational tool that can use information from existing large-scale databases to predict gene-environment relations. Therefore, in this study we formulated an integrated and comprehensive database that will provide insights of how environmental factors are associated to gene expression and disease, and leading to the identification of potential therapeutic strategies for the prevention and control of diseases attributable to both environmental and genetic factors. We followed a two-step approach to conduct this study. First, we conducted a systematic review using a standard approach to identify all studies that used integrated datasets containing comprehensive information about environmental and genetic risk factors for various diseases. Second, we curated a database and developed a statistical analysis package to enable the user to understand the relationships between differentially expressed genes and select environmental factors. Step 1: Systematic review: The aim of this step was to identify the relevant published literature from where we could obtain existing data pertinent to gene expression changes in response to an environmental factor. In detail the systematic review was conducted as follows: We undertook a comprehensive literature and database search using PubMed, Gene expression omnibus (GEO), and Gene set enrichment analysis (GSEA) databases (7) . All databases were searched from their inception until 16th October 2020. The reference lists of all the retrieved studies were examined to identify additional studies. The search terms and their synonyms related to environmental factors and gene expression. The keywords used included medical subject headings (MeSH) terms, e.g., ( Pre-set inclusion criterion for studies to be considered eligible were: • Only articles written in English • Participants of any age group and both genders. • Since most of the experimental trials involving environmental factors were carried out in humans or mice, we included hits for Homo sapiens and Mus musculus. • Four specific environmental factors were chosen, based on the previous published evidence for major contribution as an environmental factor affecting gene expression (8) . Specifically, o Cigarette smoking -Includes data related to the practice of tobacco smoking and inhalation of tobacco smoke. o Diet -Includes data on the various types and quantities of food consumed by a person. o Infections -Includes data on infections caused by pathogenic organisms such as viruses, bacteria, fungi, protozoa and parasites. o Toxic chemicals -Includes data on substances such as metals or other chemical agents that are hazardous to human health if inhaled, ingested or absorbed. • We included published data from datasets, series and platforms. Samples were excluded if they consisted of unpublished data. We did not limit the search specific for any disease. We did not include any dataset relating to mRNA, protein, CDS or small non-coding RNAs like miRNA or siRNA. Two reviewers screened the abstracts and citations independently at the same date and time and using the same search parameters. We identified articles that met the inclusion criteria. After title and abstract screening, studies were selected for full-text review. After the full length article review, those studies that met the inclusion criteria were selected for data extraction (7) . Two reviewers independently extracted data. The specific features extracted from each article were: (1) Differential gene expression data; (2) specific description of the type of data collected; (3) specific keywords related to the differentially expressed genes for each dataset, including disease, sample condition and pathways. These were manually searched in the abstract, demographics and result sections of each publication. Data were extracted and coded in a spreadsheet to collate information from each study. The data were combined and any anomalies between reviewers were resolved by a third reviewer. The methodological quality was checked before including the data, using the Q-Genie tool (9) . We recorded whether the study used a standard microarray procedure and descriptions of the sample data, causes of up-and downregulation of genes and any other specific changes in the gene expression. Step 2: Software generation The statistical analysis package E.PAGE (Environmental Pathways Affecting Gene Expression) (github.com/AhmedMehdiLab/E.PAGE) was written in R version 4.0.3 (www.Rproject.org/) and developed using RStudio (rstudio.com). Using publicly available packages (tidyverse www.tidyverse.org, Seurat satijalab.org/seurat/) as dependencies, the package performs enrichment analysis as previously described by Mehdi and colleagues (10) . For the enrichment analysis of gene modules, we followed standard methods to perform gene set enrichment analysis. We compared the number of genes that had a specific gene modules against those that did not. A hypergeometric distribution was used to determine a p-value, which was corrected using false discovery rate (FDR) for multiple hypothesis testing using the Benjamini and Hochberg correction method (Hochberg, 2018). The results are filtered based on the adjusted P value, and those with P ≤ 0.05 are displayed to the user. Fold enrichment was calculated by taking the ratio of a set of genes containing a specific gene modules, and the total set of genes was obtained by taking the union of all the collected gene modules (10) . Adjusted fold enrichment was measured as a ratio of the fold enrichment value to the negative log of adjusted P value. This log fold change value was used to determine the significance of the enriched genes specific for an gene module (10) . An odds ratio then was measured to determine the probability of finding the set of enriched genes specific to an gene module (Szumilas, 2010). Step 3: Case studies: We used six case studies to test our enrichment tool, these studies were not used in database curation. Case study 1 involves gene expression data in peripheral blood mononuclear cells (PBMC) in children with type 1 diabetes (11) . Gene expression changes were identified using microarray analysis from 43 patients with new onset T1D compared with 24 healthy controls. The gene expression data set in case study 2 is taken from the GEO database (microarray datasets; GSE12021, GSE55457, GSE55584 and GSE55235) that includes samples from 45 patients with rheumatoid arthritis, compared with 29 healthy control samples (12) . Case study 3 includes gene expression data from 23 small cell lung cancer samples and 42 healthy lung tissues (13) . The gene expression data from the case study 4 was taken from cobalt-exposed rat liver derived cells (14) . The final two case studies used differentially expressed genes extracted from single-cell expression data. Case study 5 was based on single-cell RNAseq data from COVID-19 patients, comparing severe and healthy cases in peripheral immune environments (15) , while case study 6 was based on a single-cell RNAseq-based atlas of epithelial cellspecific responses to smoking (16) . For single-cell RNA seq data, E.PAGE used a Seurat object (with clustering performed) as an input and performs differential expression analyses between the clusters to uncover lists of genes to compute related enriched gene modules. The initial electronic search of GEO and MSigDB database identified a total of 90,018 studies ( Figure 1 ). Title and abstract screening of retrieved studies resulted in a total of 3547 studies which had potential data related to environmental factors. After full text examination of 3547 studies, 3008 studies were excluded since they did not provide any differential gene expression data associated with any of the four environmental factors. A total of 243 datasets were obtained from 186 studies and the gene expression data were retrieved and collated to form a database. Figure 1 illustrates a flow chart of all the steps taken to obtain the data that satisfy the required parameters. The overall structure of E.PAGE is shown in Figure 2 We first tested whether query signatures associated with T1D and RA could recover common pathways associated with these autoimmune disease. We used 291 DE genes uncovered from 43 patients with new-onset T1D as compared to 24 healthy controls (8) ( Table 2 ) and 229 DE genes from 45 samples from patients with RA, compared with 29 healthy control samples (12) ( Table 3 ). The statistical enrichment using E.PAGE identified that the genes in both datasets are involved in Immune response. Other significant gene modules that were common to both diseases include Interferons, IL-12 and Transcription regulation. These processes are all well known to be involved in RA and T1D (17) . Insulin resistance and Xenobiotic metabolism, which are both believed to be associated with T1D, were uncovered using E.PAGE and validate the utility of the platform ( We next studied gene modules associated with small cell lung cancer. The query signature containing 71 DE genes was derived from 23 clinical small cell lung cancer samples and 42 healthy control tissues (13) . We found that several lungs cancer associated gene modules were infections were was the most common environmental factor associated with the DE genes statistically significant ( (Table 4 ). Other interesting gene modules that are known to be involved in lung cancer were also identified, including Lung cancer, Cigarette smoking, Airway epithelium and Immune response. Case study 4: Genotoxicity associated with cobalt exposed gene expression We next used E.PAGE to understand the gene expression pathways involved in cobalt exposure. We used 27 DE genes uncovered by measuring the effect of cobalt exposure on gene expression in two rat liver derived cell lines using microarray analysis (14). Cobalt exposed DE genes were associated with chemical induced gene expression. Other significant gene modules include genotoxicity, carcinogen, non-genotoxic, hepatocarcinogens, and liver-based in vitro models (Table 5) . From a single-cell RNA sequencing dataset (15), we first conducted a standard Seurat pipeline to determine the graph based clusters (18) . We then analysed enrichment of gene modules based on DE genes in Seurat clusters in COVID-19 and healthy cases. As expected, we identified COVID-19, SARS-COV2 modules. Significant enrichment was also observed for the Cigarette smoking amongst the top modules that were previously shown to be COVID-19related (15, 19, 20) (Table 6 ). Case study 6: Single-cell Smoking dataset: As a final case study, we attempted to identify enriched gene modules related to smoking using a single cell RNA sequencing dataset which contained data of smokers vs non-smokers (16) . After processing the data using the Seurat pipeline and analyzing the single-cell expression data, gene set enrichment identified Epithelial gene expression, Cigarette smoking, Airway epithelium, and Chronic obstructive pulmonary disease as the top gene modules with highly significant p-values, confirming that smokingrelated pathways were correctly predicted using E.PAGE (Table 7) . Furthermore, smoking associated with gene signatures of lung-associated diseases such as Lung cancer, Cystic fibrosis, as well as with Carcinogen and respiratory infections such as Influenza and COVID-19. Environmental factors are known to influence the development of disease, with or without combination with genetic factors, however there is currently no curated database and enrichment tool to identify the genes and the corresponding biological processes associated with these environmental conditions. We developed E.PAGE, a database and enrichment tool to understand the gene-environment relationship. Our database was developed based on experimental evidence obtained from the published literature to establish a relationship between environmental factors, differentially expressed genes and specific biological processes associated with the genes. To set up the database, we used cigarette smoking, infections, toxic chemicals and diet, as they constitute the primary environmental factors influencing disease outcomes (4). We made every effort to ensure completeness, accuracy and currency of the database. The current database has 243 datasets which consists of 25789 genes in total. The largest number of datasets relate to diet and infections due to the long research history of these two environmental factors and disease. We manually curated each dataset using specific keywords and a brief description, abstract published with these datasets. We then developed an enrichment tool that uncovers modules associated with genes of interest using the methods we previously published (10) . In six case studies, we tested E.PAGE with sets of DE genes available from the literature. Specifically, we tested two gene lists associated with autoimmunity -T1D and RA -along with those related to small cell lung cancer, COVID-19 and smoking subjects. To confirm the effect of toxic chemicals on differential gene expression, we also used gene expression data from a study on cobalt exposure. On testing T1D and RA associated DE genes, we found a large number of gene modules related to immune responses, which supports previous studies on how malfunction in the adaptive immune response results in activation of self-reactive T cells. We also obtained a substantial number of environmental modules associated with viral and bacterial infections, which supports recent findings on how bacterial and viral infections are implicated in immune response signaling in autoimmune disease pathogenesis. The T1D and RA associated DE genes were found to be primarily enriched in infection-associated gene modules and less in gene modules associated with the environmental factors diet, cigarette smoking or toxic chemicals. This information supports the hypotheses that infection-associated immune responses are major contributors to the development of T1D and RA (21) (22) (23) . A substantial number of genes involved in the central nervous system were also related to RA, consistent with other evidence (24) . When small cell lung cancer genes were tested, we found a large number of environmental modules for DE genes to be related to lung cancer, as expected. We also found an expected link to cell cycle, since cell cycle checkpoints are disrupted leading to tumour development and cancer progression. Genes relating to cytoprotective function, mitotic spindle formation are also generally dysregulated in cancer. Recent studies that show a high incidence of retrovirus in lung small cell cancer suggest a possible direct link between infections and small cell cancer (25) . To further assess associations between environmental factors with toxic chemicals, we tested genes differentially expressed due to cobalt exposure against the E.PAGE database. On testing, we found the modules Genotoxicity and Carcinogen to be enriched. We also obtained a substantial number of genes differentially expressed due to toxic chemicals as environmental factors, supporting the validity of the tool to identify potential involvement of toxic chemicals on DE genes involved in critical functions in a relevant datasets. On testing gene expression data sourced from patients with COVID-19, we found that genes differentially expressed in severe cases were linked to gene modules common between bronchoalveolar and peripheral immune environments (15, 20) . This finding shows how the E.PAGE database can be used to find commonalities between two sets of differentially expressed genes, even if they may not have many genes in common. On testing the single-cell gene expression data for smoking we found gene modules for Cigarette smoking, Airway epithelium, Epithelial gene expression, and Chronic obstructive pulmonary disease. Additional pathways that are well known to be altered by cigarette smoking were identified. Therefore, E.PAGE was able to find relevant significantly enriched gene modules. From the above case studies, we found that our database is highly reliable and has the potential to establish a link between environmental factors and important biological processes. In the case studies, we generally obtained a higher number of DE genes related to infection as an environmental factor. Though this link with infection may be valid, there is a possibility of dataset bias due to limited type of input data such as gene list, similarities between infection and tissue damage -associated immune responses. Additionally, our study is limited to four types of environmental variables, therefore to increase usage towards wider community more environmental datasets need to be integrated. A key benefit of this research is to predict gene-environment interactions to identify novel associations between environmental factors and disease, and to inform hypothesis synthesis and target selection. Thereby, it allows scientists and epidemiologists to dissect which genes may be influenced by environmental exposures in different disease conditions. We illustrate this by using examples from type-1 diabetes, rheumatoid arthritis, small cell lung cancer and COVID-19 datasets. The current study lends itself to future extension to additional environmental variables such as alcohol, physical activities, life-style factors, which could facilitate developing disease risk prediction models. The authors declare that they have no competing interests Environmental epigenomics and disease susceptibility Environmental triggers and autoimmunity Discovering environmental causes of disease A Potential Link between Environmental Triggers and Autoimmunity Big Data Analytics for Genomic Medicine Big data analytics in healthcare: promise and potential A Systematic Review of Interventions to Improve Diabetes Care in Socially Definition, diagnosis and classification of diabetes mellitus and its complications. Part 1: diagnosis and classification of diabetes mellitus provisional report of a WHO consultation Assessing the quality of published genetic association studies in meta-analyses: the quality of genetic studies (Q-Genie) tool A peripheral blood transcriptomic signature predicts autoantibody development in infants at risk of type 1 diabetes Gene Expression in Peripheral Blood Mononuclear Cells from Children with Diabetes Identification of Key Genes and Pathways in Rheumatoid Arthritis Gene Expression Profile by Bioinformatics PRC2 overexpression and PRC2-target gene repression relating to poorer prognosis in small cell lung cancer Exposure to Cobalt Causes Transcriptomic and Proteomic Changes in Two Rat Liver Derived Cell Lines A single-cell atlas of the peripheral immune response in patients with severe COVID-19 Dissecting the cellular specificity of smoking effects and reconstructing lineages in the human airway epithelium Host and Environmental Factors Influencing Individual Human Cytokine Responses Integrating single-cell transcriptomic data across different conditions, technologies, and species Current smoking and COVID-19 risk: results from a population symptom app in over 2.4 million people Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19 Role of Infections in the Pathogenesis of Rheumatoid Arthritis Latent gammaherpesvirus exacerbates arthritis through modification of age-associated B cells The role of innate immune pathways in type 1 diabetes pathogenesis Central nervous system involvement in rheumatoid arthritis : possible role of chronic inflammation and tnf blocker therapy Molecular evidence of viral DNA in non-small cell lung cancer and non-neoplastic lung