key: cord-0325086-zjk0jsak authors: Lv, Y.; Huang, Y.; Xu, X.; Wang, Z.; Ma, Y.; Wu, M. title: Integrated multi-omics data analysis identifies a novel genetics-risk gene of IRF4 associated with prognosis of oral cavity cancer date: 2021-11-21 journal: nan DOI: 10.1101/2021.11.17.21266500 sha: 8ec3fe42b8c644977f722d9cdb31f8e2b8c61825 doc_id: 325086 cord_uid: zjk0jsak Oral cavity cancer (OCC) is one of the most common carcinoma diseases. Recent genome-wide association studies (GWAS) have reported numerous genetic variants associated with OCC susceptibility. However, the regulatory mechanisms of these genetic variants underlying OCC remain largely unclear. By combining GWAS summary statistics (N = 4,151) with expression quantitative trait loci (eQTL) across 49 different tissues from the GTEx database, we performed an integrative genomics analysis to uncover novel risk genes associated with OCC. By leveraging various computational methods based on multi-omics data, risk genes were prioritized as promising candidate genes for drug repurposing in OCC.Using two independent computational algorithms, we found that 14 risk genes whose genetics-modulated expressions showed a notable association with OCC. Among them, nine genes were newly identified, such as IRF4 (P = 2.5x10-9 and P = 1.06x10-4), TNS3 (P = 1.44x10-6 and P = 4.45x10-3), ZFP90 (P = 2.37x10-6 and P = 2.93x10-4), and DRD2 (P = 2.0x10-5 and P = 6.12x10-3). These 14 genes were significantly overrepresented in several cancer-related terms, and 10 of 14 genes were enriched in 10 potential druggable gene categories. Based on differential gene expression analysis, the majority of these genes (71.43%) showed remarkable differential expressions between OCC patients and paracancerous controls. Integration of multi-omics-based evidence from genetics, eQTL, and gene expression, we identified that the novel risk gene of IRF4 exhibited the highest ranked risk score for OCC. Survival analysis showed that dysregulation of IRF4 expression was significantly associated with cancer patients outcomes (P = 8.1x10-5). In summary, we prioritized 14 OCC-associated genes with nine novel risk genes, especially the IRF4 gene, which provides a drug repurposing resource to develop therapeutic drugs for oral cancer. especially the IRF4 gene, which provides a drug repurposing resource to develop therapeutic drugs for oral cancer. Oral cavity cancer (OCC) is one of the most common malignancy diseases [1, 2] . There are approximately 405,000 new OCC patients anticipated each year worldwide [3] . Although the rapid development of cancer diagnoses and therapies, the 5-year survival rate of OCC patients has remained at a dismal 50% [4] . Genetic components have been reported to largely influence OCC, especially variants located in alcohol-relevant genes, including alcohol dehydrogenase 7 (ADH7) and alcohol dehydrogenase 1B (ADH1B) [5, 6] . Hence, understanding the genetic underpinnings of OCC is of considerable interests, which may promote the advancement of individualized treatment strategies to maximize oncologic control and minimize the adverse influence of therapy. With the development of high-throughput microarray and sequencing technologies, genome-wide association study (GWAS) as an effective method is widely applied to simultaneously examine the associations of millions of single nucleotide polymorphisms (SNPs) with complex diseases [7, 8] . The GWAS approach leveraged large-scale sample sizes to enhance the statistical power for discovering disease-associated risk SNPs and genes, such as the UK-Biobank project [9, intractable question that how to effectively pinpoint genuine variants from current existing GWAS summary statistics. In addition, the vast majority of GWAS-reported SNPs were mapped within non-coding genomic regions [20, 21] , indicating these variants potentially have cis-/trans-regulatory effects on modulating the expression level of a given gene rather than changing the function of its protein. Mounting genomics analyses [22] [23] [24] [25] have been carried out to examine the regulatory mechanisms of GWAS-derived genetic variants, and determine whether GWAS-nominated genes whose aberrant alterations of expression contributing to disease pathogenesis due to genetic pleiotropy. For example, He et al. [22] reported a Sherlock integrative genomics tool based on Bayesian inference algorithm, which could incorporate genetic statistical values from GWAS summary data and expression quantitative trait loci (eQTL) data to systematically uncover the regulatory effects of genetic variants on gene expression for complex diseases [23, [26] [27] [28] . Recently, Barbeira and coworkers [29] introduced an efficient GWAS summary result-based extension statistical method termed S-MultiXcan, which could utilize the substantial sharing of eQTL across multiple tissues to enhance the capability to pinpoint potential target genes. The primary goal of the current comprehensive genomics study is to identify genuine risk genes for OCC susceptibility and extend our understanding of genetic architecture implicated in the etiology of OCC. We first combined the converging effects of SNPs around a given gene, and then performed an integrative genomics analysis by incorporating GWAS summary statistics with eQTL to identify novel OCC-relevant genetic variants and susceptible genes. Moreover, by leveraging various computational methods based on multi-omics data, we prioritized risk genes as promising candidates for drug repurposing in OCC. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint To yield genetic statistical information concerning OCC, we downloaded a GWAS summary statistics on oral cavity cancer from the MRC Integrative Epidemiology Unit (IEU) database Written informed consent was signed for all subjects, and the ethical approval was approved by the respective institutional review boards. The PLINK 1.934 [30] was used to carry out systematic quality controls on genotype calling. After stringent quality control, a total of 7,510,833 common SNPs were included in the current analysis. More detailed clinical information was described in the original paper [16] . The web-access tool of LocusZoom (http://locuszoom.sph.umich.edu/) was used to visualize the regional genetic association signals for SNPs within risk genes. cis-eQTL and cis-sQTL analysis was conducted based on the GTEx online tool (https://gtexportal.org/home/). To identify risk genes whose genetically-regulated expression, we applied a comprehensive integrative genomics analysis to integrate GWAS summary statistics on OCC with expression quantitative trait loci (eQTL) data for 49 tissues from the GTEx database (vserion 8) [31] . First, based on the MASHR statistical method, we leveraged a linear regression model in S-PrediXcan to . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint evaluate gene expression weights based on samples who having both genotypes and gene expression data. Through combining the variance and co-variance information of SNPs based on the reference panel of the 1,000 Genomes Project European Phase 3 [32], these estimated expression weights were assessed with a log odds ratio (OR) and standard errors from the GWAS summary data to calculate the levels of gene expression. For effective utilizing the information of significant genes whose expression has similar functions across different tissues, we further used the S-MultiXcan method to meta-analyze these MASHR results from S-PrediXcan to improve the statistical power. By integrating all available genes with convergent evidence across 49 GTEx tissues, the S-MultiXcan tool conducted a multivariate regression model for 22,331 genes, and FDR < 0.05 was considered to be of significance. To pinpoint risk genes associated with OCC, we conducted a gene-level genetic association analysis by incorporating the converging genetic signals around a given gene based on the Multi-marker Analysis of GenoMic Annotation (MAGMA) [7, 33] . We first extracted the SNP information including P values and chromosome position from GWAS summary statistics, and defined an extended genomic region for a specific gene containing the analyzed SNPs located within downstream or upstream 20kb of the gene. To calculate the linkage disequilibrium (LD) between SNPs, we used the 1,000 Genome Project Phase3 European Panel for reference [32] . To adjust the raw P values for multiple testing correction, we leveraged the widely-used method of Benjamini-Hochberg false discovery rate (FDR) [34] , and FDR < 0.05 was considered to be of significance. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint To understand the biological functions for these identified risk genes, we conducted a pathway-based enrichment analysis according to the authorized KEGG pathway resource [35] through utilizing the online-access tool of WEB-based Gene SeT AnaLysis Toolkit (WebGestalt; http://www.webgestalt.org) [36] . There were three well-documented methods used for enrichment analysis, including gene set enrichment analysis, network topology-based analysis, and over-representation analysis. By using the over-representation approach, we inputted these identified significant or suggestive gene lists for establishing the functional relationships of these genes with biological pathways. In addition, we also performed a gene-ontology (GO) enrichment analysis using the WebGestalt tool based on three widely-used functional categories: cellular component (CC), molecular function (MF), and biological process (BP). The number of genes in each pathway or GO-term were limited to a range between 5 and 2,000. The hypergeometric test was adopted to assess the significance level of each enrichment analysis, and the Benjamini-Hochberg FDR method was used for multiple testing correction. To determine whether there exist prominently consistent results between genes from MAGMA analysis (N = 1,324, P < 0.05) and genes from S-MultiXcan analysis (N = 1,366, P < 0.05), we used a permutation method [11] to carry out a in silico permutation analysis with 100,000 times of random selections. In the first step, we calculated the overlapped genes between MAGMA and S-MultiXcan (N observation = 472 genes). In the second step, we leveraged all examined genes from S-MultiXcan . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint analysis as background genes (N background =22,331 genes). By randomly selecting the same number of genes as those identified from S-MultiXcan analysis from background genes to overlap with genes identified from MAGMA for 100,000 times, we counted the number of overlapped genes in each random time (N random). The empirically permuted P value = ≤ 100,000 , and P value < 0.05 is of significance. To uncover the functional interaction patterns of these identified OCC-risk genes, we performed There were 14 risk genes as an input list for searching interaction relationships based on current existing proteomics and genomics data, including genetic interactions, co-expression links, physical interactions, pathway links, and co-localizations. The Cytoscape platform [39] was adopted for visualizing the subnetworks. To find effective gene-targeted drugs for OCC, we submitted these 14 significant OCC-risk genes into the widely-applied Drug Gene Interaction Database (DGIdb v.3.0.2, http://www.dgidb.org/) based on 20 databases with 51 known interaction types. Based on the druggable categories in the DGIdb database, we attempted to search genes with potential drug abilities for these 14 OCC-risk genes. Furthermore, based on the GLAD4U database [40], we applied the hypergeometric test to conduct a drug-based enrichment analysis of these identified common genes between MAGMA . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint analysis and S-MultiXcan analysis (Supplemental Table S5 ) for obtaining a drug repurposing resource. To further validate the functionality of these 14 identified OCC-risk genes, we downloaded two independent RNA expression datasets from NCBI GEO database (Accession Nos. GSE160042 and GSE139869). The first analyzed dataset, GSE160042, contained 10 fresh human oral squamous cell carcinoma and their paired adjacent non-cancerous counterparts. All these enrolled patients received radical surgery without any form of pre-surgical adjuvant therapy. The RNA expression profiles were detected by the Agilent-045142 Human LncRNA v4 4X180K Microarray. The second RNA expression dataset (i.e., GSE139869) contained five paired tongue squamous cell carcinoma and paracancerous tissues, detected the transcriptomes with the use of the Agilent-062918 OE Human lncRNA Microarray. The paired Student's t-test was used to calculate the significance between cancer patients and controls, and the Corrplot R package was used to visualizing the co-expression patterns among these 14 risk genes. Based on the expression dataset on head and neck squamous cell carcinoma (N = 518) from Cancer Genome Atlas (TCGA) database, we applied the GEPIA v2 online tool (http://gepia2.cancer-pku.cn/#survival) to perform survival analysis for identified genes and plot Kaplan-Meier curves. Current study is a two-stage designed integrative genomics analysis to identify novel risk is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint variants and genes for OCC in the discovery stage, and replicate these results in the validation stage ( Figure 1 ). In the discovery stage, we collected a GWAS summary dataset on OCC with 1,223 patients and 2,928 matched controls [16] , and collected eQTL data across 49 tissues from the GTEx database [31]. Among these eQTL datasets, RNA sequencing expression samples with < 10 million mapped reads were excluded, and normalized across samples using the edgeR R package [41] . There were 838 donors having RNA sequencing data and genotype data. Combining the GWAS summary data with eQTL data, we highlighted risk genes whose genetically-regulated abnormal expression associated with OCC. In the validation stage, we further conducted subsequent comprehensive computational analyses by using multiple bioinformatics tools for replication. In view of the original study [16] did not perform a gene-level genetic association analysis, we conducted a MAGMA gene-based association analysis to incorporate the converging effects of SNPs in a given gene. There were 14 genes showing notable associations with OCC susceptibility (FDR < 0.05, Figure 2 and Supplemental Table S1); including IRF4 (P = 2.5×10 -9 ), TNS3 (P = 1.44×10 -6 ), NOTCH1 (P = 2.13×10 -6 ), ZFP90 (P = 2.37×10 -6 ), ATRN (P = 3.28×10 -6 ), HLA-DQA1 (P = 6.45×10 -6 ), DRD2 (P = 2.0×10 -5 ), EPHX2 (P = 2.03×10 -5 ), CLPTM1L (P = 2.13×10 -5 ), SCL6A2 (P = 2.76×10 -5 ), FGF7 (P = 2.78×10 -5 ), GTF2IRD1 (P = 3.15×10 -5 ), NLRP12 (P = 3.26×10 -5 ), and CDH16 (P = 3.46×10 -5 ). Quantile-quantile plot exhibited a low-level genomic inflation (λ = 1.02) in the gene-based association analysis (Supplemental Figure S1 ). There were two loci of 5p15.33 and 9q34 mapped by significant genes of CLPTM1L and NOTCH1 were consistent with the previous study [16] . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint Among these 14 genes with multiple converging genetic association signals, there were 15 index SNPs showing notable associations with OCC (Supplemental Figures S2-S16 ), such as rs10775305 in ZFP90, rs6499103 in CDH16, rs4586205 in DRD2, rs10283378 in EPHX2, and rs12916839 in FGF7. For the most significantly MAGMA-identified gene of IRF4, the top lead SNP rs7773324 (P = 7.8×10 -7 ) has high CADD score and regulomeDB score with strong activating chromatin state ( Figure 3 and Supplemental Figure S17 ). In addition, both top lead SNPs of rs3125011 in NOTCH1 (P = 7.9×10 -7 ) and rs12916839 (P = 6.28×10 -7 ) in FGF7 have high CADD score and regulomeDB score with strong activating chromatin state (Supplemental Figures S18-S19). We observed that rs12916839 exhibited prominently allelic specific associations with FGF7 among multiple tissues, and the T allele of rs1296839 showed down-regulated effects on the expression of FGF7 (Supplemental Figure S20 ). Furthermore, the rs12916839 represents a splicing quantitative trait locus (sQTL) for FGF7 (Supplemental Figures S21-S22 ). Moreover, a number of 1,310 genes showed suggestive associations with OCC susceptibility from MAGMA analysis (P < 0.05, Supplemental Table S1 ). Among these suggestive genes, 191 genes have been documented to be implicated in other cancers in the GWAS Catalog database (Supplemental Table S1 ), and 11 genes of HLA-DQB1, ZNF512, CCDC121, C2orf16, GPN1, GALNT14, AIF1, CTLA4, MICB, FAM175A, and ADH7 have been reported to be associated with OCC in previous studies (Supplemental Table S1 ). To examine their biological functions, we performed a pathway-based enrichment analysis of these suggestive genes, and found that six biological pathways were significantly associated with OCC (FDR < 0.05, Figure 4A -B, Supplemental Table S2 ). For example, staphylococcus aureus infection (P = 8.04×10 -6 ), antigen processing and presentation (P = 9.18×10 -5 ), graft-versus-host disease (P = 2.01×10 -4 ), and cell is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint adhesion molecules (P = 8.40×10 -4 ). By performing a GO enrichment analysis, we found that there were 4 cellular component (CC) terms and 11 molecular function (MF) terms significantly enriched, including MHC protein complex, synaptic membrane, endonuclease complex, and GABA receptor activity ( Figure 4C -D, and Supplemental Tables S3-S4). Overall, these results suggest that these observed genes may be a good resource for pinpointing causal risk genes for OCC. By integrating GWAS summary statistics with eQTL data across 49 GTEx tissues using the S-MultiXcan method, we identified that 1,366 genes whose aberrant gene expressions were significantly or suggestively associated with OCC (P < 0.05, Supplemental Table S5 ). Such as ATRN Table S5 ). Based on 100,000 times of in silico permutation analysis, we found that there existed a prominent concordance of results from between S-MultiXcan and MAGMA analysis (Empirical P < 1.0×10 -5 , Figure 5B ), suggesting that these findings were relatively reliable and could be used for subsequent analyses. Furthermore, we performed two pathway enrichment analyses for these 472 common genes . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint based on two independent pathways resources of KEGG and Reactome. We observed that 18 biological pathways were significantly overrepresented (FDR < 0.05, Figure 5C -D and Supplemental Tables S6-S7). For example, staphylococcus aureus infection (P = 5.09×10 -8 ), allograft rejection (P = 2.35×10 -5 ), autoimmune thyroid disease (P = 2.79×10 -5 ), graft-versus-host disease (P = 3.94×10 -5 ), type I diabetes mellitus (P = 5.43×10 -5 ), and activation of C3 and C5 (P = 2.78×10 -5 ), reminiscing that these pathways showed significant enrichments in aforementioned pathway analysis. To explore the druggable actions of these 472 common genes, we carried out a drug-based enrichment analysis based on the GLAD4U database. We found that seven drug-terms were significantly enriched (FDR < 0.05, Figure 5E and Supplemental Table S8 ), including mumps vaccines (P = 5.73×10 -6 ), oxcarbazepine (P = 1.67×10 -5 ), trichloroethylene (P = 3.16 ×10 -5 ), rubella vaccines (P = 4.32×10 -5 ), desflurane (P = 9.79×10 -5 ), and lumiracoxib (P = 9.79×10 -5 ). In addition, there were 45 suggestively-enriched drug-terms by these OCC-related common genes (P < 0.05, Supplemental Table S8 ). These observed drug terms and risk genes may provide a repurposing resource for searching drug targets for OCC. Among the 14 MAGMA-identified significant genes (FDR < 0.05), 13 genes (13/14=92.86%) were significantly replicated by using S-MultiXcan integrative analysis ( Figure 6A ). There exist a moderate correlation of these top-ranked significant genes from two independent tools (r = 0.386, Figure 6B ). Five genes of HLA-DQA1, NOTCH1, FGF7, SLC6A2, and CLPTM1L have been reported to be associated with oral and pharynx cancer [15] [16] [17] [18] [19] . By performing the S-PrediXcan integrative analysis, we found that several genes showed significant associations with oral cancer for . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Figure S27 and Table S9 ). To explore whether there exist collective functions of these 14 genes on OCC, we conducted a PPI network enrichment analysis based on two databases of STRING and GeneMAINA, and these genes were remarkably enriched in a functional subnetwork (P < 0.05, Figure 6C ), suggesting that these risks have highly biological interactions implicated in oral cancer. Compared with patients with non-oral squamous cell carcinoma or other squamous cell carcinomas, mutations in NOTCH1 have been frequently identified in patients with oral squamous cell carcinomas [17] . The genes of IRF4, GTF2IRD1, FGF7, and SLC6A2 showed functional connections with NOTCH1. An earlier study [45] showed that there existed a genetic interaction of NOTCH1 with SLC6A2 and GTF2IRD1. By performing an interlaboratory comparability study, Dobbin et al. [46] reported that the cancer-related gene of ARTN was highly co-expressed with EPHX2. Based on a drug-gene interaction analysis, we observed that 10 of 14 OCC-associated genes (71.43%) were overrepresented in 10 potential "druggable" gene categories, such as druggable genome, cell surface, drug resistance, transcription factor, growth factor, and G-protein coupled receptor (Supplemental Figure S28 ). There were five genes, namely NOTCH1, SLC6A2, DRD2, EPHX2, and HLA-DQA1, found to be targeted at least four drugs ( Figure 6D ). For example, the gene of NOTCH1 is targeted by 10 drugs, including brontictuzumab, nirogacestat, crenigacestat, wnt-974, rg-4733, asparaginase, ribociclib, mercaptopurine, temozolomide, and prednisolone. DRD2 gene shows molecular interactions with 10 drugs, including nemonapride, pipotiazine, bupropion, is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint pramipexole, ropinirole, pridopidine, bromperidol, acetophenazine, and aripiprazole. These results suggest that these 14 risk genes predisposed to have collective functions on OCC. By using the expression dataset of GSE139869, we conducted a DGE analysis of these 14 risk genes. Three genes not expressed in this dataset. We found that six of 11 genes were prominently expressed between OCC patients and paracancerous controls ( Figure 7A ). ATRN (P = 0.027), CLPTM1L (P = 0.007), IFR4 (P = 0.0054), and TNS3 (P = 7.6×10 -4 ) showed significantly higher expressions among OCC patients than that among controls, and EPHX2 (P = 4.0×10 -5 ) and GTF2IRD1 (P = 0.017) exhibited significantly lower expressions among OCC patients than that among controls (P < 0.05, Figure 7A ). Consistently, by analyzing the expression dataset of GSE160042, we observed that 10 of 14 genes were significantly differentially expressed between OCC patients and matched controls ( Figure 7B ). ATRN (P = 0.014), CLPTM1L (P = 8.7×10 -5 ), HLA-DQA1 (P = 0.003), IRF4 (P = 4.6×10 -4 ), TNS3 (P = 1.2×10 -6 ), and ZFP90 (P = 0.043) were prominently highly expressed in OCC patients, and CDH16 (P = 9.6×10 -5 ), EPHX2 (P = 2.1×10 -5 ), GTF2IRD1 (P = 3.3×10 -4 ), NOTCH1 (P = 4.6×10 -4 ) were remarkably down-expressed in OCC patients ( Figure 7B ). Furthermore, to determine whether the co-expression patterns among these 14 genetics-risk genes were changed by OCC disease status, we conducted a Pearson correlation analysis for the expression data from three datasets. Interestingly, we observed that an obvious change of the co-expression patterns among 14 genes categorized by OCC status (Supplemental Figures S29-S30 ). For example, the correlation coefficient between ATRN and CLPTM1L was fundamentally increased is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint from -0.356 in controls to 0.779 in OCC patients among the GSE160042 dataset. Integrating multi-omic evidence across different datasets, we calculated the ranked scores of these 14 risk genes by summing five pieces of evidence from genetic, eQTL, and gene expression, and found that the novel risk gene of IRF4 exhibited the highest rank score ( Figure 7C ). Based on a large-scale TCGA dataset on head and neck squamous cell carcinoma (N = 518), we found the expression level of IRF4 gene was significantly associated with patients' survive (P = 8.1×10 -5 , Figure 7D ). Previous genetic studies have indicated that there existed a crucial role of genetic susceptibility is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint cancer-related genes from the GWAS Catalog database, seven genes that are IRF4, TNS3, HLA-DQA1, EPHX2, CLPTM1L, FGF7, and NLRP12 have been documented to be associated with the susceptibility of other cancer types [51-58], and five genes that are NOTCH1, FGF7, SLC6A2, HLA-DQA1, and CLPTM1L have been reported to be associated with oral and pharynx cancer [15] [16] [17] [18] [19] . Based on the DisGeNET method, we observed that 14 genes were significantly enriched in cancer-related terms, including malignant carcinoid syndrome, small lymphocytic lymphoma, basal cell neoplasm, and basal cell cancer. These results suggest the 14 identified risk genes may have important functions in OCC. Additionally, we also found that 71.43% of these genes were significantly enriched in 10 druggable gene categories, and five genes that are NOTCH1, SLC6A2, DRD2, HLA-DQA1, and EPHX2 were targeted by at least one known drug, indicating that these identified genes have druggable functions and may as good candidates for drug repurposing in OCC. The novel risk gene IRF4, which is a tumor suppressor gene, encodes a protein that belongs to the interferon regulatory factor family of transcription factors, which have important roles in modulating interferon-inducible genes and activating the innate and adaptive immune systems. The , non-small-cell lung cancer [62] , breast cancer [63] , and large B-cell lymphomas [64] . By performing a large-scale bisulfite genomic sequencing analysis, Vernier et al. [63] reported that reversal of promoter hyper-methylation and consequently de-repression of is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint IRF4 gene is a factor contributing to the suppression of breast cancer. The NOTCH1 gene, which encodes a member of the NOTCH family of proteins, has an important role in the development of numerous cells and tissue types. Mutations in the NOTCH1 gene have been reported to be associated with chronic lymphocytic leukemia [65] [66] [67] [68] , pancreatic cancer metastasis [69] , head and neck squamous cell carcinoma [70] [71] [72] [73] , T-cell acute lymphoblastic The druggable gene of SLC6A2, which encodes a member of the sodium neurotransmitter symporter family that is responsible for the reuptake of norepinephrine into presynaptic nerve terminals, was identified to be targeted by ten drugs, including reboxetine, dexmethylphenidate, and is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint reported that the Taq1A A1/* genotypes were remarkably associated with smoking cessation under both the fixed-and random-effects models. Based on a GWAS of self-reported alcohol consumption is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint It should be noted that the sample size used in the GWAS summary statistics was not very large. Furthermore, due to the heterogeneity across different datasets, we applied distinct correction methods for multiple testing for each omics dataset. The current study was based on the European population, and we did not validate these OCC-risk genes in other ethnicities. Further studies are needed to evaluate the inference by using genotype and gene expression data from other ancestries. In summary, our integrative genomics analysis provides an effective method for identifying risk genes that convey susceptibility to OCC, and highlights 14 important OCC-risk genes with 9 novel risk genes, such as IRF4, TNS3, and DRD2. These identified genes targeted with drugs may be promising candidates for drug repurposing in the future pharmaceutical studies. Furthermore, we connected risk variants with susceptible genes and functional processes, providing a good clue for explaining the effects of genetic variants on OCC. More relevant studies are warranted to study the molecular functions and therapeutic efficacy of these identified risk genes. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. All authors have reviewed the current version, and approved the finial manuscript. Not applicable. The authors declare no conflicts of interest. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint datasets. A) Boxplots showing the differential expression patterns of 11 genes between OCC patients and paracancerous controls among the GSE139869 dataset. There were three genes that are is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint rs9271378 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint rs7773324 in IRF4. There were three distinct data used for assessing the variant functionality, including CADD score, regulomeDB, and chromatin state. This plot was generated by using the FUMA online tool (https://fuma.ctglab.nl/). Figure S18 . Multiple layers of evidence supporting the functional role of rs3125011 in NOTCH1. There were three distinct data used for assessing the variant functionality, including CADD score, regulomeDB, and chromatin state. This plot was generated by using the FUMA online tool (https://fuma.ctglab.nl/). Figure S19 . Multiple layers of evidence supporting the functional role of rs12916839 in FGF7. There were three distinct data used for assessing the variant functionality, including CADD score, regulomeDB, and chromatin state. This plot was generated by using the FUMA online tool (https://fuma.ctglab.nl/). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 21, 2021. ; https://doi.org/10.1101/2021.11.17.21266500 doi: medRxiv preprint Global cancer statistics Cancer statistics Cancer of the oral cavity IRF4 in multiple myeloma-Biology, disease and therapeutic target Human lung tumor FOXP3+ Tregs upregulate four "Treg-locking" transcription factors Inhibition of DNMT1 and ERRα crosstalk suppresses breast cancer via derepression of IRF4 Distinct molecular profile of IRF4-rearranged large B-cell lymphoma Genomic alterations in high-risk chronic lymphocytic leukemia frequently affect cell cycle key regulators and NOTCH1-regulated transcription Bidirectional linkage between the B-cell receptor and NOTCH1 in chronic lymphocytic leukemia and in Richter's syndrome: therapeutic implications Clinical impact of clonal and subclonal TP53, SF3B1, BIRC3, NOTCH1, and ATM mutations in chronic lymphocytic leukemia A Notch-Dependent Inflammatory Feedback Circuit between Macrophages and Cancer Cells Regulates Pancreatic Cancer Metastasis The mutational landscape of head and neck squamous cell carcinoma