key: cord-0027177-mqjf5gls authors: Islam, Md Khairul; Rahman, Md. Habibur; Islam, Md Rakibul; Islam, Md Zahidul; Mamun, Md Mainul Islam; Azad, A.K.M.; Moni, Mohammad Ali title: Network based systems biology approach to identify diseasome and comorbidity associations of Systemic Sclerosis with cancers date: 2022-02-08 journal: Heliyon DOI: 10.1016/j.heliyon.2022.e08892 sha: 62f764097c4043ba7ba891081e40fed7b6ad2383 doc_id: 27177 cord_uid: mqjf5gls Systemic Sclerosis (SSc) is an autoimmune disease associated with changes in the skin's structure in which the immune system attacks the body. A recent meta-analysis has reported a high incidence of cancer prognosis including lung cancer (LC), leukemia (LK), and lymphoma (LP) in patients with SSc as comorbidity but its underlying mechanistic details are yet to be revealed. To address this research gap, bioinformatics methodologies were developed to explore the comorbidity interactions between a pair of diseases. Firstly, appropriate gene expression datasets from different repositories on SSc and its comorbidities were collected. Then the interconnection between SSc and its cancer comorbidities was identified by applying the developed pipelines. The pipeline was designed as a generic workflow to demonstrate a premise comorbid condition that integrate regarding gene expression data, tissue/organ meta-data, Gene Ontology (GO), Molecular pathways, and other online resources, and analyze them with Gene Set Enrichment Analysis (GSEA), Pathway enrichment and Semantic Similarity (SS). The pipeline was implemented in R and can be accessed through our Github repository: https://github.com/hiddenntreasure/comorbidity. Our result suggests that SSc and its cancer comorbidities share differentially expressed genes, functional terms (gene ontology), and pathways. The findings have led to a better understanding of disease pathways and our developed methodologies may be applied to any set of diseases for finding any association between them. This research may be used by physicians, researchers, biologists, and others. Scleroderma, or Systemic Sclerosis (SSc), is an autoimmune rheumatic sickness portrayed by extreme creation and amassing of collagen, called fibrosis, in the skin, inner organs as well as injuries to small arteries. SSc is very heterogeneous and generally affects patients' lungs, heart, and kidney that cause most of the deaths. There are many studies claimed that there might be a possibility of developing cancer in SSc patients [1] . There are three autoantibodies associated with SSc namely: anti-centromere, anti-RNA polymerase SSc patients' first year. Moreover, immuno-suppressants therapies are used to cure this autoimmune disorder may also contribute to the establishment of cancer [6] [7] . Interestingly, the presence and extent of SSc diffuse lung disease (scarring of lung tissue in internal organs like Interstitial Lung Disease) have indeed been reported as a potential cause for Lung cancer in previous research [8] . It is suspected that SSc patients mostly diagnosed with lung disease [9] [10] at a young age as well as women [11] have a greater chance of developing rapidly growing lung malignancy. In an Italian study, Lung cancer was found approximately in 5 percent of SSc patients (16 out of 318 patients). Furthermore, SSc patients have a 4.2-times chance of growing Lung cancer relative to ordinary people [12] . In another study, Lung cancer and Lymphoma have the largest rates of occurrence [13] in SSc patients. Lymphoma incidence was observed in SSc patients, with a standardized incidence ratio (SIR) of 2.1 [14] in Sweden. Over 56 years old SSc patients (130 people) have been observed, while 66 patients diagnosed with Lymphoma (mostly in B-cell tumors) and 28 patients diagnosed with Leukaemia (in Lymphocytes) [15] [16] . The distinctive evidence of specific co-occurring disorders in a person referred to as a comorbid condition-has become a typical practice because of its extraordinary alert on the forthcoming diseases and also reduce medical services cost [17] . Therefore, the accessibility of adaptable, simple to utilize programming structures, called bioinformatics, are fundamental to assist the investigation of comorbid condition of a pair of disease, over a enormous transcriptomic datasets of the patient's different organs or cells [18] . In recent bioinformatics approach, researcher tried to establish mechanisms to find shared genes, molecular pathways, protein-protein interactions, diseasome network, cluster analysis and semantic similarity in terms of DEGs and GO pathways, that help in determining a comorbid condition between a pair of disease [19] [20] . Previously researchers applied computational modelling to predict a comorbid condition and clinical bioinformatics approaches to validate comorbidities [21] . Since we have discovered common significant genes, molecular pathways, semantic similarity and other risk factors between SSc and cancer (pair like SSc vs Lung, SSc vs Lymphoma, and SSc vs Leukaemia), as well as SSc is a heterogeneous disease, so it is much more difficult to understand common molecular mechanisms and therapeutic drug targets for SSc with coexisting diseases. To avoid such complexity in research area, diagnosing diseases, or medicine discovering, we have developed a bioinformatics pipeline for SSc and its cancer comorbidities. The approach demonstrate the shared genes, molecular pathways of SSc and a cancer (Lung, Lymphoma, and Leukaemia). As there is no usual way of determining comorbid condition, we have followed previous research and implement few additional bioinformatics mechanisms together to develope a pipeline. Our proposed pipeline, combined bioinformatics approaches like analysing transcriptomics, pathways, protein-protein interactions and identification of hub proteins along with semantic similarity, performed much better than previous clinical data based bioinformatics approaches as well as the recent discovered individual mechanism. In this regard, we have collected 7 different transcriptomic datasets of the selected diseases. Then we have performed gene expression profiling of the transcriptomic data that used to identify significant biomarker genes and regarding gene ontology, and molecular pathways. Identified significant genes also utilised to conduct gene set enrichment analysis, semantic similarity, and cluster network. The findings assisted to prove that SSc established the risk of being diagnosed with cancer disease [22] . To summarise, our proposed methodology proved that SSc is highly associated with Lung Cancer than Leukaemia or Lymphoma. In fact, We may apply that methodology to any omics (transcriptomic) datasets including RNA-seq or microarray data. Anyone with a simple understanding of programming and biological knowledge may use the proposed pipelines to validate comorbid conditions between pairs of diseases. In this study, RNA-seq (high throughput sequencing) datasets were collected from the public repository, freely accessible, including the National Biotechnology Information Center (NCBI) [23] and Bio-Jupies [24] . Atleast 6 samples were considered from the raw datasets to maximize the full strength of this analysis. The factors considered in choosing the datasets for such analysis are as follows. 1. In our analysis, we have excluded duplicate samples (based on GSM IDs) that are present across different datasets; 2. SSc and cancers have many datasets but we consider only those datasets which have both control and case samples; 3. Just human data was included for the study while non-human dataset was avoided; 4. We have counted the number of significant genes from Differentially Expressed Genes (DEGs) considering two conditions: absolute log fold change value is greater than or equal to 1 as well as the adjusted p-value ≤0.05. For the quest of the highly expressed genes in SSc patients, RNAseq datasets were collected from GEO repositories and included for our analysis: GSE102864, a review of 9 control samples and 9 SSc subjects from Dermal fibroblast (skin) tissue [25] ; GSE104174, a review of 15 control samples and 15 SSc subjects from monocytes, lymphocytes (blood) tissue [26] . To prove our hypothesis we also collected selected cancer's RNA-seq dataset from GEO repositories. For lung cancer our included datasets: GSE60052, a review of 7 control samples and 7 small cell lung cancer subjects from lung tissue [27] ; GSE99531, a review of 4 control samples and 4 non small cell lung cancer subjects from T cell (blood) tissue [28] . For leukaemia cancer our included datasets: GSE94453, a review of 5 control samples and 4 leukaemia cancer subjects from Acute Myeloid Leukemia (AML) cell (blood) tissue [29] ; GSE107071, a review of 3 control samples and 3 leukaemia cancer subjects from AML cell (blood) tissue [29] . For Lymphoma cancer our included dataset: GSE106092, a review of 2 control samples and 4 Mantle Cell Lymphoma (MCL) cancer subjects from mantle cell line (blood) tissue [30] . MCL is a type of non-Hodgkin lymphoma. Gene Set Enrichment Analysis (GSEA) is used for interpreting gene expression data as well as functionally enriched GO terms on the different conditions or disease states. This approach works with multiple genes that share a similar biological process, chromosome position, or modulation, and thereby reveals several similar biological pathways, which facilitates the biological interpretations of the findings [31] . The GSEA approach analysis information at the standard of gene sets that utilized prior biological knowledge, such as gene pathways, gene expression profiles [32] . Our identified genes were associated with different genotypes (phenotypes) [33] . While GSEA takes nearly all the genes into account in the analysis, not just those above a randomly chosen threshold in terms of log2 fold change. We have implemented the GSEA using package: 'topGO' [34] . Firstly, we have got the most specific GOs corresponding to the set of genes. By implementing 'topGO' package, we have got the total number of GO and their interactions among them for our regarding significant genes of each dataset. At last, we have got top genes annotated to the GO terms. On the found data, we have applied Fisher's Exact Test to find gene counts accountable for enriching particular GO terms and Kolmogorov-Smirnov (KS) tests to compare continuous distributions. Here we have selected disease datasets from NCBI Gene expression Omnibus. Then, we have developed our case-control pairs of the selecting samples to identify the Differentially Expressed Genes (DEGs). Then significant genes were determined based on two criteria from DEGs: adjusted p-value ≤0.05 and | | ≥ 1. GO, KEGG pathways, and Semantic Similarity analysis were performed on the DEGs to reveal SSc and its cancer comorbidities. In many bioinformatics approaches, it is very important to measure the significance of the overlap between a given set of genes of interest and previously annotated sets of genes to identify functional pathways that are associated with these genes set [35] . The advent of high-throughput experimental technologies such as microarrays and RNA sequencing have revolutionized molecular biology's knowledge [36] [37] [33] . Analysing the most significant genes against a database of well-annotated gene sets, such as molecular pathways, allows us to see that a gene set is functionally important or not [38] [39] [40] . We have used KEGG repositories to identify DEGenriched molecular pathways for inferring molecular pathways relevant to SSc that also interact with the cancer diseases [41] . KEGG is an interactive database resource for researching the associations of the biological systems with cells and organisms using molecular-level information which make the way in developing the disease in a patient. The networks of KO nodes, the KEGG pathway maps, KEGG modules, and BRITE hierarchies are formed, which reflecting high-level of cells and organisms functions [42] . Gene Ontology (GO) has an enormous community public database that gives a lot of controlled vocabularies (biological or biochemical terms), represents gene products depending on their features in the cell. [43] . It is a community-oriented database of gene ontologies to help organic annotation of significant genes [44] [45] . In the hierarchical ontology graph, terminology placed closely together (i.e. with a few intermediate terminologies between them) appear to be ontologically more equivalent than any others far apart [46] . Various methods were used to quickly access all aspects of the data accommodated for the information and to allow valuable comprehension of exploratory knowledge using the GO, for example by enrichment analysis [44] [45] . Semantic similarity methods have been used to provide a framework for their pragmatic analysis using GO [47] . Ontologies were presented as directed acyclic graphs (DAGs) in which the term is represented as nodes and relations as edges. Semantic associations with each of its predecessor terms were identified by the DAG (a graph of ontology's subset). To summarise semantic similarity tries to find similarities between two terms (genes, GO) [48] . A GO term R can generally be defined as DAG R = (R,T R ,E R ) where T R is the DAG R set of GO term (such as R and each of its predecessor terms in the GO graph) and E R is the DAG R set of all edges (semantic relationships). The semantic value of R is: where w e is the edge e's ( ∈ R ) semantic contribution multiplier, the standard term t with its child term ′ . According to the form of connection, the semantic contribution is allocated between 0 and 1. For R, the global semantic meaning is determined as: and the semantic similarity between two terms is: Given two sets of terms The length of the very first set of terms is n, and the length of the second set of terms is m. For two given sets, we used the best-match average (BMA) to measure the semantic similarity as follows: with i, j indices on R, K terms. The schematic diagram of the proposed workflow is showed in Fig. 1 . The developed bioinformatics pipeline was implemented using the R language, available for public access at: https://github .com / hiddenntreasure /comorbidity. At first, we have considered transcriptomic datasets from Gene expression omnibu's (GEO) repositories for the selected diseases. Both SSc and cancer's high throughput sequencing (RNA-seq) samples (GSM records) were collected considering the fact: diseases-affected samples and healthy controls samples. Before selecting GSM records of a particular dataset, the case-control pairs (Like: SSc vs Lung cancer) were designed according to the research aim. The designed case-control pairs for SSc, were SSc-control vs SSc-affected and for Cancers (Lymphoma, Leukaemia, and Lung cancer)-affected vs control. GSM samples were investigated for better understanding of control and case samples. 'BioJupies' [24] , a online platform, is used for GSM records selection as well as differential expression analysis of genes. In some cases, BioJupies can automatically select healthy-control and case samples. If not then we select manually reviewing the details of GSM records. A Bayesian approach is applied for filtering candidate genes according to the designed casecontrol pair using two parameters, they are adjusted P-value (using Benjamini & Hochberg's FDR correction method and the cut-off p-value is 0.05) and log2 fold change (logFC) [49] . We used 'topGOdata', a R packages, on the significant genes to determine the GO term along with the annotation for mapping. Then conduct Fisher's Exact Test to filter GO terminology and find relationships within genes [34] . We calculated semantic equivalence for all immunological pathologies after obtaining the enriched GO terms corresponding to the SSc and cancer datasets by using mgoSim function available in GoSemSim package [50] . We used the KEGG pathways database to understand the common molecular pathways between SSc and cancer diseases, a manually created mapping which is accessible by anyone interested in diseases relationships [41] . Shared significant genes have been compared against the KEGG database to identify and demonstrate molecular mechanisms among the diseases. At the end of our methodology we have got the statistical summary of the datasets, genes-GO term, semantic similarity matrix (and a tree diagram) using genes and GO terminology, DAG, and KEGG molecular pathways. Alongside the most common pathway associated with the chosen pathologies, we have constructed a genomic network using common DEGs that were highly expressed in SSc and any of the cancer. The pipeline was constructed using the following packages of Bioconductor, 'genefilter' [51] : is a method for extracting genes from previous research with high-throughput data; 'topGO' [34] : this package offers metrics to evaluate GO terms when estimating for the GO graph topology. It may also possible to introduce and apply various statistical tests and different approaches to remove local correlations and incompatibilities between suggested GO terms.; 'GOSemSim' [50] [52]: computation of semantic similarities among GO, groups of GO, regu-latory genes and gene categories; 'clusterProfiler' [53] : for the KEGG pathways simulation analysis. After identifying DEGs, we have used the DEGs for protein-protein interactions. To show the protein-protein interaction which reveals the hub proteins. We used STRING to create protein interaction networks for the DEGs found in our enrichment study. Since the number of DEGs was low, we used a medium trust score (500) to build the PPI to identify hub proteins. Table 1 shows the SSc and its cancer comorbiditie's statistical summary based on our methodological needs. Here in this study, we have collected two datasets for each disease from GEO database except only one dataset considered for Lymphoma. Table 1 includes GEO accession numbers for each dataset, cell descriptions, the number of case and control samples, raw genes, significant gene, raw GSEA and Fisher GSEA. At first, we have conducted differential expression analysis on the count data of each dataset considering the case-vs-control samples. Thus we got candidate genes for each datasets which shown as raw genes in the Table 1 . Then two conditions were applied on the candidate genes to identify significant/biomarker genes. This helped us to achieve dimensionally reduced downstream analyses via focusing only on the active part of the biological system. To that extent, we have considered 2 transformed expression fold changes (FC), where a threshold value of 1 for the absolute 2 was considered. As well as, we have conducted empirical Bayes procedure, where the significance cut-off p-value < 0.05 was considered. Thus acquired significant or biomarker genes in downstream analyses showed in the Table 1 as seventh feature. The initial step of the GO enrichment analysis is to discover raw GSEA. Hence, we performed gene-ontology mapping using DEGs and GO term from the biologic process (BP). Feature 8th showed in Table 1, depict the number of nodes/annotated genes to the Gene Ontology term. Fisher's Exact Test is a measurable test depends mainly on the possibility tables to contemplate the criticalness of the relationship between two sets of characterizations. This is also used to classify statistically important and enriched biological roles. For example, the functional annotations of a group of genes can be compared with the rest of the genome, or the functional profile, since two experimental conditions can be compared against each other. The 9th features in Table 1 show the number of important GO terminology enriched in the Fisher Exact test. Taking the top 150 DEGs, Table 2 shows the number of up-regulated and down-regulated genes for all the datasets. Then we compare SSc We also use the Genemania, a online tool, to build a cluster network with similar genes between SSc and any of the selected cancer [54] . Fig. 2A, 2B and 2C show the cluster network, based on the shared genes, delivered from Genemania that represents the enriched pathways between SSc and Lung Cancer, SSc and Leukaemia, SSc and Lymphoma, respectively. Firstly, the most prominent pathways between SSc and LC were associated with the selected pathologies and their network percentile coverage: Peptide ligand-binding receptors (0.74%), Class A1 . This network helps to check whether co-morbidities genes are in the vicinity of SSc genes. The network illustrates that there were 8 genes shared between SSc and Lung Cancer, 3 genes shared between SSc and Leukaemia and 2 genes shared between SSc and Lymphoma. Fig. 4 . A dot plot illustrates to demonstrate the pathway Enrichment of the differentially expressed genes using KEGG pathway database. Here the dot corresponds to the significantly enriched pathways, colour and size of each dot refers to the significance of the enrichment test (adjusted p-value < 0.05) and the ratio of overlapping DEGs with that of a particular KEGG pathway, respectively. KEGG's most commonly enriched pathways between SSc and considered cancer were : Cytokine-cytokine receptor interaction, Rheumatoid arthritis, Pl3K-Akt signaling pathway, Hematopoietic cell lineage, AGE-RAGE signaling pathway in diabetic complications, Chemokine signaling pathway, ECM-receptor interaction, Malaria, Amoebiasis, Graft-versus-host disease, Autoimmune thyroid disease, Bladder cancer, IL-17 signaling pathway, and protein digestion and absorption. (0.06%), Amine Oxidase reactions (0.0028%), Spermine (0.13%) and V$AR 03 (0.063%) etc. In Fig. 3 , shows the significant associations among these cancers with the effects of Systemic Sclerosis, disease-genes association-ship networks were constructed for DEGs using Cytoscape plugins41, centered on SSc. 'JDP2', 'EDNRB', 'CD93', 'ALDH2', 'C1QB', 'SMTN', and 'CXCL2' were common between SSC and LC. 'RNPC3' and 'SMOX' were common between SSc and LP. 'COL8A2', 'PKIB', and 'NAV3' were common between SSc vs Lk. Pathway enrichment test is a tool for observing the functional relevance of a group of genes/proteins/molecules by measuring the significance of their overlap with an annotated group of genes/proteins/molecules known apriori for their particular biological function, namely pathways analysis. We have hypothesized that such pathway enrichment test would reveal shared biological functions between SSc and its comorbidities with cancer disease via observing commonly enriched pathways against DEGs. KEGG pathways is an known database that mapped pathway with annotation. An overrepresentation statistical test of the DEGs, namely Fisher's exact test, has been conducted with adjusted p-values < 0.05 to obtain significantly enriched pathways. As shown in Fig. 4 , we have found that there were several commonly enriched KEGG pathways between SSc (at least one) and considered cancer diseases including Cytokine-cytokine receptor interaction, Rheumatoid arthritis, Pl3K-Akt signaling pathway, Hematopoietic cell lineage, AGE-RAGE signaling pathway in diabetic complications, Chemokine signaling pathway, ECM-receptor interaction, Malaria, Amoebiasis, Graft-versus-host disease, Autoimmune thyroid disease, Bladder cancer, IL-17 signaling pathway, and protein digestion and absorption. We've found many DEGs from RNA-seq data using two conditions (p-value and logFC), yet their biological interpretation hard to discover as well as perplexing. Therefore, the enrichment analysis of DEGs was conducted to identify functional GO terms that are strongly enriched using known gene set annotations. The 'topGO', a R package, was used to perform GO enrichment analysis using Fisher's Exact test as well as Kolmogorov-Smirnov on each dataset. And also generate hierarchical Direct Acyclic Graphs (DAG) using the most enriched GO terms corresponding to the datasets. Fig. 5 shows the DAG of significant GO terms regarding the GSE104174-dataset, where GO terminology rep-resents the spectrum of the GO mappings with the five most important GO terminology. DAG is created using the elim algorithm in 'topGO' R package [34] . As shown in Fig. 5 , the top 5 GO terms for the GSE104174 dataset includes: a multicellular organismal process (GO:0032501), response to the external stimulus (GO:0009605), regulation of response to the stimulus (GO:0048583), defense response (GO:0006952), and inflammatory response (GO:0006954). Fig. 5 is used to show the important interrelations (top 5 in this case) of GO terms throughout the hierarchical structure. However, it also implies that all GO terminology is not essential to acknowledge the biological meaning of a disease. The following list of GO were common terms found between SSc and Lung cancer [34] : The following list of GO were common terms found between SSc and Lymphoma: • GO:0032501: multicellular organismal process • GO:0048856: anatomical structure development • GO:0007275: multicellular organism development • GO:0032502: developmental process To reveal SSc and its cancer's comorbid condition, we had focused on the shared significant genes, molecular pathways, enrichment analysis. Which disclosed biological functions and their rationale in developing cancer in SSc patients. We have also conducted a semantic similarity experiment with enriched GO terms across different datasets. Hence, the semantic similarity matrix specifies the likeliness of co-occurring two disease, though we have identified the association of cancer disease in SSc patients. Using mgoSim function in 'GoSemSim' R package [see Methods], it has been observed that SSc shares moderately its biological functions with only Lung cancer when considered all the enriched GO terms, as shown in the bottom two rows in Fig. 6 . But, when we had considered only the top 5 GO terms in each of the datasets, it was observed that SSc shared biological functions with Leukemia as well [ Fig. 7 ]. In this experimental analysis, there has not associations between SSc and Lymphoma, probably that is because there is only one function (Cytokine-cytokine receptor interaction) shared between them which was revealed via pathway enrichment test. This also shows the importance of conducting both pathway enrichment and semantic similarities as they may reveal complementing evidence in deciphering the comorbid condition between SSc and cancer diseases. To identify hub proteins, at first, protein-protein interactions (PPI) network was constructed by retrieving the interaction of the common DEGs from STRING database as shown in Fig. 8 . The PPI network consists of 88 nodes and 87 edges. This PPI analysis revealed four hub proteins, namely JDP2, EDNRB, ALDH2, and SMTN. To evaluate the validity of our work, we conducted an investigation of many literatures regarding the identified genes that have been clinically used as biomarkers for any of the three cancers. Table 3 shows the verified potential targets that confirm the validity of our study. From Table 3 , we observe that Fonseca et al. [55] identified ED-NRB genes associated with Systemic Sclerosis, Wei et al. [56] identified EDNRB genes associated with Lung cancer and MacLean et al. [57] identified EDNRB genes associated with Lymphoma, Hsiao et al. [58] identified EDNRB genes associated with Leukemia. Tan et al. [59] identified ALDH2 genes associated with Systemic Sclerosis. Li et al. [60] Yang et al. [61] identified ALDH2 genes associated with lung cancer. Liang Fig. 7 . The semantic similarity matrix is based on the top 5 terms of GO. The matrix demonstrated using legend (depicting the score of semantic similarity) and accession number of the disorder. It reveals that Lung Cancer highly associated with SSc (accession no: GSE102864) with score 0.92 (accession no: GSE99531 and GSE60052). Also discovered that SSc (accession no: GSE102864) associated with Leukaemia with a score of 0.61 (accession no: GSE94453), additionally SSc (accession no: GSE104174) with a score of 0.62 (accession no: GSE107071). [67] identified C1QB genes associated with lung cancer. Yanaba et al. [68] identified CD93 genes associated with Systemic Sclerosis. Liu et al. [69] identified CD93 genes associated with lung cancer. Iwasaki et al. [70] identified CD93 genes associated with Leukemia. Avraham et al. [71] and Luo et al. [72] identified JDP2 genes associated with lung cancer. Huang et al. [73] identified JDP2 genes associated with Lymphoma. Mansour et al. [74] identified JDP2 genes associated with Leukemia. Dufour et al. [75] identified COL8A2 genes associated with Systemic Sclerosis. Mansour et al. [74] identified NAV3 genes associated with Leukemia. Dou et al. [76] identified PKIB genes associated with lung cancer. Xu et al. [77] identified RNPC3 genes associated with Systemic Sclerosis. Chen et al. [78] identified RNPC3 genes associated with Lymphoma. Chen et al. [78] identified RNPC3 genes associated with Leukemia. The overall purpose of this research is to develop a bioinformatics and systems biology-based pipeline to retrieve novel knowledge from publicly available data repositories in terms of deciphering disease co- morbidities. Our proposed pipeline, which is readily reproducible, may further be used for omics datasets related to any set diseases in order to find similarities between them. In this study, we have collected the high throughput sequencing data like: RNA-seq datasets from Gene Expression Omnibus (GEO) repository: (https://www .ncbi .nlm .nih .gov /geo) where we have identified comorbidities between SSc and any cancer, includes: Lung, Leukaemia, and Lymphoma. In our proposed pipeline it has been hypothesized that SSc shows its comorbidities with cancer diseases for which relevant datasets were collected following our data-curation principle [see Methods] that includes both manual and automatic (using BioJupies) sample filtration based on GSM IDs. Although differential expression analysis of GEO datasets using linear models (i.e. limma R package), is assumed to have enough statistical power if the number of samples is equal or more than 3 for each group, but low number of samples added limitation to our pipeline in case of the outcomes at the end. Then we have applied GSEA that performed experiments with genome-wide expression profiles based on our designed case-control pair, and facilitated downstream analyses of SSc, particularly in terms of DEGs. Which discovered biological processes, molecular mechanisms, chromosomal position to portray the relations with other diseases. Then, the significance of the semantic similarity approach is observed, since it measured the similarity of separate disease (including different subtypes of same disease) depending on the chosen ontological terms yielded by other analysis, such as enrichment analysis of pathway and GO terms. The semantic similarity methodology was used to calculate the relationship using identified pathologies involving important biomarkers and enriched GO term sets. The purpose of this study is to evaluate the associations between two diseases, thus the use of semantic similarities admittedly strengthened the detection and classification capability of unique biological processes engaged in each disorder/disease. To decipher and characterize the latent pathological details of comorbid condition between SSc and Lung cancer/Leukemia/Lymphoma, we have identified shared significant DEGs, molecular pathways, Gene Ontology, protein-protein interactions using enrichment or conditions based process. However, the semantic similarities among enriched GO terms revealed that SSc is highly correlated with Lung Cancer rather than Leukaemia and Lymphoma that also proved in other methods applied in our pipeline. Since our pipeline is data-driven, the study type and sample characteristics of the selected data played key roles in generating and validating our hypotheses. Although we have used two datasets for each of the diseases, we may claimed that various number of datasets collecting from different tissues or organs with at least three samples can be used to apply our pipeline for detecting comorbid condition between a pair of diseases. Both biologically and technically different data like RNA-seq or micro-array data can be used to reduce data-bias and offer robust inference. Since, this pipeline generates data-driven hypotheses that are biologically plausible (based on the experimental validations within the pipeline), experimenting with two or more diseases that are seemingly unrelated, would supposedly generate results accordingly. However, we argue that even experimenting with those 'seemingly unrelated' diseases using our data-driven pipeline paves the way to produce potential novel hypotheses -some of which may provide direction to further research. Many of the previous approaches were designed for the disease comorbidity analysis by examining either single omics or clinical datasets. In [79] Elixhauser et al. [80] , the medical comorbidity index, called the Charlson comorbidity file, was modified for using with clinical documents as well as analysed the International Classification of Diseases (ICD-9-CM) codes. Then the Charlson comorbidity index has been converted into a large number of ICD-9-CM codes. The Charlson Index Summation is thought to be relevant with this article, since the ICD-9-CM regulatory information used to assess the comorbidity relationship. In Hidalgo et al. [81] , demonstrate a central phenotypic database namely Phenotypic Disease Network (PDN) has outlined the connections gained from previous experience of diagnosis of more than 30 million patients. The number of connections with a disease specified the number of comorbid diseases that acted as a risk factor for the patient diagnosed with the disease. Comorbidity4j introduced by Ronzano et al. [82] , is an open-source Java platform for comprehensive comorbidity analysis. While an integrated Web interface is used to collect clinical input data and to customise the comorbid tests, then the results of those tests may be sorted against comorbidity indices and condition names, and analyzed through heat maps and network charts. In our previous work [83] , an R method was applied to identify the disease comorbidity using initial diagnosis, genetic and clinical data of a patient and also forecast the association between diseases which has developed using different packages (i.e. pcalg, qtlnet, etc). We [83] have also developed a framework to predict survival probability of a patient using survival analysis tools (e.g. Net-Cox, rbsurv etc). In this work, we have developed a R tool called "POGO" which integrates the multi-omics data, ontology term, and phenotypic information for more robust comorbidity prediction. But it doesn't count the genetic effect on diseases. In [84] , a tool has been developed called CytoCom for Cytoscape. CytoCom is capable of clustering the Disease Comorbidity Network (DCN) based on the group of ICD9 disease codes that reveals the meaning of pathogenesis. Furthermore, contribute to the ultimately improvement of diagnosis and treatment analysis. Many kinds of research have been conducted on molecular networkbased approaches in order to help in drug development [85] [86] [87] . In earlier approaches, multi-omics, ontology, phenotypic information, and clinical data were used along with genetic data, but genetic effects on the diseases were not counted to demonstrate the relations. But two of the work used genetic effects by analyzing gene expression, molecular pathway, and Gene Ontology. A bioinformatics pipeline in R was developed [18] [88] to evaluate gene expression, gene ontology (GO), and molecular pathways data by integrating Gene Set Enrichment Analysis and Semantic Similarity. Our proposed approach to certify the comorbid conditions between SSc and cancer may be utilized by anyone, it has two basic applications that are discovered the potential pathways of SSc-related events that cause cancer disease progression and classify tentatively relevant health conditions by using omics and molecular evidence. With the advancement of such bioinformatics analysis, it may offer new opportunities for physicians to make a decision, such as potential danger evaluations, cancer detection, and subtyping, drug treatment, and dosage selection, which is a step towards the development of genuinely regenerative medicine [89] [90] . The approach will then offer profound fresh perspectives into disease mechanisms, and those defined disease mechanisms may be helpful to define promising pharmacological strategies for further studies. The aim of this manuscript was to identify pathways of SSc patients that play a vital role in developing different cancer, and also draw potential therapeutic drug targets (biomarkers) by utilising and analysing transcriptomics data, molecular pathways, protein-protein interactions and identification of hub proteins along with semantic similarity in terms of DEGs and GO pathways. Hence the methodology has identified required evidence to verify the linkage between Systemic Sclerosis and selected cancers. In addition to cancer diseases, the proposed strategy could be generalized mostly as a comorbidity chart by adding certain disease details. In terms of basic biological functions, pathways as well as omics info, such as GO, we have found that certain cancer conditions were strongly related to SSc. Our results suggest that by utilizing bioinformatics techniques, the evolution of emerging diseases may be detected and analyzed as it provides the opportunity to develop an understanding of various diseases functions. There is a growing interest among research communities in understanding comorbidity associations since it might uncover new knowledge concerning disease-causing facts as well as potential therapeutic strategic objectives. In exposing potential disease associations and prospects for drug development, this research illustrates the importance of an advanced bioinformatics and system biology-based approach. We have also assumed that such a form of methodology would be useful for providing evidence-based decisions on comorbid conditions. By combining other disease information alongside various types of cancer, our suggested approach may be applied as an association between various conditions map. It may be also used by researchers and healthcare professionals as an essential method to discover the specified fundamental disease's processes that underpin the nature and pathology of comorbid conditions and to develop more reliable and efficient therapies, possibly in a highly personalized and customized pharmacotherapic framework. Md Khairul Islam: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. Md Rakibul Islam: Analyzed and interpreted the data; Wrote the paper. Md Zahidul Islam: Conceived and designed the experiments; Performed the experiments; Wrote the paper. Md. Habibur Rahman: Performed the experiments; Analyzed and interpreted the data; Wrote the paper. Md Mainul Islam Mamun: Contributed reagents, materials, analysis tools or data. AKM Azad: Contributed reagents, materials, analysis tools or data; Wrote the paper. Mohammad Ali Moni: Conceived and designed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Data associated with this study has been deposited at https://www . ncbi .nlm .nih .gov /gds The authors declare no conflict of interest. No additional information is available for this paper. Cancer incidence in systemic sclerosis: meta-analysis of population-based cohort studies Associations of breast cancer development in patients with systemic sclerosis: an exploratory study Impact of radiation therapy on scleroderma and cancer outcomes in scleroderma patients with breast cancer, Arthritis Care Res Mechanistic and clinical insights at the sclerodermacancer interface Causes and risk factors for death in systemic sclerosis: a study from the EULAR scleroderma trials and research (EUSTAR) database Bladder and kidney cancer following cyclophosphamide therapy for non-Hodgkin's lymphoma Cyclophosphamideinduced cystitis and bladder cancer in patients with Wegener granulomatosis Intriguing relationships between cancer and systemic sclerosis: role of the immune system and other contributors Incidence of lung cancer in systemic sclerosis Cancer and systemic sclerosis Incidence of cancer among patients with systemic sclerosis in Korea: results from a single centre Lung cancer in scleroderma: results from an Italian rheumatologic center and review of the literature Systemic sclerosis and the risk of cancer: a nationwide population-based cohort study Autoimmune diseases associated with non-Hodgkin lymphoma: a nationwide cohort study Haematological malignancies in systemic sclerosis patients: case reports and review of the world literature, Case Rep Systemic sclerosis and chronic lymphocytic leukaemia Defining comorbidity: implications for understanding health and health services Bioinformatics methodologies to identify interactions between type 2 diabetes and neurological comorbidities Network-based identification of genetic factors in ageing, lifestyle and type 2 diabetes that influence to the progression of Alzheimer's disease Covid-19 patient transcriptomic and genomic profiling reveals comorbidity interactions with psychiatric disorders Comorbidity effects of mitochondrial dysfunction to the progression of neurological disorders: insights from a systems biomedicine perspective Discovering biomarkers and pathways shared by Alzheimer's disease and Parkinson's disease to identify novel therapeutic targets Ncbi geo: archive for functional genomics data sets-update Biojupies: automated generation of interactive notebooks for RNA-seq data analysis in the cloud Ptp4a1 promotes tgf signaling and fibrosis in systemic sclerosis Changes in macrophage transcriptome associate with systemic sclerosis and mediate GSDMA contribution to disease risk Genomic landscape survey identifies SRSF1 as a key oncodriver in small cell lung cancer A transcriptionally and functionally distinct pd-1+ cd8 + t cell pool with predictive potential in non-small-cell lung cancer treated with pd-1 blockade Mef2c phosphorylation is required for chemotherapy resistance in acute myeloid leukemia NR4A1 inhibition synergizes with ibrutinib in killing mantle cell lymphoma cells Genetic effect of type 2 diabetes to the progression of neurological diseases A system biological approach to investigate the genetic profiling and comorbidities of type 2 diabetes Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles Gene set enrichment analysis with topGO, Bioconductor Improv A network-based bioinformatics approach to identify molecular biomarkers for type 2 diabetes that are linked to the progression of neurological diseases Pathway enrichment analysis with networks Identification of key pathways and genes in nasopharyngeal carcinoma using bioinformatics analysis Pathways database system: an integrated system for biological pathways Preeclampsia: a bioinformatics approach through protein-protein interaction networks analysis Kegg as a reference resource for gene and protein annotation Gene ontology and KEGG pathway enrichment analysis of a drug target-based classification system Bioinformatics and machine learning methodologies to identify the effects of central nervous system disorders on glioblastoma progression The gene ontology project in 2008 The gene ontology and the meaning of biological function Primer on the gene ontology Metrics for GO Based Protein Semantic Similarity: a Systematic Evaluation A new method to measure the semantic similarity of go terms limma powers differential expression analyses for RNA-sequencing and microarray studies GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Genefilter: methods for filtering genes from high-throughput experiments, R package version 1 Gene ontology semantic similarity analysis using gosemsim He, clusterProfiler: an R package for comparing biological themes among gene clusters The genemania prediction server: biological network integration for gene prioritization and predicting gene function Endothelin axis polymorphisms in patients with scleroderma Role of endothelin receptor type B (EDNRB) in lung adenocarcinoma Downregulation of endothelin receptor B contributes to defective B cell lymphopoiesis in trisomy 21 pluripotent stem cells Promoter methylation of p16 and EDNRB gene in leukemia patients in Taiwan Classification analysis of the transcriptosome of nonlesional cultured dermal fibroblasts from systemic sclerosis patients with early disease ALDH2 repression promotes lung tumor progression via accumulated acetaldehyde and DNA damage Methylation-induced silencing of ALDH2 facilitates lung adenocarcinoma bone metastasis by activating the MAPK pathway A modified murine model of systemic sclerosis: bleomycin given by pump infusion induced skin and pulmonary inflammation and fibrosis Depletion of CXCR2 inhibits tumor growth and angiogenesis in a murine model of lung cancer Role of chemokines in non-small cell lung cancer: angiogenesis and inflammation The complement system and systemic sclerosis The imbalance in the complement system and its possible physiological mechanisms in patients with lung cancer Is the complement protein C1q a proor anti-tumorigenic factor? bioinformatics analysis involving human carcinomas Augmented production of soluble CD93 in patients with systemic sclerosis and clinical association with severity of skin sclerosis Radioimmunoimaging of 125i-labeled anti-CD93 monoclonal antibodies in a xenograft model of non-small cell lung cancer Cd93 marks a non-quiescent human leukemia stem cell population and is required for development of MLLrearranged acute myeloid leukemia ATF3 and JDP2 deficiency in cancer associated fibroblasts promotes tumor growth via SDF-1 transcription Identification of four differentially methylated genes as prognostic signatures for stage I lung adenocarcinoma Histone chaperone Jun dimerization protein 2 (JDP2): role in cellular senescence and aging JDP2: an oncogenic bZIP transcription factor in T cell acute lymphoblastic leukemia IL-17A dissociates inflammation from fibrogenesis in systemic sclerosis PKIB promotes cell proliferation and the invasion-metastasis cascade through the PI3K/Akt pathway in NSCLC cells Systematic autoantigen analysis identifies a distinct subtype of scleroderma with coincident cancer Identification of RNPC3 as a novel JAK2 fusion partner gene in B-acute lymphoblastic leukemia refractory to combination therapy including ruxolitinib Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases Comorbidity measures for use with administrative data A dynamic network approach for the study of human phenotypes Comorbidity4j: a tool for interactive analysis of disease comorbidities over large patient datasets comoR: a software for disease comorbidity risk assessment Cytocom: a cytoscape app to visualize, query and analyse disease comorbidity networks Molecular networks in network medicine: development and applications Identifying novel associations between small molecules and mirnas based on integrated molecular networks Understanding biological functions through molecular networks Bioinformatics methodologies for coeliac disease and its comorbidities Emerging landscape of genomics in the electronic health record for personalized medicine The CLIPMERGE PGx program: clinical implementation of personalized medicine through electronic health records and genomics-pharmacogenomics