HLA-SPREAD: A comprehensive resource for HLA associated diseases, drug reactions and SNPs across populations HLA-SPREAD: A comprehensive resource for HLA associated diseases, drug reactions and SNPs across populations Dhwani Dholakia1,2*#, Ankit Kalra3#, Uma Kanga4, Mitali Mukerji1,2* 1. Institute of Genomics and Integrative Biology-Council of Scientific and Industrial Research, New Delhi-110025, India. 2. Academy of Scientific and Innovative Research, Ghaziabad-201002, India. 3. Netaji Subhas University of Technology, New Delhi-110078, India. 4. All India Institute of Medical Sciences, New Delhi-110029, India. * Correspondence: Mitali Mukerji; Email: mitali@igib.res.in Dhwani Dholakia; Email: dhwani.dholakia@igib.in #Equal Contribution Keywords: HLA associations, Natural Language processing, Adverse Drug Reactions, HLA Biomarker, Transplantation, HLA alleles (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 ABSTRACT Extreme complexity in the HLA system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR), Transplantation. PubMed search displays ~110,000 studies on Human Leukocyte Antigens (HLA) reported from, diverse locations and on multiple populations and IPD-IMGT/HLA database houses data on 28,320 HLA alleles till date. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~24 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms and semantic analysis to infer HLA association(s). This resource from 116 countries and 47 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 INTRODUCTION Human Leukocyte Antigen (HLA) locus consists of six classical genes (HLA-A, -B, -C, -DP, -DQ and - DR) that play an important role in eliciting immune response against pathogens (1) and three non- classical genes (HLA-E, -F and -G) that interact with Natural Killer cells to regulate virus-infected and malignant cells (2). HLA genes harbour a large number of mutations. As of September 2020, there are 28,320 HLA alleles reported in IPD-IMGT/HLA database. These variations mostly arise to generate defensive mechanisms against pathogens. However, some variations also confer risk to autoimmune diseases like rheumatoid arthritis, multiple sclerosis, Type 1 diabetes and Graves’ disease etc. More than 100 different autoimmune diseases, infectious diseases and adverse drug reactions have been reported to be associated with HLA genes (3–5). These alleles have clinical utility as diagnostic markers for example in rheumatoid arthritis, ankylosing spondylitis (6–8). They are also used in genetic screening e.g. HLA-B*57:01 in Caucasian population for abacavir hypersensitivity, HLA-B*15:02 in Chinese and Asians for carbamazepine induced life-threatening conditions like Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN) and also for SJS due to carbamazepine and other drug combinations (9, 10). In the context of transplantation, mismatch of HLA alleles between donor and recipient impacts the solid organ and hematopoietic stem cell transplantation outcomes (11). In addition, mismatching for certain HLA loci are also reported to provide benefit in terms of Graft versus Leukemia effect (12). Each of the reported studies is unique in itself as they describe the molecular basis of disease associations, HLA matching and anti-HLA antibody formation that are relevant for transplantation. Besides, studies also report some relevant and associated clinical information, e.g different HLA-B27 subtypes are reported to be associated with clinical categories under spondyloarthropathies (13). There are other studies that implicate HLA allele association with the composition of gut microbiome and diseases (14–16). The expanse of this information is immense as there is wide genetic variability and heterogeneity among populations (17). Although advancements in HLA typing technologies has been beneficial in identifying novel HLA sequences (18), this has also led to reporting the same HLA allelic variant using different HLA nomenclature. With the rapid increase in biomedical data, HLA alleles and their associations in multiple diseases, it becomes imperative to create a platform with structured information to query and retrieve relevant information. Current knowledge about HLA limits to individual papers that can be searched through PubMed or reviews where a subset of studies has been summarised. Hitherto, there exists no database that complies the existing HLA related information in an organised framework. In absence of such a repository with meta information gaps, resource sharing among researchers and clinicians becomes a big challenge. The integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. Natural Language Processing (NLP) is a method to extract relevant information from unstructured data (19). A simple NLP pipeline contains 4 components: data assembly, pre-processing and normalization, Named Entity Recognition (NER) and Relation Extraction (RE). The output of NLP algorithms, i.e. structured dataset can be used to generate insights via direct interpretation or through downstream analyses. In recent times, NLP methods have started (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 gaining popularity in biological sciences. For instance, Rakhi et.al (20) reported a text mining pipeline to study spice-disease associations and link phytochemicals from different spices/herbs to diseases. Another report by Lee et.al highlights BioBERT, a pre-trained biomedical language representation model that can be used for various text mining tasks like Name Entity Recognition (NER), Relationship extraction (RE) and question answering, specifically on biomedical datasets. Similarly, PubTator Central (21) is an open access tool available via NCBI that uses text mining algorithms for assisted bio- curation of entities in literature. The tool uses NER to identify and thus highlight six bio-entities viz. Gene, Disease, Chemical, Mutation, Cell Line and Species from abstracts and open access articles available on PubMed. Another interesting report by Kuleshov et.al(22) presents a machine compiled database for studying genotype-phenotype associations generated using applications of text mining on genome-wide association studies (GWAS). All these resources work on similar text mining algorithms, but each has a different set of applications and tasks to perform. The use of these resources as such in addressing the HLA research often overlooks the extent of variability of HLA complex and involved parameters in this domain. For instance, PubTator Central is able to mine gene names from literature, but would not pick HLA allele information e.g. HLA-DRB1*01:01 when HLA-DRB1 is the search query. Conventional processes to individually mine a large amount of unstructured literature available on HLA research requires both manpower and resources. For understanding and integrating the observations from HLA studies we require knowledge of genomic datasets, i.e. diseases, SNPs, drugs, populations, and ethnic groups along with an understanding of the relationship between them. NLP based text mining is an ideal approach to understand the complexity of this process to create a structured information. We provide HLA-SPREAD (Figure 1) as a platform for integrated HLA resources that has been developed using NLP to understand the complexity of this locus. The resource provides a platform to summarize HLA related genomics knowledge as well as to design and develop new hypothesis. In this study, we have used publicly available ~24 million peer reviewed abstracts. We extracted biomedical entities including HLA alleles, diseases, SNPs, drugs and geographical locations. We also tried assigning positive and negative relationships between disease and alleles. This HLA connectivity was then used to address biologically and clinically relevant objectives like HLA-biomarkers and risk and protective alleles for various diseases. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 MATERIAL AND METHODS Data Retrieval MEDLINE was used as a source of biomedical literature that comprises more than 24 million peer- reviewed articles from over 5600 scholar journals. Bulk data was downloaded from the FTP server in XML format. HLA alleles with nomenclature were downloaded from IPD-IMGT/HLA database(23). To maintain uniformity in disease names and their IDs, we used MeSH keywords from UMLS (Unified Medical Language System). Drugs associated with side effects were obtained from SIDER 4.1 and Allele Frequency Net Database (AFND) (24, 25). Allele frequency of HLA alleles were also taken from AFND. Extensive Pre-processing was done on all the datasets before they were implemented in the pipeline. Pre-processing and Keywords Dictionary PubMed parsing: A modified version of PubMed parser was used to extract PMID, title, abstract, publication date, journal, article type and authors’ information from MEDLINE biomedical literature dataset (26). Only records with the above information were considered for further analysis and stored in a tabular format. All the subheadings in the abstract viz background, introduction, objective, method, experimental design, result, discussion, importance, setting, design, study objective, patients, participants and conclusion were removed. Disease Dictionary: Mentions of disease keywords were identified using a dictionary created from UMLS 2019MRCONSO.RRF (27). UMLS is a set of biomedical vocabulary that includes data from OMIM, Gene Ontology, Clinical repositories, Medical Subject Headings (MeSH) and NCBI taxonomy. In this study, we used MeSH descriptors including Entry Term (ET), Main Heading (MH), Preferred Entry term (PEP), Descriptor Sort Version (DSV), Machine Permutation (PM). Descriptor Entry Version (DEV) was excluded as keywords belonging to this category were incomplete, e.g. abdominal injury was reported as abdominal inj. These descriptors are assigned a unique MeSH ID which is stored in a hierarchical format with 24 head categories along with a unique Descriptor ID. We termed the root form of the disease as level-zero and top-level diseases as level-one for our analysis. Multiple forms of a disease like diabetes insipidus, diabetes mellitus, type 1 diabetes, juvenile-onset diabetes and others are assigned the same MeSH ID. This dataset was also supplemented with keyword variants such as plural and lemmatised forms to increase the search space. HLA Dictionary: Keywords for HLA alleles and their nomenclature were fetched from the centralized repository of international ImMunoGeneTics project (IMGT) database. IMGT is updated quarterly with submission or deletion of alleles and their nomenclature and currently houses 28,320 alleles. Many reports do not follow the conventional HLA allele nomenclature which makes mapping a strenuous task. To maximally capture all HLA alleles, we created a dataset comprising of all possible keywords including the removal of special characters, whenever required. We have also attempted (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 mapping all the old nomenclature to the current allele names. This dictionary also includes few generic HLA keywords like HLA class I, HLA class II, HLA linked and HLA associated. There are few alleles based on old nomenclature that belong to more than one antigenic group, hence they were put under “broad antigen” category. A few haplotypes that were a combination of more than one HLA allele were grouped in “haplotype” category. Named Entity Recognition Keyword Matching across Abstracts A python-based NER pipeline was implemented to filter abstracts based on a dictionary matching approach using parallel multiprocessing. Disease and HLA allele keyword dictionaries were used for initial screening. Abstracts were converted to lower case with special characters removed and if a match was found in either title or text, the abstract was sentence tokenized using sentence tokenizer, a part of python Natural Language Tool Kit (NLTK). We encountered a great extent of variability in the names of disease keywords. Most of it had special characters like (-) and (‘) in the keyword or with the plural and singular forms. To deal with the former, we kept instances of sentences where special characters were not removed, this increased the search space that enables capturing of keywords such as Stevens-Johnson syndrome (Stevens-Johnson syndrome), Graves' disease (Graves disease). Our disease dictionary was already enriched with plural and lemmatized forms of keywords to tackle the latter. For HLA allele keywords, word boundary-based regex matching was implemented to search alleles in the sentences. Sentences with at least a single mention of both HLA allele and disease keywords were considered for further steps. Identification of Tags: Populations, Drugs and SNPs Populations: The filtered abstracts were processed using spaCy NLP tagging algorithm (model: en_core_web_md) to search for mention of populations in text. From the two output tags, i.e. GPE (Geo-Political Entities) and NORP (Nationalities Or Religious Groups), we selected the keywords having the latter as GPE tag often reported scientific names of organisms as populations when applied on biomedical data, e.g. scientific names such as Chlamydia spp. and Chlamydomonas spp. were reported under GPE tags. The output was classified into countries and ethnic groups for further analysis with the help of an expert anthropologist. Manual curation of the obtained list was also done to remove plural and inappropriate entries. Drugs: The information on drugs with side effects were taken from the SIDER database (SIDER 4.1). We also added 16 drugs from AFND, whose information was missing in SIDER. The list of drugs was mapped across the dataset to check for its occurrences in selected HLA related abstracts. There were many instances where drug names were subpart of disease keywords, e.g. “insulin” was obtained as a false match wherever it was present as a part of the disease name “insulin dependent diabetes mellitus”. A small python snippet was written to remove such false positives. SNPs: SNP IDs were mapped across abstracts of the HLA dataset using the RegEx module of python. The algorithm iteratively searched for all instances of RSIDs using regular expression “[rR][sS][0-9]{2,}”. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 All the tags captured in various sentences of abstracts were stored in a list of strings format along with their respective PMIDs for facilitated future access. Semantic Assessment N-GRAM Evaluation and Manual Labelling N-grams refers to a contiguous sequence of n items (can be syllables, letters, or word pairs) in a text for determining the context of said items in a sentence or paragraph. We used the functions of NLTK viz. WordNetLemmatizer, WordPunctTokenizer and CollocationFinder to create a corpus of NGRAMS (n=1, 2 and 3) from the abstract dataset. After removal of stop words, that do not add significant meaning to the context, a subset consisting of all reported verb/adverb(n=1), adverb-verb(n=2,3) combinations based on a frequency cut-off was filtered out using Part of Speech (POS) tags of tokenised words. We observed that N-grams for negative labels often gave misleading information, e.g. “HLA-B27 negative” refers to the absence of allele rather than a negative association between entities. Hence, we used very stringent criteria for choosing negative labels. Manual annotation of positive and negative labels was then carried out on this dataset and a total of 1128 labels (Supplementary Table 1) were categorised (1108 positive and 20 negative) for labelling the sentences. We assert a positive label where the HLA allele is positively associated with disease and hence its presence makes individuals susceptible to disease, whereas in negative statements the HLA allele is negatively associated with disease and hence protective for the disease. We also considered negation words like “not, none, no” which if present, can reverse the actual meaning of the sentences. Instances of above mentioned three keyword sets (positive, negative and negation) were iteratively searched in all the sentences. Further, a coding scheme was constructed using the binary layout to label sentences as positive, negative, complex ambiguous. Sentences having no match from either of the categories were labelled as others. Root-Verb and Associated Adverbs using Dependency Parsing Dependency parsing refers to the formation of a tree layout based on the semantics of a sentence, where the root node is represented by a verb that relates different entities of that sentence. The allele and disease keywords present in each sentence were replaced with @GENE and @DISEASE tags and a parse tree was generated using StanfordCoreNLP python module (Stanford-corenlp-full-2018-10-05 package). The list of verbs obtained from the root nodes of all the sentences in the dataset was manually curated under positive and negative labels. We also added a category “Studied/Investigatory” that doesn’t convey any positive or negative context but have mentions of both entities together, e.g. “To investigate the association of HLA-A, B, and DRB1 alleles with leukaemia in the Han population in Hunan province”. Sentence Annotation We termed our approach as “hybrid approach” for labelling sentences, where annotation was done using both N-gram labels and the type of root verbs. If a sentence had a positive N-gram label and a positive root verb, that inferred the relationship between entities as associated or linked, then the sentence was labelled as positive. For negative labelling also we used the same approach. Finally, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 labelling of sentences were grouped into different categories: 1) Positive, 2) Negative, 3) Both positive and negative, referring as Complex sentences, 4) Positive+negation referring as Ambiguous group, and 5) Investigatory. Database and web server HLA SPREAD database is built for quick and easy retrieval of information related to HLA genes. The web interface was coded in HTML5, CSS3, Bootstrap & ES6. We used D3.js for data visualization and jQuery DataTables for table integration. The server was hosted using Apache HTTP server. The database uses flat file system with data stored in excel file. JavaScript handles the search queries & data visualizations. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 RESULTS Mining Medline literature for HLA association NLP based text mining of 24 million publicly available biomedical abstracts provided 41845 abstracts with either one or more sentences that describe the relationship between the HLA alleles and diseases. To understand the distribution of various kinds of articles published among the filtered abstracts, we studied the article type per year trend from 1975 to 2019 (Figure 2). We found research journal, comparative study and review articles to have maximum numbers every year. In addition, there were papers corresponding to clinical trials phase I, II, III and IV and observational studies highlighting the importance of this locus in translational studies. HLA genes, alleles and its distribution There are 28,320 alleles, and we hypothesize that not all of them would be associated with a disease or pathological condition. For instance, while collating data/analysing of HLA alleles, we observed a great extent of variability in the names within articles. E.g. HLA-B*13:01, a risk factor for dapsone hypersensitivity syndrome in multiple populations was written as HLA-B*13:01, HLA-B*1301, B*1301, B(*)1301 and B1301 in different papers. In such instances, if one has to search for an allele and its related information, the user must be aware of all possible formats of writing an allele encompassing its current and previous nomenclature. So, based on this, we converted all existing HLA keywords to a standard allele name. We identified only ~1% of the total alleles to be associated with conditions like diseases, graft survival, or drug reactions. To represent these alleles in the form of a graph, we collapsed the nomenclature to two-digit level (Figure 3). Majority of the studies were with HLA-DRB1 loci, followed by HLA-B and HLA-A, while fewer studies were on HLA-C locus. Each HLA alleles, collapsed to its two-digit information are linked to AFND server highlighting its allele frequency. The focus of our present study was also to understand the semantics between alleles and diseases, wherein we noted that some alleles were reported as protective and some as risk alleles. e.g. some reports indicated HLA-DRB1*15 was protective for HIV and diabetes whereas some studies reported it as a risk allele for pulmonary tuberculosis. We were also interested in exploring the effects of multiple alleles individually on a single disease. To address this, we listed out 45 articles (Supplementary Table 2)highlighting the fact that for a single disease, different alleles can have contrasting effects, e.g. HLA- DQA1*02:01 and HLA-DQB1*06:02 can be protective in Artemisia pollen-induced allergic rhinitis while HLA-DQA1*03:02 can be a risk factor (28). Exploring diseases, its associated categories and other relevant information (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 The HLA studies were divided into four broad categories: Diseases, Transplantations, Sign and Symptoms, and Therapeutics/ADRs, to study the information systematically. This grouping was done based on the MeSH keywords identified in the abstracts. There is a total of 24 categories for diseases in MeSH, ranging from C1 to C26 and Transplantation procedures are listed under E04. Keywords falling under C23 were grouped as “Sign and Symptoms” and C20.452 (GVHD) and E04 were grouped as “Transplantations”. For “Therapeutics/ADRs”, we selected only those sentences that had mentions of drug keywords, allele name and disease names together. We then filtered them further if they satisfied either of the three conditions: 1) Belongs to category Drug adverse reactions category or 2) Sentences had mentions of keywords such as reactions, -induced(carbamazepine-induced) or 3) Disease keyword had mention of –induced (Drug-induced liver injury). The remaining were grouped as “Diseases”. Table 1 shows the number of articles under each category. To study the association with diseases, we analysed data from both the “Diseases” and “Transplantation” category. Inconsistency in writing disease names increases the efforts in searching a specific query. To reduce this variability, MeSH ID was used to summarise the obtained information e.g. diseases like tumour, cancer, malignancy, and neoplasm (malignant and benign) were mapped to a single entity malignancy (D009369). Collapsing a large number of similar keywords to a single ID reduces the complexity in searching for articles related to particular diseases. We observed a total of 3615 different disease terms mapping to unique 1869 MeSH IDs. Figure 4 represents a snapshot of common HLA associated diseases. To examine the disease associations, we mapped it to level-one (level-zero) terms. Diabetes Mellitus Type 1, Rheumatoid Arthritis, Multiple Sclerosis (Autoimmune Disease), Melanoma and Leukemic (Neoplasms by Histologic Type), Psoriasis (Skin disease) and Celiac Disease (Metabolic) were the topmost HLA associated diseases. In the analysed abstracts, the list of HLA associated diseases/conditions indicates that some diseases were very frequently reported, whereas other diseases like Down syndrome, Guillain-Barre Syndrome, Polymyalgia Rheumatica were infrequently or rarely reported. Supplementary Table 3 represent the distribution of both common and less explored HLA associated diseases. To get an overall perspective of genes and diseases, we considered the diseases at level-one along with HLA gene. We observed the majority of reported associations with HLA-DRB1, followed by HLA- B and HLA-A (Figure 5). We also listed details of individual allele-disease pairs for more information (Supplementary Table 4). HLA-DRB1 was reported to be linked with disease conditions like rheumatoid arthritis, type 1 diabetes, multiple sclerosis, melanoma and 1184 other diseases. HLA-B association was reported with spondylitis, infections, hypersensitivities, psoriasis, drug allergies and 928 other diseases and HLA-A was reported to be associated with melanoma, leukemia, influenza, haemochromatosis, and 778 other diseases. The analysis also takes into consideration the diseases which require transplantation and also include the complications associated with it both pre and post-transplantation. As anticipated, we observed that individuals suffering from beta Thalassemia and sickle cell anaemia (genetic and congenital disorders), multiple myeloma (an immunoproliferative disorder) and liver injury underwent transplantations of bone marrow, hematopoietic stem cells and renal tissue. However, there were other additional details (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 included with the transplantation data such as disease history of patients before undergoing transplantation e.g. psoriasis, Graves’ disease, diabetic neuropathy and post-transplantation complications e.g. Ischemia, Necrosis, Fibrosis, Haemorrhage.” Such collated information under one platform may be of interest to a clinician for designing therapy modules. Supplementary Table 5 represents details of transplantation related studies. SNPs and HLA diseases HLA loci have a repertoire of genetic variations, a large number of which have been linked to multiple diseases via genome-wide association studies (GWAS). Though GWAS lists information about SNPs in/associated with HLA gene, a number of genetic variation studies go unnoticed either because they are small cohort analysis or are not compiled in a single resource for systematic study. Thus, to include the overlooked studies and missing information, this analysis reports information from all kinds of studies and includes abstracts mainly from journal articles, review, metanalysis, letters, and clinical trials. To acquire robust data, we retained only those HLA variations, that are present in the sentences along with the disease and allele keywords. We identified 313 unique SNPs mention and its details is compiled in Supplementary Table 6. Majority of SNPs mapped to intronic variants followed by missense and intergenic. Figure 6 represents genomic distribution of mapped SNPs. A substantial number of variations also mapped to genes other than HLA, indicating they may be in Linkage Disequilibrium (LD) or frequently occur in conditions like transplantation success or ADRs example. We observed top hits of SNPs mapping to infectious diseases like HIV and hepatitis, inflammatory conditions like psoriasis, complex diseases like asthma and diabetes and hypersensitivity largely attributed by drug ADRs. SNP association studies are also based on a proxy SNP, which can be in LD with the causal variant and the LD values vary from one population to another. To address this, we also added population information of the studies whenever available in the abstract. The most studied SNP rs9277535, associated with hepatitis B virus, has been studied across a large number of populations from Asian and central Asian countries like China, Japan, Asia, Turkey, Korea, and Indonesia. Geographical Spread of HLA literature across various ethnic groups and populations Genetic differences in HLA genes across populations and their link with biological conditions make it imperative to consider geographical information while studying HLA association with a particular condition. We assumed that the population/ethnic groups name might not be present in the same sentences that mention HLA and disease, so we used a flexible approach here and fetched the names of geographical locations present anywhere in the abstracts. In total, we reported 7696 NORP tags, mapping to 174 unique geographical entities. These unique tags were binned into 112 country-based populations and 62 ethnic groups. Figure 7 represents the frequency distribution of these matched populations belonging to the countries and ethnic groups. Japan, China, USA, India and Italy are the major countries where the HLA gene-disease association studies have been reported with disease (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 groups as shown in Supplementary Table 7. Along with this, the European subcontinent has been extensively studied (1102 unique reports) as a major ethnic group. Apart from frequently studied areas, we also observed locations like New Zealand, Armenia and Sri Lanka that have a low number of reported studies. This type of analysis can help researchers understand not only the extent of allele- disease associations among populations in the context of these immune players but also the scope of research in their selected geographical location while planning their hypothesis. Response to therapeutics HLA genes are known to have association with various hypersensitivities and drug reactions, a few of them like Stevens-Johnson syndrome can also be life-threatening. Due to allele differences among individual and population level, these hypersensitivities vary, and thus studying these pharmacogenetic markers with the population information becomes important. For instance, we observed from our data that HLA-A*31:01 is associated with carbamazepine induced Stevens-Johnson syndrome in European population while HLA-B*15:02 is associated with Chinese and Indian populations. A meta resource like HLA-SPREAD can help understand such population-wise differences that obstruct designing of therapy modules for ADRs/ hypersensitivities. To be more specific, this analysis focuses on drugs that are present in sentences along with the disease and allele keywords. We observed a total of 1755 abstracts mentioning 252 unique drugs, of which 78 mapped to ADR category. Details of drugs and related information are listed in Supplementary Table 8. We also validated our results with AFND, a manually curated database that has information about ADRs. Out of 42 drugs present, we were able to find 33 common. One of the drugs “Valporic acid”, mentioned in AFND, was not present in the actual cited article. The remaining drugs could not be captured because of the stringent criteria of drug mapping i.e. the drug name should be present in the sentence along with disease and allele keyword. Figure 8 lists the frequency-based distribution of top 20 drugs fetched from our analysis. Interestingly, we also observed 19 drugs that are not mentioned in AFND database, e.g. HLA-B*38:02:01 allele was found to predict carbimazole/methimazole induced agranulocytosis, HLA-DRB1 associated azathioprine induced pancreatitis in IBD patients. This analysis highlights, how one can miss information apart from the time and manpower intensive nature in manual curation. Insights from HLA-SPREAD: Biomarker ANALYSIS We demonstrate the usability of the database to address clinically relevant queries. Multiple questions on the identification of HLA alleles and diseases linked with hypersensitivity, allergy, genetic marker, prognosis and diagnosis can be addressed using HLA-SPREAD. As an example, we present an analysis to identify biomarkers in HLA studies. To address this question, we used an n-gram based approach to identify the keyword most frequently occurring with “marker” in the sentences. Supplementary Table 9 list the most common keywords identified. We checked the details of such sentences and complied the information (Supplementary Table 10). A few of them like abacavir (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 hypersensitivity and SJS syndrome were present in multiple papers. HLA-G and HLA-E were also reported to be markers for conditions like tumour, transplantation and heart diseases. Discussion HLA alleles are known to be associated with a large number of diseases. There is no existing repository that summarises this information in a systemic manner. Manual curation is a cumbersome process and one might also miss a lot of important information. The need for such a user-friendly platform increases significantly since HLA alleles have been found clinically associated with a large number of conditions. NLP based text mining offers a way to fetch this information pragmatically. NLP is instrumental in terms of extracting information from unstructured data. This method has started assuming immense importance in the biomedical domain. A few papers like GWASkb and SNP literature have used it for extracting information such as SNP and its related knowledge from the biomedical data whereas Monarch initiative has used it for studying phenotype information (29). Extracting information from HLA related literature is very difficult owing to the large number of studies and complex nomenclature. This project is an attempt to consolidate all the HLA relevant information such as SNPs, populations studied, ADRs and associated diseases into a structured database. This resource is also handy for user-specific advanced HLA searches like looking for biomarkers for toxicity-based studies and disease progression. There were a few drawbacks of this analysis worth highlighting – primary arising due to the different formats of various journals. The initial tokenised data used in the analysis was based on English stop words. However, we observed in a small set of papers, the author missed giving full stops or spaces which lead to the fusion of two sentences. The subheadings were present in different cases and often followed by different special characters leading to complexity in their removal. Also, a prefix of keywords like SETTINGS, STUDY DESIGN, etc. have been observed in a few sentences, as those papers did not follow standard headlines. Apart from these, few other parameters like abbreviations at the end of sentences, presence of roman letters in sentences and different brackets and quotes styles in title caused errors during tokenisation process. Similarly, it was observed that with the updation of various abstracts in new releases, the previous incorrect entries were not removed which lead to duplication of different information. Since HLAspread has catalogued information from diverse resources, in many instances it provides pieces of information that would be more informative and exhaustive. For instance, besides information retrieved from databases like DisGeNET, OMIM (Mendelian) reporting information on a few diseases we also used MESH is more comprehensive as it houses 139264 variant disease terms mapping to 4674 diseases. We also reduced the high variability in the method of mentioning the disease name in various articles. On average, a disease has around 30 names with one ID, showing the wide spectrum of disease dictionary required to capture all possible disease terms. In order to capture the HLA and ADRs we selected a list of drugs from SIDER4.1. However, not all drugs present in side effect database will be associated with ARDs. To get a more specific answer, we selected drugs from categories such as adverse drug reactions, hypersensitivity and toxicity. We were able to fetch a large number of studies (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 and observed that the AFND database has missed quite some drugs in the ADR analysis. We thus added information from both AFND and SIDER to get heuristic information for a set of different drugs. There were a few unique aspects that we could capture because of our approach. For instance, in transplantation studies in addition to just listing different kinds of transplantations, we also observed the most common diseases which required transplantation and drugs given during the process with few side effects. Also, a unique aspect we added was a category called signs and symptoms for simplifying user searches. For instance, some users may also be interested in knowing the context of HLA alleles with conditions like inflammation, relapse, hypoxia, septic shock, diarrhoea, etc. We aim to add a few features in future updates for example mapping the variants reported in dbSNP, OMIM, ClinVar with to the HLA alleles. This would help in seamless integration of high-throughput variation data with the wealth of HLA information in literature and HLA alleles reported in IMGT database. To summarise this is one of its kind of efforts to integrate the diversity of HLA information into a structured format for ease of query and analysis. This could also provide an informative resource for the non-HLA specialists for initiating any new studies in populations and diseases. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Acknowledgements The authors would Acknowledge COE M/o AYUSH grant MLP-901 to MM and DD and SRF fellowship to DD from Department of Biotechnology (DBT) and Dr. Yatender Kumar (NSIT) for permitting AK to work on this project. We would also acknowledge Mr Praveen Sinha for designing and developing the webpage of HLA SPREAD, Dr. Debasis Dash, CSIR-IGIB for critical reviewing of work, Dr. Ganesh Bagler and Rudransh Tunwani from IIITD for NLP discussion, Dr. Ganganath Jha from Hazaribagh University in QC of population curation and Malika Seth in QC of semantic annotations. The authors would also like to acknowledge Mr. Raghunandanan MV and Mr. Amit Khulve at CSIR-IGIB for IT support. Authors Contributions MM, DD designed the study and co-wrote the manuscript. DD and AK executed the entire work. UK helped in HLA analysis, interpretation and manuscript writing (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 List of Figures Figure 1. Workflow of HLA-SPREAD: An automated pipeline developed to extract information related from ~110,000 studies related to HLA retrieved from over 24 million abstracts. Structured information from these abstracts was created using Natural Language Processing methods developed into a database HLA-SPREAD. The various resources used at each step are indicated. Figure 2. Nature and trends of HLA related publications in PubMed annually from 1975 onwards: Stacked Bar plot shows distribution of PubMed articles in different categories. a) Diverse studies including clinical trials are reported, with maximum numbers represented in the “journal article” category. b) A subplot of (a) after removing the most frequent “Journal article” type to visualise the trends in other categories. 2a 2b (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 3. The topmost reported HLA alleles associated with diseases: All the HLA alleles indicated have been grouped to their second digit and represented in the pie chart. HLA-A, HLA-B and HLA- DRB1 are the most studied amongst the HLA genes. Figure 4. Diseases/conditions associated with HLA genes: Graph represents three level hierarchy of diseases. Each colour represents a level. There are 24 major categories as represented in green colour, which is further divided into subcategories. Each disease name is matched to its Mesh id and a normalised mesh keyword. Autoimmune, Neoplasms and Joint disease are the top most associated diseases. As anticipated, significant numbers of studies related to transplantation are also observed. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 5. Heatmap of HLA Disease associations: The gradient heat map representing the number of diseases associated with HLA genes. First column represents generic “HLA” studies where specific gene information is not mentioned. A large number of associations were also observed with Non- classical(HLA-E,F,G) genes. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 6. Genomic distribution of SNPs: Pie chart representing the number of variations in genic region with majority of them mapping to introns. Figure 7. Geographical Spread of HLA studies: Identified geographical locations are binned to the nearest a) Country b) Ethnic group. Color gradient representing the count of various HLA alleles with respect to disease or ARD’s studies. China, Japan and the USA report maximum studies and European, Asian and African are the most studied ethnic groups 7a 7b Count (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 8. Statistics of drugs related HLA studies: This bar plot includes the most common top 20 drugs associated with ADR’s identified using HLA-SPREAD. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 List of Tables Table1: Number of articles in broad categories Supplementary tables:- https://doi.org/10.5281/zenodo.4276878 Categories Number of PubMed abstracts Diseases 29713 Transplantation 9258 Signs and Symptoms 6050 ADR 317 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 References: 1. Mosaad,Y.M. (2015) Clinical Role of Human Leukocyte Antigen in Health and Disease. Scand J Immunol, 82, 283–306. 2. Niehrs,A. and Altfeld,M. (2020) Regulation of NK-Cell Function by HLA Class II. Front. Cell. Infect. Microbiol., 10, 55. 3. Shiina,T., Hosomichi,K., Inoko,H. and Kulski,J.K. (2009) The HLA genomic loci map: expression, interaction, diversity and disease. J Hum Genet, 54, 15–39. 4. Blackwell,J.M., Jamieson,S.E. and Burgner,D. (2009) HLA and Infectious Diseases. CMR, 22, 370– 385. 5. Fricke-Galindo,I., LLerena,A. and López-López,M. (2017) An update on HLA alleles associated with adverse drug reactions. Drug Metabolism and Personalized Therapy, 32. 6. Klimenta,B., Nefic,H., Prodanovic,N., Jadric,R. and Hukic,F. (2019) Association of biomarkers of inflammation and HLA-DRB1 gene locus with risk of developing rheumatoid arthritis in females. Rheumatol Int, 39, 2147–2157. 7. Khan,M.A., Mathieu,A., Sorrentino,R. and Akkoc,N. (2007) The pathogenetic role of HLA-B27 and its subtypes. Autoimmunity Reviews, 6, 183–189. 8. Khan,M.A. (2008) HLA-B27 and Its Pathogenic Role: JCR: Journal of Clinical Rheumatology, 14, 50–52. 9. Ferrell,P.B. and McLeod,H.L. (2008) Carbamazepine, HLA-B*1502 and risk of Stevens–Johnson syndrome and toxic epidermal necrolysis: US FDA recommendations. Pharmacogenomics, 9, 1543– 1546. 10. Sawal,N., Kanga,U., Shukla,G., Goyal,V. and Srivastava,A.K. (2020) Stevens-Johnson syndrome triggered by Levetiracetam—Caution for use with Carbamazepine. Seizure, 80, 63–64. 11. Ayuk,F., Beelen,D.W., Bornhäuser,M., Stelljes,M., Zabelina,T., Finke,J., Kobbe,G., Wolff,D., Wagner,E.-M., Christopeit,M., et al. (2018) Relative Impact of HLA Matching and Non-HLA Donor Characteristics on Outcomes of Allogeneic Stem Cell Transplantation for Acute Myeloid Leukemia and Myelodysplastic Syndrome. Biology of Blood and Marrow Transplantation, 24, 2558–2567. 12. Petersdorf,E.W. (2017) Which factors influence the development of GVHD in HLA-matched or mismatched transplants? Best Practice & Research Clinical Haematology, 30, 333–335. 13. Kanga,U., Mehra,N.K., Larrea,C.L., Lardy,N.M., Kumar,A. and Feltkamp,T.E.W. (1996) Seronegative Spondyloarthropathies and HLA-B27 Subtypes: A Study in Asian Indians. Clin Rheumatol, 15, 13–18. 14. Xu,H. and Yin,J. (2019) HLA risk alleles and gut microbiome in ankylosing spondylitis and rheumatoid arthritis. Best Practice & Research Clinical Rheumatology, 33, 101499. 15. Andeweg,S.P., Keşmir,C. and Dutilh,B.E. (2020) Quantifying the impact of Human Leukocyte Antigen on the human gut microbiome Bioinformatics. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 16. Gomez,A., Luckey,D., Yeoman,C.J., Marietta,E.V., Berg Miller,M.E., Murray,J.A., White,B.A. and Taneja,V. (2012) Loss of Sex and Age Driven Differences in the Gut Microbiome Characterize Arthritis-Susceptible *0401 Mice but Not Arthritis-Resistant *0402 Mice. PLoS ONE, 7, e36095. 17. Buhler,S. and Sanchez-Mazas,A. (2011) HLA DNA Sequence Variation among Human Populations: Molecular Signatures of Demographic and Selective Events. PLoS ONE, 6, e14643. 18. Saxena,A., Suzuki,S., Mourya,M., Shiina,T. and Kanga,U. (2020) Novel and extended HLA class I and II alleles encountered in Kashmiri Brahmin population from North India. HLA, 96, 487–489. 19. Sfakianaki,P., Koumakis,L., Sfakianakis,S., Iatraki,G., Zacharioudakis,G., Graf,N., Marias,K. and Tsiknakis,M. (2015) Semantic biomedical resource discovery: a Natural Language Processing framework. BMC Med Inform Decis Mak, 15, 77. 20. Rakhi,N.K., Tuwani,R., Mukherjee,J. and Bagler,G. (2018) Data-driven analysis of biomedical literature suggests broad-spectrum benefits of culinary herbs and spices. PLoS ONE, 13, e0198030. 21. Wei,C.-H., Allot,A., Leaman,R. and Lu,Z. (2019) PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Research, 47, W587–W593. 22. Kuleshov,V., Ding,J., Vo,C., Hancock,B., Ratner,A., Li,Y., Ré,C., Batzoglou,S. and Snyder,M. (2019) A machine-compiled database of genome-wide association studies. Nat Commun, 10, 3341. 23. Giudicelli,V., Chaume,D., Bodmer,J., Muller,W., Busin,C., Marsh,S., Bontrop,R., Marc,L., Malik,A. and Lefranc,M.-P. (1997) IMGT, the international ImMunoGeneTics database. Nucleic Acids Research, 25, 206–211. 24. Kuhn,M., Letunic,I., Jensen,L.J. and Bork,P. (2016) The SIDER database of drugs and side effects. Nucleic Acids Res, 44, D1075–D1079. 25. Ghattaoraya,G.S., Dundar,Y., González-Galarza,F.F., Maia,M.H.T., Santos,E.J.M., da Silva,A.L.S., McCabe,A., Middleton,D., Alfirevic,A., Dickson,R., et al. (2016) A web resource for mining HLA associations with adverse drug reactions: HLA-ADR. Database, 2016, baw069. 26. Achakulvisut,T., Acuna,D. and Kording,K. (2020) Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. JOSS, 5, 1979. 27. Bodenreider,O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32, 267D – 270. 28. Wang,M., Xing,Z.-M., Yu,D.-L., Yan,Z. and Yu,L.-S. (2004) Association between HLA class II locus and the susceptibility to Artemisia pollen-induced allergic rhinitis in Chinese population. Otolaryngol Head Neck Surg, 130, 192–196. 29. Shefchek,K.A., Harris,N.L., Gargano,M., Matentzoglu,N., Unni,D., Brush,M., Keith,D., Conlin,T., Vasilevsky,N., Zhang,X.A., et al. (2020) The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research, 48, D704–D715. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409