key: cord-0671546-gvmxh9e6 authors: Patel, Harsh title: BioNerFlair: biomedical named entity recognition using flair embedding and sequence tagger date: 2020-11-03 journal: nan DOI: nan sha: 5b57067ee5088b5faa111137c3cb33cd28a8fe46 doc_id: 671546 cord_uid: gvmxh9e6 Motivation: The proliferation of Biomedical research articles has made the task of information retrieval more important than ever. Scientists and Researchers are having difficulty in finding articles that contain information relevant to them. Proper extraction of biomedical entities like Disease, Drug/chem, Species, Gene/protein, can considerably improve the filtering of articles resulting in better extraction of relevant information. Performance on BioNer benchmarks has progressively improved because of progression in transformers-based models like BERT, XLNet, OpenAI, GPT2, etc. These models give excellent results; however, they are computationally expensive and we can achieve better scores for domain-specific tasks using other contextual string-based models and LSTM-CRF based sequence tagger. Results: We introduce BioNerFlair, a method to train models for biomedical named entity recognition using Flair plus GloVe embeddings and Bidirectional LSTM-CRF based sequence tagger. With almost the same generic architecture widely used for named entity recognition, BioNerFlair outperforms previous state-of-the-art models. I performed experiments on 8 benchmarks datasets for biomedical named entity recognition. Compared to current state-of-the-art models, BioNerFlair achieves the best F1-score of 90.17 beyond 84.72 on the BioCreative II gene mention (BC2GM) corpus, best F1-score of 94.03 beyond 92.36 on the BioCreative IV chemical and drug (BC4CHEMD) corpus, best F1-score of 88.73 beyond 78.58 on the JNLPBA corpus, best F1-score of 91.1 beyond 89.71 on the NCBI disease corpus, best F1-score of 85.48 beyond 78.98 on the Species-800 corpus, while near best results was observed on BC5CDR-chem, BC3CDR-disease, and LINNAEUS corpus. There is a sharp increase in the number of research papers in the biomedical domain since the pandemic arrived. Scientists around the world are conducting experiments and clinical trials to learn more about the effects of this pandemic on global health and the economy. Because of this, Journals around the world are flooded with biomedical literature and it's getting difficult to find articles that are relevant, robust, and credible. According to different reports, over 100,000 papers are already being published for COVID-19 alone. PubMed alone comprises over 30 million citations for biomedical literature. As reports on information about discoveries and insights are added to the already overwhelming amount of literature, the need for advanced computational tools for text mining and information extraction is more important than ever. Recent progress of deep learning techniques in natural language processing (NLP) has led to significant advancements on a wide range of tasks and applications. The domain of biomedical text mining has likewise seen an improvement. The performance in biomedical named entity recognition which automatically extracts entities such as disease, gene/protein, chemicals, species has substantially improved [1, 2] . We can use BioNer for building biomedical knowledge graph. Other NLP domains like entity relation, question answering (QA), depend upon this graph. Thus, improved performance of BioNer can lead to better performance of other complex NLP tasks. Named Entities in biomedical literature have several characteristics that make their extraction from text particularly challenging [3] , including the descriptive naming convention (e.g. 'normal thy-mic epithelial cells'), abbreviations (e.g. 'IL2' for 'Inter-leukin 2'), non-standardized naming convention (e.g. 'Nace-tylcysteine', 'N-acetyl-cysteine', 'NAcetylCysteine', etc.), c-onjunction and disjunction (e.g. '91 and 84 kDa proteins' comprises two entities '91 kDa proteins' and '84 kDa proteins'). Traditionally, NER models for biomedical literature perform-ed efficaciously using feature engineering, i.e. carefully selecting features from the text. These features can be linguistic, orthographic, morphological, contextual [4] . Selecting right features that properly represent target entities requires expert knowledge, lots of trialerror experiments, and is often time consuming whose solution leads to highly specialized models that only works for specialized domains. Models based on convolutional neural networks was proposed to tackle sequence tagging problems [5] . This kind of neural network architecture and learning algorithms reduced the need for domain-specific feature engineering. However, these types of networks could not connect with previous information that could improve performance for Named Entity Recognition. RNN's could capture earlier information through back propagation, but they suffer from the vanishing gradients, exploding gradient problems, and don't handle long-term dependencies well. The gradients carry information for parameter updates. The text data sequences for NER are generally long. For longer sequences, gradients become vanishingly smaller, resulting in no updates of weights [6] . These problems are addressed by a special RNN architecture -Long Short-Term Memory (LSTM), capable of handling long-term dependencies [7] . The neural architecture -BiLSTM-CRFs produces state-of-the-art performance for NER tasks. This architecture comprises two components: BiLSTM that predict the label by capturing information from the text in both directions and CRF that compute transition compatibility between all possible pairs of labels on neighboring tokens. We now consider this neural architecture standard for sequence labeling problems [8] . This kind of architecture generally uses vector representation of words (word embeddings) as input to LSTMs. Word2Vec [9] , GloVe [10] are some popular context-independent vector representations of words. Many times, character level features of the text are incorporated into word embeddings layer to improve the performance of NER models [11] . The use of BiLSTM-CRFs along with certain word embeddings led to significant improvement in the performance of NER models. Researchers starting experimenting with this architecture for Biomedical named entity recognition. Some models used character level embedding along with word embedding pre-trained on a large entity independent corpus (Pub-Med abstracts). These models outperformed earlier state-of-the-art models for BioNER [12, 13, 14] . All the word embeddings used until now were context independent. They cannot address the polysemous and context dependent nature of words. The introduction of contextualized string embeddings such as flair embeddings [15] , ELMo [16] solved this problem. These context-dependent word embeddings when used with BiLSTM-CRFs outperformed all previous models in named entity recognition. Also, transformers based [17] language representation models like BERT [18] came that achieved state-of-the-art performance in NER. However, applying these NLP methodologies on biomedical literature has limitations because of the different word distribution of general and biomedical corpora. Since recent language representation models are mostly trained in general domain text, they often face problems on biomedical corpora. Most recent state-of-the-art solutions have shown that using a language representation model pre-trained on biomedical corpora (like PubMed abstracts and PMC full-text articles) gives the best results for Biomedical Named Entity Recognition [1, 2] . This paper represents BioNerFlair, a novel architecture for biomedical named entity recognition. BioNerFlair uses contextualized string embeddings Flair (pre-trained on bio-medical domain) along with GloVe embeddings at the token embeddings layer, then a sequence tagger based on BiLSTM-CRFs is used to extract named entities from biomedical literature. I evaluate the performance of BioNerFlair on 8 benchmarks datasets. BioNerFlair outperforms earlier state-of-the-art models on 5 datasets while shows near similar performance of previous models on other 3 datasets. The following sections present a description of the corpora used for evaluation. Furthermore, a technical description of the architecture used along with details of evaluation metrics is given. The statistics of biomedical named entity recognition datasets are listed in Table 1 . BioNerFlair performance is evaluated on eight standard corpora of disease, gene/protein, dru-g/chemical, and species for biomedical Ner: The NCBI [19] and BC5CDR [20] corpus for disease, BC5CDR [20] and BC4CHEMD [21] corpus for drug/chemical, BC2GM [22] and JNLPBA [23] corpus for gene/protein, LINNAEUS [24] and Species-800 [25] corpus for species. These datasets are widely used by Biomedical NLP researchers for testing Bio-Ner models. All the datasets are tagged with the IOB tagging scheme. For proper evaluation with other state-of-the-art techniques, the same data split for training, validation, and testing from earlier works [2, 26] is adopted. Species 3708 Note: The number of annotations from [12] , [27] , and [2] is provided. BioNerFlair comprises of three layers, namely token embedding layer giving contextualized vector representation of input sequence, passed into vanilla BiLSTM-CRF sequence labeler as depicted in Figure 2 , giving state-of-theart results on BioNer tasks. The token embeddings layer takes as input a sequence of N tokens (x 1 , x 2 , ..., x N ), and outputs a fixed-dimensional vector representation of each token (e 1 , e 2 , ..., e N ). The output here is the concatenation (Equation 1) of precomputed GloVe embeddings [10] and contextualized flair embeddings [15] pre-trained on on roughly 3 million full texts and about 25 million abstracts from the PubMed. Analysis by [15] , shows that combining flair embeddings with classic world embeddings improves the performance of NER models. In BioNerFlair, GloVe embedding is combined with flair embedding. Flair embedding is a contextualized character level word embedding that combines the best attributes of different kinds of embeddings. As shown in recent studies [2, 28] , that pre-training models on biomedical corpora significantly improves the performance of BioNer models, this study uses a flair embedding model pre-trained on biomedical data and it seems to capture latent syntactic and semantic similarities. Flair embeddings produce vector representation from hidden states that computes not only on the characters of the word but also the characters of the surrounding context like illustrated in Figure 1 . Since flair embedding is pre-trained on biomedical corpora and extracts context based on linguistic features at the character level, it handles rare, misspelled, different naming conventions of the words, frequently occurring in biomedical literature very well. A Long Short Term Memory network (LSTM), is a special kind of RNN introduced by [7] , explicitly designed to avoid long-term dependency problem. LSTMs does not suffer from vanishing and exploding gradient problems. Unlike RNN, LSTMs can therefore remember information for long periods of time. LSTMs are equipped with memory cells along with an adaptive gating mechanism that regulates the information added or removed from the memory cells. There are three layers in a typical LSTM. A sigmoid layer that decides what information to remove (forget gate), a concatenation of sigmoid and tanh layer that decides what new information to add (input gate), another sigmoid layer that decides the output (output gate). LSTM memory cell is implemented using equations as follows: In the above Equations, σ denotes logistic sigmoid function, and i, f, O, and C are the input gate, forget gate, output gate and cell vectors. In BioNerFlair, the final word embeddings are passed into a BiLSTM network as is seems to capture past features and future features efficiently for a specific time frame [8, 29, 30] . The bidirectional LSTM network is trained using backpropagation through time [31] . Conditional Random Fields (CRFs) [32] is a probabilistic discriminative sequence modeling framework that brings in all the advantages of MEMMs models [33, 34] while also solving the label bias problem. Given a training dataset D = (x 1 , y 1 ), . . . , (x N , y N ) of N data sequences to be labeled x i and their corresponding label sequences y i , CRFs maximize the log-likelihood of conditional probability of label sequences given their data sequences, that is: The performance of BioNerFlair is evaluated by training models for each dataset. I used pre-processes versions of BioNer datasets provided by [2] . Also, the same data split is used for training and testing the models. Models are evaluated using precision (P), recall (R), and F1 score metrics on the test corpora. A predicted entity is considered correct if and only if both the entity type and boundary exactly match with annotations in test data. Precision T P T P + F P , R = T P T P + F N , F 1 = 2 * P * R P + R 3. Results and discussion All the models are trained using Flair NLP library, a simple frame-work for state-of-the-art NLP tasks built directly upon PyTorch. I used GPU (12 GB) provided for free by Google Colab to train models. The maximum sequence length was set to 512 to get the best training speed without running out of GPU memory while the mini-batch size for all experiments was set to 32. Model training is started using an initial learning rate of 0.1, patience of 3, and annealing factor of 0.5. A high learning rate of 0.1 works well at starting when using Stochastic Gradient Descent optimizer and is gradually reduced as the model converges. Flair embeddings dropout is set to 0.5. These hyper-parameters are same for all the models. Because of the smaller size of training data and fast GPU, training time of most of the models was less than an hour. However, for the BC4CHEMD dataset, the model could not fit into GPU memory because of which training time increased to around 5 hours. Flair NLP library also comes with Hunflair [35] , a NER tagger for biomedical text. HunFlair comes with models for genes/proteins, chemicals, diseases, species and cell lines. HunFlair models are trained with multiple datasets at same time due to which it outperforms tools like SciSpacy [36] for unseen text but does not give state-of-the-art results on gold standard datasets. In BioNerFlair, I trained models from scratch for each dataset giving results mentioned above. For experiments, I tried to fine tune Hun-Flair models on target corpus but the model doesn't fit within 12GB of GPU memory. Results of the BioNerFlair method for different datasets are shown in Table 2. The performance of BioNerFlair is compared with other recent stateof-the-art methods. BioNerFlair outperformed state-of-the-art methods on five out of eight datasets while shows near best performance on the remaining three datasets. We can see the biggest improvement in the gene/protein category. BioNerFlair achieves the best F1 score of 90.17 beyond 84.72 on BC2GM corpus and an F1 score of 88.73 beyond 78.58 on JNLPBA corpus. For the species category, BioNerFlair achieves the best F1 score of 85.48 beyond 74.98 on Species-800 corpus, while gets second best score on LINNAEUS corpus. We can notice the same thing for disease and drug/chemical category where BioNerFlair achieves state-of-the-art results of one dataset while getting near best score for other datasets. Even though BioNerFlair does not get best results on BC5CDR corpus for disease and chemical, the results are still competitive when compared with other recent methods and significant improvements can be seen on other datasets. In BioNerFlair, I use GloVe embedding and flair embedding at the token embedding layer. Flair NLP library provides the option of Stacked embedding, which allows us to combine different embeddings together. Flair sup- ports classic word embeddings, character embedding, contextualized word embeddings, pre-trained transformer embedding. Therefore, we can experiment with different pairs of embeddings for sequence labeling tasks. The initial plan for this experiment was to use the concatenation of XLNet [42] , GloVe embedding, and pooled variant of flair embedding [43] . However, this combination of embeddings requires lots of GPU memory because of which I used the combination of embeddings mentioned above. If more resources are available, we can possibly further improve the performance of BioNer models. In conclusion, this article presents BioNerFlair, a metho-d to train models for biomedical named entity recognition using Flair plus GloVe embeddings and a sequence tagger. This paper shows that using contextualized word embedding pre-trained on biomedical corpora significantly improves the results of BioNer models. I evaluated the performance of BioNerFlair on eight datasets. BioNerFlair achieves state-of-the-art results on five datasets. For future study, I plan to experiment with different contextualized and transformer-based word embeddings to further improve the performance of Biomedical Named Entity recognition models. Dtranner: biomedical named entity recognition with deep learning-based label-label transition model Biobert: a pre-trained biomedical language representation model for biomedical text mining Recognizing names in biomedical texts: a machine learning approach Biomedical named entity recognition: a survey of machine-learning tools Natural language processing (almost) from scratch Learning long-term dependencies with gradient descent is difficult Long short-term memory Neural architectures for named entity recognition Distributed representations of words and phrases and their compositionality Glove: Global vectors for word representation Character-aware neural language models Deep learning with word embeddings improves biomedical named entity recognition An attention-based bilstm-crf approach to document-level chemical named entity recognition Character-word lstm language models Contextual string embeddings for sequence labeling Deep contextualized word representations Attention is all you need Pre-training of deep bidirectional transformers for language understanding Ncbi disease corpus: a resource for disease name recognition and concept normalization Biocreative v cdr task corpus: a resource for chemical disease relation extraction The chemdner corpus of chemicals and drugs and its annotation principles Overview of biocreative ii gene mention recognition Introduction to the bio-entity recognition task at jnlpba Linnaeus: a species name identification system for biomedical literature The species and organisms resources for fast and accurate identification of taxonomic names in text Cross-type biomedical named entity recognition with deep multitask learning Clinical concept extraction with contextual word embedding D3ner: biomedical named entity recognition using crf-bilstm improved with fine-tuned embeddings of various linguistic information Speech recognition with deep recurrent neural networks, in: 2013 IEEE international conference on acoustics, speech and signal processing Bidirectional lstm-crf models for sequence tagging A guide to recurrent neural networks and backpropagation, the Dallas project Conditional random fields: Probabilistic models for segmenting and labeling sequence data A maximum entropy model for part-of-speech tagging Maximum entropy markov models for information extraction and segmentation Hunflair: An easy-to-use tool for state-of-the-art biomedical named entity recognition Scispacy: Fast and robust models for biomedical natural language processing Document-level attentionbased bilstm-crf incorporating disease dictionary for disease named entity recognition Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition A transition-based joint model for disease named entity recognition and normalization Collabonet: collaboration of deep neural networks for biomedical named entity recognition Transfer learning for biomedical named entity recognition with neural networks Generalized autoregressive pretraining for language understanding Pooled contextualized embeddings for named entity recognition I would like to thank the Department of Computer Science and Engineering, Medi-Caps University for the support. I also thank the anonymous reviewers for their comments and suggestions. This research did not receive any specific grant from fun-ding agencies in the public, commercial, or not-for-profit sectors. Source code and data is available at https://github.com/harshpatel1014/-BioNerFlair Declarations of interest: none