key: cord-0286735-zpo2d0kz
authors: Yeung, Cheng S.; Beck, Tim; Posma, Joram M.
title: MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition
date: 2022-02-23
journal: bioRxiv
DOI: 10.1101/2022.02.22.481457
sha: 8dec610f8e0f40ab596a926c19d6d07bdaa7cbd9
doc_id: 286735
cord_uid: zpo2d0kz

Reviewing the metabolomics literature is becoming increasingly difficult because of the rapid expansion of relevant journal literature. Text-mining technologies are therefore needed to facilitate more efficient literature review. Here we contribute a standardised corpus of full-text publications from metabolomics studies and describe the development of two new metabolite named entity recognition (NER) methods. We introduce two deep learning methods for metabolite NER based on Bidirectional Long Short-Term Memory (BiLSTM) networks incorporating different transfer learning techniques. Our first model (MetaboListem) follows prior methodology using GloVe word embeddings. Our second model exploits BERT and BioBERT for embedding and is named TABoLiSTM (Transformer-Affixed BiLSTM). The methods are trained on a novel corpus annotated using rule-based methods, and evaluated on manually annotated metabolomics articles. MetaboListem (F1 score 0.890, precision 0.892, recall 0.888) and TABoLiSTM (BioBERT version: F1 score 0.909, precision 0.926, recall 0.893) have achieved state-of-the-art performance on metabolite NER. A corpus with >1,200 full-text Open Access metabolomics publications and >116,000 annotated metabolites was created. This work demonstrates that deep learning algorithms are capable of identifying metabolite names accurately and efficiently in text. The proposed corpus and NER algorithms can be used for metabolomics text-mining tasks such as information retrieval, document classification and literature-based discovery. Availability The corpus and NER algorithms are freely available with detailed instructions from Github at https://github.com/omicsNLP/MetaboliteNER.

Since the late 1990s, metabolomics has contributed to better the understanding of the roles that small molecules play in health and disease (1) . The on-going developments of high-throughput data acquisition ensures that the technology is accessible to more researchers (2) , evident by the increasing number of studies year-on-year for studies in any organism as well as those only done in humans ( Figure 1 ). With the growth of the number of publications comes the difficulty of literature review, and performing a literature review of all papers relating to a specific phenotype has become an almost impossibly difficult task for researchers (3) . To facilitate more efficient review processes, computational literature mining tools are needed.

Natural Language Processing (NLP) is a branch of computer science that aims to understand natural human language computationally; in the biomedical field, NLP often refers to the technologies that allow computational tools to interpret scientific texts written by humans. In recent years, NLP has been successfully applied in many biomedical text mining tasks, including information retrieval, document classification and literature-based discovery (4) (5) (6) (7) . By identifying biologically important entities in articles and computationally inferring the connection between them, NLP facilitates understanding of biological relationships from textual information. Such efficient and automatic knowledge mining systems can provide a comprehensive viewpoint to researchers by analysing a vast amount of articles.

Some advances in metabolomics have been made for using computer algorithms to read scientific text (8) (9) (10) (11) , however these suffer from a few drawbacks such as using rulebased annotation methods and dictionary matching. While these methods can indeed correctly capture a vast amount of metabolite names, they are dependent on the library of compound names used -and no library is exhaustive. The latest version of the Human Metabolome Database (HMDB) (12) reports over 18,557 metabolites have been detected and quantified in the human body, and 91,822 more are expected or predicted. Moreover, these numbers are still increasing, from 2,180 (HMDB 1.0 in 2007) to 114,100 (HMDB 4.0 in 2017) in a decade (13) and predicted to be an order of magnitude higher (1) . Hence there is a need for generalisable approaches for automated metabolite recognition.

Within NLP, Named Entity Recognition (NER) is a field concerned with automatically detecting specific entities from texts. To facilitate biomedical text-mining, and NER in particular, BioCreative (Critical Assessment of Information Extraction in Biology) (14) have proposed multiple tasks/challenges and provided corpora for chemical entity recognition, such as the CHEMDNER (chemical compound and drug name recognition, BioCreative V4) (15) and CEMP (Chemical Entity Mention in Patents, BioCreative V5) (16) challenges. The use of machine learning techniques such as Conditional Random Fields (CRF) has shown promising results with F1-scores >0.87 of tmChem (17) in the V4 challenge, however the CRF methods were surpassed by a deep learning (DL) system using Bidirectional Long Short-Term Memory (BiLSTM) recurrent neural networks called Chem-Listem (18) in the V5 challenge (F1-score of 0.90). Adding a word embedding layer to NER DL models is well known to improve performance (19) , and ChemListem used context-free word embeddings/representation using global vectors (GloVe) (20) for this purpose. However, in recent years big improvements have been made in NLP by using contextual embedding with transformers such as BERT (Bidirectional Encoder Representations from Transformers) (21) . BERT was trained on a corpus containing 3.3 billion words mainly from the English Wikipedia and BooksCorpus. Unlike the context-free embedding techniques, BERT's contextual embedding makes the vector representation of a word dependent on its context. Therefore, for biomedical text mining the BioBERT model (22) was pretrained on PubMed abstracts and PubMed Central (PMC) articles (23) , and has been used to obtain F1-scores on the BioCreative V4 and V5 of 0.93 and 0.94, respectively, and F1-scores of 0.78-0.85 for recognising genes and proteins (24) . There are three main issues with using these methods for metabolite NER. First, as mentioned, with these methods the context matters and therefore training algorithms to recognise chemical entities in abstracts does not necessarily mean they work as well for full-text articles. Second, while a metabolite is a chemical entity, a chemical entity is not necessarily a metabolite, therefore these existing algorithms may result in many false positives when using them in metabolomics.

Last, each metabolite can have multiple (sometimes ten or even a hundred) different synonyms -over 1.2 million synonyms are reported in HMDB) -adding to the complexity of using a dictionary of synonyms as new names can be used by different authors. Hence, new algorithms are needed that focus specifically on recognising metabolites as well as not constrain these algorithm to abstracts, but on full-text paragraphs. Here, we describe the development of a standardised, machine-readable metabolomics corpus of full-text Open Access (OA) PMC articles by using the Auto-CORPus (Automated and Consistent Outputs from Research Publications) package (25) for text standardisation in conjunction with semi-automatic annotation of metabolites in full texts using a combination of dictionary searching (using HMDB to stay in line with prior work (10)), regular expression matching and rule-based approaches. We then use this corpus to train two DL-based algorithms to perform metabolite NER with the aim to obtain a generalisable model that can be used to speed up metabolomics literature review.

Dataset. The data used in the study consists of Open Access metabolomics publications from PMC (n=1,218). The metabolomics corpus is constituted of human-related articles in 18 categories; 8 of which are selected by traits and 10 are by journals (see Table 2 ). The following search terms were used for filtering the metabolomics corpus: The publications were extracted on the following dates: cancer (27 Nov, 2020) , gastrointestinal (15 Apr, 2020) , liver disease (4 Jun, 2021), metabolic syndrome (27 Nov, 2020) , neurodegenerative, psychiatric, and brain illnesses (27 Nov, 2020) , respiratory (16 Apr, 2020), sepsis (8 Apr, 2020), smoking (4 Jun, 2021); Analytica Chimica Acta (14 Apr, 2020), Analytical and Bioanalytical Chemistry (15 Apr, 2020), Analytical Chemistry (22 Apr, 2020), Journal of Chromatography A (14 Apr, 2020), Journal of Proteome Research (8 Apr, 2020), Metabolites (11 Sep, 2020) , Metabolomics (15 Apr, 2020), PLOS One (27 Nov, 2020) , Proceeding of the National Academy of Sciences of the United States of America (14 Apr, 2020), Scientific Reports (15 Apr, 2020). The search terms above are formatted in segments, and the segments were to filter PMC articles by: types of metabolomics studies, type of sample (biofluid), technology/assay, human study, avoiding non-human and nonmetabolomics studies and publication date respectively. After we filtered for metabolomics articles, a list of search terms was used to search by phenotypes/journals. Journals were filtered by adding a term such as '(Metabolites[Journal])', whereas the phenotypes were filtered by adding the following:

Neurodegenerative, psychiatric, and brain illnesses: The metabolomics articles were stored in HyperText Markup Language (HTML) format, which were then processed by the Auto-CORPus package (25) , and standardised into machinereadable JavaScript Object Notation (JSON) documents. Auto-CORPus outputs three JSON files based on the input of an HTML file of an article: 'maintext', 'table' and 'abbreviation'. In the maintext file, the textual content in the HTML is split into subsections, and each subsections have five attributes: 'section_heading', 'subsection_heading', 'body', 'IAO_term' and 'IAO_ID', where IAO stands for Information Artifact Ontology (26) . Here we focused on four section types based on standardised IAO terms, namely the textual abstract ('A'), methods ('M'), results ('R') and discussion ('D') sections. We did not use the introduction sections as these do not typically contain information pertaining to the study itself or (how) new results (were acquired).

Rule-based annotation pipeline. The metabolomics corpus was annotated using a semi-automated pipeline to deal with the large volume of texts and prepare it for supervised learning. The pipeline takes the Auto-CORPus-processed articles as JSON input, and outputs the location (character number) of metabolites in each text (an example can be found in Figure 4 in the Results section). The annotation pipeline is comprised of three steps: pre-processing, identifying metabolites and locating recognised entities.

Pre-processing. (Sub)sections are split into sentences using sentence tokenisation with the BioBERT (22) model (version biobert-base-cased-v1.2) using the Huggingface Transformer package (27) . Each sentence is uniquely identifiable by two attributes: the PMC identifier (PMCID) of the article that the sentence belongs to, and a sentence identifier. A sentence identifier is a string that contains a letter and five digits ( Metabolite identification. The next step is to identify any metabolite mentioned in each tokenised sentence using both dictionary searching and regular expression matching. The dictionary used here was is based on the list of metabolites (and their synonyms) that are classified in HMDB (12) as 'quantified' or 'detected'. A list of all metabolites and synonyms was first downloaded on 25 June 2020, and subsequently refreshed on 16 August 2021, after which it was filtered for the 'quantified' or 'detected' metabolites. Short abbreviations (up to 5 characters) that are not exclusive for metabolites were removed (e.g., 'PC' can refer to the metabolite 'phosphatidylcholine', but is also a common abbreviation for 'principal component' and 'pancreatic cancer'), and it was then further cleaned of other non-specific words such as 'result' (a synonym of 'omeprazole'). To complement dictionary searching, the regular expressions are designed to provide partial matches for the metabolite entities (Table 1 ). It is expected that more metabolites can be captured in this manner compared to complete matching. Entities with a partial match are then expanded to include the entire word and included a full metabolite name or part of a metabolite name; the latter scenario is treated further in post-processing. To craft the regular expressions, the HMDB dictionary was used as a reference. From the dictionary, the entities with one or more spaces and the ones with symbols like hyphen ('-') and colon (':') were selected to form a 'reference set'. The reference set meant to include metabolite names that are less trivial (and thereby more regular). Trivial terms such as singletons are usually more difficult to be captured with regular expressions without drastically increasing the number of false positives by including common words. The regular expression approach is used to complement the HMDB dictionary, therefore trivial terms are omitted from the reference set as they are assumed to be listed in HMDB already. This was used to create a set of regular expressions such that most, if not all, of the reference set can be (at least partially) matched by at least one regular pattern in the set, while avoiding matching unwanted terms. The regular expressions were created in a recursive manner: while not all terms in the reference set are matched and there remain some observable patterns in the remaining terms, either a new expression was created or an existing expression augmented in order to match the remaining regular terms. Examples of regular expressions and the terms that they are expected to match are shown in Table 1 .

Post-processing. Post-processing of the matches was utilised to integrate the result of dictionary searching and regular expression matching, including those from partial matching. This step is comprised of three components: combining adjacent words, balancing the number of brackets, and merging overlapped entities. For each metabolite entity that has been recognised its adjacent words are evaluated. The term before the entity is prepended to the annotation only if certain rules are satisfied: the preceding word ends with a hyphen ('-'), ends with a number, the recognised entity itself starts with a comma and number (e.g. ',3') or the current word starts with a hyphen. If the preceding word is a stop word, or if none of the rules can be met, then the procedure is complete an no term is prepended to the entity. The rules for the next word are the same as described above, except now the entity becomes the 'preceding' word in the above scenario and one more rule is evaluated (whether the next word is one of either 'acid', 'isomer', 'ester' or 'ether'). This process is recursive and is done until no adjacent words satisfy any of the rules. For all entities the number of brackets is evaluated to ensure all parentheses, square brackets and curly brackets have a starting and ending bracket in the entity. If an opening or closing bracket is missing then the entity is enclosed by the corresponding missing bracket. All entities that overlap or separated only by a (white)space character are merged into a single entity. The entire postprocessing step is executed recursively, i.e. until there is no more change in the list of identified metabolite entities. Once this has finished, two files are created that are used to train the DL models. The first file contains all sentences that have at least one recognised metabolite, along with their sentence identifiers (see Figure 2 ) and the PMCIDs of the articles the sentences originate from. The second file is in table with 5 fields: PMCID, sentence identifier, position of the start character of the entity, position of the end character of the entity, and the entity itself. These two files are exemplified in the middle panel of Figure 4 .

Training, validation and test set generation and evaluation. The 1,218 metabolomics publications were split at random into a training, validation and test set a 75:10:15 ratio, corresponding to 913 training publications, 122 publications for validation and 183 publications for the test set. The publications in the training and validation set were processed using the rule-based annotation pipeline described above, creating a semi-automatically annotated corpus. The publications in the test set, this included entities in the full text as well as in tables, were manually annotated to avoid bias and to provide a ground truth to evaluate the algorithms on.

Although metabolite abbreviations and entities in tables were not deliberately annotated in the training set with the annotation pipeline, they were manually annotated in the test set for completeness; these types of metabolite names were considered separately while employing models and evaluating performances (detailed in Post-processing). The performance of algorithms was assessed by evaluating the number of true positives (TP), false positives (FP) and false negatives (FN) and calculating the precision ( T P T P +F P ), recall ( T P T P +F N ), F1-score ( 2×precision×recall precision+recall ) and F* measure (28) ( precision×recall precision+recall−precision×recall ) on the test set.

Metabolite NER using LSTM. A BiLSTM model architecture that was used previously on chemical NER task achieving F1-scores >0.9 (18) was used here for metabolite NER. We first evaluated the original ChemListem model on the corpus, we then developed two new methods (MetaboListem, TABoLiSTM) based on the prior methodology to improve the performance and described below. Our new models were trained in 50 epochs, and the epoch that achieved the highest F1-score on the validation set was used for evaluation of the test set. All F1-scores reported here are for the test set only.

Pre-processing. Following prior work (18), a tokeniser designed for chemical text-mining (OSCAR4) (29) was employed to tokenise words in the metabolomics corpus for MetaboListem. For the TABoLiSTM model, we used BioBERT as tokeniser (27) because TABoLiSTM uses the BioBERT token embeddings. For both word tokenisation systems, the BIOES tag scheme was employed to label token sequences. BIOES tag scheme is a standard sequence labeling technique that classifies tokens into five categories: 'B' labels the tokens at the beginning of entities, 'I' labels internal tokens of entities, 'E' labels the ending tokens of entities, 'S' labels singletons (i.e. tokens that are a complete entity on their own), and 'O' the tokens that do not belong to an entity. ChemListem used a pre-classifier random forest to predict the probability of a token being (part of) a chemical entity ('B', 'I', 'E' or 'S') based on a set of ChEBI-derived (30) chemicals and chemical elements. For MetaboListem and TABoLiSTM we implemented the same pre-classifier approach using the set of metabolites from HMDB (12) rather than chemicals from ChEBI, where tokens were segmented using OSCAR4 and BioBERT for MetaboListem and TABoLiSTM, respectively. The pre-classifier produces name-internal features that were used in the model training.

ChemListem and MetaboListem. For ChemListem (18) , name-internal features for each token was computed in the pre-processing stage using the pre-classifier and treated as one of the two inputs to the model (Figure 3 ). The features go through a convolutional layer with width of 3 and are concatenated with the numeric representation of words (i.e. a unique number assigned to each word) in the GloVe (20) vector space. Each token now has 256+300=556 dimensions, where 256 is the output neuron number of the convolutional layer, and 300 the GloVe vector representation. The concatenated tensor is then fed into the BiLSTM layer, before outputting a 5-dimensional result (number of tags used in BIOES tagging) via a TimeDistributed Dense neuron. Our MetaboListem algorithm is based on the same neural network architecture (and GloVe embedding) as ChemListem, hence we named the algorithm in the same manner. It differs from ChemListem only in the training data (metabolomics corpus opposed to BioCreative V5 (16)), pre-classifier data (HMDB instead of ChEBI) and TensorFlow implementation (version 2.0 versus 1.3).

TABoLiSTM model differs from MetaboListem mainly in the use of pre-trained BERT (21) (bert-base-cased) and BioBERT (22) (biobert-base-cased-v1.2) embedding. The name-internal features and the numeric presentation of words were all re-computed using (Bio)BERT tokens instead of words and the last hidden layer of the (Bio)BERT model replaces the GloVeEmbeddingLayer in Figure 3 . The (Bio)BERT layer provides contextualised word embeddings, and the embeddings were concatenated with the convolution output of name-internal features as previously, except for larger tensors of 256+768=1214 dimensions. A Spatial-Dropout1D layer with dropout rate 0.25 is applied on top of the (Bio)BERT layer to prevent overfitting and improve model performance by embedding the dropout (31).

Post-processing. Post-processing of our DL models involves the same procedure as that of the annotation pipeline (section ) to fix pathological terms such as incomplete entities: to check the neighbouring entities, to fix unbalanced parentheses if possible, and to merge overlapped positions. At this stage further rule-based approaches can be employed to further improve models, for MetaboListem and TABoLiSTM these were not used, but rather represent potential further improvements that can be made to the post-processing. The DL algorithms were trained on sentences with full metabolite names only, and were not trained on abbreviations or data in tables. In the post-processing step, the algorithms were applied to the Auto-CORPus JSON output of the list of abbreviations (abbreviations and definitions) and any recognised entity (definition) was then linked to its abbreviation and all occurrences in the full text marked as a recognised entity. This was only done when the entire definition was recognised as metabolite entity, i.e. the rules to fix incomplete entities are not applied here. A direct approach to replace all abbreviations in the text with their definitions was investigated but discarded since the current implementation of the abbreviation recognition does not detect all types of abbreviations. The same approach was undertaken on the Auto-CORPus JSON output of all tabular data, where each cell was evaluated by the DL models as a sentence. Any recognised entity was annotated and evaluated against the manually annotated entities in the test set.

Metabolomics corpus. The corpus generated by means of the annotation pipeline (see Materials and Methods) is of similar size to that of the BioCreative V5 dataset. We extracted 54,493 sentences that contain metabolites from the 1,218 metabolomics articles, and in these have identified a total of 116,393 metabolites. For comparison, the BioCreative V5 dataset contains 21,000 abstracts and 99,632 chemical entities. Table 2 details the metabolomics corpus and number of annotated unique entities illustrating that most metabolomics articles report between 20 and 30 unique metabolites. While metabolomics articles on cancer are the largest single group within the corpus, the average number of metabolites reported in cancer articles is similar to that of other disease areas.

Model performance. We first evaluated the application of the rule-based annotation pipeline on the 183 manually annotated test set articles. Our pipeline yields an F1-score of 0.8893, with precision and recall of 0.8850 and 0.8937 respectively (see Table 3 ). Next, we evaluated the performance of ChemListem (18) , trained on the BioCreative V5 dataset, on the metabolomics corpus where it achieves an F1-score of 0.7669. Our first DL model uses the same architecture as ChemListem but was trained on the metabolomics corpus, therefore this model was named MetaboListem, and it achieved an F1-score of 0.8900 (precision of 0.8923 and recall of 0.8877). Our second DL model uses a different embedding layer (BERT or BioBERT) to MetaboListem, but was trained on the same dataset. The TABoLiSTM (Transformed-Affixed Bidirectional LSTM) model with BERT embeddings achieved an F1score of 0.9004, with precision and recall rates of 0.9187 and 0.8829 respectively; the same model with BioBERT embeddings results in a 0.85% improvement of the F1-score. The workflow including the structure of the metabolomics corpus, the annotation and result of the TABoLiSTM application on the abstract of a publication is shown in Figure 4 . We exemplify the differences in the output from the TABoLiSTM model and our annotation pipeline using two snippets of text Table 3 . Summary of model performances. Models are evaluated in terms of F1-score, precision and recall rate, and F*-score using the metabolomics corpus test set (183 manually annotated full-text articles). The best results for each metric is highlighted in bold.

in Figure 5 that illustrate the typical examples of how the DL model predictions differ from the rule-based annotations. These examples include spelling errors, metabolites not included in the dictionary and metabolite abbreviations. Beyond prediction accuracy, another aspect to evaluate is the model sizes. Despite the performance advantages, TABoLiSTM is much larger than both ChemListem and MetaboListem; the sizes of the latter two models are around 26MB, whereas the size of the TABoLiSTM weight file exceeds 412MB. Furthermore, pretrained BERT/BioBERT (cased versions) are required to employ the TABoLiSTM model, which adds an extra 415MB. The performance of each of the 5 models was evaluated for separate sections (abstract, materials/methods, results, results&discussion, discussion) in the test set ( Figure 6 ). Across all models, the best performance was achieved on separate results and discussion sections, with lowest performance for material/methods and results&discussion sections.

Automated literature review of the metabolomics corpus. We applied TABoLiSTM on the abstract, results (including table data) and discussion sections of each of the 8 traits (see Table 2 ) in the metabolomics corpus to summarise the application of the algorithm in literature review. Figure 7 visualises the results for the 492 cancer articles in the corpus, where only metabolites found in at least 3 different articles are visualised. The size of the text of each metabolites is proportional to the number of articles that reported the metabolite and the position is based on how frequent metabolites are found together. In total, 664 metabolites are recognised in at least three of the 492 articles on cancer (compared to 12,662 unique metabolite names across all 492 articles). The five most frequent metabolites amongst these are glucose, glutamine, lactate, alanine and glutamate, which are mentioned in 22-29% of articles (see Table 4 ).

In the cancer summary graph, no apparent patterns or clusters are observable based on the node colours due to the three-article threshold applied to the graph. This is in contrast to other networks such as one created for the smoking trait where the restriction of 3 articles is not applied and all 1,360 metabolites reported in 124 articles are visualised (see Figure 8 ). This reflects the low number of common metabolites for this trait in the corpus (Table 4 ). Glucose is the only metabolite that appears in the top 10 for all traits, with it being reported in 62% of articles on metabolic syndrome.

We have described three new DL models, using different word embedding layers, that have both achieved state-ofthe-art performance (F1-score>0.89) for metabolite NER on a manually annotated dataset with OA metabolomics articles. Our DL models (F1-scores of 0.91, 0.90 and 0.89 for TABoLiSTM (BioBERT), TABoLiSTM (BERT) and MetaboListem, respectively) surpass by a considerable margin previous methods (using CRFs) for metabolite NER which achieved maximum F1-scores of 0.78 (8) and 0.87 (9) . Our methods, compared to similar methods for chemical entity recognition (17, 18, 24) and metabolite NER (8, 9) , have the benefit of having been trained not only on abstracts and titles, but also on full-text paragraphs that are relevant for information retrieval from studies (results and discussion sections). The corpora these algorithms were trained and evaluated on are made available for future re-use and algorithm development, and to the best of our knowledge are the first of its kind for metabolomics. In addition, we developed a semiautomated annotation pipeline combining regular expression and rule-based approaches with dictionary searching, both for processing the training data for and as an alternative to DL methods, which too achieved outstanding performance (F1-score >0.88) which also surpassed prior DL algorithms (8, 9) .

Annotation pipeline. Recent work that used NLP methods via dictionary searching for text mining in metabolomics highlighted the considerable user effort (1-4hrs as per (10)) required to run these type of analyses. Our annotation pipeline consists of a dictionary of metabolite names and synonyms (from HMDB) which can easily be added to by including data from other databases such as ChemSpider (34) and ChEBI (30) . However, any entity added to the dictionary must undergo some form of quality control to ensure it is a metabolite. Here, we have limited ourselves to only including metabolites listed in HMDB that were classified as either 'quantified' or 'detected' and have not included those that are 'expected' or 'predicted'. The annotation pipeline also includes regular expression and rule-based methods used to recognise metabolites in text and these add additional advantages: they are time-efficient, easily customisable and explainable. While it requires minutes or even hours for an skilled expert to annotate paragraphs and entire full-text articles, the time to process one full article using these methods is less than 15 seconds on average. Once experts establish additional rules focusing on recognising naming conventions, then these can easily be added into the pipeline. Moreover, for each identified entity it is immediately obvious which rule(s) resulted in it being predicted as a metabolite entity, and therefore more interpretable than a DL model. For this reason, any false positives or false negatives can be analysed and new sets of rules established to correctly annotate the metabolites in text. This also allows for developing a user-feedback pipeline in which new rules can be shared to obtain better results. We have used this semiautomated annotation pipeline to prepare the training corpus for the DL methods, therefore any improvement in the rules for annotation may also improve DL models. Another application of the pipeline is to run it in parallel with the DL-based methods and combine the results as a post-processing step.

The main limitation of the annotation system is that the system assumes that the dictionary of metabolite names contains all known metabolites with irregular names. We used Fig. 7 . Graphical representation of metabolites mentioned in the abstract, results and discussion sections of cancer-related metabolomics articles. Each node represents a metabolite and is displayed as the (lower-cased) metabolite name. Two nodes are connected by an edge if the corresponding metabolites appear in the same article at least once. The displayed names are sized proportionally to the number of articles that report the metabolite, and are coloured according to the article that mentions the metabolite most (if such article is not unique, the one with smaller PMCID number is selected). The graph layout is generated using ForceAtlas2 (32) implemented in Gephi (33) . Because of the large number of metabolites (see Table 2 ), only the metabolites that appear in three or more articles are used for the graph construction. Here, edges are hidden for sake of visibility.

the HMDB as dictionary because it is the most comprehensive database of human metabolites thus far, however our annotation approach may still result in missing metabolites. One example is endoxifen, which is an active metabolite of the breast cancer drug tamoxifen. Although its synonym 4hydroxy-N-desmethyltamoxifen can be recognised using regular expressions (for example because of the prefix), the word endoxifen itself is not recognised by the pipeline because of its irregularity and not being included in our dictionary (it exists in HMDB as 'expected'). Augmenting the dictionary with the metabolites listed as 'expected'/'predicted' may appear to be a feature and obvious solution, however adding all of these terms would generate a dictionary 4 times larger than the current one and therefore drastically increase the runtime for the rule-based annotation. Also, >90% of the 'expected' metabolites are lipids (12) , and these names are usually recognisable using regular expressions.

Overall, the annotation pipeline is dependent on its data and rules, it can therefore not detect metabolites that have been misspelled, nor will it flag certain terms as false positives if they appear in the dictionary as synonyms for a metabolite. For example, common words or words with multiple meanings (depending on the context) would be recognised by the system such as 'result' (synonym of 'omeprazole') and 'retinal' (both a vitamin-derivative and a medical term for 'eye'). HMDB also includes chemical entities such as 'ammonium acetate' and 'silica', which are commonly mentioned in methods sections of metabolomics articles. As the HMDB includes more than 18,000 'detected' and 'quantified' metabolites and their synonyms, it was not feasible to examine all entities as pre-processing in this study and only those with 5 or less characters were manually checked; therefore, further cleaning of the dictionary would likely improve the precision of the annotation pipeline. Table 4 . Top ten metabolites reported for each trait in the metabolomics corpus. The number of articles that report the metabolite are given including the percentage of the total. The total number of articles in each category can be found in Table 2 . Fig. 8 . Graphical representation of metabolites mentioned in the smoking-related metabolomics articles. Each node represents a metabolite and is displayed as the (lower-cased) metabolite name. Two nodes are connected by an edge if the corresponding metabolites appear in the same article at least once. The displayed names are sized proportionally to the number of articles that report the metabolite, and are coloured based on the article that reports the metabolite most (if such article is not unique, the one with lower PMCID number is selected). The graph is generated using ForceAtlas2 (32) implemented in Gephi (33) . Only the metabolites in the abstract, result (including those in tables) and discussion sections are plotted here. Unlike Figure 7 , all metabolites (including the ones appear in less than three articles) are used in the graph construction. Edges are hidden for sake of visibility.

indicate that the combination of regular expressions, dictionary searching and rules can indeed correctly capture a vast amount of metabolite names as also reported by Majumder et al (10) .

To the best of our knowledge, our metabolomics corpus, which consists of the abstract, method, result and discussion sections of 183 manually annotated and 835 semi-automatically annotated PMC articles, is the first available corpus that is designed for human metabolomics research. While some databases include some references to articles where metabolites are mentioned, the databases do not contain enough information (e.g. annotated texts) for NLP model development. An advantage of our metabolomics corpus, compared to common chemical NER corpora (15, 16) , is that it contains not only abstracts but also sentences from the method, result and discussion sections which is relevant for context-dependent NER. While abstracts indeed provide a rich set of context, the structure of sentences are often different from the main text and therefore training and employing contextual NLP models only on abstracts could potentially result in overlooking valuable information while mining the full text of articles.

The limitation of the current metabolomics corpus is that it was designed only to label metabolites that appear in full in sentences; that is, if the recognition of a metabolite requires semantic interpretation then it would not be correctly detected. For example, if the string '1-and 3-methylhistidine' is encountered by the pipeline, then only '3-methylhistidine' is annotated, whereas the metabolite '1-methylhistidine' is also in the string. To overcome this limitation of the corpus, it needs to be processed further using semantic analysis techniques such as parse trees (35, 36) .

We are currently working on addressing a limitation of the Auto-CORPus package (25) that we used to process the fulltext. In investigating our results, we found that any abbreviations that cannot be mapped to full names are not annotated, this includes for example cases where Greek letters are abbreviated by letters of Latin alphabet (α as 'a'). We also encountered instances where metabolites are abbreviated without the abbreviation (PC, PE, SM, and others) being defined, because the abbreviations are commonly used in the (lipidomics) community. These abbreviations were also not annotated in the (semi-automatically annotated) metabolomics corpus, but could easily be added as a set of rules to the algorithm. Lastly, superscripts and subscripts in the corpus are encoded differently from normal text by Auto-CORPus (25) and this causes negative impact to the annotation algorithms. We therefore anticipate periodically updating the metabolomics corpus based on user feedback and further research.

Deep learning models. Four LSTM-based models were evaluated on the same manually annotated test set as the annotation pipeline. ChemListem (18) , which achieved an F1-score >0.90 on the BioCreative V5 (16) challenge data, achieved a lower F1-score of 0.77 when evaluated on the metabolomics corpus test set. This is in contrast with our MetaboListem model that used the same neural network architecture as ChemListem, but was trained on the metabolomics corpus, achieving an F1-score of ∼0.89. The lower precision and recall rates of ChemListem when applied to our corpus are the results of higher false positives and false negatives, respectively. The false positives are expected due to the fact ChemListem was trained to recognise chemical entities and not metabolites specifically. The higher false negative rate is indicative of limited power for detecting metabolites -despite these also being chemical entities. The recall rates of ChemListem evaluated on the CEMP and metabolomics corpora are 0.89 (18) and improves on all metrics compared to MetaboListem. This improvement is directly attributable to using contextualised token embeddings -enhancing the contextual sensitivity -and using smaller token sequences that are in between characterand word-level embeddings. Smaller token sequences are known to be more suitable for NER tasks where segmenting words is difficult (38) , for example with metabolites.

While the annotation pipeline has the highest recall rate, we observed that for many instances with spelling errors in metabolite names that these were recognised by the MetaboListem and TABoLiSTM models but were missed by the rule-based annotation pipeline. This includes for example misspelled entities such as 'slanine' (typo of 'alanine'), 'acetoace acid' (typo of 'acetoacetic acid') and 'palmi acid' (typo of 'palmitic acid') indicating that the DL models are capable to generalise and identify unseen metabolite names, including the misspelled ones, by recognising the token structures and the contexts. In light of earlier discussion on the annotation pipeline not recognising endoxifen, the DL models do correctly recognise endoxifen as metabolite which is likely because the token structure of endoxifen is similar to that of tamoxifen in addition to the similar context embedding of metabolites.

MetaboListem outperforms the precision of the annotation pipeline by 0.7% which may be explained by the ambiguous synonyms falsely identified by the annotation pipeline. For the TABoLiSTM model (BioBERT embedding achieving higher precision by 4% compared to the annotation pipeline) this may be to learning contexts rather than learning the rules and regular structures designated to the annotation pipeline. The algorithms were (trained and) evaluated on the full text output from Auto-CORPus (25), however Auto-CORPus also provides separate JSON output files for table data and abbreviations and these files mostly contain single terms without context. Empirically, we found that although the DL models are context sensitive by construction (BiLSTM network and BioBERT embedding) they detect entities in tables and abbreviation lists with high accuracy comparable to the full text results.

Overall, the semi-automatic annotation pipeline is ideally used in parallel with the DL methods as combining these can to provide wider coverage of features that are not yet learned by the DL algorithms due to limited training examples. Likewise, this is also relevant for addressing some limitations of the DL methods identified from their false positives. We found that while most bacteria are not recognised, a very small number of bacterial species are picked up as 'metabolites' by the DL methods. These terms all include tokens that can also be found in metabolites such as 'lact' (Lactobacillus, Lactococcus), 'butyr' (Butyricimonas, Pseudobutyrivibrio, Butyrivibrio), 'chloro' (Chloroflexi), 'succini' (Succinivibrio), 'sulfo' (Desulfovibrio), and 'acid' and 'amino' (Acidaminobacter). As part of a wider effort to provide NLP tools to the omics community, we are developing an NER algorithm for microbiome studies which can also be used to filter out false positives from our DL models when complete.

Other metabolites are identified (by both the DL and rule-based methods) that are in fact part of an enzyme or pathway name. In our DL models we have not included negative training examples of terms in these contexts, however more elaborate rules can also be developed to filter these false positives out in a post-processing step. For example, we observed in the training set that the words 'process' and 'success' are falsely recognised as metabolites in certain context. Based on the dictionary list, no known metabolite name/synonym ends with 'ss', so therefore all terms that match the regular pattern 'ss$' could be excluded. This type of rule-based post-processing was not used for our models because of the risk of introducing bias in model evaluation, where an n-gram of 'success' is found in metabolite names such as succinic acid/succinate. In terms of usability of the DL algorithms, the model size is an important factor. The TABoLiSTM models require 827MB of storage, whereas ChemListem and MetaboListem only need 26MB. Therefore, TABoLiSTM may be too heavy to be leveraged on web services and would ideally be used on local systems. Therefore, employing the lighter MetaboListem model in an online environment is more feasible, regardless of the cost of 1.8% in the F1-score. On average, a fulltext article can be annotated by MetaboListem in under 10s, whereas for TABoLiSTM this is close to 40 seconds.

We have exemplified how the algorithms can be used to speed up literature review by providing a summary of all identified metabolites in the full text. The most common metabolites reported appear to be those that can be measured using virtually all major assays. Glucose is reported most frequently for 4 of the traits, and as one of the top 5 metabolites for the other traits. For cancer studies, the high co-occurence of glucose and lactate is indicative of the Warburg effect (39) where glucose is converted to lactate for cancer cell energy metabolism. However, the remaining top reported metabolites are mostly amino acids, which is also observed for other traits. At the same time, some trait-specific metabolites are found such as bilirubin and cholesterol for liver diseases, cholesterol and triglycerides for metabolic syndrome, cotinine and nicotine for smoking, and tryptophan, kynurenine and serotonin for neurological/psychiatric/brain diseases. The latter being reflective of the serotonin hypothesis (40) that has resulted in antidepressant drugs such as kynurenine uptake inhibitors. The proximities between nodes in the network graphs are known to express community structures (41) , hence it is not surprising to find so many amino acids and other common metabolites near the centre of the graphs.

In summary, this work contributed the first annotated metabolomics corpus to facilitate future NER development for metabolomics research. Our four novel NER algorithms, including a rule-based annotation algorithm and three DL NER systems, are implemented in Python. All data generated, including the annotated metabolomics corpus (manually annotated test set and semi-automatically anno-tated training and validation sets), code to generate models, the MetaboListem model and code for post-processing, is available via https://github.com/omicsNLP/MetaboliteNER. These algorithms have all shown state-of-the-art performance (F1 score >0.88) for metabolite NER tasks (with the best model achieving an F1-score of 0.91), thereby facilitating future text-mining of metabolomics literature. 

Metabolomics for investigating physiological and pathophysiological processes

Novel technologies for metabolomics: More for less

Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references: Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References

Natural language processing to extract symptoms of severe mental illness from clinical text: the clinical record interactive search comprehensive data extraction (CRIS-CODE) project

Natural language processing of clinical notes on chronic diseases: Systematic review

A framework for information extraction from tables in biomedical literature

A machine-compiled database of genome-wide association studies

Mining metabolites: extracting the yeast metabolome from the literature

Metabolite named entity recognition: A hybrid approach

Cognitive analysis of metabolomics data for systems biology

Is current practice adhering to guidelines proposed for metabolite identification in lc-ms untargeted metabolomics? a meta-analysis of the literature

Claudine Manach, and Augustin Scalbert. HMDB 4.0: the human metabolome database

HMDB: a knowledgebase for the human metabolome

Overview of BioCreAtIvE: critical assessment of information extraction for biology

CHEMDNER: The drugs and chemical names extraction challenge

Overview of the interactive task in BioCreative V. Database

tmChem: a high performance approach for chemical named entity recognition and normalization

Chemlistem: chemical named entity recognition using recurrent neural networks

Deep learning with word embeddings improves biomedical named entity recognition

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Report from the Field: PubMed Central, an XML-based Archive of

Biomedical named entity recognition using BERT in the machine reading comprehension framework

Auto-CORPus: A Natural Language Processing Tool for Standardising and Reusing Biomedical Literature

An information artifact ontology perspective on data collections and associated representational artifacts

Transformers: State-ofthe-Art Natural Language Processing

F*: an interpretable transformation of the f-measure

OSCAR4: a flexible architecture for chemical text-mining

Chemical Entities of Biological Interest: an update

A theoretically grounded application of dropout in recurrent neural networks

ForceAt-las2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software

Gephi: An Open Source Software for Exploring and Manipulating Networks

Chemspider: A Platform for Crowdsourced Collaboration to Curate Data Derived From Public Compound Databases

RelEx-Relation extraction using dependency parse trees

Tree kernel-based relation extraction with context-sensitive structured parse tree information

Status of text-mining techniques applied to biomedical text

Character-level neural network for biomedical named entity recognition

Understanding the Warburg Effect: The Metabolic Requirements of Cell Proliferation

Intensification of the central serotoninergic processes as a possible determinant of the thymoleptic effect

Modularity clustering is force-directed layout

This work has been supported by Health Data Research (HDR) UK and the Medical Research Council via an UKRI Innovation Fellowship to TB (MR/S003703/1) and a Rutherford Fund Fellowship to JMP (MR/S004033/1). The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. 1 orcidgreenORCID: 0000-0002-4971-9003 (JMP).