key: cord-0648611-m892b5c5
authors: Kraljevic, Zeljko; Searle, Thomas; Shek, Anthony; Roguski, Lukasz; Noor, Kawsar; Bean, Daniel; Mascio, Aurelie; Zhu, Leilei; Folarin, Amos A; Roberts, Angus; Bendayan, Rebecca; Richardson, Mark P; Stewart, Robert; Shah, Anoop D; Wong, Wai Keong; Ibrahim, Zina; Teo, James T; Dobson, Richard JB
title: Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit
date: 2020-10-02
journal: nan
DOI: nan
sha: e8f4b7de50f3f3ea50eb5ab362923ab2dc1d5c1e
doc_id: 648611
cord_uid: m892b5c5

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customizing and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets ( F1 0.467-0.791 vs 0.384-0.691). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability ( F1>0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

Electronic Health Records (EHR) are large repositories of clinical and operational data that have a variety of use cases from population health, clinical decision support, risk factor stratification and clinical research. However, health record systems store large portions of clinical information in unstructured format or proprietary structured formats, resulting in data that is hard to manipulate, extract and analyse. There is a need for a platform to accurately extract information from freeform health text in a scalable manner that is agnostic to underlying health informatics architectures.

We present the Medical Concept Annotation Toolkit (MedCAT) : an open-source Named Entity Recognition + Linking (NER+L) and contextualization library, an annotation tool and online learning training interface, and integration service for broader CogStack 1 ecosystem integration for easy deployment into health systems. The MedCAT library can learn to extract concepts (e.g. disease, symptoms, medications) from free-text and link them to any biomedical ontology such as SNOMED-CT 2 and UMLS 3 . MedCATtrainer 4 , the annotation tool, enables clinicians to inspect, improve and customize the extracted concepts via a web interface built for training MedCAT information extraction pipelines. This work outlines the technical contributions of MedCAT and compares the effectiveness of these technologies with existing biomedical NER+L tools. We further present real clinical usage of our work in the analysis of multiple EHRs across various NHS hospital sites including running the system over ~20 years of collected data pre-dating even the usage of modern EHRs at one site. MedCAT has been deployed and contributed to clinical research findings in multiple NHS trusts throughout England 5,6 .

Recently NER models based on Deep Learning (DL), notably Transformers 7 and Long-Short Term Memory Networks 8 have achieved considerable improvements in accuracy 9 . However, both approaches require explicit supervised training. In the case of biomedical concept extraction, there is little publically available labelled data due to the personal and sensitive nature of the text. Building such a corpus can be onerous and expensive due to the need for direct EHR access and domain expert annotators. In addition, medical vocabularies can contain millions of different named entities with overlaps (see Figure. 1). Extracted entities will also often require further classification to ensure they are contextually relevant; for example extracted concepts may need to be ignored if they occurred in the past or are negated. We denote this further classification as meta-annotation. Overall, using data-intensive methods such as DL can be extremely challenging in real clinical settings.

This work is positioned to improve on current tools such as the Open Biomedical Annotator (OBA) service 10 that have been used in tools such as DeepPatient 11 and ConvAE 12 to structure and infer clinically meaningful outputs from EHRs. MedCAT allows for continual improvement of annotated concepts through a novel self-supervised machine learning algorithm, customisation of concept vocabularies, and downstream contextualisation of extracted concepts. All of which are either partially or not addressed by current tools. Figure 1 . A fictitious example of biomedical NER+L with nested entities and further 'meta-annotations'; a further classification of an already extracted concept e.g. 'time current' indicates extracted concepts are mentioned in a temporally present context. Each one of the detected boxes (nested) has multiple candidates in the Unified Medical Language System (UMLS). The goal is to detect the entity and annotate it with the most appropriate concept ID, e.g. for the span Status , we have at least three candidates in UMLS, namely C0449438 , C1444752 , C1546481.

Due to the limited availability of training data in biomedical NER+L, existing tools often employ a dictionary-based approach. This involves the usage of a vocabulary of all possible terms of interest and the associated linked concept as specified in the clinical database e.g. UMLS or SNOMED-CT.

This approach allows the detection of concepts without providing manual annotations. However, it poses several challenges that occur frequently in EHR text. These include: spelling mistakes, form variability (e.g. kidney failure vs failure of kidneys), recognition and disambiguation (e.g. does 'hr' refer to the concept for 'hour' or 'heart rate' or neither).

We compare prior NER+L tools for biomedical documents that are capable of handling extremely large concept databases (completely and not a small subset).

MetaMap 13 was developed to map biomedical text to the UMLS Metathesaurus. MetaMap cannot handle spelling mistakes and has limited capabilities to handle ambiguous concepts. It offers an opaque additional 'Word-Sense-Disambiguation' system that attempts to disambiguate candidate concepts that consequently slows extraction.

Bio-YODIE 14 improves upon the speed of extraction compared to MetaMap and includes improved disambiguation capabilities, but requires an annotated corpus or supervised training. SemEHR 15 builds upon Bio-YODIE to somewhat address these shortcomings by applying manual rules to the output of Bio-YODIE to improve the results. Manual rules can be labour-intensive, brittle and time-consuming, but they can produce good results 16 . cTAKES 17 , builds on existing open-source technologies-the Unstructured Information Management Architecture 18 framework and OpenNLP 19 the natural language processing toolkit. The core cTAKES library does not handle any of the previously mentioned challenges without additional plugins.

MetaMap, BioYODIE, SemEHR and cTakes only support extraction of UMLS concepts. BioPortal 20 offers a web hosted annotation API for 880 distinct ontologies. This is important for use cases that are not well supported by only the UMLS concept vocabulary 21 or are better suited to alternative terminologies 22 . However, transmitting sensitive hospital data to an externally hosted annotation web API may be prohibited under data protection legislation 23 . The BioPortal annotator is a 'fixed' algorithm so does not allow customisation or improvements through machine learning or support of non-english language corpora 24 .

Aside from limited capabilities by SemEHR, none of the reviewed tools support further contextualisation of extracted concepts, as this is ostensibly treated as a downstream task although is often required before extracted concepts can be used in clinical research. Supplementary Figure. 1 provides further examples of NER+L challenges and prior tool comparisons.

Overall, MedCAT addresses shortcomings of prior tools to support modern information extraction requirements for biomedical text.

We firstly present our concept recognition and linking results, comparing performance across previously described tools in Section 1. 1. 2 using the UMLS concept database and openly available datasets presented in Section 5. 3. We then present a qualitative analysis of learnt concept embeddings demonstrating the captured semantics of MedCAT concepts. Finally, we show real world clinical usage of the deployed platform to extract, link and contextualise SNOMED-CT concepts across multiple NHS hospital trusts in the UK. Table 1 presents our results for self-supervised training of MedCAT and NER+L performance compared with prior tools using openly available datasets. Our results show MedCAT improves performance compared to all prior tools across all tested metrics. We observe that the best performance across all tools is achieved in the MedMentions (Diseases) dataset. However, MedCAT still improves F1 performance by 10 percentage points over the next best system. We also observe the performance improvement across all metrics when self-supervised training is carried out initially on MIMIC-III then further on MedMentions. We note the simpler Word2Vec embedding (U/MI) on average performs better than the more expressive Bio_ClinicalBERT (U/MI/B) embeddings.

For concept disambiguation the MedCAT core library learns vector embeddings from the contexts in which a concept appears. This is similar to prior work 26 , although we also present a novel self-supervised training algorithm, annotation system and wider workflow. Using our learnt concept embeddings we perform a qualitative analysis by inspecting concept similarities, with the expectation that similar concepts have similar embeddings. Figure. 2. shows the learnt context embeddings capture medical knowledge including relations between diseases, medications and symptoms, Figure. 

The MedCAT platform was used in a number of clinical use cases providing evidence for its applicability to answer relevant, data intensive research questions. For example, we extracted relevant comorbid health conditions in individuals with severe mental illness and patients hospitalized after Covid-19 infection 5, 6, 27 . These use cases analysed data sources from 2 acute secondary/tertiary care services at King's College Hospital (KCH), University College London Hospitals (UCLH) and mental health care services South London and Maudsley (SLaM) NHS Foundation Trusts in London, UK.

The following results focus on providing an aggregate view of MedCAT performance over real NER+L clinical use-cases, meta-annotation classification tasks and model transferability across clinical domains (physical health vs mental health), EHR systems and concepts. 

Contextualisation of extracted and linked concepts is, by design, bespoke per project. Due to this, reporting and comparing results across studies / sites is difficult as the definitions of tasks and concepts collected are different and therefore output trained models are bespoke. Figure. 3. a shows aggregate performance at each site, and Figure. 3. b-c show further experiments for cross-site and cross-concept model transferability.

We achieve strong weighted (0.892-0.977) / macro (0.841-0.860) F1 performance across all tasks and sites, with breakdown of each metric per site/task available in Supplementary  Tables 2-4 . We report average macro and weighted F1 score demonstrating the variation in performance due to unbalanced datasets across most tasks. For cross-concept transferability, Figure. 3. b, shows a decrease in performance when stratifying by concept. However, we still observe a relatively high 0.82-0.85 score suggesting the model is capable of learning disorder independent representations that distinguish the classification boundary for the 'Diagnosis' task, not just the disorder specific contexts.

Our cross-site transferability results, Figure. 3. c, suggest the 'Status' meta-annotation model that is trained on cross site (KCH) data then fine-tuned on site specific data performs better (+ 0. 0.08 Macro / + 0.09 Weighted F1) compared with training on only the SLaM site specific training only.

We export and aggregate our MedCATtrainer collected annotations and find our system consistently shows an increase in the ratio of marked 'correct' vs 'incorrect' annotations even as we continue to see more concepts or forms of concepts (i.e. abbreviations, synonyms and alternative names). Detailed analysis over a range of KCH collected annotations are available in Supplementary Note: MedCATtrainer Annotation Analysis.

3. Discussion

Our evaluation of MedCAT's NER+L method using self-supervised training was benchmarked against existing tools that are able to work with large biomedical databases and are not use-case specific. Our datasets and methods are publicly available making the experiments transparent, replicable, and extendable. With the MedMentions dataset, using only self-supervised learning, our results in Section 2.1 demonstrate an improvement on the prior tools for both disease detection (F1=0.791 vs 0.691) and general concept detection (F1=0.467 vs. 0.384). We observe all tools perform best with the MedMention (Diseases) dataset. We suggest this broadly due to the lack of ambiguity in the set of available disease concepts allowing alternative systems to also perform reasonably well.

The general concept detection task with MedMentions is difficult due to: the larger number of entities to be extracted, the rarity of certain concepts and the often highly context dependent nature of some occurrences. Recent work 28 highlights examples of ambiguous texts within the MedMentions dataset such as 'probe' with 7 possible labels ('medical device', 'indicator reagent or diagnostic aid' etc.) Further work 28 also showed a deep learning approach (BioBERT+) that achieved F1=0.56. When MedCAT is provided with the same supervised training data we achieve F1=0.71. We find our improved performance is due to the long tail of entities in MedMentions that lack sufficient training data for methods such BioBERT to perform well.

Our qualitative inspection of the learnt concept embeddings, Section 2. 2, indicate learnt semantics of the target medical domain. This result mirrors similar findings reported in fields such as materials science 29 . Recent work has suggested an approach to quantity the effectiveness of learnt embeddings 26 in representing the source ontology. However, this relies on concept relationships to be curated before assessment requiring clinical guidance that may be subjective in the clinical domain. We leave a full quantitative assessment of the learnt embeddings to future work for this reason.

Finally, as more concepts are extracted the likelihood of concepts requiring disambiguation increases, particularly in biomedical text 30 . Estimating the number of training samples for successful disambiguation is difficult but based on our experiments we need at least 30 occurrences of a concept in the free text to perform disambiguation (see Supplementary Note Estimating Example Counts for Sufficient F1 Score ).

MedCAT models and annotated training data have been implemented to be easily shared and reused, facilitating a federated learning approach to model improvement and specialisation with models brought to sensitive data silos. Our results in Section 2. 3 demonstrate that we are able directly apply models trained at one hospital site (KCH) to multiple other sites, and clinical domains (physical vs mental health datasets) with only a small drop in average F1 (0.044 at UCLH, 0.062 at SLaM), and after small amount of additional site specific training, we observe comparable performance (-0.021 at UCLH, -0.002 at SLaM).

We also highlight that separate teams were able to deploy, extract and analyse real clinical data using the tools as is by following provided examples, documentation and integrations with the wider CogStack ecosystem. Academic engineering projects are often built to support a single research project, however MedCAT and the CogStack ecosystem are scalable fit-for-purpose locally-tunable solutions for teams to derive value from their data instead of being stalled by poor quality code or lack of documentation.

Each hospital site and clinical team freely defined the set of meta-annotation tasks and associated values for each task. On aggregate our results show performance is consistently strong across all sites and tasks (Macro F1: 0.841-0.860, Weighted F1: 0.892-0.977). With many of the tasks the annotated datasets are highly unbalanced. For example, the 'Presence' task at KCH, disorders are often only mentioned in the EHR if they are affirmed (e.g. "...pmhx: TIA..."), and only rarely are hypothetical (e.g. "...patient had possible TIA...") or negated terms (e.g. "...no sign of TIA…") encountered. This explains the differences in performance when reporting macro vs weighted average F1 score. We would expect generalization performance to lie between these reported metrics.

MedCAT is able to employ a self-supervised training method as the initial pass of the algorithm uses a given unique name to learn and improve an initial concept embedding. However, if the input vocabulary linked to the concepts inadequately specifies possible names or the given names of a concept rarely appear in the text then improvements can only occur during standard supervised learning. Large biomedical concept databases (e.g. UMLS) however have a well specified vocabulary offering many synonyms, acronyms and differing forms of a given concept.

A limitation of our concept embedding approach is if different concepts appear in similar contexts disambiguation and linking to the correct concept can be difficult. For example, 'OD' can link to 'overdose' or 'once daily', both referring to medications with very different implications. We also have rarely seen this problem during real-world usage.

Our approach can also struggle if concepts appear in many varying contexts that are rarely seen or annotated for. With each new context updating the underlying concept embedding this may decrease performance of the embedding.

Supervised learning requires training data to be consistently labelled. This is a problem in the clinical domain that consists of specialised language that can be open to interpretation. We recommend using detailed annotation guidelines that enumerate ambiguous scenarios for annotators.

MedCAT uses a vocabulary based approach to detect entity candidates. Future work could invistaged the expansion of such an approach with a supervised learning model like BERT 31 . The supervised learning model would then be used for detection of entity candidates that have enough training data and to overcome the challenge of detecting new unseen forms of concept names. The vocabulary based approach would cover cases with insufficient annotated training data or concepts that have few different names (forms). The linking process for both approaches would remain the same self-supervised.

Our self-supervised training over the ~20 year KCH EHR, as described in Section 5. 3, took over two weeks to complete. Future work could improve the training speed by parallelizing this process since concepts in a CDB are mostly independent of one another. Further work could address effective model sharing, allowing subsequent users/sites to benefit from prior work, where only model validation and fine-tuning is required instead of training from scratch.

Finally, ongoing work aims to extend the MedCAT library to address relation identification and extraction. For example, linking the extracted drug dosage / frequency with the associated drug concept, or identifying relations between administered procedures and following clinical events.

This paper presents MedCAT a multi-domain clinical natural language processing toolkit within a wider ecosystem of open-source technologies namely CogStack.

The biomedical community is unique in that considerable efforts have produced comprehensive concept databases such as UMLS and SNOMED-CT amongst many others.

MedCAT flexibly leverages these efforts in the extraction of relevant data from a corpus of biomedical documents (e.g. EHRs). Each concept can have one or more equivalent names, such as abbreviations or synonyms. Many of these names are ambiguous between concepts. The MedCAT library is based upon a simple idea: at least one of the names for each concept is unique and given a large enough corpus that name will be used in a number of contexts. As the context is learned from the unique name, when an ambiguous name is later detected, its context is compared to the learnt context, allowing us to find the correct concept to link. By comparing the context similarity we can also calculate confidence scores for a provided linked concept. 

MedCAT presents a set of decoupled technologies for developing IE pipelines for varied health informatics use cases. Figure. 4 shows a typical MedCAT workflow within a wider typical CogStack deployment. This section presents the MedCAT platform technologies, its method for learning to extract and contextualise biomedical concepts through self-supervised and supervised learning.

Integrations with the broader CogStack ecosystem are presented in Supplementary Note: Wider CogStack Ecosystem Integration. Finally, we present our experimental methodology for assessing MedCAT in real clinical scenarios.

We now outline the technical details of the NER+L algorithm, the self-supervised and supervised training procedures and methods for flexibly contextualising linked entities.

MedCAT NER+L relies on two core components:

• Vocabulary (VCB): the list of all possible words that can appear in the documents to be annotated. It is primarily used for the spell checking features of the algorithm. We have compiled our own VCB by scraping Wikipedia and enriching it with words from UMLS. Only the Wikipedia VCB is made public, but the full VCB can be built with scripts provided in the MedCAT repository ( http://shorturl.at/enQ17 ). The scripts require access to the UMLS Metathesaurus ( https://www.nlm.nih.gov/research/umls ). • Concept Database (CDB): a table representing a biomedical concept dictionary (e.g.

UMLS, SNOMED-CT). Each new concept added to the CDB is represented by an ID and Name . A concept ID can be referred using multiple names such as heart failure and myocardial failure, weak heart and cardiac failure.

With a prepared CDB and VCB, we perform a first pass NER+L pipeline then run a trainable disambiguation algorithm. The initial NER+L pipeline starts with cleaning and spell-checking the input text. We employ a fast and lightweight spell checker ( http://www.norvig.com/spell-correct.html ) that uses word frequency and edit distance between misspelled and correct words to fix mistakes. We use the following rules • A word is spelled against the VCB, but corrected only against the CDB.

• The spelling is never corrected in the case of abbreviations.

• An increase in the word length corresponds to an increase in character correction allowance.

Next, the document is tokenized and lemmatized to ensure a broader coverage of all the different forms of a concept name. We used SciSpaCy 32 , a tool tuned for these tasks in the biomedical domain. Finally, to detect entity candidates we use a dictionary based approach with a moving expanding window:

1. Given a document d 1 2. Set window_length = 1 and word_position = 0 3. There are three possible cases: a. The text in the current window is a concept in our CDB (the concept dictionary), mark it and go to 4. b. The text is a substring of a longer concept name, if so go to 4. c. Otherwise reset window_length to 1, increase word_position by 1 and repeat step 3 4. Expand the window size by 1 and repeat 3.

Steps 3 and 4 help us solve the problem of overlapping entities shown in Figure. 1.

For concept recognition and disambiguation we use context similarity. Initially, we find and annotate mentions of concepts that are unambiguous, (e.g. step 3. a. in the previous expanding window algorithm) then we learn the context of marked text spans. For new documents, when a concept candidate is detected and is ambiguous its context is compared to the currently learned one, if the similarity is above a threshold the candidate is annotated and linked. The similarity between the context embeddings also serves as a confidence score of the annotation and can be later used for filtering and further analysis.

The self-supervised training procedure is defined as follows:

1. Given a corpus of biomedical documents and a CDB. 2. For each concept in the CDB ignore all names that are not unique (ambiguous) or that are known abbreviations. 3. Iterate over the documents and annotate all of the concepts using the approach described earlier. The filtering applied in the previous steps guarantee the entity can be annotated. 4. For each annotated entity calculate the context embedding V cntx . 5. Update the concept embedding V concept with the context embedding V cntx .

The self-supervised training relies upon one of the names assigned to each concept to be unique in the CDB. The unique name is a reference point for training to learn concept context, so when an ambiguous name appears (a name that is used for more than one concept in the CDB) it can be disambiguated. For example, the UMLS concept id: C0024117 has the unique name Chronic Obstructive Airway Disease . This name is unique in UMLS. If we find a text span with this name we can use the surrounding text of this span for training, because it uniquely links to C0024117 . ~95% of the concepts in UMLS have at least one unique name.

The context of a concept is represented by vector embeddings. Given a document d 1 where C x is a detected concept candidate (Equation. 1) we calculate the context embedding. This is a vector representation of the context for that concept candidate (Equation 2). That includes a pre-set (s) number of words to the left and right of the concept candidate words. Importantly, the concept candidate words are also included in context embedding calculation as the model is assisted by knowing what words the surrounding context words relate to. d 1 -Example of a document w 1..n -Words/tokens in a document C x -Detected concept candidate that matches the words w k and w k+1 V cntx -Calculated context embedding V w k -Word embedding s -Words from left and right that are included in the context of a detected concept candidate. Typically in MedCAT 's' is set to 9 for long context and 2 for short context.

To calculate context embeddings we use the word embedding method Word2Vec 33 .

Contextualised embedding approaches such as BERT 31 were also tested alongside fastText 34 Once a correct annotation is found (a word uniquely links to a CDB name), a context embedding V cntx is calculated, and the corresponding V concept is updated using the following formula:

C concept -Number of times this concept appeared during training sim -Similarity between V concept and V cntx lr -Learning rate

To prevent the context embedding for each concept being dominated by most frequent words, we used negative sampling as explained in 33 . Whenever we update the V concept with V cntx we also generate a negative context by randomly choosing K words from the vocabulary consisting of all words in our dataset. Here K is equal to 2s i.e. twice the window size for the context ( s is the context size from one side of the detected concept, meaning in the positive cycle we will have s words from the left and s words from the right). The probability of choosing each word and the update function for vector embeddings is defined as: n -Size of the vocabulary P(w i ) -Probability of choosing the word w i K -Number of randomly chosen words for the negative context V ncntx -Negative context

The supervised training process is similar to the self-supervised process but given the correct concept for the extracted term we update the V concept using the calculated V ctx as defined in This no longer relies upon the self-supervised constraint that at least one name in the set of possible names for a concept is unique as the correct term is provided by human annotators.

Once a span of text is recognised and linked to a concept, further contextualisation or meta-annotation is often required. For example, a simple task of identifying all patients with a fever can entail classifying the located fever text spans that are current mentions (e.g. the patient reports a fever vs the patient reported a fever but ...), are positive mentions (e.g. patient has a high fever vs patient has no sign of fever), are actual mentions (e.g. patient is feverish vs monitoring needed if fever reappears), or are experienced by the patient (e.g. pts family all had high fevers). We treat each of these contextualization tasks as distinct binary or multiclass classification tasks.

The MedCAT library provides a 'MetaCAT' component that wraps a Bidirectional-Long-Short-Term-Memory (Bi-LSTM) model trainable directly from MedCATtrainer project exports. Bi-LSTM models have consistently demonstrated strong performance in biomedical text classification tasks [37] [38] [39] and our own recent work 40 demonstrated a Bi-LSTM based model outperforms all other assessed approaches, including Transformer models. MetaCAT models replace the specific concept of interest for example 'diabetes mellitus' with a generic parent term of the concept '[concept]'. The forward / backward pass of the model then learns a concept agnostic context representation of the concept allowing MetaCAT models to be used across concepts as demonstrated in Section 2. 3. 2. The MetaCAT API follows standard neural network training methods but are abstracted away from end users whilst still maintaining enough visibility for users to understand when MetaCAT models have been trained effectively. Each training epoch displays training and test set loss and metrics such as precision, recall and F1. An open-source tutorial showcasing the MetaCAT features are available as part of the series of wider MedCAT tutorials ( https://colab.research.google.com/drive/1zzV3XzFJ9ihhCJ680DaQV2QZ5XnHa06X ). Once trained, MetaCAT models can be exported and reused for further usage outside of initial classification tasks similarly to the MedCAT NER+L models.

MedCATtrainer allows domain experts to inspect, modify and improve a configured MedCAT NER+L model. The tool either actively trains the underlying model (facilitating live model improvements as feedback is provided by human users) or simply collects and validates concepts extracted by a static MedCAT model. Version 0.1 4 presented a proof-of-concept annotation tool that has been rewritten and tightly integrated with the MedCAT library, whilst providing a wealth of new features supporting clinical informatics workflows. We also provide extensive documentation ( https://github.com/CogStack/MedCATtrainer/blob/master/README.md ) and pre-built containers ( https://hub.docker.com/r/cogstacksystems/medcat-trainer ) updated with each new release facilitating easy setup by informatics teams.

MedCATtrainer contains two interfaces. Firstly, the primary user interface allows annotators to login, select their permissioned projects, view documents and the identified / linked concepts by the underlying configured MedCAT model. Feedback can be recorded per concept with a 'correct' / 'incorrect' label provided by the user. Configured meta annotation values can be selected for concepts that are marked as 'correct'. Text spans that have not been identified by the configured MedCAT model can be highlighted and linked to a concept directly within the interface.

Secondly, project administrators use a separate interface to create and configure projects, MedCAT models, annotators, and manage outputs of projects. Concept vocabularies such as UMLS and SNOMED-CT can contain millions of concepts although users often only require a limited subset. MedCATtrainer projects can be configured to filter extracted concepts to train and validate underlying models only for the concepts of interest.

An annotation exercise serves two purposes: to validate a provided MedCAT model's outputs, especially as clinical data can vary between sites, specialisms and patient cohorts, and to provide training data for model improvement via fine-tuning. MedCATtrainer project outputs are the feedback (e.g. correct / incorrect) provided by human annotators, the extra annotations that were not initially extracted and linked by the MedCAT model and any meta annotations selected. Annotations can be downloaded from MedCATtrainer in a simple JSON schema, to be shared between MedCAT models, offering improvements or specialisms in subsets of concepts.

MedCAT concept recognition and linking was validated on the following publicly datasets: 1) MedMentions 41 -consists of 4,392 titles and abstracts randomly selected from papers released on PubMed in 2016 in the biomedical field, published in the English language, and with both a Title and Abstract. The text was manually annotated for UMLS concepts resulting in 352,496 mentions. We calculate that ~40% of concepts in MedMentions require disambiguation, suggesting a detected span of text can be linked to multiple UMLS concepts if only the span of text is considered. 2) ShARe/CLEF 2014 Task 2 42 -we used the development set containing 300 documents of 4 types -discharge summaries, radiology, electrocardiograms, and echocardiograms. We've used the UMLS annotations and ignored the attribute annotations.

3) MIMIC-III 36 -consists of ~58,000 de-identified EHRs from critical care patients collected between 2001-2012. MIMIC-III includes demographic, vital sign, and laboratory test data alongside unstructured free-text notes.

We attempted to use the SemEval 2019 shared task for the evaluation of the NER+L task( https://competitions.codalab.org/competitions/19350 ), but dataset access is currently under review for all requests to i2b2.

Our further experiments used real world EHR data from the following UK NHS An annotation by MedCAT is considered correct only if the exact text value was found and the annotation was linked to the correct concept in the CDB. We contrast our performance with the performance of tools presented in Section 1. 1. 2. All compared tools were used with their default models and parameters or existing container images. We do not perform any fine-tuning or post processing. Supplementary Note MedCAT Core Library Training Configuration provides self-supervised training configuration details.

For our clinical use cases we extracted SNOMED-CT terms, the official terminology across primary and secondary care for the UK National Health Health service, as this was preferred by our clinical teams over UMLS. Site-specific models (M3, M5, M7) are loaded into deployed instances of MedCATtrainer and configured with annotation projects to collect SNOMED-CT annotations for a range of site specific disorders, findings, symptoms, procedures and medications that our clinical teams are interested in for further research (i.e. already published work on Covid-19 5, 6 ). These included chronic (i.e. diabetes mellitus, Ischemic heart disease, heart failure) and acute (Cerebrovascular accident, transient ischemic attack) disorders. For comparison between sites we find 14 common extracted concept groups (Supplementary Table. 1), calculate F1 score for each concept group and report Daverage, standard deviation (SD), and interquartile-range (IQR).

Model provenance for NER+L clinical use case results between datasets and sites. M1-8, showing the MedCAT model instances, the data and method of training and base model used across all sites.

We shared fine-tuned MedCAT models between KCH and 2 NHS partner Trusts UCLH and SLaM. This was a collaborative effort with each hospital team only having access to their respective hospital EHR / CogStack instance. Each site collected annotated data using MedCATtrainer, tested the original base model, a self-supervised only trained model and a final supervised trained model with the MedCATtrainer collected annotations.

From ongoing and published work 5,6 we configured and collected meta-annotation training examples and trained contextualisation models as shown in Supplementary Tables 2-4 . We firstly include all annotations collected at each site across all SNOMED-CT terms of interest as defined by clinicians. We do not compare these results across sites as they include different sets of extracted concepts.

Our further experiments test the effectiveness of our meta annotation modelling approach to flexbibly learn contextual cues by assessing cross-disorder and cross-site transferability. To assess cross-disorder transferability of each of the 11 disorder groups (Supplementary We run this procedure 11 times so that each disorder group ends is tested once. We average all scores of each fold and report results.

To demonstrate cross-site transferability we derive an equivalent meta-annotation dataset from the 'Presence' (KCH) and 'Status' (SLaM) datasets as they are semantically equivalent despite having different possible annotation values. We merge 'Presence' annotations from Affirmed/Hypothetical/False to Affirmed/Other to match classes available in SLaM. We then train and test new meta annotation models between sites and datasets then report average results. ; specific work on research on natural language processing for clinical coding was reviewed with expert patient input on the KERRI committee with Caldicott Guardian oversight. Direct access to patient-level data is not possible due to risk of re-identification, but aggregated de-identified data may be available subject to legal permissions. UCLH: UCLH is deploying CogStack within its records management infrastructure and is growing its capacity to annotate its clinical records as part of wider work for routine curation. The work at UCLH described here is a service evaluation that represents MedCAT's annotation of the records. Access to the medical records will not be possible given their confidential nature. SLaM: This project was approved by the CRIS Oversight Committee which is responsible for ensuring all research applications comply with ethical and legal guidelines. The CRIS system enables access to anonymised electronic patient records for secondary analysis from SLaM and has full ethical approvals. CRIS was developed with extensive involvement from service users and adheres to strict governance frameworks managed by service users. It has passed a robust ethics approval process acutely attentive to the use of patient data. Specifically, this system was approved as a dataset for secondary data analysis on this basis by Oxfordshire Research Ethics Committee C (08/H06060/71). The data is de-identified and used in a data-secure format and all patients have the choice to opt-out of their anonymized data being used. Approval for data access can only be provided from the CRIS Oversight Committee at SLaM. Figure 1 . Example NER+L problems found in a typical biomedical sentence Figure 1 . Two examples of biomedical text used to showcase disambiguation, spelling and resistance to form variability. E1 requires disambiguation and should be detected as Heart Rate , E2 is misspelled and should be detected as Patient , E3 is again disambiguation -Hour , and finally E4 is another form of the concept Kidney Failure. 

CogStack -experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital

SNOMED clinical terms: overview of the development process and project status

The Unified Medical Language System (UMLS): integrating biomedical terminology

A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation

ACE-inhibitors and Angiotensin-2 Receptor Blockers are not associated with severe SARS-COVID19 infection in a multi-site UK acute Hospital Trust

Evaluation and Improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study

Advances in Neural Information Processing Systems

Long short-term memory

Universal Language Model Fine-tuning for Text Classification

The open biomedical annotator

Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

Deep representation learning of electronic health records to unlock patient stratification at scale

An overview of MetaMap: historical perspective and recent advances

A Named Entity Linking System for Biomedical Text

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research

Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications

UIMA: an architectural approach to unstructured information processing in the corporate research environment

Opennlp: A java-based nlp toolkit

BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications

Consumer Health Concepts That Do Not Map to the UMLS: Where Do They Fit?

Standard Lexicons, Coding Systems and Ontologies for Interoperability and Semantic Computation in Imaging

Data protection and information governance

Fostering Multilinguality in the UMLS: A Computational Approach to Terminology Expansion for Multiple Languages

Publicly Available Clinical BERT Embeddings

A case-control and cohort study to determine the relationship between ethnic background and severe COVID-19. Public and Global Health

Extracting UMLS Concepts from Medical Text Using General and Domain-Specific Deep Learning Models

Unsupervised word embeddings capture latent knowledge from materials science literature

Term identification in the biomedical literature

Pre-training of Deep Bidirectional Transformers for Language Understanding

Fast and Robust Models for Biomedical Natural Language Processing

Advances in Neural Information Processing Systems

Enriching Word Vectors with Subword Information

Glove: Global vectors for word representation

An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition

Cross-type biomedical named entity recognition with deep multi-task learning

Leveraging Biomedical Resources in Bi-LSTM for Drug-Drug Interaction Extraction

Comparative Analysis of Text Classification Approaches in Electronic Health Records

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

Task 2: ShARe/CLEF eHealth evaluation lab

S-44054006 -Diabetes mellitus type 2 (disorder)

S-46635009 -Diabetes mellitus type 1 (disorder) S-422088007 -Disorder of nervous system co-occurrent and due to diabetes mellitus (disorder) S-25093002 -Disorder of eye co-occurrent and due to diabetes mellitus (disorder)

S-73211009 -Diabetes mellitus (disorder)

S-84114007 -Heart failure (disorder) S-128404006 -Right heart failure (disorder)

S-48447003 -Chronic heart failure (disorder)

S-56675007 -Acute heart failure (disorder)

S-85232009 -Left heart failure (disorder)

S-42343007 -Congestive heart failure (disorder) S-84114007 -Heart failure (disorder)

S-414545008 -Ischemic heart disease

S-413439005 -Acute ischemic heart disease (disorder) S-413838009 -Chronic ischemic heart disease (disorder) S-194828000 -Angina (disorder)

S-22298006 -Myocardial infarction (disorder)

S-414545008 -Ischemic heart disease (disorder)

S-38341003 -Hypertensive disorder, systemic arterial (disorder)

S-31992008 -Secondary hypertension (disorder)

S-48146000 -Diastolic hypertension (disorder)

S-56218007 -Systolic hypertension (disorder)

S-59621000 -Essential hypertension (disorder) S-38341003 -Hypertensive disorder, systemic arterial (disorder)

S-13645005 -Chronic obstructive lung disease (disorder)

S-195951007 -Acute exacerbation of chronic obstructive airways disease (disorder)

S-87433001 -Pulmonary emphysema (disorder) S-13645005 -Chronic obstructive lung disease (disorder)

S-195967001 -Asthma (disorder) S-195967001 -Asthma (disorder)

S-709044004 -Chronic kidney disease (disorder)

S-723190009 -Chronic renal insufficiency (disorder)

S-709044004 -Chronic kidney disease (disorder)

S-230690007 -Cerebrovascular accident (disorder)

S-25133001 -Completed stroke (disorder)

S-371040005 -Thrombotic stroke (disorder) S-371041009 -Embolic stroke (disorder) S-413102000 -Infarction of basal ganglia (disorder) S-422504002 -Ischemic stroke (disorder)

S-723082006 -Silent cerebral infarct (disorder)

S-1078001000000105 -Haemorrhagic stroke (disorder)

S-230690007 -Cerebrovascular accident (disorder)

S-266257000 -Transient ischemic attack (disorder)

S-266257000 -Transient ischemic attack (disorder)

S-84757009 -Epilepsy (disorder) S-352818000 -Tonic-clonic epilepsy (disorder)

S-19598007 -Generalized epilepsy (disorder)

S-230456007 -Status epilepticus (disorder)

S-509341000000107 -Petit-mal epilepsy (disorder) S-84757009 -Epilepsy (disorder)

S-49436004 -Atrial fibrillation

S-49436004 -Atrial fibrillation (disorder)

S-267036007 -Dyspnea (finding) S-267036007 -Dyspnea (finding)

S-59282003 -Pulmonary embolism

S-59282003 -Pulmonary embolism (disorder)

S-29857009 -Chest pain (finding) S-29857009 -Chest pain (finding)

Each group is manually defined with clinical guidance, i.e. Diabetes mellitus (disorder) includes both Diabetes mellitus type 1 (disorder), Diabetes mellitus type 2 (disorder) concepts. S-267036007 -Dyspnea (finding), S-59282003 -Pulmonary embolism, (disorder) S-29857009 -Chest pain (finding)

We would like to thank all the clinicians who provided annotation training for MedAT; this includes Rosita Zakeri 

We provide details to build both UMLS and SNOMED-CT concept databases. In both cases once the CSV files are obtained we can use the scripts available in the MedCAT repository to build a CDB ( https://github.com/CogStack/MedCAT/blob/master/medcat/prepare_cdb.py ).

The UMLS can be downloaded from https://www.nlm.nih.gov/research/umls/index.html , once done it is available in the Rich Release Format (RRF). To make subsetting and filtering easier we import UMLS RRF into a PostgreSQL database (scripts available at https://github.com/w-is-h/umls ).Once the data is in the database we can use the following SQL script to download the CSV files containing all concepts that will form our CDB.# Selecting concepts for all the Ontologies that are used SELECT DISTINCT umls.mrconso.cui, str, mrconso.sab, mrconso.tty, tui, sty, def FROM umls.mrconso 

We use the SNOMED-CT data provided by the NHS TRUD service ( https://isd.digital.nhs.uk/trud3/user/guest/group/0/pack/26 ). This release combines the International and UK specific concepts into a set of assets that can be parsed and loaded into a MedCAT CDB. We provide scripts for parsing the various release files and load into a MedCAT CDB instance. We provide further scripts to load accompanying SNOMED-CT Drug extension and clinical coding data (ICD / OPCS terminologies) also from the NHS TRUD service. Scripts can are available at: https://github.com/tomolopolis/SNOMED-CT_Analysis Note: MedCAT Core Library Training Configuration

MedCAT was configured for self-supervised training across experiments presented in Section 2. 1 as follows:• Misspelled words were fixed only when 1 change away from the correct word for words under 6 characters, and 2 changes away for words above 6 characters. • For each concept we calculate long and short embeddings and take the average of both. The long embedding takes into account s = 9 words from left and right (as shown in Equation 2). The short embedding takes into account s = 2 words from left and right. The exact numbers for s were calculated by testing the performance of all possible combinations for s in the range [0, 10]. • The context similarity threshold used for recognition is 0.3 unless otherwise specified.This means for a given concept candidate, or sequence of words, to be recognised and linked to the given concept the concept similarity provided by Equation 2 would be greater than 0.3.

We train MedCAT self-supervised over MIMIC-III using the entirety of UMLS, 3.82 Million concepts from 207 separate vocabularies. We use ~2.4M clinical notes (nursing notes, notes by clinicians, discharge reports etc.) on a small one-core server taking approximately 30 hours to complete.

The below plots show multiple projects that initially use the KCH self-supervised trained SNOMED-CT model to annotate a range of concepts. We observe the total number of unique concept forms evolve throughout annotation projects and how the ratio of correct (blue area) / incorrect (orange area) annotations changes as our clinicians annotate documents. We have no control over how many forms appear within a single document, so we plot the ratio of correct / incorrect annotations per document relative to the number of unique forms that have been seen during the annotation session (marked by the black line in each curve). Visually, this is the area under the cumulative unique forms line in each plot taken by either the correct or incorrect ratios. We observe that in the Covid_COPD project we annotated 5 distinct concepts, and 12 distinct textual forms of those concepts. This converges to 100% correctness after ~60 documents even with further forms added on two separate occasions. However, we also observe larger annotation projects such as Covid_CTPA_Reports that saw over 400 unique forms of 194 concepts where the model was still converging to an optimal solution. We observe the ratio of correct to incorrect annotations dip (orange area vs blue area under the black unique forms line) where MedCATtrainer is presented with a high volume of new unique concept forms. This performance drop is quickly rectified (blue area increases) by subsequent training. Model performance should be understood by examining the progressive increase in the ratio of correct/incorrect annotations, rather than the absolute number of incorrect annotations. So for example in Covid_CTPA_Reports, the incorrect ratio area (orange) looks largely flat, however performance is slowly improving as correct annotation ratios per document are improving (blue area is larger than orange).

Top left to bottom right: MedCATtrainer annotation projects with respective numbers unique concepts seen throughout annotating and the number of configured concepts: Covid_COPD ( We perform error analysis to find the causes of the performance drops shown (the sharp rises in incorrect orange area vs blue correct area under the unique forms line) at various times. For example in the Covid_Gastro project, our clinical annotators marked "ileal Crohn's" incorrect for Crohn's disease despite it being a sub-type of the more general Crohn's disease concept, and marked "Previous medical history: UC" as incorrect for ulcerative colitis. Both of these could arguably be marked correct. In our Diabetes_Covid project we see annotators that mark examples such as "episode in diabetic clinic" and "referred to medics by diabetic reg" whe7re the condition is being used as an adjective so was likely marked incorrect as its not directly describing a condition experienced by the patient. These are confusing for the MedCAT model, and should actually be marked as correct and left to a meta annotation to determine patient experience.

An example CogStack ecosystem deployment.An example CogStack deployment using the MedCATservice to integrate trained MedCAT models allowing for continuous extraction and indexing of concepts as records flow from the source DBThe CogStack ecosystem comprises multiple components, selection of which depends on a particular deployment. It is usually deployed as a Platform-as-a-Service where each service runs in a container (see above figure) . The ecosystem can be split into 3 key cross-functional areas: (i) information extraction from underlying EHR systems, (ii) exploratory data analysis with data monitoring, (iii) natural language processing.(i) EHR free-text data is often stored in different database systems and in many binary file formats (such as MS Word documents, PDFs or scanned images). CogStack relies upon opensource technologies to extract text from any such documents such as Apache Tika( https://tika.apache.org/ ) with data pipelines handled in Apache NiFi( https://nifi.apache.org/ ) before indexing into the Elasticsearch datastore.(ii) Upon indexing, a user can perform exploratory data analysis in the Kibana( https://www.elastic.co/kibana ) user interface, including running queries over unstructured free-text notes and structured tabular data. This step is especially useful to narrow down datasets to a set of patients, time periods or set of notes to be used during the development and application of IE pipelines such as those made possible using MedCAT. Moreover, use-case specific dashboards can be created in Kibana with monitoring and alerting enabled for designated user groups.(iii) Finally, having identified a corpus of notes, the user can load these into MedCATtrainer to tailor MedCAT models according to specific research questions. After running supervised training, a model can be deployed into the information extraction data pipeline (as in (i)). This can be done using the MedCATservice that exposes the model functionality behind a RESTful API. Alternatively, the model can be saved as a file and used in custom user applications.