key: cord-0134482-kbuo1ad7 authors: Fries, Jason A.; Steinberg, Ethan; Khattar, Saelig; Fleming, Scott L.; Posada, Jose; Callahan, Alison; Shah, Nigam H. title: Trove: Ontology-driven weak supervision for medical entity classification date: 2020-08-05 journal: nan DOI: nan sha: d0780b0e4de4b2f4a56e69d3d612ed435131edc6 doc_id: 134482 cord_uid: kbuo1ad7 Motivation: Recognizing named entities (NER) and their associated attributes like negation are core tasks in natural language processing. However, manually labeling data for entity tasks is time consuming and expensive, creating barriers to using machine learning in new medical applications. Weakly supervised learning, which automatically builds imperfect training sets from low cost, less accurate labeling rules, offers a potential solution. Medical ontologies are compelling sources for generating labels, however combining multiple ontologies without ground truth data creates challenges due to label noise introduced by conflicting entity definitions. Key questions remain on the extent to which weakly supervised entity classification can be automated using ontologies, or how much additional task-specific rule engineering is required for state-of-the-art performance. Also unclear is how pre-trained language models, such as BioBERT, improve the ability to generalize from imperfectly labeled data. Results: We present Trove, a framework for weakly supervised entity classification using medical ontologies. We report state-of-the-art, weakly supervised performance on two NER benchmark datasets and establish new baselines for two entity classification tasks in clinical text. We perform within an average of 3.5 F1 points (4.2%) of NER classifiers trained with hand-labeled data. Automatically learning label source accuracies to correct for label noise provided an average improvement of 3.9 F1 points. BioBERT provided an average improvement of 0.9 F1 points. We measure the impact of combining large numbers of ontologies and present a case study on rapidly building classifiers for COVID-19 clinical tasks. Our framework demonstrates how a wide range of medical entity classifiers can be quickly constructed using weak supervision and without requiring manually-labeled training data. Analyzing text to identify concepts such as disease names and their associated attributes like negation are foundational tasks in medical natural language processing (NLP). Traditionally, training classifiers for named entity recognition (NER) and cue-based entity classification have relied on hand-labeled training data. However annotating medical corpora requires considerable domain expertise and money, creating barriers to using machine learning in critical applications [1, 2] . Moreover, hand-labeled datasets are static artifacts that are expensive to change. The recent COVID-19 pandemic highlights the need for machine learning tools that enable faster, more flexible analysis of clinical and scientific documents in response to rapidly unfolding events [3] . To address the scarcity of hand-labeled training data, machine learning practitioners increasingly turn to lower cost, less accurate label sources to rapidly build classifiers. Instead of requiring hand-labeled training data, weakly supervised learning relies on task-specific rules and other imperfect labeling strategies to programmatically generate training data. This approach combines the benefits of rule-based systems, which are easily shared, inspected and modified, with machine learning which typically improves performance and generalization properties. Weakly supervised methods have demonstrated success across a range of NLP and other settings [4, 5, 6, 7, 8] . Knowledge bases and ontologies provide a compelling foundation for building weakly supervised entity classifiers. Ontologies codify a vast amount of medical knowledge via taxonomies and example instances for millions of medical concepts. However, repurposing ontologies for weak supervision creates challenges when combining label information from multiple sources without access to ground truth labels. The hundreds of terminologies found in the Unified Medical Language System (UMLS) Metathesaurus [9] and other sources [10] typify the highly redundant, conflicting, and imperfect entity definitions found across medical ontologies. Naively combining such conflicting label assignments can cause substantial performance drops in weakly supervised classification [11] ; therefore, a key challenge is correcting for labeling errors made by individual ontologies when combining label information. In this work, we explore how ontology-driven weak supervision can be used to train medical entity classifiers without hand-labeled training data. Prior research on weakly supervised medical NER has required complex preprocessing to identify possible entity spans [12] , generated labels from a single source rather than combining multiple sources [13] , or relied on ad hoc rule engineering [14] . High impact application areas, such as clinical NER using weak supervision, are largely unstudied. Key questions remain about the extent to which we can automate weak supervision using existing medical ontologies and how much additional task-specific rule engineering is required for state-of-the-art performance. It is also unclear whether, and by how much, pre-trained language models such as BioBERT [15] improve the ability to generalize from weakly labeled data and reduce the need for task-specific labeling rules. We present a Trove, a framework for training weakly supervised medical entity classifiers using off-the-shelf ontologies. The overall pipeline is shown in Figure 1 . We focus on the challenge of building classifiers without hand-labeled training data by unifying: (1) imperfect labels generated by multiple ontologies and (2) task-specific rules. Our main hypothesis is that ontology-only weak supervision, coupled with recent pre-trained language models such as BioBERT, substantially reduces the engineering cost of creating entity classifiers while matching performance of prior, more expensive, weakly supervised methods. The central intuition of this work is that individual ontologies and task-specific rules each make systematic labeling errors. By observing the rates of agreement and disagreement across labeling rules, and without requiring ground truth labels, we can learn each source's accuracy and correct for label noise to generate "denoised", probabilistic training data [16, 17] . These data are then used to train deep learning models to generalize beyond the concepts found in ontologies alone. We conduct experiments on six benchmark tasks for clinical and scientific text, reporting state-of-the-art weakly supervised performance (i.e., using no hand-labeled training data) on NER datasets for chemical/disease and drug tagging. We further present new weakly supervised baselines for two tasks in clinical text: disorder tagging and event temporality classification. Our study includes ablation analyses exploring the performance trade-offs of training models with labels generated from easily automated ontology-based weak supervision vs. more expensive, task-specific rules. Finally, we present a case study deploying Trove for COVID-19 symptom tagging and risk factor monitoring using a near-realtime feed of Stanford Health Care emergency department notes. Rule-based systems for NER [18] and cue detection [19, 20] are common in clinical text processing, where labeled corpora are difficult to share due to privacy concerns. Generating imperfect training labels from indirect sources (e.g., patient notes) is often used in analyzing medical images [21, 22, 23, 24, 25, 26] and text processing [27] . Recent work has explored learning the accuracies of sources to correct for label noise in rule-based systems Learn Label Source Accuracies Concept Output (Diseases, Negation, etc.) Figure 1 : Trove pipeline for ontology-driven weak supervision for medical entity classification: dotted boxes/lines indicate optional steps. Users specify: A) a mapping of an ontology's class taxonomy to entity classes; B) a set of label sources (e.g., ontologies, task-specific rules) for weak supervision; and C) a collection of unlabeled document sentences with which to build a training set. Ontologies instantiate labeling function templates which are applied to sentences to generate a label matrix. This matrix is used to train the label model which learns source accuracies and corrects for label noise to predict a consensus probability per word. Consensus labels are transformed into the probabilistic sequence label dataset which is used as training data for an end model (e.g., BioBERT). Alternatively, the label model can also be used as the final classifier. for text classification [28, 29, 4, 17] . However these focus on sentence or document classification via task-specific labeling rules and do not explore NER or automating labeling via multiple ontologies. Weakly supervised learning is an umbrella term referring to methods for training classifiers using imperfect, indirect, or limited labeled data and includes techniques such as distant supervision [30, 31] , co-training [32] and others [33] . Prior approaches for weakly supervised NER such as co-training use a small set of labeled seed examples [34] which are iteratively expanded through bootstrapping or self-training [35] . Semi-supervised methods also use some amount of labeled training data and incorporate unlabeled data by imposing constraints on properties such as expected label distributions [36] . Distant supervision requires no labeled training data, but typically focuses on a single source for labels [13] , rather than unifying labels assigned using heterogeneous sources of unknown quality. Crowdsourcing methods combine labels from multiple human annotators with unknown accuracy [37] . However compared to human labelers, programmatic label assignment has different correlation and scaling properties which create technical challenges when combining sources. Data programming [16, 11, 17] formalizes theory for combining multiple label sources with different coverage and unknown accuracy as well as correlation structure to correct for labeling errors. This approach is used in SwellShark [12] where a generative model is trained using labels from multiple dictionary and rule-based sources. However this approach required task-specific preprocessing to identify candidate entities a priori to achieve competitive performance. Safranchik et al. [14] presented WISER, a linked hidden Markov model where weak supervision was defined separately over tags and tag transitions using linking rules derived from language models, ngram statistics, mined phrases and custom heuristics to train a BiLSTM-CRF. Our work advances these prior approaches by: (1) eliminating the requirement for identifying probable entity spans a priori by combining word-level weak supervision with contextualized word embeddings; (2) using ontology-only supervision; and (3) quantifying the relative contributions of sources of label assignment -such as pre-existing ontologies from the UMLS (low cost) and task-specific rule engineering (high cost) -to the achieved performance for a task. We analyze two categories of medical tasks using six datasets: (1) NER; and (2) span classification where entities are identified a priori and classified for cue-driven attributes such as negation or document relative time i.e., the order of an event entity relative to the parent document's timestamp. Both categories of tasks are formalized as token classification problems, either tagging all words in a sequence (NER) or just the head words for an entity set (span classification). preprocessed using a spaCy [38] pipeline optimized for medical tokenization and sentence boundary detection [29] . We used 99 label sources covering a broad range of medical ontologies. We used the 2018AA release of the UMLS Metathesaurus, removing non-English and zoonotic source terminologies as well as sources containing less than 500 terms, resulting in 92 sources. Additional sources included the 2019 SPECIALIST abbreviations [43]; Disease Ontology [44] ; Chemical Entities of Biological Interest (ChEBI) [45] ; Comparative Toxicogenomics Database (CTD) [46] ; the seed vocabulary used in AutoNER [13] ; ADAM abbreviations database [47] ; and word sense abbreviation dictionaries used by the clinical abbreviation system CARD [48] . We applied minimal preprocessing to all source ontologies, filtering out English stopwords [49] and applying a letter case normalization heuristic to preserve abbreviations. We assume a sequence labeling problem formulation, where we are given a dataset ., x i,t ) consisting of words x from a fixed vocabulary. Each sequence is mapped to a corresponding sequence of latent class variables Y i = (y i,1 , ..., y i,t ), where y ∈ {0, ..., k} for k tag classes. Since Y is not observable, our primary technical challenge is estimating Y from multiple, potentially conflicting label sources of unknown quality to construct a probabilistically labeled datasetD . This dataset can then be used for training classification models such as deep neural networks. Such a labeling regimen is typically low-cost, but less accurate than the hand-curated labels used in traditional supervised learning, hence this paradigm is referred to as weakly supervised learning. Combining labels assigned via term-matching using multiple ontologies and task-specific rules is challenging because the different sources have unknown, task-dependent accuracies and can disagree on the correct (unobserved) label, introducing noise into the labeling process. To correct for such label noise, we use data programming [16] to estimate accuracies of each source and ensemble the sources via a label model which assigns a consensus probabilistic label per word. To learn the label model, m different label sources are parameterized as labeling functions λ 1 , ....λ m . Labeling functions assign a label given an input instance (e.g., a document or entity span) and an underlying heuristic such as matching strings against a dictionary. The output of a labeling function is in the domain {−1, 0, ..., k} where -1 denotes ABSTAIN, i.e., not assigning any class label. The vector of m labeling functions applied to n instances forms the label matrix Λ ∈ {−1, 0, ..., k} m×n . A key finding of data programming is that we can use Λ to recover the latent class-conditional accuracy of each label source without ground truth labels by observing the rates of agreement and disagreement across all pairs of labeling functions λ i , λ j [16] . We use the weak supervision framework Snorkel [11] to train a probabilistic label model which captures the relationship between the true label and label sources P (Y, Λ). Here the training input is only the label matrix Λ, generated by applying labeling functions λ 1 , ....λ m to the unlabeled dataset D. Formally, P (Y, Λ) can be encoded as a factor graph-based model with m accuracy factors between λ 1 , ..., λ m and our true (unobserved) label y. θ Acc Snorkel implements a matrix completion formulation of data programming which enables faster estimation of model parameters θ using stochastic gradient descent rather than relying on Gibbs sampling-based approaches [17] . The label model estimates P (Y |Λ) to provide "denoised" consensus label predictionsŶ and generates our probabilistically labeled datasetD. In this work, a labeling function λ j accepts an unlabeled sequence X i as input and emits a vector of predicted labelsỸ i,j = (ỹ j,1 , ...,ỹ j,t ), i.e., a labelỹ j ∈ {−1, 0, ..., k} for each word in X i . A typical labeling function serves as a wrapper for an underlying, potentially task-specific labeling heuristic such as pattern matching with a regular expression. Since these labeling functions are not easily automated and require hand coding, we refer to them as task-specific labeling functions. In contrast, medical ontologies are easily transformed into labeling functions by defining reusable labeling function templates. Templates only require specifying a target entity taxonomy and providing a collection of terminologies mapped to that taxonomy. These mappings are common in knowledge bases such as the UMLS Metathesaurus, where the UMLS Semantic Network [50] provides a shared taxonomy for over a hundred medical terminologies. We utilize two ontology-based labeling functions in this work. Taxonomy labeling functions require a set of terms (single or multi-word entities) t ∈ T mapped to a taxonomy, where a term may be mapped to multiple entity classes. This mapping is converted to a k-dimensional probability vector where k is the number of entity classes t i → [p 1 , ..., p k ]. Given input sequence X i , use string matching to find all longest term matches (in token length) and assign each match to its most probable entity class y = max(t i ), abstaining on ties. Using the longest match is a heuristic which helps disambiguates nested terms ("lung" as anatomy vs "lung cancer" as disease). Matching optionally includes a set of slot-filled patterns to capture simple compositional mentions (e.g., "{*} ({*})" → "Tylenol (Acetaminophen)"). Synonym (synset) labeling functions require synsets (collections of synonymous terms) {t 1 , ...,t n } ∈T and terms T mapped to a taxonomy. Given input sequence X i and it's parent context (e.g., document) search for >1 unique synonym matches from a target synset and label all matchesỹ = max(t i ). This is useful for disambiguating abbreviations (e.g, "Duchenne muscular dystrophy" → "DMD") , where a long form of an abbreviated term appears elsewhere in a document. Matches can be unconstrained, e.g., any tuple found anywhere in a context, or subject to matching rules e.g., using Schwartz-Hearst abbreviation disambiguation [51] to identify out-of-dictionary abbreviations. Our labeling functions generate word-level labels. Figure 2 shows how this provides a principled way to synthesize a label when there is disagreement across label sources about what constitutes an entity span. Here the disease mention "diabetes type 2" is not found in Metathesaurus Names (MTH) or SNOMED Clinical Terms (SNOMEDCT) [52] which leads to disagreement and label errors. Using a soft majority vote of labeling functions misses the complete entity span, while the label model learns to account for systematic errors made by each ontology to generate a more accurate consensus label prediction. The output of the label model is a set of probabilistically labeled words, which we transform back into sequenceŝ While probabilistic labels may be used directly for classification, this suffers from a key limitation: the label model cannot generalize beyond the direct output of labeling functions. Rules alone can miss common error cases such as out-of-dictionary synonyms or misspellings. Therefore, to improve coverage we train a discriminative end model, in this case a deep neural network, to transform the output of labeling functions into learned feature representations. Doing so leverages the inductive bias of pre-trained language models [53] and provides additional opportunities for injecting domain knowledge via data augmentation [54] and multi-task learning [55] to improve classification performance. Here four ontology labeling functions are used to label a sequence of words X i containing the entity "diabetes type 2". Soft majority vote estimates Y i as a word-level sum of positive class labels. The label model learns a latent class-conditional accuracy parameter for each ontology, which is used to reweight the original labels to generate a more accurate consensus prediction of Y i . We use the transformer-based BioBERT [15] , a language model fine-tuned on medical text. We also evaluated ClinicalBERT [56] for clinical tasks, and found its performance to be the same as BioBERT. BioBERT is trained as a token-level classifier with a max sequence length of 512 tokens. We follow Devlin et al. [53] for sequence labeling formulation, using the last BERT layer of each word's head wordpiece token as the contextualized embedding. Since sequence labels may be incomplete (i.e., cases where all labeling functions abstain on a word), we mask all abstained tokens when computing the loss during training. We modified BioBERT to support a noise-aware binary cross entropy loss function [16] which minimizes the expected value with respect toŶ to take advantage of the more informative probabilistic labels. All models were trained using weakly-labeled versions of the original training splits, i.e., no hand-labeled instances. We used a hand-labeled validation and test set for hyperparameter tuning and model evaluation, respectively. Result metrics are reported using the test set. The label model was tuned for learning rate, training epochs, L2 regularization, and a uniform accuracy prior used to initialize labeling function accuracies. BioBERT weights were fine-tuned, and end models were tuned for learning rate and training epochs. We used a linear decay learning rate schedule with a 10% warmup period. We report precision, recall, and F1-score for all tasks. DocRelaTime is reported using micro-averaging. NER metrics are computed using exact span matching [57] . Each NER task is trained separately as a binary classifier using IO (inside, outside) tagging to simplify labeling function design, with predicted tags converted to BIO (beginning, inside, outside) to properly count errors detecting head words. Span task metrics are calculated assuming access to gold test set spans, as per the evaluation protocol of the original challenges. Label model and BioBERT scores are reported as the mean and standard deviation of five runs with different random seeds. Trove is written in Python v3.6. Snorkel v0.9.5 was used for training the label model. BioBERT-Base v1.1, Transformers v2.8 [58] , and PyTorch v1.1.0 were used to train all discriminative models. All code is open source and publicly available at https://github.com/som-shahlab/trove. After quantifying the performance of ontology-driven weak supervision in all our tasks, we performed four experiments. First, we examined performance differences by label source ablations, which compared ontologybased labeling functions against those incorporating task-specific rules. Second, we compared Trove to existing weakly supervised tagging methods. Third, we examined learning source accuracies for UMLS terminologies. Finally we report on a case study that used Trove to monitor emergency department notes for symptoms and risk factors associated with patients tested for COVID-19. Expanded experimental details, tuning experiments, and performance measures are provided in supplemental materials. Table 2 : F1 scores for ontology and task-specific rule-based weak supervision categories. Models are soft majority vote (SMV); label model (LM); weakly supervised BioBERT (WS); and fully supervised BioBERT (FS). LFs denote labeling function counts or total added task-specific rules. Bold indicates the best weakly supervised score for each approach and task. Scores are the mean and ±1 SD of five random weight initializations. For NER tasks, we examined five ablations, ordered by increasing cost of labeling effort. A dictionary of all positive and negative examples explicitly provided in annotation guidelines, including dictionaries for punctuation, numbers, and English stopwords. 2. + UMLS: All terminologies available in the UMLS. 3. + Other: Additional ontologies or existing dictionaries not included in the UMLS. Task-specific rules including regular expressions, small dictionaries, and other heuristics. Hand-labeled: Supervised learning using the expert-labeled training split. Ontology-based Labeling Functions: We used the UMLS Semantic Network as our entity taxonomy and defined a mapping of semantic types (STYs) to target class labels y ∈ {0, 1}. Non-UMLS ontologies that did not provide semantic type assignments (e.g., ChEBI) were mapped to a single class label. All UMLS terminologies v were ranked by term coverage on the unlabeled training set, defined as each term's document frequency summed by terminology, and the top s terminologies were used to initialize templates, where s was tuned with a validation set. The remaining (v s+1 , ..., v 92 ) UMLS terminologies were merged into a single labeling function to ensure all term in the UMLS were included. UMLS synsets were constructed using concept unique identifiers (CUIs) and templates were initialized with the union of all terminologies and fixed across all NER tasks. Task-specific Labeling Functions: All task-specific labeling functions were developed by inspecting unlabeled training set documents. For NER, we used three general rule types to label concepts: regular expressions to detect out-of-ontology mentions; small dictionaries of related terms (e.g., illegal drugs); and bigram word co-occurrence graphs from ontologies to support fuzzy span matching. For negation, we built on NegEx [19] which uses regular expressions to search left and right context windows for negation cues. For DocRelaTime we used a heuristic based on the nearest explicit datetime mention (in token distance) to an event mention [59] . Additional regular expression-based rules were added to detect other cues of event temporality. Weakly Supervised BioBERT ). This was due to adding the ChEBI ontology which increased recall but only had 65% word-level precision. Soft majority vote cannot learn or utilize this information, so naively adding ChEBI labels hurt performance. However the label model learned ChEBI's accuracy to take advantage of the noisier, but higher coverage signal, thus the WS BioBERT UMLS+Other (red bar) outperformed UMLS (green bar) by 2.4 F1 points (88.2 vs 85.8). We compared Trove to three existing weakly supervised methods for NER and sequence labeling: SwellShark [12] , AutoNER [13] , and WISER [14] . We compared performance on BC5CDR (the combination of disease and chemical tasks) against all methods and on the i2b2 drug task for SwellShark. Table 3 : Precision (P), recall (R), and F1 scores for the BC5CDR task using state-of-the-art weakly supervised NER methods. Underlined numbers indicates the best weakly supervised score using only dictionaries/ontologies and bold indicates the best score using custom rules. Estimating accuracies with the label model requires observing agreement and disagreement among multiple label sources. However it is non-obvious how to partition the UMLS, which contains many terminologies, into labeling functions. The naive extremes are to either create a single labeling function from the union of all terminologies or include all terminologies as individual labeling functions. To explore how partitioning choices impact label model performance, we held all non-UMLS labeling functions fixed across all ablation tiers and computed performance across s = (1, ..., 92) partitions of the UMLS by terminology. All scores were normalized to the best global soft majority vote (SMV) score per tier to assess the impact of correcting for label noise. Figure 4 shows the impact of partitioning the UMLS into s different labeling functions. Modeling source accuracy consistently outperformed SMV across all tiers, in some cases by 2-8 F1 points. The best performing partition size s ranged from 1-10 by task. Two naive baseline approaches -collapsing the UMLS into a single labeling function or treating all terminologies as individual labeling functions -generally did not perform best overall. We deployed Trove to monitor emergency departments for patients undergoing COVID-19 testing, analyzing clinical notes for presenting symptoms and risk factors [60] . This required identifying disorders and defining a novel classification task for exposure to a confirmed COVID-19 positive individual, a risk factor informing patient contact tracing. The dataset consisted of hourly dumps of emergency department notes from Stanford Health Care (SHC), beginning in March 2020. We manually annotated a gold test set of 20 notes for all mentions of disorders and 776 notes for mentions of a positive COVID exposure. Two clinical experts generated gold annotations which were adjudicated for disagreements by authors AC and JF. As a baseline for disorder tagging, we used the fully supervised ShARe/CLEF disorder tagger. This reflects a readily available, but out-of-distribution training set (MIMIC-II [61] vs. SHC). We used the same disorder labeling function set as our prior experiments, adding one additional dictionary of COVID terms [62]. BioBERT was trained using 2482 weakly-labeled documents. Custom labeling functions were written for the exposure task and models were trained on 14k sentences. Table 4 : COVID-19 presenting symptoms and risk factors evaluated on Stanford Health Care emergency department notes. Bold and underlined scores indicate the best score in symptom/disorder tagging and COVID exposure classification respectively. Our experiments demonstrate the effectiveness of using weakly supervised methods to train entity classifiers using off-the-shelf ontologies and without requiring hand-labeled training data. medical ontologies are freely available sources of weak supervision for NLP applications [63] and in several NER tasks, our ontology-only weakly supervised models matched or outperformed more complex weak supervision methods in the literature. Our work also highlights how domain-aware language models, such as BioBERT, can be combined with weak supervision to build low-cost and highly performant medical NLP classifiers. Rule-based approaches are common tools in scientific literature analysis and clinical text processing [64, 65, 66, 67] Our results suggest that engineering task-specific rules in addition to labels provided by ontologies provides strong performance for several NER tasks -in some cases approaching the performance of systems built using hand-labeled data. We further demonstrated how leveraging the structure inherent in knowledge bases such as the UMLS to estimate source accuracies and correct for label noise provides substantial performance benefits. We find that the classification performance of the label model alone is strong, with BioBERT providing modest gains of 0.9 F1 points on average. Since the label model is orders-of-magnitude more computationally efficient to train than BERT-based models, in many settings (e.g., limited access to high-end GPU hardware) the label model alone may suffice. Our tasks reflect a wide range of difficulty. Clinical tasks required more task-specific rules to address the increased complexity of entity definitions and other non-grammatical, sub-language phenomena [68] . Here custom rules improved clinical tasks an average of 9.1 F1 points vs. 2.3 points for scientific literature. Moreover, adding non-UMLS ontologies to PubMed tasks consistently improved overall performance while providing little-to-no benefit for our clinical tasks. Annotation guidelines for our clinical tasks also increased complexity. The i2b2 drug task combines several underlying classification problems (e.g., filtering out negated medications, patient allergies, and historical medications) into a single tagging formulation. This extends beyond entity typing and requires more complex, cue-driven rule design. Manually labeling training data is time consuming and expensive, creating barriers to using machine learning for new medical classification tasks. Sometimes, there is a critical need to rapidly analyze both scientific literature and unstructured electronic health record data -as in the case of the COVID-19 pandemic when we need to understand the full repertoire of symptoms, outcomes, and risk factors at short notice [60, 69, 70] . However, sharing patient notes and constructing labeled training sets presents logistical challenges, both in terms of patient privacy and in developing infrastructure to aggregate patient records [71] . In contrast, labeling functions can be easily shared, edited, and applied to data across sites in a privacy preserving manner to rapidly construct classifiers for symptom tagging and risk factor monitoring. This work has several limitations. Our task-specific labeling functions were not exhaustive and only reflect lowcost rules easily generated by domain experts. Additional rule development could lead to improved performance. In addition, we did not explore data augmentation or multi-task learning in the BioBERT model, which may further mitigate the need to engineer task-specific rules. There is considerable prior work developing machine learning models for tagging disease, drug, and chemical entities [72, 13] that could be incorporated as labeling functions. However, our goal was to explore performance tradeoffs in settings where existing machine learning models are not available. Our framework leverages the wide range of medical ontologies available for English language settings, which provides considerable advantages for weakly supervised methods. Additional work is needed to characterize the extent to which the framework can benefit tasks in non-English settings [73] . Combining labels from multiple ontology sources violates an independence assumption of data programming as used in this work, because for any pair of source ontologies we may have correlated noise. This restriction applies to all label sources, but is more prevalent in cases with extremely similar label sources, as can occur with ontologies. In our experiments, for a small number of sources, the impact was minor, however performance tended to decrease after including more than 20 ontologies. Additional research into unsupervised methods for structure learning [74, 75] , i.e., learning dependencies among sources from unlabeled data, could further improve performance or mitigate the need to limit the number of included ontologies. Identifying named entities and attributes such as negation are critical tasks in medical natural language processing. Manually labeling training data for these tasks is time consuming and expensive, creating a barrier to building classifiers for new tasks. The Trove framework provides ontology-driven weak supervision for medical entity classification and achieves state-of-the-art weakly supervised performance in the NER tasks of recognizing chemicals, diseases, and drugs. We further establish weakly supervised baselines for disorder tagging and classifying the temporal order of an event entity relative to its document timestamp. The weakly supervised NER classifiers perform within 1.4 -4.8 F1 points of classifiers trained with hand-labeled data. Modeling the accuracies of individual ontologies and rules to correct for label noise improved performance in all of our entity classification tasks. Combining pre-trained language models such as BioBERT with weak supervision results in an additional improvement in most tasks. The Trove framework demonstrates how classifiers for a wide range of medical NLP tasks can be quickly constructed by leveraging medical ontologies and weak supervision without requiring manually labeled training data. Weakly supervised learning provides a mechanism for combining the generalization capabilities of state-of-the-art machine learning with the flexibility and inspectability of rule-based approaches. Taxonomy and synset labeling functions do not require any manual inspection of data, only that users specify a target taxonomy which maps to the entity labels used to train the target machine learning model. These examples initialize labeling functions for a simple definition of "drug" using the SNOMEDCT US terminology from the UMLS. Deep learning for health informatics A guide to deep learning in healthcare CORD-19: The covid-19 open research dataset A machine-compiled database of genome-wide association studies Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences Multi-frame weak supervision to label wearable sensor data Multi-resolution weak supervision for sequential data Cross-Modal data programming enables rapid medical machine learning The unified medical language system (UMLS): integrating biomedical terminology The open biomedical annotator Snorkel: Rapid training data creation with weak supervision SwellShark: A generative model for biomedical named entity recognition without labeled data Learning named entity tagger using Domain-Specific dictionary Weakly supervised sequence tagging from noisy rules BioBERT: a pre-trained biomedical language representation model for biomedical text mining Data programming: Creating large training sets, quickly Training complex models with Multi-Task weak supervision An information extraction framework for cohort identification using electronic health records A simple algorithm for identifying negated findings and diseases in discharge summaries NegBio: a high-performance tool for negation and uncertainty detection in radiology reports ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on Weakly-Supervised classification and localization of common thorax diseases Unsupervised joint mining of deep features and image labels for Large-Scale radiology image categorization and scene recognition Detecting hip fractures with radiologist-level performance using deep neural networks Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists Machine-Learning-Based multiple abnormality prediction with Large-Scale chest computed tomography volumes Assessment of NER solutions against the first and second CALBC silver standard corpus A clinical text classification paradigm using weak supervision and deep representation Medical device surveillance with electronic health records Constructing biological knowledge bases by extracting information from text sources Distant supervision for relation extraction without labeled data Combining labeled and unlabeled data with co-training Label embedding for zero-shot fine-grained named entity typing Unsupervised models for named entity classification Weakly supervised learning for hedge classification in scientific literature Generalized expectation criteria for Semi-Supervised learning with weakly labeled data Learning from noisy singly-labeled data spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing Overview of the BioCreative V chemical disease relation (CDR) task Task 2: ShARe/CLEF ehealth evaluation lab 2014 Extracting medication information from clinical text SemEval-2016 task 12: Clinical TempEval Disease ontology: a backbone for disease semantic integration ChEBI: a database and ontology for chemical entities of biological interest The comparative toxicogenomics database's 10th year anniversary: update 2015 ADAM: another database of abbreviations in MEDLINE A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD) Natural Language Processing with Python Aggregating UMLS semantic types for reducing conceptual complexity A simple algorithm for identifying abbreviation definitions in biomedical text SNOMED-CT: The advanced terminology and coding system for ehealth BERT: pre-training of deep bidirectional transformers for language understanding EDA: Easy data augmentation techniques for boosting performance on text classification tasks A survey on Multi-Task learning Publicly available clinical BERT embeddings Introduction to the CONLL-2000 shared task: Chunking HuggingFace's transformers: State-of-the-art natural language processing Brundlefly at SemEval-2016 task 12: Recurrent neural networks vs. joint inference for clinical temporal information extraction Estimating the efficacy of symptom-based screening for COVID-19 Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database Covid-19 synonyms Biomedical ontologies: a functional perspective High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries Literature mining and database annotation of protein phosphorylation using a rule-based system Extraction of regulatory gene/protein networks from medline Two biomedical sublanguages: a description based on the theories of zellig harris Augmented curation of unstructured clinical notes from a massive EHR system reveals specific phenotypic signature of impending COVID-19 diagnosis COVID-19 SignSym: A fast adaptation of general clinical NLP tools to identify and normalize COVID-19 signs and symptoms to OMOP common data model National COVID cohort collaborative (N3C) The CHEMDNER corpus of chemicals and drugs and its annotation principles The #benderrule: On naming the languages we study and why it matters Learning the structure of generative models without labeled data Learning dependency structures for weak supervision models This task predicts the order of a clinical event relative to the parent document's creation timestamp, binned into four classes {BEFORE antidepressive agent, cAMP, carbidopa, estrogen, estrogen receptor agonist, estrogenic agent, estrogenic compound, estrogenic effect, ethanolic extract of daucus carota seed, fatty acid, glucose, grape seed proanthocyanidin extract, levodopa, low-dose oral contraceptive, nitric oxide, oral contraceptive, phasic oral contraceptive, polyethylene glycol, saturated fatty acid, steroid, sucrose, thymoanaleptics, thymoleptics] -Negative [DNA, adrenergic, anti-HIV agent, anticholinesterase drug, anticoagulant, anticonvulsant, antipsychotic, atom, cellulose, collagen, glucagon, glucocorticoid, glycogen, gold standard, insulin, ion, juice, lipid, lipopolysaccharide, mRNA, molecular, muscarinic, nucleic acid polymer, oligosaccharide, opiate, opioid, opioid alkaloids, opium poppy plant, papaver somniferum, polypeptide, polysaccharide, prolactin, protein, purinergic, saline, starch, water] • Disease -Positive [akathisis, auditory toxicity, bone marrow oedema, cancer, cardiac toxicity, death, dyskinesia, erythroblastocytopenia, hepatitis, hypertension, hypertensive, liver toxicity, ototoxicity, ovarian and peritoneal cancer, pain, partial seizures Shared Task: Guidelines for the Annotation of Disorders in Clinical Notes • Positive [bowel obstruction, chest pain, chronic gingivitis, colon cancer, crohn, facial droop, lower extremity DVT, lupus, numbness, pain, rash, schizophrenia, severe pre-eclampsia TYLENOL ( ACETAMINOPHEN ), acetaminophen, asa, aspirin, atenolol, avapro, bb, caltrate plus D, caltrate plus D, novolog, diuretic, diuretics, fasting lipids sent, fluocinonide 0.5% cream, furosemide, glucophage, lasix, lasix, lasix, long acting nitrate, nephrotoxic meds, plavix, red blood cells, saline, saline solution, this medication