key: cord-0113324-33033rys authors: Hong, Zhi; Pauloski, J. Gregory; Ward, Logan; Chard, Kyle; Blaiszik, Ben; Foster, Ian title: AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text date: 2021-01-12 journal: nan DOI: nan sha: 37ef2bf54f634447bfed4da3032839164a6c9d25 doc_id: 113324 cord_uid: 33033rys Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of coronavirus research. We report here on a project that leverages both human and artificial intelligence to detect references to drug-like molecules in free text. We engage non-expert humans to create a corpus of labeled text, use this labeled corpus to train a named entity recognition model, and employ the trained model to extract 10912 drug-like molecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198875 papers. Performance analyses show that our automated extraction model can achieve performance on par with that of non-expert humans. The Coronavirus Disease (COVID-19) pandemic, caused by transmissible infection of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has resulted in tens of millions of diagnosed cases and over 1 450 000 deaths worldwide [1] ; straining healthcare systems, and disrupting key aspects of society and the wider economy. It is thus important to identify effective treatments rapidly via discovery of new drugs and repurposing of existing drugs. Here, we leverage advances in natural language processing to enable automatic identification of drug candidates being studied in the scientific literature. The magnitude of the pandemic has resulted in an enormous number of academic publications related to COVID-19 research since early 2020. Many of these articles are collated in the COVID-19 Open Research Dataset Challenge (CORD-19) collection [2, 3] . With 198 875 articles at the time of writing, that collection is far too large for humans to read. Thus, tools are needed to automate the process of extracting relevant data, such as drug names, testing protocols, and protein targets. Such tools can save domain experts significant time and effort. Towards this goal, we describe here how we have tackled two important problems: creating labelled training data via judicious use of scarce human expertise, and applying a named entity recognition (NER) model to automatically identify drug-like molecules in text. In the absence of rich labeled data for the growing COVID literature, we employ an iterative model-in-the-loop collection process inspired by our previous work [4, 5] . We first assemble a small bootstrap set of human-verified examples to train a model for identifying similar examples. We then iteratively apply the model, use human reviewers to verify the predictions for which the model is least confident, and retrain the model until the improvement in performance is less than a threshold. (The human reviewers were administrative staff without scientific backgrounds, with time available for this task due to the pandemic.) Having collected adequate training data via this model-guided human annotation process, we then use the resulting labeled data to re-train a NER model originally developed to identify polymer names in materials science publications [6] and apply this trained model to . We show that the labeled data produced by our approach are of sufficiently high quality than when used to train NER models, which achieves a best F-1 score of 80.5%-roughly equivalent to that achieved by non-expert humans. The labeled data, model, and model results are all available online, as described in Section 7. We aim to develop and apply new computational methods to mine the scientific literature to identify small molecules that have been investigated or found useful as antiviral therapeutics. For example, processing the following sentence should allow us to determine that the drug sofosbuvir has been found effective against the Zika virus: "Sofosbuvir, an FDA-approved nucleotide polymerase inhibitor, can efficiently inhibit replication and infection of several ZIKV strains, including African and American isolates." [7] . This problem of identifying drug-like molecules in text can be divided into two linked problems: 1) identifying references to small therapeutic molecules ("drugs") and 2) determining what the text says about those molecules. In this work, we consider potential solutions to the first problem. A simple way to identify entities in text that belong to a specialized class (e.g., drug-like molecules) is to refer to a curated list of valid names, if such is available. In the case of drugs, we might think to use DrugBank [8] or the FDA Drug Database [9], both of which in fact list sofosbuvir. However, such databases are not in themselves an adequate solution to our problem, for at least two reasons. First, they are rarely complete. The tens of thousands of entity names in DrugBank and the FDA Drug Database together are just a tiny fraction of the billions of molecules that could potentially be used as drugs. Second, such databases may be overly general: DrugBank, for example, includes the terms "rabbit" and "calcium," neither of which have value as antiviral therapeutics. In general, the use of any such list to identify entities will lead to both false negatives and false positives. We need instead to employ the approach that a human reader might follow in this situation, namely to scan text for words that appear in contexts in which a drug name is likely to appear. In the following, we explain how we use natural language processing (NLP) techniques for this purpose. Finding strings in text that refer to drug-like molecules is an example of named-entity recognition (NER) [10] , an important NLP task. Both grammatical and statistical (e.g., neural networkbased) methods have been applied to NER; the former can be more accurate, but require much effort from trained linguists to develop. Statistical methods use supervised training on labeled examples to learn the contexts in which entities of interest (e.g., drug-like molecules) are likely to occur, and then classify previously unseen words as such entities if they appear in similar contexts. For instance, a training set may contain the sentence "Ribavirin was administered once daily by the i.p. route" [11] , with ribavirin labelled as Drug. With sufficient training data, the model may learn to assign the label Drug to arbidol in the sentence "Arbidol was administered once daily per os using a stomach probe" [11] . This learning approach can lead to general models capable of finding previously unseen candidate molecules in natural language text. The development of effective statistical NER models is complicated by the many contexts in which names can occur. For example, while the contexts just given for ribavirin and arbidol are similar, both are quite different from that quoted for sofosbuvir earlier. Furthermore, authors may use different wordings and sentence structures: e.g., "given by i.p. injection once daily" rather than "administered once daily by the i.p. route." Thus, statistical NER methods need to do more than learn template word sequences: they need to learn more abstract representations of the context(s) in which words appear. Modern NLP and NER systems do just that [12] . We consider two NER models in this paper, SpaCy and a Keras long-short term memory (LSTM) model. Both models are publicly available on DLHub [13] and GitHub, as described in Section 7. SpaCy is an open source NLP library that provides a pre-trained entity recognizer that can recognize 18 types of entities, including PERSON, ORGANIZATION, LOCATION, and PRODUCT. Its model calculates a probability distribution of a word over the entity types, and outputs the type with the highest probability as the predicted type for that word. When pre-trained on the OntoNotes 5 dataset of over 1.5 million labeled words [14] , the SpaCy entity recognizer can identify supported entities with 85.85% accuracy. However, it does not include drug names as a supported entity class, and thus we would need to retrain the SpaCy model on a drug-specific training corpus. Unfortunately, there is no publicly available corpus of labeled text for drug-like molecules in context. Thus, we need to use other methods to retrain this model (or other NER models), as we describe in Section 4. While SpaCy is easy to use, it lacks flexibility: its end-to-end encapsulation does not expose many tunable parameters. Thus we also explore the use of a Keras-LSTM model that we developed in previous work for identification of polymers in materials science literature [6] . This model is based on the Bidirectional LSTM network with a conditional random field (CRF) layer added on top. It takes training data labeled according to the "IOB" schema. The first word in an entity is given the label "B" (Beginning), the following words in the same entity are labeled "I" (Inside), and non-entity words are labeled "O" (outside). During prediction, the Bi-LSTM network tries to assign one of "IOB" to each word in the input sentence, but it has no awareness of the validity of the label sequence. The CRF layer is used on top of Bi-LSTM to lower the probability of invalid label sequences (e.g., "OIO"). We compare the performance of SpaCy and Keras-LSTM models under various conditions in Section 4. We address the lack of labeled training data by using Algorithm 1 (and see Figure 1 ) to assemble a set of human-and machine-labeled data from CORD-19 [3] . In describing this process, we refer to paragraphs labeled automatically via a heuristic or model as silver and to silver paragraphs for which labels have been corrected by human reviewers as gold. We use the Prodigy machine learning annotation tool to manage the review process: reviewers are presented with a silver paragraph, with putative drug entities highlighted; they click on false negative and false positive words to add or remove the highlights and thus produce a gold paragraph. Prodigy saves the corrected labels in standard NER training data format. Our algorithm involves three main phases, as follows. In the first bootstrap phase, we assemble an initial test set of gold paragraphs for use in subsequent data acquisition. We create a first set of silver paragraphs by using a simple heuristic: we select N 0 paragraphs from CORD-19 that contain one or more words in DrugBank with an Anatomical Therapeutic Chemical Classification System (ATC) code, label those words as drugs, and ask human reviewers to correct both false positives and false negatives in our silver paragraphs, creating gold paragraphs. In the subsequent build test set phase, we repeatedly use all gold paragraphs obtained so far to train an NER model; use that model to identify and label additional silver paragraphs, and engage human reviewers to correct false positives and false negatives, creating additional gold paragraphs. We repeat this process until we have N t initial gold paragraphs. In the third build labeled set phase, we repeatedly use an NER model trained on all humanvalidated labels obtained to date, with the N t gold paragraphs from the bootstrap phase used as a test set, to identify and label promising paragraphs in CORD-19 for additional human review. To maximize the utility of this human effort, we present the reviewers only with paragraphs that contain one or more uncertain words, i.e., words that the NER model identifies as drug/nondrug with a confidence in the range [min, max]). We continue this process of model retraining, paragraph selection and labeling, and human review until the F-1 score improves by less than . The behavior of this algorithm is influenced by six parameters: N 0 , N , N t , , min, and max. N 0 and N are the number of paragraphs that are assigned to human reviewers in the first and subsequent steps, respectively. N t is the number of examples in the test set. is a threshold that determines when to stop collecting data. The min and max determine the confidence range from which words are selected for human review. In the experimental studies described below, we used N 0 =278, N =120, N t =500, =0, min=0.45, and max=0.55. The NER model used in the model-in-the-loop annotation workflow to score words might also be viewed as a parameter. In the work reported here, we use SpaCy exclusively for that purpose, as it integrates natively with the Prodigy annotation tool and trains more rapidly. However, as we show below, the Keras-LSTM model is ultimately somewhat more accurate when trained on all of the labeled data generated, and thus is preferred when processing the entire CORD-19 dataset: see Section 5.1 and Section 6. This semi-automated method saves time and effort for human reviewers because they are only asked to verify labels that have already been identified by our model to be uncertain, and thus worth processing. Furthermore, as we show below, we find that we do not need to engage biomedical professionals to label drugs in text: untrained people, armed with contextual information (and online search engines), can spot drug names in text with accuracy comparable to that of experts. We provide further details on the three phases of the algorithm in the following, with numbers in the list referring to line numbers in Algorithm 1. 1 We start with the 2020-03-20 release version of the CORD-19 corpus, which contains 44 220 papers [3] . We create C, a random permutation of its paragraphs from which we will repeatedly fetch paragraphs via next(C). C) Build labeled set 13 We assemble a training set G, using the test set T assembled in the previous phases for testing. This process continues until the F-1 score stops improving (see Section 4). 14-17 Same as Steps 6-9, except that we train on G and test on T , and add new gold paragraphs to G instead of T . As noted in Section 3.2, our model-in-the-loop annotation workflow requires repeated retraining of a SpaCy model. Thus we conducted experiments to understand how SpaCy prediction performance is influenced by model size, quantity of training data, and amount of training performed. As the training data produced by the model-in-the-loop evaluation workflow are to be used to train an NER model that we will apply to the entire CORD-19 dataset, we also evaluate the Keras-LSTM model from the perspectives of big data accuracy and training time. We first need to decide which SpaCy model to use for model-in-the-loop annotation. Model size is a primary factor that affects training time and prediction performance. In general, larger models tend to perform better, but require both more data and more time to train effectively. As our model-in-the-loop annotation strategy requires frequent model retraining, and furthermore will (initially at least) have little data, we hypothesize that a smaller model may be adequate for our purposes. To explore this hypothesis, we study the performance achieved by the SpaCy medium and large models on our initial training set of 278 labeled paragraphs. We show in Figure 2 the performance achieved by the two models as a function of number of training epochs. Focusing on the harmonic mean of precision and recall, the F-1 score (a good measure a model's ability to recognize both true positives and true negatives), we see that the two models achieve similar prediction performance, with the largest difference in F-1 score being around 2%. As the large model takes over eight times longer to train per epoch, we select the medium model for modelin-the-loop data collection. As data labeling is expensive in both human time and model training time, it is valuable to explore the tradeoff between time spent collecting data and prediction performance. To this end, we manually labeled a set of 500 paragraphs selected at random from CORD-19 [3] as a test set. Then, we used that test set to evaluate the results of training the SpaCy and Keras-LSTM models of Section 3.1 on increasing numbers of the paragraphs produced by our human-in-the-loop annotation process. Figure 3 shows Prediction performance is also influenced by the number of epochs spent in training. The cost of training is particularly important in a model-in-the-loop setup, as human reviewers cannot work while an model is offline for training. Figure 4 shows the progression of the loss, precision, recall, and F-1 values of the SpaCy model during 100 epochs of training with the initial 278 examples. We can see that the best F-1 score is achieved within 10 to 20 epochs. Increasing the number of epochs does not result in any further improvement. Indeed, F-1 score does not tell us all about the model's performance. Sometimes training for more epochs could lead to lower loss values while other metrics (such as precision, recall, or F-1) no longer improve. That would still be desirable because it means the model is now more "confident," in a sense, about its predictions. However, that is not the case here. As shown in Figure 4 , after around 40 epochs the loss begins to oscillate instead of continuing downwards, suggesting that in this case training for 100 epochs does not result in a better model than only training for 20 epochs. accuracy curves diverge: the training accuracy continues to increase but the validation accuracy plateaus. This trend is suggestive of overfitting, which is corroborated by Figure 5(b) . After about 50 epochs, the validation loss curve turns upwards. Hence we choose to limit the training epochs to 64. After each epoch, if a lower validation loss is achieved, the current model state is saved. After 64 epochs, we test the model with the lowest validation loss on the withheld test set. We conducted experiments to compare the performance of the SpaCy and Keras-LSTM NER models; compare the performance of the models against humans; determine how training data influences model performance; and analyze human and model errors. We used the collected data of Section 3.2 to train both the SpaCy and Keras-LSTM NER models of Section 3.1 to recognize and extract drug-like molecules in text. We find that the trained en core web md SpaCy model achieved a best F-1 score of 77.3%, while the trained Keras-LSTM model achieved a best F-1 score of 80.5%, somewhat outperforming SpaCy. As shown in Figure 3 , the SpaCy model performs better than the Keras-LSTM model when trained with small amounts of training data-perhaps because of the different mechanisms employed by the two methods to generate numerical representations for words. SpaCy's built-in language model, pre-trained on a general corpus of blog posts, news, comments, etc., gives it some knowledge about commonly used words in English, which are likely also to appear in a scientific corpus. On the other hand, the Keras-LSTM model uses custom word embeddings trained solely on an input corpus, which provides it with better understanding of multi-sense words, especially those that have quite different meanings in a scientific corpus. However, without enough raw data to draw contextual information from, custom word embeddings can not accurately reflect the meaning of words. Recognizing drug-like molecules is a difficult task even for humans, especially non-medical professionals (such as our non-expert annotators). To assess the accuracy of the annotators, we asked three people to examine 96 paragraphs, with their associated labels, selected at random from the labeled examples. Two of these reviewers had been involved in creating the labeled dataset; the third had not. For each paragraph, each reviewer decided independently whether each drug molecule entity was labeled correctly (a true positive), was labeled as a drug when it was not (a false positive), or was not labeled (a false negative). If all three reviewers agreed in their opinions on a paragraph (the case for 88 of the 96 paragraphs), we accepted their opinions; if they disagreed (the case for eight paragraphs), we engaged an expert. This process revealed a total of 257 drug molecule entities in the 96 paragraphs, of which the annotators labeled 201 correctly (true positives), labeled 49 incorrectly (false positives), and missed 34 (false negatives). The numbers of true positives and false negatives do not sum up to the total number of drug molecules because in some cases an annotator labeled not to a drug entity but the entity plus extra preceding or succeeding word or punctuation mark (e.g. "sofosbuvir," instead of "sofosbuvir") and we count such occurrences as false positives rather than false negatives. In this evaluation, the non-expert annotators achieved an F-1 score of 82.9%, which is comparable to the 80.5% achieved by our automated models, as shown in Figure 3 . In other words, our models have performance on par with that of non-expert humans. We described in the previous section how review of 96 paragraphs labeled by the non-expert annotators revealed an error rate of about 20%. This raises the question of whether model performance could be improved with better training data. To examine this question, we compare the performance of our models when trained on original vs. corrected data. As we only have 96 corrected paragraphs, we restrict our training sets to those 96 paragraphs in each case. We sorted the 96 paragraphs in both datasets so that they are considered in the same order. Then, we split each dataset into five subsets for K-fold cross validation (K =5), with the first four subsets having 19 paragraphs each and the last subset having 20. Since K is set to five, the SpaCy and Keras models are trained five times. In the i -th round, each model is trained on four subsets (excluding the i -th) of each dataset. The i -th subset of the corrected dataset is used as the test set. The i -th subset of the original dataset is not used in the i -th round. We present the K-fold cross validation results in Tables 1 and 2. The models performed reasonably well when trained on the original dataset, with an average F-1 score only 2% less than that achieved with the corrected labels. Given that the expert input required for validation is hard to come by, we believe that using non-expert reviewers is an acceptable tradeoff and probably the only practical way to gather large amounts of training data. Finally, we explore the contexts in which human reviewers and models make mistakes. Specifically, we study the tokens that appear most frequently near to incorrectly labeled entities. To investigate the effects of immediate and long-distance context, we control, as window size, the maximum distance between a token and a entity for that token to be considered as "context" for that entity. One difficulty with this analysis is that the most frequent tokens identified in this way were mostly stop words or punctuation marks. For instance, when the window size is set to three, the 10 most frequent tokens around mislabeled words are, in descending order, "comma(,)," "and," "mg," "period(.)," "right parenthesis())," "with," "of," "left parenthesis(()," "is," and "or." Only "mg" is neither a stop word nor punctuation mark. Those tokens provide little insight as to why human reviewers might have made mistakes, and furthermore are unlikely to have influenced reviewer decisions. Thus we exclude stopwords and punctuation marks when providing, in Table 3 , lists of the 10 most frequent tokens within varying window sizes of words that were incorrectly identified as molecules by human reviewers. We see that there are indeed several deceptive contextual words. With a window size of one, the 10 most frequent tokens include "oral," "dose," and "intravenous." It is understandable that an untrained reviewer might label as drugs words that immediately precede or follow such context words. Similar patterns can be seen for window sizes of three and five. Without background knowledge to draw from, non-experts are more likely to rely on their experience gained from labeling previous paragraphs. One may hypothesize that after the reviewers have seen a few dozen to a few hundred paragraphs, those deceptive contextual words must have left a deep impression, so that when those words re-appear they are likely to label the strange unknown word close to them as a drug. To investigate this hypothesis, we also explored the most frequent words around drug entities that are correctly labeled by human reviewers: see Table 4 . Interestingly, we found overlaps between the lists in Tables 3 and 4 : in all, three, four, and two overlaps for window sizes of one, three, and five, respectively, when treating all numerical values as identical. This finding supports our hypothesis that those frequent words around real drug entities may confuse human reviewers when they appear around non-drug entities. Token Count Token Count Token Count 1 resistance 176 Tetracycline 230 Tetracycline 230 2 treatment 9 resistance 177 resistance 178 3 mM 4 Trimethoprim 118 Trimethoprim 118 4 oral 3 treatment 11 treatment 14 5 after 3 20∼ 7 20∼ 8 6 analogue 3 Figure 5 placebo 7 7 responses 3 concentration 5 effects 6 8 antibiotics 2 compared 4 Figure 6 9 exposure 2 100 4 KLK5 6 10 pharmacokinetics 2 mM 4 matriptase 6 We repeat this comparison of context words around human and model errors while considering stopwords and punctuation marks. Tables 5 and 6 show the 20 most frequent tokens in each case. We see that 20-25% of the tokens in Table 5 , but only 5-10% of those in Table 6 , are not stop words or punctuation marks. As the model only learns its word embeddings from the input text, if a token often co-occurs with drug entities in the training corpus the model will treat it as an indication of drug entities near its presence, regardless of whether or not it is a stopword. This apparently leads the model to make incorrect inferences. Humans, on the other hand, are unlikely to think that stopword such as "the" is indicative of drug entities, no matter how frequently they appear together. After training the models with the labeled examples, we applied the trained models to the entire CORD-19 corpus (2020-10-04 version with 198 875 articles) to identify potential drug-like molecules. Processing a single article takes only a few seconds; we adapted our models to use data parallelism to enable rapid processing of these many articles. We ran the SpaCy model on two Intel Skylake 6148 processors with a total of 40 CPU cores; this run took around 80 core-hours and extracted 38 472 entities. We ran the Keras model on four NVidia Tesla V100 GPUs; this run took around 40 GPU-hours and extracted 121 680 entities. We recorded for each entity the number of the times that it has been recognized by each model, and used those numbers as a voting mechanism to further determine which entities are the most likely to be actual drugs. In our experiments, "balanced" entities (i.e., those whose numbers of detection by the two models are within a factor of 10 of each other) are most likely to appear in the DrugBank list. As shown in Figure 6 , we sorted all extracted entities in descending order by their total number of detection by both models, and when comparing the the top 100 entities to DrugBank, only 77% were exact matches to drug names or aliases, or 86% if we included partial matches (i.e., the extracted entity is a word within a multi-word drug name or alias in DrugBank). In comparison, among the top 100 "balanced" entities, 88% were exact matches to DrugBank, or 91% with partial matches. Although DrugBank provides a reference metric to evaluate the results, it is not an exhaustive ontology. For instance, remdesivir, a drug that has been proposed as a potential cure for COVID-19, is not in DrugBank. We manually checked the top 50 "balanced" and top 50 "imbalanced" entities not matched to DrugBank, and found that 70% in the "balanced" list are actual drugs, but only 26% in the "imbalanced" list. Looking at the false positives in these top 50 lists, the "balanced" false positives are often understandable. For example, in the sentence "ELISA plate was coated with . . . and then treated for 1h at 37.8C with dithiothreitol . . . ", the model mistook the redox reagent dithiothreitol for a drug entity, probably due to its context "treated with." On the other hand, we found no such plausible explanations for the false positives in the "imbalanced" list, where most false positives are chemical elements (e.g., silver, sodium), amino acids (e.g., cysteine, glutamine), or proteins (e.g., lactoferrin, cystatin). Finally, we compared our extraction results to the drugs being used in clinical trials, as listed on the U.S. National Library of Medicine website ( [15] ). We queried the website with "covid" as the keyword and manually screened the returned drugs in the "Interventions" column to remove stopwords (e.g., tablet, injection, capsule) and dosage information (e.g., 2.5mg, 2.5%) and only kept the drug names. Then we compared the top 50 most frequently appeared drugs to the automatically extracted drugs from literature. The "balanced" entities extracted by both models matched to 64% of the top 50 drugs in clinical trial, whereas the "imbalanced" entities only matched to 6% in the same list. The results discussed here are available in the repository described in Section 7. We have made our annotated training data, trained models, and the results of applying the models to the CORD-19 corpus publicly available online. [16] . In order to facilitate training of various models, we published the training data in two formats-an unsegmented version in line-delimited JSON (JSONL) format, and a segmented version in Comma Separated Value (CSV) format. The JSONL format contains the most comprehensive information that we have collected on the paragraphs in the dataset. We choose JSONL format rather than a JSON list because it allows for the retrieval of objects without having to parse the entire file. A JSON object in the JSONL file has the following structure: • text: The original paragraph stored as a string without any modification. • tokens: The list of tokens from text after tokenization. text: The text of the token as a string. Another commonly adopted labeling scheme for NER datasets is the "IOB" labeling scheme, in which the original text is first tokenized and each token is assigned a label "I," "O," or "B." The label "B(eginning)" means the corresponding token is the first in a named entity. A label "I(nside)" is given to every token in a named entity except for the first token. All other tokens gets the label "O(utside)" which means they are not part of any named entity. The aforementioned JSONL data are converted according to the IOB scheme and stored in Comma Separated Value (CSV) files with one training example per line. Each line consists of two columns: a first of tokens that made up of the original texts, and a second of the corresponding IOB labels for those tokens. In addition to a different labeling scheme, the samples in the CSV files are segmented, meaning that each sentence is treated as a training sample instead of an entire paragraph. This structure aligns with that used in standard NER training sets such as CoNLL03 [17] . The trained SpaCy and Keras models and the results of applying the models to the 198 875 articles in the CORD-19 corpus are also available in this GitHub repo. Additionally, the pretrained SpaCy model is provided as a cloud service via DLHub [13, 18] . (The Keras model could not be hosted there due to compatibility issues with DLHub.) This cloud service allows researchers to apply the model to any texts they provide with as few as four lines of code. We have presented a human-machine hybrid pipeline for collecting training data for named entity recognition models. We applied this pipeline to create an automated model for identifying drug-like molecules in COVID-19-related research papers. Our pipeline facilitated efficient use of valuable human resources by presenting human labellers only with samples that were most likely to confuse our model. We explored various trade-offs, including model size, number of training samples, and training epochs, to find the right balance between model performance and time-to-result. In total, human reviewers working with our pipeline validated labels for 278 bootstrap samples, 1000 training samples, and 500 test samples. As this work was performed in conjunction with other tasks, we cannot accurately quantify the total effort taken to collect and annotate the above training and test samples, but it was likely around 100 person-hours. NER models trained with these data achieved a best F-1 score of 80.5% when evaluated on our collected test set. Our models have correctly identified 64% of the top 50 drugs that are in clinical trials for COVID-19. The models were applied to 198 875 articles in the CORD-19 collection, from which we identified 10 912 molecules with potential therapeutic effects against the SARS-CoV-2 coronavirus. The extracted molecule list were subsequently given to scientists at Argonne National Laboratory to be used in computational screening pipelines. The code, model, and extraction results are publicly available. In the future, we hope to further improve NER model performance by integrating our models with more advanced language models. The datasets analyzed in this study can be found in the Kaggle dataset: https://www.kaggle. com/allen-institute-for-ai/CORD-19-research-challenge. The models used in this work and the datasets generated for this study can be found on GitHub at https://github.com/globus-labs/covid-nlp/tree/master/drug-ner. The models are also available on DLHub [18] . While every effort has been made to produce valid data, by using this data, User acknowledges that neither the Government nor UChicago Argonne LLC makes any warranty, express or implied, of either the accuracy or completeness of this information or assumes any liability or responsibility for the use of this information. Additionally, this information is provided solely for research purposes and is not provided for purposes of offering medical advice. Accordingly, the U.S. Government and UChicago Argonne LLC are not to be liable to any user for any loss or damage, whether in contract, tort (including negligence), breach of statutory duty, or otherwise, even if foreseeable, arising under or in connection with use of or reliance on the content displayed on this site. An interactive web-based dashboard to track COVID-19 in real time CORD-19: The COVID-19 Open Research Dataset Preprint Active learning yields better training data for scientific named entity recognition in Creating training data for scientific named entity recognition with minimal human Extracting Named Entities from Scientific Literature in International Conference on Computational Science The FDA-approved drug Sofosbuvir inhibits Zika virus infection DrugBank 5.0: A major update to the DrugBank database for A survey of named entity recognition and classification Evaluation of antiviral efficacy of ribavirin, arbidol, and T-705 (Favipiravir) in a mouse model for Crimean-Congo hemorrhagic fever Named entity recognition with bidirectional LSTM-CNNs DLHub: Simplifying publication, discovery, and use of machine learning models in science OntoNotes release 5.0. Linguistic Data Consortium, Philadelphia Lit -A Collection of Literature Extracted Small Molecules to Speed Identification of COVID-19 Therapeutics Materials Data Facility Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition in 7th Conference on Natural Language Learning A Model to Extract Drug-like Molecules from Natural Language Text. Data and Learning Hub for Science The data generated have been prepared as part of the nCov-Group Collaboration, a group of over 200 researchers working to use computational techniques to address various challenges associated with COVID-19. This research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This work was also supported by financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD).We are grateful to our human reviewers: India S. Gordon, Linda Novak, Kasia Salim, Susan Sarvey, Julie Smagacz, and Monica Orozco White.