key: cord-0580203-npe3jvms authors: Soshnikov, Dmitry; Soshnikova, Vickie title: Using Text Analytics for Health to Get Meaningful Insights from a Corpus of COVID Scientific Papers date: 2021-10-28 journal: nan DOI: nan sha: 2f8c0fa4312babeda1891f05245b00d217eb7b8e doc_id: 580203 cord_uid: npe3jvms Since the beginning of COVID pandemic, there have been around 700000 scientific papers published on the subject. A human researcher cannot possibly get acquainted with such a huge text corpus -- and therefore developing AI-based tools to help navigating this corpus and deriving some useful insights from it is highly needed. In this paper, we will use Text Analytics for Health pre-trained service together with some cloud tools to extract some knowledge from scientific papers, gain insights, and build a tool to help researcher navigate the paper collection in a meaningful way. : Architecture of a system to extract knowledge insights from a corpus of scientific papers. Note that this architecture is built on top of the platform components of Microsoft Azure, which allows us to delegate many complex issues (such as scalability) to the cloud provider. The idea to apply NLP methods to scientific literature seems quite natural and has been proposed in many different works [1, 2, 3] . First of all, scientific texts are already well-structured, they contain things like keywords, abstract, as well as well-defined terms. Thus, at the very beginning of COVID pandemic, a research challenge has been launched on Kaggle to analyze scientific papers on the subject. The dataset behind this competition is called CORD [4] , and it contains constantly updated corpus of everything that is published on topics related to COVID. This dataset consists of the following parts: • Metadata file Metadata.csv contains most important information for all publications in one place. Each paper in this table has unique identifier cord_uid (which in fact does not happen to be completely unique, once you start working with the dataset). The information includes: Title of publication, Journal, Authors, Abstract, Date of publication, doi • Full-text papers in document_parses directory that contain structured text in JSON format, which greatly simplifies the analysis. • Pre-built Document Embeddings that maps cord_uids to float vectors that reflect some overall semantics of the paper. In this paper, we will focus on paper abstracts, because they contain the most important information from the paper. However, for full analysis of the dataset, it makes sense to use the same approach on full texts as well. In the recent years, there has been a huge progress in the field of Natural Language Processing, and very powerful neural network language models have been trained. In the area of NLP, the following tasks are typically considered: • Text classification / intent recognition -In this task, we need to classify a piece of text into a number of categories. This is a typical classification task. • Sentiment Analysis -We need to return a number that shows how positive or negative the text is. This is a typical regression task. • Named Entity Recognition (NER) -In NER, we need to extract named entities from text, and determine their type. For example, we may be looking for names of medicines, or diagnoses. Another task similar to NER is keyword extraction. • Text summarization -Here we want to be able to produce a short version of the original text, or to select the most important pieces of text. • Question Answering -In this task, we are given a piece of text and a question, and our goal is to find the exact answer to this question from text. • Open-Domain Question Answering (ODQA) -The main difference from previous task is that we are given a large corpus of text, and we need to find the answer to our question somewhere in the whole corpus. In [5] have described how we can use ODQA approach to automatically find answers to specific COVID questions. However, this approach does not provide insights into the text corpus. To make some insights from text, NER seems to be the most prominent technique to use. If we can find specific entities that are present in text, we could then perform semantically rich search in text that answers specific questions, as well as obtain data on co-occurrence of different entities, figuring out specific scenarios that interest us. To train NER model, as well as any other neural language model, we need a reasonably large dataset that is properly marked up. Finding those datasets is often not an easy task, and producing them for new problem domain often requires initial human effort to mark up the data. Luckily, modern transformer language models can be trained in semi-supervised manner using transfer learning. First, the base language model (for example, BERT [6] ) is trained on a large corpus of text first, and then can be specialized to a specific task such as classification or NER on a smaller dataset. This transfer learning process can also contain additional step -further training of generic pretrained model on a domain-specific dataset. For example, in the area of medical science Microsoft Research has pre-trained a model called PubMedBERT [7] , using texts from PubMed repository. This model can then be further adopted to different specific tasks, provided we have some specialized datasets available. However, training a model requires a lot of skills and computational power, in addition to a dataset. Microsoft (as well as some other large cloud vendors) also make some pre-trained models available through the REST API. Those services are called Cognitive Services, and one of those services for working with text is called Text Analytics [8] . It can do the following: • Keyword extraction and NER for some common entity types, such as people, organizations, dates/times, etc. • Sentiment analysis • Language Detection • Entity Linking, by automatically adding internet links to some most common entities. This also performs disambiguation, for example Mars can refer to both the planet or a chocolate bar, and correct link would be used depending on the context. For example, here is the result of analyzing one medical paper abstract by Text Analytics: As you can see, some specific entities (for example, HCQ, which is short for hydroxychloroquine) are not recognized at all. Recently, a special version of the service, called Text Analytics for Health [9] was released, which exposes pre-trained PubMedBERT model with some additional capabilities. Here is the result of extracting entities from the same piece of text using Text Analytics for Health: Text Analytics is a REST service, which can be called by using Text Analytics Python SDK in the following manner: In addition to just the list of entities, we also get the following: • Enity Mapping of entities to standard medical ontologies, such as UMLS [10] . • Relations between entities inside the text, such as TimeOfCondition, etc. • Negation, which indicated that an entity was used in negative context, for example COVID-19 diagnosis did not occur. In addition to using Python SDK, we can also call Text Analytics using REST API directly. This is useful if you are using a programming language that does not have a corresponding SDK, or if you prefer to receive Text Analytics result in the JSON format for further storage or processing. In Python, this can be easily done using requests library: Resulting JSON file will look like this: {"id": "jk62qn0z", "entities": [ {"offset": 24, "length": 28, "text": "coronavirus disease pandemic", "category": "Diagnosis", "confidenceScore": 0.98, "isNegated": false}, {"offset": 54, "length": 8, "text": "COVID-19", "category": "Diagnosis", "confidenceScore": 1.0, "isNegated": false, "links": [ {"dataSource": "UMLS", "id": "C5203670"}, {"dataSource": "ICD10CM", "id": "U07.1"}, ... ]}, "relations": [ {"relationType": "Abbreviation", "bidirectional": true, "source": "#/results/documents/2/entities/6", "target": "#/results/documents/2/entities/7"}, ...], } In production code, one may want to incorporate a mechanism that will retry the operation when an error is returned by the service. Since the dataset currently contains ~700K paper abstracts, processing them sequentially through Text Analytics would be quite time-consuming. To run this code in parallel, we can use technologies such as Azure Batch or Azure Machine Learning [11] . Both allow you to create a cluster of identical virtual machines, and have the same code run in parallel on all of them. Azure Machine Learning is a service intended to satisfy all needs of a Data Scientist. It is typically used for training and deploying model and ML Pipelines; however, we can also use it to run our parallel sweep job across a compute cluster. To do that, we need to submit a sweep_job experiment. Our experiment is described in more detail in [13] . In our experiment, we created Azure Machine Learning workspace with a cluster of 8 lowperforming VMs. We then defined an environment to run the experiment on, based on mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04 container image, and conda configuration files to install required Python SDKs. We then defined a sweep job to run on the cluster. The job will start process.py Python script on each node of the cluster, and pass the number of experiment as well as the dataset and total number of nodes as command-line parameters: The processing logic will be encoded in the Python script, and will be roughly the following: ## process command-line arguments using ArgParse … df = pd.read_csv(args.data) # Get metadata.csv into Pandas DF ## Connect to the database coscli = azure.cosmos.CosmosClient(cosmos_uri, credential=cosmoskey) cosdb = coscli.get_database_client("CORD") cospapers = cosdb.get_container_client("Papers") ## Process papers for i,(id,x) in enumerate(df.iterrows()): if i%args.nodes == args.number: # process only portion of record # Process the record using REST call (see code above) # Store the JSON result in the database cospapers.upsert_item(json) For simplicity, we will not show the complete configuration file and script here and refer you to the complete blog post [13] , or to the project GitHub repository. Using the code above, we have obtained a collection of papers, each having a number of entities and corresponding relations. This structure is inherently hierarchical, and the best way to store and process it would be to use NoSQL approach for data storage. We will use Cosmos DB database to store and query semi-structured data like our JSON collection. The code shown above demonstrates how we can store JSON documents directly into CosmosDB database from our processing scripts running in parallel. Using SQL, we can formulate some very specific queries. Suppose a medical specialist wants to find out all proposed dosages of a specific medication (say, hydroxychloroquine), and see all papers that mention those dosages. This can be done using the following query: A more difficult task would be to select all entities together with their corresponding ontology ID. This would be extremely useful, because eventually we want to be able to refer to a specific entity (hydroxychloroquine) regardless or the way it was mentioned in the paper (for example, HCQ also refers to the same medication). We will use UMLS as our main ontology. While being able to use SQL query to obtain an answer to some specific question, like medication dosages, seems like a very useful tool -it is not convenient for non-IT professionals, who do not have high level of SQL mastery. To make the collection of metadata accessible to medical professionals, we can use PowerBI tool to create an interactive dashboard for entity/relation exploration. From this tool, we can make queries similar to the one we have made above in SQL, to determine dosages of a specific medications. To do it, we need to select DosageOfMedication relation type in the left table, and then filter the right table by the medication we want. It is also possible to create further drill-down tables to display specific papers that mention selected dosages of medication, making this tool a useful research instrument for medical scientist. The most interesting part of the story, however, is to automatically draw some visual insights from the text, such as the change in medical treatment strategy over time. To do this, we need to write some more code in Python to do proper data analysis and visualization. The most convenient way to do that is to use Notebooks embedded into Cosmos DB: Those notebooks support embedded SQL queries; thus, we are able to execute SQL query, and then get the results into Pandas DataFrame, which is Python-native way to explore data: Here we end up with meds DataFrame, containing names of medicines, together with corresponding paper titles and publishing date. We can further group by ontology ID to get frequencies of mentions for different medications: Another interesting insight is to observe which terms occur frequently together. To visualize such dependencies, there are two types of diagrams: • Sankey diagram allows us to investigate relations between two types of terms, eg. diagnosis and treatment • Chord diagram helps to visualize co-occurrence of terms of the same type (eg. which medications are mentioned together) To plot both diagrams, we need to compute co-occurrence matrix, which in the row i and column j contains number of co-occurrences of terms i and j in the same abstract (one can notice that this matrix is symmetric). For the visualization to be more clear, we select relatively small number of terms from our ontology, grouping some terms together if needed. To plot the Sankey diagram, we use Plotly graphics library. For visualizing co-occurrences of entities of the same typeeg. different medications, we can plot a chord diagram. We use a libary called Chord, and the same function as above to populate cooccurrence matrix, passing the same ontology twice. Co-occurrence chord diagrams for treatments (on the left) and medications (on the right). Please note that on the right diagram of co-occurrences of medications we can clearly see some well-known combinations of medications, such as hydroxychloroquine + azitromycin, which were included into standard treatment strategy [14] . We can also see that chloroquine and lopinavir are frequently mentioned together, but that does not necessarily mean that they are used together (for the counter-example, see [15] ). This demonstrates that we need to perform deeper text analysis to understand the nature of co-occurrence of different terms in the abstract. In this paper, we have described the architecture of a proof-of-concept system for knowledge extraction from large corpora of medical texts. We use Text Analytics for Health to perform the main task of extracting entities and relations from text, and then a number of cloud services together to build a query tool for medical scientist and to extract some visual insights. For further research, it would be interesting to switch to processing full-text articles in addition to abstracts, in which case we need to think about slightly different criteria for co-occurrence of terms (eg. in the same paragraph vs. the same paper). The same approach can be applied in other scientific areas, but we would need to be prepared to train a custom neural network model to perform entity extraction (and for that we might need to both fine-tune BERT model on text from the problem domain, and train a model based on finetuned BERT feature extractor to perform NER (for that we would need relatively large dataset of labeled entities). We hope that the proposed approach would be nevertheless transferable to different problem domains, and that the codebase that we provide can be used as a starting point for further research in the area of using natural language processing machinery to gain insights from large text corpora. Automated knowledge extraction from polymer literature using natural language processing Using natural language processing to extract clinically useful information from Chinese electronic medical records Extracting medications and associated adverse drug events using a natural language processing system combining knowledge base and deep learning CORD-19: The COVID-19 Open Research Dataset Keras" for Natural Language Processing answers COVID Questions. TowardsDataScience Pre-training of Deep Bidirectional Transformers for Language Understanding Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing The Unified Medical Language System (UMLS): integrating biomedical terminology Azure machine learning Performing large science experiments on azure: Pitfalls and solutions Analyzing COVID Medical Papers with Azure Machine Learning and Text Analytics for Health Treatment with hydroxychloroquine, azithromycin, and combination in patients hospitalized with COVID-19 Efficacy of chloroquine versus lopinavir/ritonavir in mild/general COVID-19 infection: a prospective, open-label, multicenter, randomized controlled clinical study