key: cord-1013015-cf558jr6
authors: Yang, Heyoung; Sohn, Eunsoo
title: Expanding Our Understanding of COVID-19 from Biomedical Literature Using Word Embedding
date: 2021-03-15
journal: Int J Environ Res Public Health
DOI: 10.3390/ijerph18063005
sha: fb252f2e46b00623a5c9b740f7eeb79898f68f20
doc_id: 1013015
cord_uid: cf558jr6

A better understanding of the clinical characteristics of coronavirus disease 2019 (COVID-19) is urgently required to address this health crisis. Numerous researchers and pharmaceutical companies are working on developing vaccines and treatments; however, a clear solution has yet to be found. The current study proposes the use of artificial intelligence methods to comprehend biomedical knowledge and infer the characteristics of COVID-19. A biomedical knowledge base was established via FastText, a word embedding technique, using PubMed literature from the past decade. Subsequently, a new knowledge base was created using recently published COVID-19 articles. Using this newly constructed knowledge base from the word embedding model, a list of anti-infective drugs and proteins of either human or coronavirus origin were inferred to be related, because they are located close to COVID-19 on the knowledge base. This study attempted to form a method to quickly infer related information about COVID-19 using the existing knowledge base, before sufficient knowledge about COVID-19 is accumulated. With COVID-19 not completely overcome, machine learning-based research in the PubMed literature will provide a broad guideline for researchers and pharmaceutical companies working on treatments for COVID-19.

Coronavirus disease , caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was first identified in Wuhan, China, in December 2019 [1] . The rapid spread of COVID-19 has caused a severe health crisis worldwide and gravely impacted human life and society [2] . The urgent need to develop effective therapeutics and vaccines against COVID-19 is driving numerous clinical studies worldwide. Efforts by several scientists have led to the designing of effective antiviral agents based on an understanding of the SARS-CoV-2 s [3, 4] viral genome structure and pathogenicity [5, 6] , as well as the body's host response and its protein-protein interactions [7] [8] [9] . Currently, a few vaccines have been developed. Still safety issues remain in doubt and the supply is insufficient. A therapeutic agent showing a definite effect has not been developed [10] .

In addition to clinical-based novel drug development studies, such as antibody therapeutics and plasma therapy, drug repurposing is receiving considerable attention as an alternative for developing COVID-19 treatments [11] [12] [13] . Several computational drug repurposing studies, including network-based or machine learning-based studies, were conducted to predict drug-target interactions by understanding or utilizing the structural properties of SARS-CoV-2, such as in silico docking and analysis [14] , network proximity analysis of drug targets and coronavirus-host interactions in the human interactome [15] , and therapeutic target-based virtual ligand screening [16] .

Bibliometrics has played a large role as a tool for knowledge discovery. Although traditional bibliometric techniques based on statistics and citation analysis are still widely used for measuring and visualizing the impact of knowledge from the scientific literature [17] , new techniques are being developed that have a better effect in inferring knowledge. With the confluence of recently advanced deep learning technologies, bibliometrics has been reborn as a new data mining technology with enhanced inferring ability to discover new knowledge from a latent knowledge base.

Knowledge graph, a graph-based machine-readable data structure, was originally developed to describe interactions between entities and has recently been used as a networkbased knowledge discovery tool for understanding COVID-19 and finding a therapy for the disease [18] [19] [20] [21] .

Most existing studies extract the structure of the knowledge contained in accumulated databases. Therefore, for their results to become accurate, a significant quantity of data has to be accumulated. In this study, we try to determine a way to infer the characteristics of COVID-19 using the biomedical knowledge base accumulated so far without waiting for further knowledge to be significantly accumulated.

Word embedding techniques, one of the machine learning techniques, can extract knowledge by processing text and keywords, or obtain suggestions for new knowledge using relational reasoning and inference between keywords. This is because word embedding projects keywords onto space and expresses them as vectors [22] . Therefore, inference and analogy between keywords such as France − Paris = Korea − Seoul, or France − Paris + Seoul = Korea becomes possible mathematically. If we have information on France, Paris, and Seoul, it becomes possible to find Korea via word embedding. Using these characteristics, many studies on the use of word embedding are being conducted in each field. Word embedding is also widely used to understand biomedical entities [23] .

When COVID-19 was first discovered, there was little knowledge about it, but studies on similar viruses, such as other coronaviruses and RNA viruses, have been accumulated. Using this knowledge to infer the characteristics of COVID-19, it may be possible to accelerate the discovery of solutions for COVID- 19. In this study, we use word embedding and PubMed literature as the knowledge base ( Figure 1 ). Over the past decades, a huge number of studies on viruses, drugs, proteins, and biological entities have been accumulated in PubMed. We try to apply inference of word embedding to the PubMed knowledge base to interpret the characteristics of COVID-19, even when knowledge of COVID-19 is still insufficient. For this, we strive to establish a knowledge base that fully represents the biomedical knowledge collection of the 2010s, that is, a balanced knowledge base, not biased towards a specific area. Then, a modified knowledge base is built by adding a small initial collection of early COVID-19related articles (new thing). The knowledge base and the modified knowledge base are built into the pretrained model and final model through the word embedding technique. If the pretrained model expresses the knowledge base well, the modified knowledge base, inferring the relationship between the new term and pre-existing words, will be meaningful for understanding the characteristics of the new thing, i.e., COVID-19. To infer characteristics about COVID-19, we analyze the relationship between COVID-19 and two biomedical entities, namely drugs (chemicals), and proteins interacting with COVID-19. Where limited studies on COVID-19 and SARS-CoV-2 have been reported, we attempt to enhance our understanding of the virus using the existing knowledge stock on coronaviruses based on a modified knowledge base. This study aims to examine the potential of drug repurposing by applying word embedding to the PubMed literature. The relationship between COVID-19 and drugs as well as COVID-19 and proteins can then be deduced by the trained model. 

SARS-CoV-2 is a novel coronavirus, and the detrimental impact great to wait until enough research has been conducted to find a so determine information on SARS-CoV-2 using accumulated knowledg particularly coronaviruses, on PubMed, which is the largest and mo database for research in the fields of biomedical and life sciences. Th has a plethora of information on various subjects, which can be ident ical Subject Headings (MeSH) and Substance Name (SN) of Unique I (UNII), and the Chemical Abstracts Service (CAS) fields. MeSH an mation sources that can be analyzed by extracting the subject keyw Using all the sentences included in the abstract of a given publication of the word embedding model, it would be possible to extract more tionships between keywords in text contents. However, if so, signi and keyword refinement are required, and this will take a long time. I as to whether to secure a richer keyword dictionary or a refined keyw out noise, and we chose the latter for accurate inferring. To block dat damentally and to efficiently process and analyze data, we only used vocabulary from MeSH and SN. Regarding the PubMed literature pe fields, we attempted to identify associations between COVID-19 a COVID-19 and proteins.

The analyzed dataset included 7,804,687 articles from PubMed 2010 and 2019; these articles were tagged with MeSH and SN terms teristics of COVID-19, all COVID-19-related articles published bef were downloaded from PubMed. COVID-19-related articles that w MeSH or SN terms were included using the Other Term (OT) field author keyword field. Unlike MeSH or SN terms, the OT category d trolled vocabulary; thus, we further cleaned the terms. Keywords ref 

SARS-CoV-2 is a novel coronavirus, and the detrimental impact of the disease is too great to wait until enough research has been conducted to find a solution. Our aim is to determine information on SARS-CoV-2 using accumulated knowledge on known viruses, particularly coronaviruses, on PubMed, which is the largest and most updated literature database for research in the fields of biomedical and life sciences. The PubMed literature has a plethora of information on various subjects, which can be identified using the Medical Subject Headings (MeSH) and Substance Name (SN) of Unique Ingredient Identifiers (UNII), and the Chemical Abstracts Service (CAS) fields. MeSH and SN provide information sources that can be analyzed by extracting the subject keywords of publications. Using all the sentences included in the abstract of a given publication for the construction of the word embedding model, it would be possible to extract more keywords and relationships between keywords in text contents. However, if so, significant noise removal and keyword refinement are required, and this will take a long time. It is a matter of choice as to whether to secure a richer keyword dictionary or a refined keyword dictionary without noise, and we chose the latter for accurate inferring. To block data contamination fundamentally and to efficiently process and analyze data, we only used a controlled subject vocabulary from MeSH and SN. Regarding the PubMed literature pertaining to these two fields, we attempted to identify associations between COVID-19 and drugs as well as COVID-19 and proteins.

The analyzed dataset included 7,804,687 articles from PubMed published between 2010 and 2019; these articles were tagged with MeSH and SN terms. To infer the characteristics of COVID-19, all COVID-19-related articles published before 18 March 2020, were downloaded from PubMed. COVID-19-related articles that were not tagged with MeSH or SN terms were included using the Other Term (OT) field, which refers to the author keyword field. Unlike MeSH or SN terms, the OT category does not have a controlled vocabulary; thus, we further cleaned the terms. Keywords referring to COVID-19, such as "SARS-COV-19," "2019 Novel Coronavirus," and "Corona Virus disease 2019," were all combined as COVID-19. The rest of the Other Terms were also appropriately refined. A total of 539 COVID-19-related articles were included in the analysis using OT. and Occupations, (i) Anthropology, Education, Sociology, and Social Phenomena; (j) Technology, Industry, and Agriculture; (k) Humanities; (l) Information Science; (m) Named Groups; (n) Health Care; (o) Publication Characteristics; and (p) Geographical. Articles, each of which can have as few as one or as many as 40 or more tags. If two MeSH terms are tagged in the same article, the two MeSH terms are defined as being associated with one another. MeSH terms with a known pharmacological action are indexed as Pharmacological Action terms in the MeSH vocabulary system.

When an article on PubMed literature mentions substances registered in the Unique Ingredient Identifier (UNII) and the Chemical Abstracts Service (CAS), the substance name becomes tagged in the Registry Number/EC Number and Substance Name fields. Registry Number and EC Number are codes registered in UNII and CAS, respectively, whereas Substance Name refers to the identification of the substance. Each article may have more than 20 substances tagged. If two substances are tagged in the same article, they are assumed to be associated with one another. Substance Names sometimes overlap with MeSH, but this rarely occurs. Of note, protein names are listed as broad terms in the MeSH vocabulary system, whereas they are listed in detail, along with the source, such as human, mouse, rat, or virus, in the Substance Name system.

In the present study, a knowledge base was established using literature from PubMed. COVID-19-related articles were used to extract the relationship between COVID-19 and drugs and COVID-19 and proteins. The MeSH and SN terms, which efficiently express the subject of the article with little noise, were used to structure the knowledge base. For each article, the MeSH and SN terms were merged to create the combined vocabulary. The wordembedding model, a machine-learning technique, was then generated using co-occurrence relation information. To build the final model from the COVID-19-related article set, the OT terms of the COVID-19-related articles were added to expand the vocabulary further.

To broaden our understanding of COVID-19 and to infer new information about this disease, a new knowledge base needs to be established using existing knowledge bases. This study aims to produce a word-embedding model using an already established knowledge base, and to create a new knowledge base that allows the effective comparison and inference of the relationship between newly added information and the existing information. Knowledge base refers to the stock of knowledge that has been accumulated by researchers over the years. As COVID-19 is a novel issue, we aimed to build a knowledge base using the PubMed literature from the past ten years. Using the word-embedding model, every term within the knowledge base can be expressed as a vector; consequently, the relationship between terms can be calculated by vector computation.

Word embedding converts the sparse matrix that expresses relationships among numerous keywords (as the number of dimensions equals the number of keywords) into a dense matrix that condenses the number of dimensions (i.e., 100-200 dimensions). This allows the expression of keyword characteristics as vectors. All keywords within the vocabulary are expressed as vectors with appropriate dimensions, enabling the analysis of relationships among keywords using vector algebra. In addition, keyword analogy becomes possible, allowing a more efficient display of keyword relationships.

Common word-embedding models include Word2Vec [24] and FastText [25] for wordlevel embedding, and BERT (bidirectional encoder representations for transformers) [26] for sentence-level embedding. In this study, word-level embedding was used to embed biomedical terminology tagged in articles, such as MeSH and SN terms, with minimal noise and without requiring natural language processing or named entity recognition for sentences. We also tried to build our own pretrained and final models, considering the formation of an organic relationship between the two knowledge bases. Word2Vec and FastText employ very similar embedding methods; however, FastTex was selected for this study because it has a superior sub word-level analysis and out-of-vocabulary capabilities. Moreover, FastText can utilize the packages created by Facebook and Python-based Gensim. For this study, the Gensim package for FastText was used.

FastText can use continuous-bag-of-words and skipgram models to infer relationships between words; in this study, the latter was used. MeSH and SN terms tagged in PubMed literature between 2010 and 2019 were used as data for FastText. The vocabulary consisted of 53,216 terms.

The three hyperparameters that have major impacts on the model characteristics in FastText (vector size, window size, and number of epochs) were tested for model optimization, whereas default values were used for other parameters. Vector size, which refers to the dimension of a word vector, was tested in 100, 150, and 200 settings. Window size, which describes the size of the context window used in measuring word pair relationships when building the word-embedding model, can go beyond 60 MeSH and SN terms per article. Therefore, window size was tested in 40, 50, and 60 settings. The number of epochs was tested in 10, 15, and 20 settings.

As the FastText model building in the present study was an unsupervised training, the following evaluation methods were applied for the model optimization test. First, the evaluate_word_pairs method provided by the Gensim package for FastText functions was utilized to perform plausibility validation of the medical term relation in the model. This method is similar to the one used by the National Center for Biotechnology Information (NCBI) of the US National Library of Medicine. According to [27] , NCBI builds the word embedding model of PubMed and MeSH data using FastText; model evaluation is performed by measuring word pair similarity using Medical Resident Relatedness Set (UMNSRS) medical term pairs [28] from the University of Minnesota Pharmacy Informatics Lab. UMNSRS was developed by experts who manually evaluated the relatedness of 588 medical concept pairs. Out of these, the authors selected 145 pairs that were MeSH terms and used them for pretrained model evaluations. The evaluate_word_pairs method from Gensim calculates the Pearson correlation coefficient and the Spearman correlation coefficient between the FastText model and the list of UMNSRS medical term pairs. The model by [27] at NCBI showed a similarity of 0.660 to UMNSRS medical term pairs. The similarity to UMNSRS medical term pairs in this study had a Pearson correlation coefficient of above 0.667 and a Spearman correlation coefficient of above 0.663, as summarized in Table 1 . Second, the country-capital pair list from Google's question-answer.txt, which is a widely used list to evaluate word embedding of common words that appeared in PubMed literature, was also assessed. This method utilizes the analogy between word vectors in the word-embedding model and measures the agreement accuracy of the country-capital analogy relationship. As summarized in Table 1 , the accuracy was above 0.785. Based on the two evaluations, the authors determined a vector size of 200, a window size of 50, and number of epochs of 10 as the optimal settings for the pretrained word-embedding model. A different model that exhibited higher accuracy in the second evaluation was considered; however, the results from the first evaluation were considered to be more relevant, as this is an embedding model for biomedical terms, and the Q-A accuracy of the model was found to be high (above 0.928). 

The pretrained model was a word-embedding model using MeSH and SN terms from PubMed literature between 2010 and 2019. The final model was built by adding to the pretrained model the set of articles on COVID-19 published in 2020. As the number of COVID-19 articles tagged with MeSH and SN terms is not large, OT terms were used instead. The final model is a modified model, where a new thing, in other words, COVID-19, was added to the pretrained model; the root of this model was the same as that of the pretrained model. Therefore, vector size and window size among the three hyperparameters from the pretrained model were applied as fixed parameters in the final model. For the evaluation of the final model, only the number of epochs was used as a variable. Further, the final model evaluation requires a different method than the one used in the evaluation of the pretrained model. This is because the objectives of the two models are different. The pretrained model aims to build a knowledge base from the 2010s, whereas the final model aims to infer the characteristics of COVID-19. Word pair evaluation was applied to the pretrained model to structure the biomedical knowledge base using biomedical terms. In contrast, the final model needed to be evaluated to predict the characteristics of COVID-19 accurately using the pretrained model. However, in the early stage of research on COVID-19, there were not many publications on COVID-19; hence, a model that overfits only a very small part of what humanity has learned about COVID-19 would not be adequate. One solid basic knowledge about COVID-19 is that it is caused by RNA viruses. Therefore, we selected the most effective model based on the measured similarity of the COVID-19 term to RNA virus terms. As summarized in Table 2 , the number of epochs ranged from 10 to 150, and the similarity between COVID-19 terms and RNA virus terms were measured. As the number of epochs increases, the learning is repeated, building a word embedding model that well describes the data of COVID-19 added to the final model, but at some point overfitting may occur, which hinders inferring about COVID-19. Therefore, we have to determine the appropriate number of epochs according to the evaluation method and build a final model. The average similarity increases as the number of epochs increases, reaches a maximum value at 110, and then tends to saturate somewhat. The highest average similarity was found with 110 epochs. Therefore, the final model used 110 epochs, and its vocabulary ultimately consisted of 53,316 terms, owing to the addition of the OT terms extracted from the COVID-19 article set to the pretrained model's vocabulary. 

The following COVID-19-related drugs and proteins were extracted from the final model. From the list of drugs available, the authors focused on anti-infective drugs. For MeSH terms, Pharmacological Actions keywords are provided along with the drugs. The authors selected the following Pharmacological Action drugs to filter for anti-infective drugs: Anti-Bacterial Agents; Antibiotics, Antifungal; Antibiotics, Antineoplastic; Antibiotics, Antitubercular; Anti-Infective Agents; Anti-Infective Agents, Local; Anti-Infective Agents, Urinary; Antimalarials; Antiprotozoal Agents; Antitubercular Agents; Anti-HIV Agents; Antiviral Agents; HIV Fusion Inhibitors; HIV Integrase Inhibitors; and HIV Protease Inhibitors. Using these terms, a total of 401 anti-infective drugs emerged. Within the final model, the similarity between anti-infective drugs and COVID-19 was measured to assess for any relationship. Table 3 lists the top 100 out of the 401 anti-infective drugs that were related to the COVID-19 vaccine or to the treatment drugs currently being developed. The drugs in Table 3 that are highlighted in gray represent those that showed low relevance to COVID-19, compared with the top 100 drugs; however, these are currently being studied as potential vaccines or treatments. Excelra [29] , the ReDO Project [30] , and DrugBank [31] summarize the drugs that are being repurposed as potential COVID-19 vaccines or treatments. The authors compared these three drug repurposing databases and the final model results from the current study, and the comparison results are listed in Table 3 , Reference column. Out of the 401 anti-infective drugs the authors selected, 64 drugs were identified to be in current development as COVID-19 vaccines or treatments. Based on the relevance to COVID-19, 33 repositioning candidate drugs were identified in the top 100 drugs. The imipenem and cilastatin drug combination (under the brand name Primaxin), which revealed the highest similarity, is a treatment for severe infections affecting the heart, lungs, bladder, kidney, skin, blood, bones, stomach, and the female reproductive organs. With the spread of COVID-19, the U.S. FDA approved the antibiotic combination of imipenem-cilastatin and relbactam (Recarbio) for the treatment of hospitalacquired bacterial pneumonia and ventilator associated bacterial pneumonia. Oseltamivir and chloroquine, the two drugs that were most frequently mentioned in the media in the first half of 2020, also showed a very high relevance to COVID-19. The amoxicillin and clavulanate potassium combination, more commonly known under the trade name Augmentin, is an antibiotic that is widely used for sinusitis, bronchitis, pneumonia, ear infections, and urinary tract and skin infections. Currently, clinical trials utilizing amoxicillin/clavulanate alone or in combination of azithromycin with amoxicillin/clavulanate are ongoing. The trimethoprim-sulfamethoxazole drug combination (Bactrim), which has excellent antibacterial activity against gram-negative bacteria and staphylococcus, is also an antibiotic used for the treatment of ear infections, urinary tract infections, bronchitis, traveler's diarrhea, shigellosis, and Pneumocystis jirovecii pneumonia. The drug is currently in clinical trials for its use with Anakinra, an IL-1 receptor antagonist indicated for the treatment of the COVID-19-induced hyperimmune respiratory failure (aka cytokine storm). Most of the potential drugs with the highest relevance (top 100) to COVID-19 were drugs for bacterial infections (antibiotics). Several drugs for viral infections were also on the list. Various anti-retrovirals (used in HIV/AIDS) and anti-malarial drugs were also shown to have high relevance to COVID-19.

To indirectly confirm the robustness of our final model, we compared the drug list of 10 models with different numbers of epochs. The top relevance drug list barely changed, and only the bottom relevance (about 10%) drug list showed small changes, indicating that our final model is robust and the list of potential drugs with the highest relevance (top 100) is a stable result.

Using a similar method, protein terms with high relevance to COVID-19 were extracted from the final model. Only the 5366 proteins that are of either human or coronavirus origin were extracted, and their relevance to COVID-19 was then analyzed. Table 4 lists the top 100 proteins relevant to COVID-19. The proteins highlighted in gray in Table 4 indicate those that showed low relevance to COVID-19 but are known to be human proteins that interact with COVID-19. Information on the human proteins that are known to interact with COVID-19 and on known proteins of COVID-19 can be found in [32] and in the study by [7] . Protein descriptions, gene names, and COVID-19 bait columns in Table 4 also lists the COVID-19 interacting proteins. In particular, COVID-19 viral proteins were identified as proteins with high relevance to COVID-19 in the final model, along with angiotensin converting enzyme 2, which is known as the COVID-19 entry receptor. Among the top 100 highly-relevant proteins, the following were identified: six SARS-CoV-2 viral proteins listed in The Human Protein Atlas (M protein, coronavirus; nsp1 protein, SARS coronavirus; nsp14 protein, SARS coronavirus; 3C-like proteinase, coronavirus; nonstructural protein 3, SARS coronavirus; Nsp16 protein, SARS virus) and three human proteins (angiotensin converting enzyme 2; NARS2 protein, human; ALG8 protein, human). The drugs and proteins listed in Tables 3 and 4 are the COVID-19 related term list extracted from PubMed MeSH and SN term-based word embedding model. When comparing these results with the latest references reflective of current research trends, some were consistent, while others highlighted information not being investigated in the current research.

This research aimed to understand the characteristics of COVID-19, which is a novel disease that humanity is currently facing, using the PubMed database, a knowledge base that has been established over a long duration. To accomplish this, information from PubMed literature pertaining to coronaviruses from the past decade was structured in a word embedding model, and subsequently, the relationships between COVID-19 terms and other biomedical terms were inferred. With the result of this study, proteins and drugs with high relevance to COVID-19 were deduced.

The word embedding technique used in this study upgrades the field of knowledge discovery from the biomedical literature, previously dealt with in bibliometrics, enabling inference on the demand for knowledge with many uncertainties, such as that on COVID-19. This helps to understand and discover new knowledge. The vector calculation and mathematical modeling techniques of word embedding can play a role in advancing drug development, which is time-consuming and costly, by adding inferencing capabilities to the insufficient medical literature knowledge.

The result of this study is highly comparable to the biomedical demands of research and development efforts to overcome the COVID-19 crisis. We expect that this list of drugs and proteins, and their relevance to COVID-19, will help in identifying potential vaccine or treatment candidates. This word embedding research model also provided an in-silico drug design method for drug repurposing that can drastically reduce the time and cost of drug development. With the urgent need for identifying drug candidates for COVID-19, various data, tools and methods for drug repurposing are being introduced and analyzed. The results of this study also provide a computational method to predict potential drug-target interactions (DTIs).

This study exhibits three limitations. First, it only used MeSH and SN terms for word embedding, which both has advantages and limitations. As to the advantages, these terms are controlled vocabularies, and only technical terms were used to establish the model, which virtually eliminates all noise. However, it might have excluded new terms that may exist in plain texts. If plain texts such as abstracts would be included, natural language processing and named entity recognition could be required. In this case, the BERT model can be considered. Second, for drug repositioning, a broader consideration regarding the pharmacological action of drugs as anti-infective drugs should have been included. Recently, there have been cases of drugs being used for an entirely different indication; for example, anti-tumor drugs and anti-parasitic drugs are also being studied as potential COVID-19 treatments. As this study aims to expand our knowledge of COVID-19, it may also be necessary to observe more broadly its relevance to COVID-19. Third, adding more databases beyond PubMed can provide more information. In particular, adding clinical trials databases could be helpful in enriching the information by including data on the latest commercial drugs.

Follow-up research is needed to overcome these limitations. Future research should include the entire list of drug substance terms, as well as anti-infective drugs, for analysis in order to produce helpful results for drug repositioning for COVID-19. This is because, like the cases in which new indications were added for drugs with completely different indications in the past, it is not possible to rule out the possibility that a drug that appears to be irrelevant will appear as a therapeutic candidate for COVID-19. Furthermore, a word embedding model using clinical trial databases, in addition to PubMed literature, needs to be established. With the addition of pharmacokinetic prediction, the list of potential vaccine or treatment candidates could become more meaningful and more useful information.

Studies to understand the interaction between drugs and proteins by applying a clustering technique to the drug list and protein list related to COVID-19, or studies applying the BERT model, are also meaningful as follow-up studies. If we approach the pandemic from the perspective of an X-event like a major accident [33] , machine learning-based modeling studies of complex systems for the spread of infectious disease will also help broaden our understanding of COVID-19 and new infectious diseases caused by a novel virus [34] . These efforts will contribute to availing more accurate information pertaining to COVID-19 rapidly, which will help overcome new pandemics. 

The authors declare no conflict of interest.

A novel coronavirus from patients with pneumonia in China

WHO Coronavirus Disease (COVID-19) Dashboard. Available online

Molecular investigation of SARS-CoV-2 proteins and their interactions with antiviral drugs

Drug targets for corona virus: A systematic review

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response

Emerging coronaviruses: Genome structure, replication, and pathogenesis

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Understanding human-virus protein-protein interactions using a human protein complex-based analysis framework

The proteins of severe acute respiratory syndrome coronavirus-2 (SARS CoV-2 or n-COV19), the cause of COVID-19

A review of SARS-CoV-2 and the ongoing clinical trials

Repurposing antivirals as potential treatments for SARS-CoV-2: From SARS to COVID-19

Rapid repurposing of drugs for COVID-19

Drug repositioning is an alternative for the treatment of coronavirus COVID-19

Potential covalent drugs targeting the main protease of the SARS-CoV-2 coronavirus

Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2

Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods

Network graph representation of COVID-19 scientific publications to aid knowledge discovery

COVID-19 Knowledge Graph: A computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology

Coronavirus knowledge graph: A case study. arXiv 2020

KG-COVID-19: A framework to produce customized knowledge graphs for COVID-19 response

COVID-19 literature knowledge graph construction and drug repurposing report generation

Speech Language Processing

Deep learning with word embeddings improves biomedical named entity recognition

Distributed representations of words and phrases and their compositionality. arXiv 2013

Advances in pre-training distributed word representations. arXiv 2017

Pre-training of deep bidirectional transformers for language understanding. arXiv 2018

Semantic similarity and relatedness between clinical terms: An experimental study

COVID-19 Drug Repurposing Database

ReDO Project, Covid19_DB. Available online

Covid-19 Information

The Human Protein Atlas. Available online

Modelling a Safety Management System Using System Dynamics at the Bhopal Incident

Causal thinking and complex system approaches in epidemiology