key: cord-343072-3wuh6k6g
authors: Dong, Mengying; Cao, Xiaojun; Liang, Mingbiao; Li, Lijuan; Liu, Guangjian; Liang, Huiying
title: Understand Research Hotspots Surrounding COVID-19 and Other Coronavirus Infections Using Topic Modeling
date: 2020-03-30
journal: nan
DOI: 10.1101/2020.03.26.20044164
sha: 
doc_id: 343072
cord_uid: 3wuh6k6g

Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which eventually results in the current outbreak of novel coronavirus disease (COVID-19) around the world. The research community is interested to know what are the hotspots in coronavirus (CoV) research and how much is known about COVID-19. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using a topic modeling analysis. Methods: We extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains all the 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we trained an eight-topic model from the corpus. We then analyzed the semantic relationships between topics and compared the topic distribution between COVID-19 and other CoV infections. Results: Eight topics emerged overall: clinical characterization, pathogenesis research, therapeutics research, epidemiological study, virus transmission, vaccines research, virus diagnostics, and viral genomics. It was observed that COVID-19 research puts more emphasis on clinical characterization, epidemiological study, and virus transmission at present. In contrast, topics about diagnostics, therapeutics, vaccines, genomics and pathogenesis only accounted for less than 10% or even 4% of all the COVID-19 publications, much lower than those of other CoV infections. Conclusions: These results identified knowledge gaps in the area of COVID-19 and offered directions for future research. Keywords: COVID-19, coronavirus, topic modeling, hotspots, text mining

Coronaviruses (CoVs) are enveloped, positive single-stranded large RNA viruses that cause respiratory and intestinal infections in animals and humans (1) . To date, there have seven human coronaviruses (HCoVs) been identified. The alpha-CoVs HCoVs-NL63 and HCoVs-229E and the beta-CoVs HCoVs-OC43, HCoVs-HKU1 only cause mild respiratory illness (2, 3) . However, in the past two decades, two novel coronaviruses, severe acute respiratory syndrome CoV (SARS-CoV) and Middle East respiratory syndrome CoV (MERS-CoV), caused severe human diseases in local countries and regions, representing fatality rates of around 10% and 35%, respectively (4) (5) (6) . Now, the seventh HCoV, SARS-CoV-2 (formerly 2019-nCoV), is causing an outbreak of coronavirus disease all over the world.

Compared to the earlier epidemics by CoVs, COVID-19 is highly contagious and has been reported with more than 1,000,000 confirmed cases as on April 4, 2020 (7) . Given the spread of the new CoV and its impacts on human health, governments are facing increasing pressure to stop the COVID-19 pandemic. The White House and a coalition of leading research groups have prepared a COVID-19 Open Research Dataset (CORD- 19) and issued a call to action to the world's artificial intelligence experts to support the ongoing fight against this infectious disease.

The research community has also responded rapidly and substantial literature about this epidemic has emerged.

Understanding of COVID-19 is evolving rapidly. It is essential to understand the emerging scientific knowledge to coordinate COVID-19 research globally. Recently, the WHO has published a Global Research Roadmap with immediate, mid-term and longer-term priorities to enable the implementation of priority research (8) . Bonilla-Aldana DK et al. (9) and Md Mahbub Hossain MBBS (10) have performed bibliometric analysis to evaluate the scientific literature on coronavirus infections as well as COVID-19, basing on indicators such as the number of articles, the productivity of authors, geographic distribution of articles and prominent keywords. However, to the best of our knowledge, there is still no quantitative thematic analysis available to date that focuses on CoVs.

Addressing this knowledge gap requires not only elaboration at the text level but also an understanding of semantic structures. Top modeling is the most notable approach for analyzing . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint 4 text to identify semantic content (11) . Its objective is to find latent semantic topics within vast and unstructured collections of literature. Among several topic modeling algorithms, Latent Dirichlet Allocation (LDA) is the most fundamental one and has been shown to be effective at discovering semantic structures and distinct topics from a corpus (12) (13) (14) (15) (16) (17) (18) . This kind of data-driven analysis can generally shift the information overload burden from researchers to the computer via algorithmic and statistical methods.

The purpose of this work was to conduct LDA modeling for semantic and quantitative evaluations of the current status of literature on CoV infections as well as COVID-19, identify broad research topics and how these topics interact with one another. More importantly, this work can benefit COVID-19 research coordination by recognizing high priority scientific topics. We found that topics of clinical characterization, epidemiology, and virus transmission are hotspots for COVID-19 at present, while research on pathogenesis, therapeutics, virus diagnostics, vaccines and viral genomics are urgently needed.

Data used in this study were obtained from CORD-19, which was prepared by the White House and a coalition of leading research groups in response to the COVID-19 pandemic (19) . contains all research about COVID-19, SARS-CoV-2, and other CoVs (e.g. SARS, MERS, etc.) up to March 20, 2020, including over 44,000 scholarly articles, from the following sources: 1)

PubMed's PMC open access corpus; 2) a corpus maintained by the WHO; and 3) bioRxiv and medRxiv pre-prints.

Abstracts were firstly exported from CORD-19 and merged into a local repository for further processing. Next, 8,420 records with an empty or non-English abstract (i.e., not provided in the database) were removed from the corpus. Then, 708 duplicates were removed based on the title (stripped of all special characters and spaces) or the digital object identifier (DOI, if available).

The final corpus contained 35,092 abstracts of all coronaviruses-related articles. We further retrieved COVID-19 data by administering the query, "COVID" OR "COVID-19" OR "2019-nCoV" OR "SARS-CoV-2" OR "Novel coronavirus", and limiting the publication time to 2020. In . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint 5 total, 1,482 articles were identified as COVID-19-related research.

This study is a retrospective analysis of accepted and publicly disclosed research, which does not meet the definition of human research under the US Department of Health and Human Services protection of human subject regulations. Data were not deidentified and are publicly available. There was no intervention, interaction with researchers, or use of private information to perform the analyses in this study.

We applied a standard approach for tokenization, lemmatization and part-of-speech tagging to preprocess the biomedical texts. We also removed the following words to get the clean corpus: 1) punctuation; 2) digits; 3) special characters; 4) words that appeared fewer than 5 times in the corpus; 5) useless words from text collection, including coordinating conjunctions, cardinal numbers, prepositions, pronouns and so on; 6) common terms in scientific articles, e.g., 'method'

and 'introduction', as well as 'virus', which is thought to appear in most abstracts.

We built a topic model using LDA (20, 21) from the article abstracts within the corpus. LDA assumes individual document as random mixtures over latent topics, with topics in turn being probability distributions over multiple vocabularies, and topics being uncorrelated. The Gibbs sampling algorithm is used to estimate the topic distribution parameters of the document and the multi-distribution of the vocabulary on each topic. The evaluation index in the statistical language model, coherence score, was used to find the optimal number of topics in the corpus. The coherence score is calculated by the co-occurrence frequency of the words in the sliding window, which increases with the increase of sentence similarity, meaning the higher, the better. Although the coherence score can help to decide the most appropriate topic number, such number sometimes can be still large for manually understanding. Under this circumstance, manual examination should be performed to evaluate models with different numbers of topics with the aid of expert knowledge to thoroughly understand the corpus. For each model, 15 top words per topic were examined to assess scientific coherence of the words as a set, overlap in topic words across topics, and human understandability. The selected model was used for all subsequent analyses.

LDA modeling is performed and visualized on the entire corpus via the Python language.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint

Before profiling topics in the corpus, a co-occurrence map of the 50 most frequent words in the abstracts were generated to describe the research hotspots of coronaviruses from a macro-level viewpoint (Fig. 1) . The top ten co-occurring words were 'infection' (n = 49753), 'cell' (n = 36689), 'protein' (n = 26593), 'diseases' (n = 19643), 'patient' (n = 19307), 'human' (n = 13773), 'respiratory' (n = 12572), 'response' (n = 12473), 'gene' (n = 11138), and 'identify' (n = 11100). It is obvious that most of these words take 'infection' as the central node, depicting the virus attack event from clinical, macroscopic and microscopic viewpoints. Certainly, such map treats each word as a topic and gets the relationships between words simply basing on co-occurrence statistics.

Its granularity is too small to capture the semantic information of CoV research. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint 7 model. Increasing the number of topics would make each individual topic more specific and might increase overlap between topics. Decreasing the number of topics would result in topics to be more high-level abstract. We assigned a potential theme to each topic by manual examinations based on semantics analysis of representative words in each topic. Table 1 shows the 15 most frequent words for each of the eight topics. In most cases, topics were easily recognizable representing specific subjects about the viruses, or the disease, or the public health and so on. The first most dominant topic was enriched for the clinical characterization, with words such as 'infection', 'cause', 'disease', 'severe', 'respiratory', 'acute', 'child' and 'symptom'. Representative words of topic 2 include 'cell', 'protein', 'expression', 'bind', 'replication', 'activity' and 'membrane', which usually are . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

It's common that one article discusses more than one topic, i.e., the topics are not isolated but semantic-related. To identify the semantic relationships between topics, we assigned each article two topics with the highest probabilistic proportion, measured the relationships between two topics by using their co-occurrence statistics, and showed them vividly in a direct chord diagram.

As shown in Fig. 3 , the topic 'clinical characterization' has strong relationships with all the other . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint number of relationships with that same co-occurrence as the dominant and second topics.

To evaluate the current state of knowledge on COVID-19, we compared the topic distributions of 1,482 COVID-19 related articles with that of 33,610 other CoVs infections related articles. Fig. 4 shows 

Here we applied LDA to the abstracts of 35,092 CoVs-related research articles up to March 20, 2020 . Results of this exploration present a comprehensive overview and an intellectual structure of the coronavirus research. We identified eight major research topics, analyzed their semantic relationships, and compared the topic distributions between the evolving COVID-19 research and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Normally, for viruses that broke out earlier, certain topics can be attractive and widely discussed.

In the case of topics with high proportions, we consider that they have been elaborated and specified by research community, such as the topics 'clinical characterization', 'pathogenesis research', and 'therapeutics research' in other CoVs research. While as for the new virus SARS-CoV-2, the research topic distribution is in a state of dynamic change. Therefore, comparison between the 'old' and 'new' viruses could clearly highlight the knowledge gaps between them.

This work suggests that research needs to be focused on pathogenesis, therapeutics, virus diagnostics, vaccines and viral genomics for COVID-19, which could then necessitate nationallevel support for ongoing research, if required. Pathogenesis research is usually important but time-consuming. We believe such studies will be available in the future. The more important and urgent topic is therapeutics research, which accounts for only 9.68% in COVID-19 research, much lower than 18.88% in other CoV infections research. With the number of infection cases increasing rapidly, therapeutics research will be more and more urgently needed. In addition, virus diagnostics, vaccines and viral genomics show small proportions in all CoV infections research, but even smaller in the field of COVID-19. This shows how little we know about CoVs. It is time . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org /10.1101 /10. /2020 to translate research findings into effective measures such as a diagnostics, vaccine or effective therapeutic options, as with other priority diseases (22) .

Similar to the study of Md Mahbub Hossain MBBS (10), we did not identify any topic about social, economic, political, psychological, or cultural consequences of COVID-19 as well as other CoV infections. The WHO roadmap has noticed this and taken 'social sciences in the outbreak response' as a priority. The social science research community should act to assess the impact of a major public health crisis like COVID-19 on the complex network of modern societies (23, 24) . This present work has several limitations that must be acknowledged to apprehend the findings. First, as a computer model, topic modeling has difficulties with understanding nuances and subtext. We identified eight major topics with frequent appearances, whereas some specific topics such as virus natural history and human-animal interface were buried due to low proportions. Future work will attempt to develop an approach to detect and visualize the conceptual sub-domains. Second, it is also necessary to be aware that the granularity of terms used to label topics may vary a little for different topics. We have tried our best to keep them at the same level through carefully curating the top words in each topic, as well as reviewing text intention of the corresponding publications. Third, the LDA model only provides a cross-sectional profile of the coronavirus research, which is methodologically different from academic disciplines . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint 13 such as evidence-based medicine and clinical assessment that synthesizing evidence centered on a specific question. But this study may facilitate these disciplines by informing what is already available and what is urgently required for the COVID-19 research. Future research could also study the occurrence of topics over time and analyze links to historic events and virus characteristics, in order to better understand possible temporal patterns.

Notwithstanding its limitations, this work is the first to thoroughly assess research output of coronavirus research in statistical perspective. The present study provided machine-extracted research topics in various contexts based on massive number of articles over a long period. Such a bottom-up approach can complement the existing manual content analysis techniques. It can also help to find the latent semantic relationships between topics and enrich the research landscape.

In conclusion, this study specifically shows a holistic picture of the current research in response to the outbreak of COVID-19. We conducted a topic-based bibliometric study to address two questions "What is the CoVs research interested in?" and "Where are we with our research about COVID-19?". An LDA-based model was used to identify the hotspots and their difference between COVID-19 and other CoV infections in a quantitative way. Generally, eight main research areas are identified, i.e., clinical characterization, pathogenesis research, therapeutics research, epidemiological study, virus transmission, vaccines research, virus diagnostics, and viral genomics. The results also indicate that studies in diagnostics, therapeutics, vaccines, viral genomics, and pathogenesis are urgently needed to minimize the impact of the COVID-19 outbreak. As COVID-19 can be considered a recent emerged disease, we characterize a 'snapshot' of this field at an early stage in its development. Our findings can potentially aid governments and the research community in understanding recent development of the research field, optimizing research topic decision, and monitoring new scientific or technological activities. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint

Advances in virus research

Coronaviruses -drug discovery and therapeutic options

Origin and evolution of pathogenic coronaviruses

Severe acute respiratory syndrome coronavirus as an agent of emerging and reemerging infection

Middle East respiratory syndrome coronavirus: another zoonotic betacoronavirus causing SARS-like disease

Molecular pathology of emerging coronavirus infections

MERS-CoV and now the 2019-novel CoV: Have we investigated enough about coronaviruses? -A bibliometric analysis

Current Status of Global Research on Novel Coronavirus Disease (COVID-19): A Bibliometric Analysis and Knowledge Mapping

Engineering E. Topic discovery and evolution in scientific literature based on content and citations

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint

A hotspots analysis-relation discovery representation model for revealing diabetes mellitus and obesity

Years of Sepsis: Using Topic Modeling to Understand Historical Themes Surrounding Sepsis

Analyzing research trends in personal information privacy using topic modeling

Detecting and predicting the topic change of Knowledge-based Systems: A topic-based bibliometric analysis from 1991 to

Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches

Clustering scientific documents with topic modeling

CORD-19). 2020. Version 2020-03-20

Distance metric learning with application to clustering with side-information. Advances in neural information processing systems

Latent dirichlet allocation

Severe fever with thrombocytopenia syndrome-A bibliometric analysis of an emerging priority disease

Virus will have huge socioeconomic impact in China. (oxan-db)

Costly lessons from the 2015 Middle East respiratory syndrome coronavirus outbreak in Korea

The authors declare that they have no competing interests. It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.26.20044164 doi: medRxiv preprint