key: cord-0162339-o2vbgzji
authors: Grotheer, Rachel; Huang, Yihuan; Li, Pengyu; Rebrova, Elizaveta; Needell, Deanna; Huang, Longxiu; Kryshchenko, Alona; Li, Xia; Ha, Kyung; Kryshchenko, Oleksandr
title: COVID-19 Literature Topic-Based Search via Hierarchical NMF
date: 2020-09-07
journal: nan
DOI: nan
sha: 4010bdde857d0c902801c9338f74c8343546bcad
doc_id: 162339
cord_uid: o2vbgzji

A dataset of COVID-19-related scientific literature is compiled, combining the articles from several online libraries and selecting those with open access and full text available. Then, hierarchical nonnegative matrix factorization is used to organize literature related to the novel coronavirus into a tree structure that allows researchers to search for relevant literature based on detected topics. We discover eight major latent topics and 52 granular subtopics in the body of literature, related to vaccines, genetic structure and modeling of the disease and patient studies, as well as related diseases and virology. In order that our tool may help current researchers, an interactive website is created that organizes available literature using this hierarchical structure.

The appearance of the novel SARS-CoV-2 virus on the global scale has generated demand for rapid research into the virus and the disease it causes, COVID-19. However, the literature about coronaviruses such as SARS-CoV-2 is vast and difficult to sift through. This paper describes an attempt to organize existing literature on coronaviruses, other pandemics, and early research on the current COVID-19 outbreak in response to the call to action issued by the White House Office of Science and Technology policy (Science and Policy, 2020) and posted on the Semantic Scholar (Scholar, 2020) and Kaggle (2020) websites. The original dataset posted on that site is augmented by adding articles drawn from other databases in order to make the final interactive organizational structure more robust for researchers.

Our primary goal is to create a framework for a topic-based search of papers within this dataset that is helpful to those investigating the novel coronavirus, SARS-CoV-2, and the global COVID-19 pandemic. In order to discover the latent topics present in the collection of scholarly articles, as well as to organize them into a hierarchical tree structure that allows for an interactive search, we use a modified hierarchical nonnegative matrix factorization (HNMF) approach. A website 1 that allows users to walk through the topic tree based on the top keywords associated with each topic is created using this hierarchical organization of the papers.

Our methods help make sense of a vast and rapidly growing body of COVID-19 related literature. The main contributions of this paper are as follows:

• A diverse dataset of COVID-19 related scientific literature is compiled, consisting of articles with full-text available drawn from several online collections.

• A tree-like soft 2 cluster structure is created of all the papers in the dataset based on the inherent relation between their topics using hierarchical NMF.

• The best number of topics for each layer is defined as the number that produces the most consistent clustering of the dataset with random initializations of NMF algorithm. A variance analysis method is used to identify the best number of topics on each layer.

• The effectiveness of the method is measured by exploring the coherence of each topic and dissimilarity between the topics.

• The discovered topics and distribution of articles into each of the topics are discussed, revealing the major areas of interest and research in the early months of the pandemic, as well as how existing epidemic literature can be effectively organized to allow efficient comparison to COVID-19 related research.

• The theoretical results are complemented with an interactive website.

Some relevant works that motivate our approach are briefly reviewed. NMF was first proposed for document clustering (Xu et al., 2003) , and since then many variants of the NMF algorithm have been proposed and applied to help organize various types of data (Lee and Seung, 1999; Buciu, 2008; Kuang et al., 2015) . In particular, there exist several recent papers that use NMF to find a hierarchy of topics in a set of documents. For example, Kuang and Park (2013) apply a rank-2 NMF to the recursive splitting of a text corpus and also provide an efficient on-the-fly stopping criterion. Gao et al. (2019) discuss a different version of HNMF, when the hierarchy of topics is generated by aggregation of the topics (rather than splitting). The first application of NMF produces the initial set of the most refined topics, and the subsequent NMF iterations find supertopics in which the previous set of topics can be summarized. This approach is referred as a bottom-to-top viewpoint, and the former as a top-tobottom. Approaches that utilize tools from neural networks such as back propagation to improve the topic representations have also been developed recently (Trigeorgis et al., 2016; Le Roux et al., 2015; Sun et al., 2017; Gao et al., 2019) . Tu et al. (2018) propose a hierarchical online non-negative matrix factorization method (HONMF) to generate topic hierarchies from data streams. The proposed method can dynamically adjust the topic hierarchy to adapt to the emerging, evolving and fading process of the topics. This work most closely aligns with what we present here, and although we do not consider the online setting, our method can easily be adapted to such.

Finally, several authors have sought to address the issue of interpretability of topics discovered by NMF, especially in datasets comprised of text documents. For example, Ailem et al. (2017) apply NMF to the documents using a word embedding model, Word2Vec (Mikolov et al., 2013b) , that focuses on the semantic relationship between words. We make use of this embedding to analyze the usefulness of the topics generated by examining their semantic similarity.

The dataset used is compiled from 4 different databases that contain scholarly articles related to COVID-19, various coronavirus diseases, other infectious diseases, and epidemiology (Scholar, 2020; for Disease Control and Prevention, 2020; National Center for Biotechnology Information, 2020; bioRxiv, 2020) . From each of these databases, only articles written in English that have a complete abstract and text body available are included. Punctuation, stop words, and words deem to be irrelevant such as "copyright" or "et al" are removed from the text body and abstract of each article and the articles are lemmatized. Further, each word in the text body and abstract is represented by a TD-IDF embedding (Salton and Buckley, 1988) . After processing and cleaning, the final dataset contains 25,663 articles. Most of these databases are regularly updated and one of the important future directions of this work will include developing a dynamic tree structure that pulls new articles from these databases weekly.

In a vector space model, a corpus can be represented by a d × n matrix X, where d is the size of the vocabulary, and n is the number of documents. The underlying assumptions in topic mod-eling (Blei et al., 2007) are that a latent topic can be represented as a distribution over the words, and that every document is a mixture of topics, i.e. comprises a statistical distribution of topics that can be obtained by "adding up" all of the distributions of all the topics covered. In this section, we will introduce how to apply Hierarchical NMF for topic detection and creation of the hierarchical tree structure. As a preliminary step, a brief introduction to using NMF for topic detection is given.

In NMF, the corpus matrix X ∈ R d×n ≥0 is decomposed into a pair of low-rank nonnegative matrices W ∈ R d×k , also known as the dictionary matrix, and H ∈ R k×n , also known as the coding matrix, by solving the following optimization problem

where A 2 F = i,j A 2 ij denotes the matrix Frobenius norm.

NMF, essentially an iterative optimization algorithm, has a drawback: the objective function is usually non-convex and has multiple local minimums. Therefore a different random initialization of the NMF procedure will result in a different matrix factorization. More importantly, this changes the interpretation of the results, including topic vector representations (W ) as well as the relevance between articles and topics (H). Another possible source of variability in the algorithm is the choice of the number of topics, k. Different combinations of initializations of W , H, and k yield different topics, leading to different article clustering results. See Section 5.1 for more discussion and implementation details in this vein.

The traditional NMF method treats the detected topics as a flat structure, which limits the ability of the representation of such method. A hierarchical structure, such as a tree, generally provides a more comprehensive description of the data. Given the complex nature of the coronavirus literature corpus, such a hierarchical approach is appealing.

In this work, a hierarchical NMF (HNMF) framework is applied which is able to detect supertopics, subtopics, and the relationship between them, creating a tree structure. The proposed HNMF algorithm is summarized in Algorithm 1.

In HNMF, NMF is first applied to the original corpus matrix X to obtain the dictionary matrix W and coding matrix H. The documents are then sorted into matrices X 1 , X 2 , · · · , X k , each representing a different topic, according to the coding matrix H, or into the matrix X e that temporarily holds unassigned articles. Whether the leaves need to be further divided depends on the number of the documents in each topic matrix (leaf). If the number of documents sorted into a topic is greater than a pre-specified value m, then a further division is needed. The above process is repeated until the number of documents in each leaf is less than m. For more details on the implementation of the HNMF algorithm, the reader can refer to Section 5.2.

Input: Corpus matrix X.

[W, H] = NMF(X, k * ) where topic number k * is chosen by Algorithm 2; assign articles to the related topics X 1 , · · · , X k * according to the threshold α in H, and any remaining articles to "Extra Document" matrix X e ; while # of the articles assigned to a topic i > m do determine the # of sub-topics k * i of the topic i in X i by Algorithm 2;

; assign the documents to the topics by the a threshold α in H i s; assign the rest to X e ; end for article x i in X e do calculate cosine similarity between x i and leaves, and assign the article to the most related leaf; end repeat both while and for loops until the number of the articles assigned to each topic is less than m.

This section begins with a discussion and visualization of the hierarchical tree structure obtained using Algorithm 1. Then in Sections 4.3 and 4.4 quantitative evidence is provided that the discovered topics are reasonable. In doing this, we seek to measure both the rationality of a given topic and the similarity between topics to evaluate whether the topics differ enough to be useful for a user.

Implementation of Algorithm 1 on the dataset results in a hierarchical clustering of the articles into eight supertopics, each with five to six subtopics. Two of these subtopics, the first and fourth subtopics of supertopic 7, are further decomposed into a third layer of subtopics as the number of articles assigned to the first and fourth subtopics are larger than the selected m in Algorithm 1. The full hierarchical tree structure is visualized in the diagram in Figure 1 . Each color represents one of the eight supertopics and the size of each slice is proportional to the number of articles that are clustered into that topic. It is important to note that only the top three words associated to each topic are shown due to space constraints, but in some cases extending the list of highly related words is necessary to clarify the difference between the subtopics. For reference, the top ten keywords associated with each topic and subtopic can be found in Appendix 7. Additionally, the five most probable words associated to each topic are displayed on the associated website to aid users in more effectively choosing the topics of personal interest.

In order to examine the structure in more depth, Figure 2 displays a branch of the resulting tree represented by word clouds, generated from the top five words associated with each topic. The size of the words in each word cloud cell are proportional to their weight in the corresponding W matrices, and thus, the probability they are associated with that topic. In particular, the figure follows one path down the tree structure, focusing on Topic 7 and its associated subtopics, and then continuing to the subtopics of Topic 7-1. When moving to deeper layers in the tree, the general "health" and "model" topic further differentiates into subtopics ranging from public health to animal to human transmission diseases, and data modeling. Finally, the public health subtopic leads to clusters of articles specifically related to China or hospital care, for example.

Perhaps not surprisingly, the topic to which the highest number of articles are assigned, Topic 7, is about the general study of the disease (with the most highly associated words being "health, model, disease, case, epidemic, outbreak, public, country, population, transmission"), further split into two additional layers of subtopics. This is the only topic that was split into a third layer, allowing a more effective differentiation between articles covering a similar topic.

Also unsurprisingly, much of the literature, which was compiled early on during the pandemic, is clustered around the study of other coronaviruscaused diseases. Topic 8, for example, focuses on vaccine development through the lens of the Porcine Epidemic Diarrhea Virus (PEDV). Although this is a coronavirus found only in pigs, several vaccines have been developed, especially within the last seven years, when PEDV was first discovered in North America (Gerdts and Zakhartchouk, 2017) . Hence, it is reasonable that this topic would be of interest to current researchers looking to develop a vaccine for SARS-CoV-2. Similarly, Topic 1 focuses on coronaviruses known to infect humans, such as SARS-CoV, and MERS-CoV. Topic 4 also contains a couple of subtopics that look specifically at the genetic structure of SARS-CoV. Figure 2 : Part of Topics from HNMF and related topic coherence: The first row shows the the key words for the topics in the first layer, the second row shows the subtopics of Topic 7 and the subtopics of Topic 7-1 is showed in row 3. Corresponding topic coherence score (see Section 4.3 for more details) is underneath each word cloud.

Other topics of interest focus on articles about diseases with related symptoms, although they may be caused by a different type of virus. For example, both Topics 5 and 6 examine literature related to respiratory illnesses such as influenza, though Topic 5 clusters articles more related to laboratory study and Topic 6 clusters articles more related to hospital studies and patient care.

Other major topics focus more on microbiology, including the genomic structure of the virus, the cellular infection and immuno-response, and cell-protein interaction. Thus, the hierarchical tree structure separates papers between macro-(public health) and micro-(biological) studies of the virus, and into papers that study related viruses. This creates a clear delineation of topics for those investigating papers, and gives insight into areas of interest for early researchers of SARS-CoV-2. This organizational structure appears to be more robust and high-level than e.g. a keyword based search or organization.

One measure of effectiveness of the topics discovered by HNMF is topic coherence. Topic coherence is a quantitative measure of how well the keywords that define a topic make sense as a whole to a human observer. As defined by Mimno et al. (2011) , the coherence score, C i for topic i, i = 1, . . . , k with set V (i) = {v (i) 1 , . . . , v (i) P } of the P most probable words in that topic is given by, topic coherence scores for each of the topics in the first layer, as well as the scores for the subtopics in levels 2 and 3 for Topic 7 can be seen in the word cloud display in Figure 2 . A positive coherence score indicates that the keywords are in a grouping that would be recognizable to a human expert. A negative coherence score indicates that a topic is less meaningful, which may occur, for example, if the associated keywords fall into two unrelated groups, or if the keywords are seemingly random and have no obvious connection. All of our identified subtopics have large positive coherence scores, suggesting that by this metric they are understandable and useful to human users.

Another test of the usefulness of the hierarchical structure generated is to evaluate whether the topics are different enough to allow for informative choice between them. To evaluate this, we quantify topic similarity using a metric known as the Word Mover's Distance (WMD). WMD is a popular tool for measuring distances between documents (Kusner et al., 2015) . WMD utilizes Word2Vec (Mikolov et al., 2013a) , a word embedding technique, and treats each document as a set of vectors in the embedded vector space. This embedding allows the WMD metric to consider the semantic meaning of a given word, rather than just its spelling. Thus, for example, it allows for identification of synonyms as having the same meaning in a given context despite being different words , which makes it more preferable than traditional metrics such as cosine similarity or Euclidean distance. The distance between two documents A and B is defined as the minimum cumulative distance that words from document A need to travel to match exactly the words of document B. The topic similarity across the layers and within each layer is evaluated by computing the WMD between a topic and its associated subtopics and between the subtopics themselves, where each topic is represented by its 100 most related words. The similarities between all topics in the hierarchical structure obtained from HNMF is visualized in the heat map in Figure 3 . As indicated by the overall dark colors, in general each topic in the tree is dissimilar from the others.

When examining the similarities between a topic and its subtopics, results show that for a given topic, its subtopics are less correlated with each other than with their parent topic. For example, in Figure  4 , for Topic 7, the similarity scores between its subtopics are much lower than the scores between subtopics and their parent Topic 7. Similar results can be drawn for Topic 7-1 and its subtopics, as shown in Figure 5 . However, there are some high similarity scores Figure 4 : Topic similarity between Topic 7 and its subtopics measured by WDM: Topic 7 has high topic similarity with its five subtopics (7-1, 7-2, 7-3, 7-4, 7-5) and the five topics have low similarity between themselves. Figure 5 : Topic similarity between Topic 7-1 and its subtopics measured by WDM: Topic 7-1 has high topic similarity with its four subtopics (7-1-1, 7-1-2, 7-1-3, 7-1-4) and the four topics have low similarity between themselves. between subtopics that belong to different topics, for example the light off-diagonal spot in Figure 3 showing the similarities between Topics 6-3 and 5-3. Examining the top ten keywords associated with each topic, we find that both topics are associated with the words "influenza", "virus", and "study" indicating that both topics deal with studies related to the influenza virus.

The insight into the difference between the two subtopics comes from examining supertopics 5 and 6 and the keywords associated with each subtopic that do not overlap. Looking at words such as "detection" and "assay" associated with Topic 5 and "surveillance", "case", "season", and "year" associated with Topic 5-3, it appears that Topic 5-3 is more associated with detecting and monitoring the prevalence of cases of influenza in the general populace in a given flu season. On the other hand, the presence of keywords "patient", "hospital", "clinical", and "study" associated with the parent topic, Topic 6, as well as "patient", "child", and "respiratory" associated with Topic 6-3, it seems that Topic 6-3, while also related to influenza studies, deals more specifically with cases in a hospital setting, perhaps specifically related to children, and examining the relationship with respiratory illness in general.

A study of similar subtopics such as these show the effectiveness of the tree in separating related topics into more dissimilar supertopics to make navigation to articles of interest clear. However, Algorithm 1 allows for an article to be assigned to more than one subtopic, acknowledging that a single article may of equal interest to researchers investigating different, but related topics.

In this section, we discuss the details of the implementation of HNMF and the construction of the hierarchical structure.

As previously discussed, the latent topics discovered by NMF are sensitive to the initial state of the algorithm, leading to different dictionaries for each topic. In order to reduce this sensitivity, we seek to find an appropriate number of topics, k * , in each layer such that if a k * -topic NMF is initialized using any two random seeds, the content in the topics discovered should be similar, as measured by cosine similarity. We define this as a consistent number of topics. Algorithm 2 summarizes the process to find the "best" number of topics, as defined in this manner, for a corpus matrix X.

In Algorithm 2, first the increment in proportion of variance explained by adding one more cluster to split the corpus matrix X is plotted. This is calculated by looking at the singular values of X. By examining this plot (Figure 6 ), a range [k 1 , k 2 ] = [7, 11] in which a potential optimal number of topics, k * can be found is obtained by noting where the proportion of variance explained starts to level off.

To determine the value of k * in this range, first, q + 1 random seeds are randomly selected, where q is a sufficiently large number. In this case, q = 30 was used. For each number of topics k ∈ [k 1 , k 2 ], topic sets are generated {T j } q+1 j=1 using each of the q + 1 random seeds for initializing NMF.

Algorithm 2: Determine optimal number of topics Input: integer q, corpus matrix X. Determine a range for the potential topic number [k 1 , k 2 ] by plotting increment in variance explained by adding one more cluster to X; randomly select q + 1 seeds for initialization; for integer k in [k 1 , k 2 ] do generate topic sets {T j } q+1 j=1 from NMF initialized by random seed j; generate S kj for j = 1, 2, · · · , q where S kj is the cosine similarity matrix between topics in T j , T j+1 ; for S kj , j = 1, 2, · · · , q do LSS k = ∅; add lss = min (max(s a. ), max(s .b )) to LSS k , where s ab is the (a, b)th entry of the matrix S kj ;

return k * = arg max k (median(LSS k )).

Then, the cosine similarity is calculated between each of the k topics for every consecutive pair of T j 's. The similarity scores between the topics for each pair (T j , T j+1 ) are stored in a matrix S kj ∈ R k×k . Therefore, q of such matrices are generated for each k ∈ [k 1 , k 2 ]. For a fixed k, the minimum of all maximum entries from each column and row of each similarity matrix S kj is defined to be least seed similarity (lss) score for that k. The set containing the q, lss scores for a given number of topics k is denoted LSS k . A consistent number of topics should have an overall high similarity between the topics generated for each seed. Therefore, we choose k * in [k 1 , k 2 ] to be the "best" number of topics if the median of all its lss scores is the highest.

The boxplot in Figure 7 shows the distribution of the lss scores for k in [7, 11] . In this case, 8 is chosen as the "best" number of topics since it results in the highest median lss score.

A hierarchical NMF (see Algorithm 1) is applied to cluster the articles, where the number of topics in each layer is determined by Algorithm 2. The hierarchical tree structure is established from top to bottom and consists of three layers on this data set (see Figure 1 ).

To generate topics in the first layer, NMF is ap- Figure 6 : Plot of marginal increment in proportion of variance explained by adding another cluster to split X. It is determined that the ideal number of clusters/topics likely lies in the range [7, 11] , as this is where the plot starts to level off. Figure 7 : Box plot of LSS k : Topic number 8 is the "best" as it has the highest median lss (least seed similarity) score and should be expected to yield consistent results with random seeds.

plied to the matrix X containing all the vectorized articles, resulting in a factorization with 8 topics, as determined by Algorithm 2. Next, a threshold α (in this case, α = 0.05) is chosen, and the articles in X are assigned into a topic class X 1 , · · · , X 8 if their corresponding document-topic correlation in the H matrix is greater than α. Note that by this definition, one article could be assigned to one or more topic class. After this, any articles not classified to one of the 8 topics are assigned to the "Extra Document corpus, X e . Now, the second layer of the tree consists of text corpora X 1 , · · · , X 8 . For each X i , i = 1, 2, · · · , 8 in the second layer, the topic is further subdivided into a third layer if the number of articles assigned to a topic class i is more than some m (in this analysis, we chose m = 1400). If it is determined that text corpus X i needs to be divided further using NMF, the number of subtopics is chosen by Algorithm 2 and again, articles from X i are assigned to each subtopic based on the threshold α. As before, any articles that do not receive a classification are assigned to X e . This process is continued for each level in the tree until each leaf contains no more than m articles.

Finally, the cosine similarity between each article in X e and the dictionary associated to each leaf (topic in the lowest layer in a given branch) is calculated. Note that the dictionary of a leaf is a column of the W matrix of its parent topic. Then the articles in X e are assigned to the leaf with the highest cosine similarity. After this reassignment, the number of articles associated with each leaf is calculated again, and any leaves containing more than m articles are further subdivided.

HNMF is used to organize existing literature on coronaviruses and pandemics, and early literature on COVID-19 into an interactive structure easily searchable by researchers and available to use through a corresponding website. The topics discovered by HNMF reveal that early research of interest to the COVID-19 research community divides into diverse areas such as research related to other coronaviruses, research related to other respiratory diseases, virology and genetic research, as well as research relating to the public health response. A topic coherence metric reveals that the topics discovered are consistent and semantically meaningful, while a topic similarity metric reveals that the topics differ sufficiently from one another to allow for a diversity of choice and areas of interest on the part of the user.

In the future, we hope to regularly update the hierarchical structure as well as the associated website as new research papers are added, both by adding new papers and by adding and deleting classifications as new research topics emerge. We hope to do this using an online version of the HNMF algorithm such as the one in Tu et al. (2018) . generated by HNMF. These keywords are visible to website users to enable them to make choices to navigate through the tree. Note that for the first layer we gave suggested topic titles. Not being experts in the field, these are only suggestions to give an idea of the types of research someone may be looking for within that topic. rat, cell, group, animal, cat, used, using, study, protein 7-4-3 model, data, case, outbreak, surveillance, disease, epidemic, transmission, influenza, time 7-4-4 air, particle, concentration, wind, velocity, ventilation, flow, airflow, temperature, room 7-4-5 calf, diarrhea, farm, colostrum, milk, fecal, cow, dairy, herd, day Table 11 : The top 10 keywords associated with each of the subtopics of Topic 7-4 in the 3 rd layer of the tree

Non-negative matrix factorization meets word embedding

A correlated topic model of science

Non-negative matrix factorization, a new tool for feature extraction: Theory and applications

Centers for Disease Control and Prevention. 2020. Covid-19 research articles downloadable database

Neural nonnegative matrix factorization for hierarchical multilayer topic modeling

Vaccines for porcine epidemic diarrhea virus and other swine coronaviruses

Covid-19 open research dataset challenge (cord-19)

Nonnegative matrix factorization for interactive topic modeling and document clustering

Fast rank-2 nonnegative matrix factorization for hierarchical document clustering

From word embeddings to document distances

Deep nmf for speech separation

Learning the parts of objects by non-negative matrix factorization

Efficient estimation of word representations in vector space

Distributed representations of words and phrases and their compositionality

Optimizing semantic coherence in topic models

Termweighting approaches in automatic text retrieval. Information processing & management

Semantic Scholar. 2020. Cord-19: Covid-19 open research dataset

Call to action to the tech community on new machine readable covid-19 dataset

Supervised multilayer sparse coding networks for image classification

A deep matrix factorization method for learning attribute representations

Hierarchical online nmf for detecting and tracking topic hierarchies in a text stream

Document clustering based on non-negative matrix factorization