key: cord-294304-9w6zt778 authors: Doanvo, Anhvinh; Qian, Xiaolu; Ramjee, Divya; Piontkivska, Helen; Desai, Angel; Majumder, Maimuna title: Machine Learning Maps Research Needs in COVID-19 Literature date: 2020-09-16 journal: Patterns (N Y) DOI: 10.1016/j.patter.2020.100123 sha: doc_id: 294304 cord_uid: 9w6zt778 As of August 2020, thousands of COVID-19 (coronavirus disease 2019) publications have been produced. Manual assessment of their scope is an overwhelming task, and shortcuts through metadata analysis (e.g., keywords) assume that studies are properly tagged. However, machine learning approaches can rapidly survey the actual text of publication abstracts to identify research overlap between COVID-19 and other coronaviruses, research hotspots, and areas warranting exploration. We propose a fast, scalable, and reusable framework to parse novel disease literature. When applied to the COVID-19 Open Research Dataset (CORD-19), dimensionality reduction suggests that COVID-19 studies to date are primarily clinical-, modeling- or field-based, in contrast to the vast quantity of laboratory-driven research for other (non-COVID-19) coronavirus diseases. Furthermore, topic modeling indicates that COVID-19 publications have focused on public health, outbreak reporting, clinical care, and testing for coronaviruses, as opposed to the more limited number focused on basic microbiology, including pathogenesis and transmission. publication authors themselves, rather than manually tagged keywords 3 , since such metadata 1 may not be reliable or fully reflect latent issues discussed by the investigators who conducted the 2 research. Third, defining primary topics in COVID-19 research solely by a select group of 3 influential studies or on narrow correlations between a few metadata keywords at a time is 4 insufficient because 1) topics may be broader than one or several highly influential studies and 2) 5 topics may be comprised of complex correlations mapped between hundreds of different 6 keywords. While a manual review might be desirable to capture this nuance 5 , this does not 7 effectively scale over the tens of thousands of articles available. 8 Our methods address these issues by combining three techniques commonly used in 9 natural language processing: document-term matrices, dimensionality reduction, and topic 10 modeling. Though these techniques are not methodologically novel, our specific application of 11 them is; namely, we use them to analyze where there appears to be less COVID-19 research in 12 comparison with existing research on other coronaviruses. Our document-term matrices allow us 13 to draw on the full text of publication abstracts (as opposed to relying solely on keyword 14 metadata). Our subsequent use of two machine learning (ML) techniques -dimensionality 15 reduction and topic modeling -allows us to analyze complex information at scale without any a 16 priori knowledge of topics, leveraging semantic trends between the tens of thousands of articles 17 available to identify latent concepts and topics. This allows us to explore how the focus of 18 COVID-19 studies differs from research on other coronaviruses by comparing the characteristics 19 of COVID-19 articles, identified through machine learning, with those pertaining to non-SARS-20 CoV-2 coronaviruses. These differences can then lend insight into possible gaps in research 21 efforts for J o u r n a l P r e -p r o o f appeared to share a space in common with both MERS-CoV and SARS-CoV, likely reflecting 1 some shared terminology and possible ongoing attempts to leverage existing knowledge of the 2 other two viruses to learn about SARS-CoV-2. However, SARS-CoV-2 abstracts are much more 3 concentrated among lower projection values. Notably, MERS-CoV and SARS-CoV abstracts 4 were spread more evenly along the second PC, reflecting greater breadth and variation along 5 these PCs that can be attributed to a broader range of studies focused on these pathogens as 6 compared to SARS-CoV-2. This may be in part due to the much longer time that has been spent 7 studying these viruses to date. 8 To identify terms associated with differences between COVID-19 and non-COVID-19 9 abstracts on PC2, we examined patterns of lemmatized terms from the respective abstracts 10 ( Figure 3 ). The projection values of COVID-19 abstracts on PC2 were lower and associated with 11 emergent COVID-19 clinical-, modeling-or field-based (CMF) research -such as observational, 12 clinical, and epidemiological studies -exemplified by stem terms "patient", "pandem", "estim", 13 and "case". Words in the opposite direction on PC2 -such as "protein", "cell", "bind", and 14 "express" -can be associated with viral biology and basic disease processes studied in 15 biomolecular laboratories. COVID-19 abstracts were thus mostly associated with research 16 conducted outside of laboratories, e.g., in hospitals, likely reflecting the pandemic reality of data 17 collection alongside (and often secondary to) clinical care. 18 The high-level abstraction reflected by PC2 informed our designation of the extent that 19 COVID-19 research included studies with any CMF design -ranging from epidemiological 20 studies to retrospective reviews of clinical outcomes, case studies, and randomized clinical trials 21 -or laboratory-driven research -including observational microscopy, experimentation with 22 antiviral compounds, derivation of protein structures, and studies of animal or cell culture 23 J o u r n a l P r e -p r o o f models. Overall, COVID-19 abstracts appeared more likely to have terms associated with CMF 1 research rather than laboratory studies based on comparisons of distributions for key terms in the 2 COVID-19 and non-COVID-19 abstracts (Figure 4 ; examples in Supplemental Information 3). 3 This partition along research design for non-COVID-19 and COVID-19 abstracts was also 4 evident in the abstract texts: 90% of the abstracts in the bottom 1% of projection values along the 5 second PC were related to COVID-19; conversely, only 1% of the abstracts in the top 1% were 6 related to COVID-19. In the future, we can implement PCA again to observe time-varying trends 7 in CMF-based and laboratory-driven research. If COVID-19 research continues to focus 8 significantly more on CMF-based study than laboratory-driven research, we would expect this to 9 be reflected in the separation between COVID-19 and non-COVID-19 research along a new PC 10 that separates these two categories of research. 11 When we reran dimensionality reduction and topic modeling on new data through July 12 31, 2020, we found that the body of CMF research has continued to grow far more quickly than 13 laboratory-based research (figures available in Supplemental Information 4). Our PCA analysis 14 found that PC2 strongly differentiated between abstracts related to SARS-CoV-2 and abstracts 15 that mentioned other coronaviruses. SARS-CoV-2 abstracts tended to have lower PC2 projection 16 values, which were associated with CMF-related terms, such as "hospit", "case", and "risk". 17 Conversely, non-SARS-CoV-2 studies tended to have higher projection values, which were 18 associated with laboratory-based research, including "antibodi", "cell", and "protein". 19 Topic modeling helped characterize differences between research topics discussed in 21 COVID-19 and non-COVID-19 abstracts. Results from the latent Dirichlet allocation (LDA) 22 model suggested that, similar to the pattern observed in Figure 5 , there was clear differentiation 23 J o u r n a l P r e -p r o o f Our findings demonstrate the utility of our novel NLP-driven approach for determining 1 potential areas of underrepresentation in current research efforts for COVID-19. By applying 2 unsupervised ML methods to CORD-19, we identified overarching key research topics in 3 existing coronavirus and COVID-19-specific abstracts, as well as the distribution of abstracts 4 among topics and over time. Our results support a prior bibliometric study that also found more 5 frequent appearances of epidemiological keywords in COVID-19 research compared to research 6 on other coronaviruses 3 . However, our study presents the unique finding that laboratory-based 7 COVID-19 studies, including those on genetic and biomolecular topics, are underrepresented 8 relative to studies of epidemiological and clinical issues, particularly when compared with the 9 distribution of previous research on other coronaviruses. We continued to observe this trend 10 when we updated our May 30, 2020 analysis with data through July 31, 2020. In particular, the 11 pace of basic microbiological study has lagged behind that of research in other areas (e.g., topic 12 families derived from LDA including clinical issues, societal impacts and policies, general 13 reporting, and transmission modeling), all of which are CMF-based. 14 Furthermore, we developed a framework that improves upon existing bibliometric studies 15 in three key ways; namely, our approach (1) maps connections between publications by relying 16 directly on the abstracts instead of the narrow information gained from metadata as in other 17 bibliometric analyses, including those from other fields 9,10 ; (2) uses ML to explore latent 18 semantic information of vast scale and complexity to identify hidden trends; and (3) does not 19 rely on any a priori knowledge of what topics we expect coronavirus literature to cover but 20 rather highlights them without any preconceived assumptions. We believe this methodology can 21 be reused to rapidly explore possible research gaps during future epidemics and pandemics. 22 More specifically, natural language processing and ML could serve as a way to identify the 23 J o u r n a l P r e -p r o o f major concepts and topics covered by past research in comparison against present efforts; if 1 certain topics identified in earlier research are not well-represented in more recent studies on the 2 emerging pathogen, they could be interpreted as potential research gaps. 3 The distribution of COVID-19 and non-COVID-19 abstracts from our PCA results 4 suggest that, at the time of writing (CORD-19 dataset release on July 31, 2020), the breadth of 5 published research for COVID-19 is relatively narrow compared to that of published non-6 COVID-19 studies (Figures 1 and 2 ). As shown in our results, keywords associated with 7 biomolecular processes (e.g., viral structure, pathogenesis, and host cell interactions) appeared 8 more frequently in non-COVID-19 abstracts than in COVID-19 abstracts. This finding reflects 9 the emergent nature of SARS-CoV-2. Nonetheless, the availability of laboratory studies for other 10 coronaviruses represents an opportunity for generating hypothesis-driven research questions 11 grounded in empirical research. This underrepresentation of studies on biomolecular processes could also be attributed to 9 the rapid worldwide spread of SARS-CoV-2 that occurred within mere months of its emergence, 10 necessitating an unprecedented response from healthcare and public health infrastructures 11 globally. Our PCA results reflect an overwhelming concern regarding the exponential spread of 12 the virus and risks for transmission involved with more frequent appearances of stem terms such 13 as "pandem", "outbreak", "estim", "countri", "number", and "risk" in COVID-19 abstracts. This 14 was also supported by our topic modelling results, which indicated that 58% of COVID-19 15 abstracts fell into just five of 30 topics, generally related to healthcare services, the pandemic's 16 public health issues, and testing for coronaviruses (Figure 7a , 7b). The more rapid growth of 17 CMF research, relative to laboratory-driven research, mirrors the current response to the 18 pandemic in the United States where the initial focus on pressing epidemiological and clinical 19 concerns is now followed by interest in experimental investigations, including those of structural 20 mechanisms for host cell entry and possible therapeutic targets. 21 Overall, our findings reflect a clear divide between COVID-19 and non-COVID-19 22 abstracts based upon research design; unlike CMF research, laboratory-driven SARS-CoV-2 23 research is either still underway or has only just been initiated. This can be attributed in part to 1 the fact that laboratory research is often a labor-intensive process within a federally-regulated 2 infrastructure that depends on the availability of timely, project-based funding as well as longer-3 term funding. Our findings also suggest that the pace of research on SARS-CoV-2 biomolecular 4 processes is potentially insufficient given the global threat posed by the virus (Figure 7a, 7b) . We recognize that the number of abstracts in each of these topics does not necessarily 22 represent scientific progress made in these areas, but they do reflect the pace of research and 23 potential availability of public knowledge. This indicates either a mismatch between the level of 1 effort in these issues and the urgency of work or time lags inherent to these fields that constrain 2 the responsiveness of the scientific community. Increased and consistent funding of emerging 3 pathogens research, including support of basic research even when there is no immediate threat 4 of an outbreak, would allow us to maintain a proactive posture in accumulating available 5 knowledge rather than over-reliance on reactivity. 6 These conclusions must be caveated by several limitations that must be acknowledged. proliferation of pre-print services. This reduces the lag between the discovery of knowledge and 2 the availability of an abstract to ingest in our data pipeline. Third, the number of publications in 3 each area may imply a relative difference in research productivity for different topics, and thus 4 may still serve as a proxy for indicating such progress or the attention given to specific issues. 5 And finally, our ML-based method offers the chance to quickly review large quantities of text at 6 scale and highlight underlying trends. Both speed and scale are crucial to informing time-7 sensitive decisions on policy and priorities to facilitate the most impactful research. 8 Our ML-based study offers insights into potential areas for research opportunities to 9 tackle key gaps in our knowledge regarding SARS-CoV-2 and COVID-19. Our findings 10 showcase the need for institutions to support laboratory-driven research on an ongoing basis -not 11 only during a crisis -to enable a proactive preparedness posture. While we would prefer future 12 pandemics to be prevented through comprehensive surveillance and mitigation of new 13 pathogens, if a crisis emerges in the future, the urgency to understand knowledge gaps will 14 remain. Our approach can be reused in such scenarios to rapidly explore potential research gaps 15 and to inform future efforts for other emergent pathogens. By using prior research or studies 16 focused on related pathogens as a baseline, the trends and gaps in knowledge regarding an 17 emergent pathogen can be monitored to ensure that key areas in research are not under-resourced 18 in the middle of a crisis. 19 20 Lead contact: Further information and requests for resources should be directed to and 1 will be fulfilled by the Lead Contact, Anhvinh Doanvo (adoanvo(at)gmail.com). 2 Materials availability: This study did not generate any unique reagants. 3 Data and code availability: The data of CORD-19 is available to download here. We used 4 the data released on May 30, 2020 for our initial analysis; for our update, we used the data 5 released on July 31, 2020. All of our code is available for download from GitHub here. 6 7 Without using any pre-existing knowledge about the abstracts' topics, we employed 9 unsupervised ML to determine differences between COVID-19 and non-COVID-19 abstracts in 10 our corpus of documents. A dimensionality reduction approach was used to identify principal 11 patterns of variation in the abstracts' text, followed by topic modeling to extract high-level topics 12 discussed in the abstracts 20 . Our data pipeline is available on GitHub 1 and the specific software 13 packages we used are described in Supplemental Information 8. 14 15 We obtained research abstracts from CORD-19 on May 28, 2020. Generated by the Allen 17 Institute for AI, and in partnership with other research groups, CORD-19 is updated daily with 18 coronavirus-related literature. Peer-reviewed studies from PubMed/PubMed Central, as well as 19 pre-prints from bioRxiv and medRxiv, are retrieved using specific coronavirus-related keywords 20 ("COVID-19" OR "Coronavirus" OR "Corona virus" OR "2019-nCoV"OR "SARS-CoV" OR 21 "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory including both full-text and metadata for all coronavirus research articles, with ~40% of the 2 dataset classified as virology-related 6 . We focused our analysis on the abstracts of articles in 3 CORD-19. 4 As some of the CORD-19 abstracts were neither relevant to SARS-CoV-2 nor other 5 coronaviruses, we first filtered the CORD-19 data to isolate coronavirus-specific abstracts by 6 searching for abstracts that mentioned relevant terms. These abstracts served as our "documents" 7 associated with the sparse document-term matrices (DTMs) in our natural language processing 8 (NLP) pipeline (DTMs and sparse matrices are described in more detail in Supplemental 9 Information 9). We also identified abstracts for only COVID-19-related studies by filtering for 10 COVID-19-related keywords within this subset (Supplemental Information 10). 11 We used two machine learning techniques to identify key trends in coronavirus literature: 13 dimensionality reduction and topic modeling, discussed below. For software packages and 14 additional details behind the data pipeline, see Supplemental Information 8. 15 Dimensionality Reduction 16 PCA is a dimensionality reduction algorithm that summarizes data by determining linear 17 correlations between variables 21 . PCA identifies individual patterns of variance, or PCs, in 18 DTMs that differentiate documents from one another, highlighting key trends in the data. For 19 example, in a simple corpus with two mutually exclusive topics, like machine learning and health 20 infrastructure, the terms "machine" and "learning" would be correlated with one another. PCA 21 would recognize these terms as an important source of variation, providing a way to differentiate 22 documents about either topic ("machine learning" vs. "health infrastructure") by the frequency of 1 these terms. 2 When PCA is applied to DTMs, PCs represent patterns differentiating different 3 documents, typically ordered by their prominence. This means that earlier PCs almost always 4 capture more variance than later PCs. However, in some cases, PC1 may capture less variance 5 than PC2 if certain precomputation processing is not conducted (see Supplemental Information 1 6 for more details). Each detected pattern reflects both the contextual links between words and 7 their level of importance within the texts. Words with component values of the greatest 8 magnitude on each PC most strongly drive the pattern that each individual PC recognizes. For 9 example, if "machine" and "healthcare" respectively have highly negative and highly positive 10 values on a particular PC, then that PC detects the pattern that when "machine" appears in a text, 11 "healthcare" appears less often. Another PC may detect a different pattern of variance, such as 12 when some documents mention "deep learning" more often than others. 13 The projection values of the text corpus onto the PCs suggest what concept each 14 document discusses and to what extent, relative to the average document within the corpus. 15 Following the previous example, strongly negative projection values on the first PC, which 16 would capture the data's most prominent patterns, indicate that the document mentions 17 "machine" more often than the average and thus, is more likely to focus on machine learning. In 18 addition, projection values on the second PC could distinguish between machine learning 19 documents by focus, or lack thereof, on deep learning or other techniques. This approach enables 20 us to delineate between different groups of abstracts by visualizing differences in their 21 projections on the top PCs. After applying PCA to the DTMs of our abstracts, we identified 22 which PCs successfully separated COVID-19 and non-COVID-19 abstracts. We then used the 1 component values with the largest magnitude on these PCs to interpret them. 2 Applying PCA to DTMs can be computationally expensive and sometimes infeasible 3 because of their extremely high dimensionality (i.e., many different words are being counted). 4 Furthermore, traditional implementations of PCA that rely on calculating covariance matrices 21 5 cannot be used on sparse matrices, and thus would not be applicable to our sparse DTMs where 6 instances in which words do not appear in specific documents are implied (Supplemental 7 Information 9) but not recorded. For information on the modified procedure we used to mitigate 8 these limitations, see Supplemental Information 1. 9 After establishing high-level trends using PCA, we used LDA, a topic modeling method, 11 to add nuance to observed differences between COVID-19 and non-COVID-19 literature and 12 examine potential topics of interest. LDA is an unsupervised probabilistic algorithm that extracts 13 hidden topics from large volumes of text 22 . Once trained to discover words that separate 14 documents into a predetermined number of topics, LDA can estimate the "mixture" of topics 15 associated with each document. These mixtures suggest the dominant topic for a document that 16 is then used to assign a document to an overarching topic category. For example, LDA may 17 separate documents into two topics, one on "machine learning" and another on "healthcare", and 18 if a particular document's mixture is 60% "machine learning" and 40% "healthcare", it would 19 assign that document to a "machine learning" topic category. 20 The predetermined number of topics is the most important hyperparameter in an LDA 21 model, as models with sub-optimal number of topics fail to summarize data in an efficient 22 manner 22,23 . The number of topics can be determined by (1) identifying a model that has a low 23 perplexity score and high coherence value when applied to an unseen dataset or (2) conducting a 1 principled, manual assessment of the topics that arise. Perplexity is a statistical measure of how 2 imperfectly the topic model fits a dataset, and a low perplexity score is generally considered to 3 provide better results 23 . Similarly, topic models with high coherence values are considered to 4 offer meaningful, interpretable topics 24,25 . Thus, a model with a low perplexity score and a high 5 coherence value is more desirable when choosing the optimal number of topics. Our initial 6 implementation of LDA showed no optimal value for the number of topics, even as it approached 7 ~100, potentially reflecting a relatively shallow yet broad pool of COVID-19 publications. We 8 ultimately identified 30 topics via manual review of topics from topic models with different 9 numbers of topics to identify which model satisfied two criteria: (1) topics that were relatively 10 specific, focusing on a single subject matter, and (2) topics that would typically be non-11 redundant with one another. 12 Our framework can be reused to identify literature gaps for other fields, including 14 emerging pathogens. This would require researchers to preprocess their data to create a 15 document-term matrix for literature on the emergent pathogen and that of related but previously 16 observed pathogens. Investigators can then conduct PCA and LDA to identify 1) a PC that 17 separates abstracts in the two bodies of literature, 2) the terms that enable them to interpret the 18 meaning of that PC, and 3) the distribution of literature in each of the two categories across 19 several topics. PCA and LDA together can quickly identify concepts and topics that separate the 20 two bodies of literature by tapping data from correlations between numerous different words at 21 once across all the literature considered. abstracts; for panel A, each group is comprised of abstracts within a certain range of projection 23 Principal components analysis (PCA) is typically completed in three key steps: 1. We center the data. In other words, we subtract the mean of each column from the original data, yielding a matrix where the mean of each column is zero. 2. We calculate the covariance matrix of this centered data. This represents all of the correlations between every column of data. 3. We perform an eigendecomposition of this covariance matrix. This yields what we consider to be the final products of dimensionality reduction: the principal components (eigenvectors) that represent key patterns in the data, along with information on how important they are (eigenvalues, which represent how much variance they capture). The second and third steps of this process primarily rely on matrix multiplication which has been highly optimized in most scientific computing packages, including Python's scipy and numpy, as well as sklearn's implementation of covariance-based PCA. But the first step -centering the data -relies on matrix subtraction, which has not had nearly as much technical development aimed at its optimization. This process thus typically requires a dense matrix, where every value, even if they are zero, is explicitly delineated. However, this is not computationally feasible with our DTMs, where we had tens of thousands of rows and columns, resulting in billions of elements that we could not store entirely in memory. We instead stored DTMs as sparse matrices, which are efficient because most elements are zero and only the values of nonzero elements are stored, but are incompatible column-wise addition and subtraction operations. Therefore, we calculated PCA instead by performing the singular value decomposition (SVD) on the original sparse and uncentered DTM, which is possible with the sklearn Python package (Supplemental Information 8). While the SVD operation is equivalent to PCA when SVD is performed on centered data, it is worth noting that when data is uncentered, the first principal component (PC) outputted may capture less variance than the second PC because the first PC captures the mean of the data. 1 The complexity of PCA through SVD scales with an order of , or (max (m, ) in(m, ) ) O n * m n 2 when is large, and where and are the number of observations (documents) and (m ) O * n 2 m m n features (unique words) respectively. With nearly 10,000 articles mentioning coronavirus-related terms in their abstracts and tens of thousands of unique words, SVD computations can take some time. We accelerated this step by using randomized SVD, which has an order of complexity of just , (mn log(k)) O where is the number of PCs computed. Indeed, this enabled our SVD calculations to proceed almost k instantaneously. And while there is some randomness associated with results from randomized SVD, existing literature indicates that its output converges super-exponentially to the true output of SVD with additional iterations. 2 J o u r n a l P r e -p r o o f Figure 3 ) J o u r n a l P r e -p r o o f Severe acute respiratory syndrome coronavirus nucleocapsid protein expressed by an adenovirus vector is phosphorylated and immunogenic in mice 7 Molecular cloning and expression of a spike protein of neurovirulent murine coronavirus JHMV variant cl-2 8 J o u r n a l P r e -p r o o f In our analysis of CORD-19 data up through July 31, 2020, we found results very similar to our analysis conducted on data up through May 28, 2020: COVID-19 research has continued to focus heavily on CMF-based study, much more so than laboratory-based study, especially when compared to research done on other coronaviruses. We found that PC2 strongly differentiates between COVID-19 and non-COVID-19 coronavirus abstracts. COVID-19 abstracts tend to have lower projection values on PC2. On PC2, lower projection values are associated with CMF-related terms, such as "hospit" and "case", while higher projection values are often associated with laboratory-based study. When compared with non-COVID-19 abstracts mentioning other coronaviruses, we observe that COVID-19 abstracts are much more likely to mention CMF-related terms and much less likely to mention terms related to laboratory-based study. In particular, we detected five major topic families, including (1) clinical issues: testing and diagnostics, (2) societies and outbreaks: responses to mitigate them and diseases' impact on society, (3) basic microbiological study, (4) general outbreak reporting, and (5) modeling of disease transmission. Basic microbiological research on SARS-CoV-2 continues to lag relative to CMF-related study. In both our PCA and LDA analysis, COVID-19 research has tended to focus more on CMF-based than laboratory-based study. However, some trends in the LDA analysis are particularly noteworthy: general outbreak reporting has slowed relative to research on clinical issues. Research on the societal impact of outbreaks and the modeling of disease transmission has also rapidly accelerated, relative to basic microbiological research. Study of public health responses to mitigate the pandemic continues to dominate the field. The impact of the COVID-19 pandemic has led scientists to produce a vast quantity of research aimed at understanding, monitoring, and containing the disease; however, it remains unclear whether the research that has been produced to date sufficiently addresses existing knowledge gaps. We employ artificial intelligence (AI) / machine learning techniques to analyze this massive amount of information at scale. We find key discrepancies between literature about COVID-19 and what we would expect based on research on other coronaviruses. These discrepancies -namely, the lack of basic microbiological research, which is often expensive and time-consuming -may negatively impact efforts to mitigate the pandemic and raise questions regarding the research community's ability to quickly respond to future crises. Continually measuring what is being produced, both now and in the future, is key to making better resource allocation and goal prioritization decisions as a society moving forward. • AI/machine learning techniques can analyze coronavirus research at massive scale • COVID-19 research has so far focused on non-lab-based (e.g., observational) research • COVID-19 lab-based / basic microbiological research is less prevalent than expected DSML 4, Production: The algorithms used for dimensionality reduction and topic modeling have been applied to many domains, but the research community continues to rapidly develop alternatives to them. An artificial intelligence / machine learning based approach can be used to rapidly analyze COVID-19 literature and evaluate whether the research being produced at present addresses existing knowledge gaps. We observe that COVID-19 research has been primarily clinical-, modeling-, or field-based, and we observe significantly less laboratory-based research than expected when compared against other coronavirus (non-COVID-19) diseases. Our approach can be used to identify knowledge gaps and inform resource allocation decisions for research during future crises. Prior to computing the DTMs, we removed all punctuation and numbers from the text and lemmatized the remaining words so that words with the same stem are consolidated. This reduced the noise in the dataset and enhanced the consistency between machine-derived metrics and their semantic meaning. We also leveraged existing natural-language-processing packages (Gensim) to identify potentially useful word pairs, or "bigrams", as terms to feed into the DTMs. However, these DTMs are ultimately extremely large.With 35,281 coronavirus abstracts and 69,667 unique words or bigrams identified, there are billions of elements in our matrices A Tale of Two Matrix Factorizations Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions The first report of the prevalence of COVID-19 in Chronic myelogenous leukemia patients in the core epidemic area of China:multicentre, cross-sectional survey Transmission risk of patients with COVID-19 meeting discharge criteria should be interpreted with caution COVID-19 in a designated infectious diseases hospital outside Hubei Province Characterization of the expression and immunogenicity of the ns4b protein of human coronavirus 229E Severe acute respiratory syndrome coronavirus nucleocapsid protein expressed by an adenovirus vector is phosphorylated and immunogenic in mice Molecular cloning and expression of a spike protein of neurovirulent murine coronavirus JHMV variant c1-2 We wrote a package that simplifies the preprocessing of the data, which is available on the Github repository 1 . It uses the nltk version 3.4.5 package's SnowBall stemmer to lemmatize words and the gensim package version 3.8.0 to preprocess text including the removal of punctuation, identification of bigrams, and creation of term frequency-inverse document frequency matrices (Supplemental Information 9). All cited software packages in this supplement are written in Python.To implement dimensionality reduction in our pipeline, we used the package "scikit-learn" version 0.23.1. We specifically used the "TruncatedSVD" functionality, which enables the use of dimensionality reduction on sparse matrices like term-frequency-inverse-document-frequencies (Supplemental Information 9) and is analogous to principal components analysis.We used the "gensim" package version 3.8.0 to conduct topic modeling with its LdaMulticore functionality.All plots were created using matplotlib version 3.1.3. J o u r n a l P r e -p r o o f Value(s) Description and rationale Case sensitive: MERS Not case sensitive: "covid-19", "coronavirus", "corona virus", "2019-ncov", "sars-cov", "mers-cov", "severe acute respiratory syndrome", "middle east respiratory syndrome"• The presence of these search terms in an abstract indicated that the abstract was relevant to the study • Mentioning these terms in an abstract made it more likely that a coronavirus was central to the research "COVID-19", "COVID", "2019-nCoV", "SARS-CoV-2" (case sensitive)• The presence of these terms in an abstract indicated that it was relevant to COVID-19 Case sensitive: MERS Not case sensitive: "middle east respiratory"• The presence of these terms in an abstract indicated that it was relevant to MERS-CoV Case sensitive: SARS Not case sensitive: "severe acute respiratory syndrome"• The presence of these terms in an abstract indicated that it was relevant to SARS-CoV J o u r n a l P r e -p r o o f