key: cord-1047389-ovv34qxm authors: Zhao, Ying; Zhou, Charles C. title: Applying Lexical Link Analysis to Discover Insights from Public Information on COVID-19 date: 2020-05-06 journal: bioRxiv DOI: 10.1101/2020.05.06.079798 sha: f74c00976f7b3d2500f57a0bacf3c4e127c718c6 doc_id: 1047389 cord_uid: ovv34qxm SARS-Cov-2, the deadly and novel virus, which has caused a worldwide pandemic and drastic loss of human lives and economic activities. An open data set called the COVID-19 Open Research Dataset or CORD-19 contains large set full text scientific literature on SARS-CoV-2. The Next Strain consists of a database of SARS-CoV-2 viral genomes from since 12/3/2019. We applied an unique information mining method named lexical link analysis (LLA) to answer the call to action and help the science community answer high-priority scientific questions related to SARS-CoV-2. We first text-mined the CORD-19. We also data-mined the next strain database. Finally, we linked two databases. The linked databases and information can be used to discover the insights and help the research community to address high-priority questions related to the SARS-CoV-2’s genetics, tests, and prevention. Significance Statement In this paper, we show how to apply an unique information mining method lexical link analysis (LLA) to link unstructured (CORD-19) and structured (Next Strain) data sets to relevant publications, integrate text and data mining into a single platform to discover the insights that can be visualized, and validated to answer the high-priority questions of genetics, incubation, treatment, symptoms, and prevention of COVID-19. priority questions of genetics, incubation, treatment, 23 symptoms, and prevention of COVID-19? 24 2. How to extract valued information such as information 25 with authority, insights and innovation for combating 26 SARS-Cov-2? What is the authoritative information and 27 insightful information? 28 3. What are the timelines of themes and topics across all 29 the research literature? 30 We show LLA that integrates text and data mining into a 31 single platform so that insights from data can be visualized, 32 and validated to answer the high-priority questions. applications. Emerging and anomalous information is important 52 for looking for insights and innovation. In paper, we show how to 53 apply game-theoretic framework of lexical link analysis (LLA) to 54 discover and rank high-value information from unstructured and 55 structured data from SARS-Cov-2. In LLA, a complex system can be expressed in a list of attributes 57 or word features with specific vocabularies or lexicon terms to 58 describe its characteristics. LLA is first a data-driven text analysis 59 method. Fig. 1 shows an example of extracting and learning word Bi-gram also allows LLA to be extended to structured data (15) 70 including meta-data such as the ones for CORD-19, where a word 71 In this paper, we show how to apply an unique information mining method lexical link analysis (LLA) to link unstructured (CORD-19) and structured (Next Strain) data sets to relevant publications, integrate text and data mining into a single platform to discover the insights that can be visualized, and validated to answer the high-priority questions of genetics, incubation, treatment, symptoms, and prevention of COVID-19. is an attribute combined with its possible values. LLA is related However, the uniqueness of LLA is that we consider anomalous 81 information (word features) might be more interesting. Community 82 detection algorithms have been illustrated by Newman (11, 12) 83 in terms of a quality function as the "modularity" measure for a 84 community (cluster) and optimized using a dendrogram-like greedy America, and 15 cases in Australia. • Among the four cases of Clade A2, A2a in China, three were 160 collected from 1/28/2020 to 2/6/2020, they were submitted based on these themes. Fig. 8 shows a popular theme of word pair 175 appeared in both data sources. Fig. 9, Fig. 10 , and Fig. 11 show 176 examples of emerging themes appeared in both data sources. Fig. 12 177 shows an example of an anomalous theme appeared in both data 178 sources. Emerging themes are interesting topics for researchers to 179 drill down and discover information in CORD-19 pertinent to high-180 priority questions of the SARS-Cov-2 of genetics (Group 423(E)), 181 tests (Group 50(E)), and prevention (Group 42(E)). Conclusion and Challenge to Future. We applied an unique 183 information mining method lexical link analysis to conduct a 184 preliminary study to the call to action and help the science 185 Finding community structure in networks using the eigenvectors of 222 matrices WordNet: a lexical database for English Latent Dirichlet allocation System and method for knowledge pattern search from net-229 worked agents Using latent seman-231 tic analysis to improve information retrieval Probabilistic latent semantic analysis On spectral clustering: analysis and an al-236 gorithm Theory and use case of game-theoretic lexical 240 link analysis A method for evaluating modern systems of auto-246 matic text summarization Organizing information: Principles of data base and retrieval systems. Or-249 lando Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsuper vised Classification of Reviews Enriching the knowledge sources used 254 in a maximum entropy part-ofspeech tagger New models in probabilistic infor-256 mation retrieval. London: British Library Term-weighting approaches in automatic text retrieval Biological named entity recognition using n-grams 262 and classification methods Using Bigrams in Text Categorization SciBERT: A Pretrained Language Model for Scientific 269 Text BERT: Pre-271 training of deep bidirectional transformers for language understanding Value Information from Big Data