key: cord-0613311-ns628u21
authors: Ye, Yanfang; Hou, Shifu; Fan, Yujie; Qian, Yiyue; Zhang, Yiming; Sun, Shiyu; Peng, Qian; Laparo, Kenneth
title: $alpha$-Satellite: An AI-driven System and Benchmark Datasets for Hierarchical Community-level Risk Assessment to Help Combat COVID-19
date: 2020-03-27
journal: nan
DOI: nan
sha: e4b143a8854323d466dfd9ca56ad8542b9e0104f
doc_id: 613311
cord_uid: ns628u21

The novel coronavirus and its deadly outbreak have posed grand challenges to human society: as of March 26, 2020, there have been 85,377 confirmed cases and 1,293 reported deaths in the United States; and the World Health Organization (WHO) characterized coronavirus disease (COVID-19) - which has infected more than 531,000 people with more than 24,000 deaths in at least 171 countries - a global pandemic. A growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of COVID-19 and thus better respond with actionable strategies for community mitigation. By advancing capabilities of artificial intelligence (AI) and leveraging the large-scale and real-time data generated from heterogeneous sources (e.g., disease related data from official public health organizations, demographic data, mobility data, and user geneated data from social media), in this work, we propose and develop an AI-driven system (named $alpha$-Satellite}, as an initial offering, to provide hierarchical community-level risk assessment to assist with the development of strategies for combating the fast evolving COVID-19 pandemic. More specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable individuals to select appropriate actions for protection while minimizing disruptions to daily life to the extent possible. The developed system and the generated benchmark datasets have been made publicly accessible through our website. The system description and disclaimer are also available in our website.

Coronavirus disease (COVID-19) [34] is an infectious disease caused by a new virus that had not been previously identified in humans; this respiratory illness (with symptoms such as a cough, fever and pneumonia) was first identified during an investigation into an outbreak in Wuhan, China in December 2019 and is now rapidly spreading in the U.S. and globally. The novel coronavirus and its deadly outbreak have posed grand challenges to human society. As of March 26, 2020 , there have been 85,377 confirmed cases and 1,293 reported deaths in the U.S. (Figure 1 ); and the WHO characterized COVID-19 -which has infected more than 531,000 people with more than 24,000 deaths in at least 171 countries -a global pandemic. It is believed that the novel virus which causes COVID-19 emerged from an animal source, but it is now rapidly spreading from personto-person through various forms of contact. According to the Centers for Disease Control and Prevention (CDC) [4] , the coronavirus seems to be spreading easily and sustainably in the community -i.e., community transmission which means people have been infected with the virus in an area, including some who are not sure how or where they became infected. An example of community transmission that caused the outbreak of COVID-19 in King County at Washington State (WA) is shown in Figure 2 . The challenge with community transmission is that carriers are often asymptomatic and unaware that they are infected and through their movements within the community they spread the disease. According to the CDC, before a vaccine or drug becomes widely available (i.e., this is the case for COVID-19 by far), community mitigation, which is a set of actions that persons and communities can take to help slow the spread of respiratory virus infections, is the most readily available interventions to help slow transmission of the virus in communities [5] . A growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of COVID-19 and thus better respond with actionable strategies for community mitigation. Unlike the 1918 influenza pandemic [2] where the global scope and devastating impacts were only determined well after the fact, COVID-19 history is being written daily, if not hourly, and if the right types of data can be acquired and analyzed there is the potential to improve self awareness of the risk to the population and develop proactive (rather than reactive) interventions that can halt the exponential growth in the disease that is currently being observed. Realizing the true potential of real-time surveillance, with this opportunity comes the challenge: the available data are uncertain and incomplete while we need to provide mitigation strategies objectively with caution and rigor (i.e., enable people to select appropriate actions to protect themselves at increased risk of COVID-19 while minimize disruptions to daily life to the extent possible).

To address the above challenge, leveraging our long-term and successful experiences in combating and mitigating widespread malware attacks using AI-driven techniques [7, 8, 10, 11, 15, 16, 20, [37] [38] [39] [40] [41] [42] [43] [44] [45] , in this work, we propose to design and develop an AI-driven system to provide hierarchical community-level risk assessment at the first attempt to help combat the fast evolving COVID-19 pandemic, by using the large-scale and real-time data generated from heterogeneous sources. In our developed system (named α-Satellite), we first develop a set of tools to collect and preprocess the large-scale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations, demographic data, mobility data, and user generated data from social media; and then we devise advanced AI-driven techniques to provide hierarchical community-level risk assessment to enable actionable strategies for community mitigation. More specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable people to select appropriate actions for protection while minimizing disruptions to daily life.

The framework of our proposed and developed system is shown in Figure 3 . In the system of α-Satellite, (1) we first construct an attributed heterogeneous information network (AHIN) to model the collected large-scale and real-time pandemic related data in a comprehensive way; (2) based on the constructed AHIN, to address the challenge of limited data that might be available for learning (e.g., social media data to learn public perceptions towards COVID-19 in a given area might not be sufficient), we then exploit the conditional generative adversarial nets (cGANs) to gain the public perceptions towards COVID-19 in each given area; and finally (3) we utilize meta-path based schemes to model both vertical and horizontal information associated with a given area, and devise a novel heterogeneous graph auto-encoder (GAE) to aggregate information from its neighborhood areas to estimate the risk of the given area in a hierarchical manner. The developed system α-Satellite and the generated benchmark datasets have been made publicly accessible through our website.

There have been several works on using AI and machine learning techniques to help combat COVID-19: in the biomedical domain, [6, 24, 28, 32, 35] use deep learning methods for COVID-19 pneumonia diagnosis and genome study; while [26, 36] develop learning-based models to predict severity and survival for patients. Another research direction is to utilize public accessible data to help the estimation of infection cases or forecast the COVID-19 outbreak [14, 17, 18, 22, 25, 27, 46] . However, most of these existing works mainly focus on Wuhan China; the studies of using computational models to combat COVID-19 in the U.S. are scarce and there has no work on community-level risk assessment to assist with community mitigation by far. To meet this urgent need and to bridge the research gap, in this work, by advancing capabilities of AI and leveraging the large-scale and real-time data generated from heterogeneous sources, we propose and develop an AI-driven system, named α-Satellite, to provide hierarchical community-level risk assessment at the first attempt to help combat the deadly and fast evolving COVID-19 pandemic.

In this section, we will introduce our proposed method integrated in the system of α-Satellite to automatically provide hierarchical community-level risk assessment related to COVID-19 in detail.

Realizing the true potential of real-time surveillance requires identifying the proper data sources, based on which we can devise models to extract meaningful and actionable information for community mitigation. Since relying on a single data source for estimation and prediction often results in unsatisfactory performance, we develop a set of crawling tools and preprocessing methods to collect and parse the large-scale and real-time pandemic related data from multiple sources, which include the followings.

• Disease related data. We collect the up-to-date county-based coronavirus related data including the numbers of confirmed cases, new cases, deaths and the fatality rate, from i) official public health organizations such as WHO, CDC, and county government websites, and ii) digital media with real-time updates of Figure 3 : System architecture of α-Satellite (i.e., an AI-driven system for hierarchical community-level risk assessment). In α-Satellite, (a) we first construct an AHIN to model the collected large-scale and real-time pandemic related data in a comprehensive way; (b) based on the constructed AHIN, we then exploit the cGANs to gain the public perception towards COVID-19 in an given area; (c) we finally utilize meta-path based schemes to model both vertical and horizontal information associated with a given area, and devise heterogeneous GAE to aggregate information from its neighborhood areas to estimate the risk of the given area in a hierarchical manner.

COVID-19 (e.g., 1point3acres 2 ). The collected up-to-date countybased COVID-19 related statistical data can be an important element for risk assessment of an associated area. • Demographic data. The United States Census Bureau 3 provides the demographic data including basic population, business, and geography statistics for all states and counties, and for cities and towns with more than 5,000 people. The demographic information will contribute to the risk assessment of an associated area: for example, as older adults may be at higher risk for more serious complications from COVID-19 [3, 30] , the age distribution of a given area can be considered as an important input. In this work, given a specific area, we mainly consider the associated demographic data including the estimated population, population density (e.g., number of people per square mile), age and gender distributions. • Mobility data. Given a specific area (either user input or automatic positioning), a mobility measure that estimates how busy the area is in terms of traffic density will be retained from location service providers (i.e., Google maps). • User generated data from social media. As users in social media are likely to discuss and share their experiences of COVID-19, the data from social media may contribute complementary knowledge such as public perceptions towards COVID-19 in the area they associate with. In this work, we initialize our efforts with the focus on Reddit, as it provides the platform for scientific discussion of dynamic policies, announcements, symptoms and events of COVID-19. In particular, we consider i) three subreddits with general discussion (i.e., r/Coronavirus 4 , r/COVID19 5 and r/CoronavirusUS 6 ); ii) four region-based subreddits, which are r/CoronavirusMidwest, r/CoronavirusSouth, r/CoronavirusSouthEast and r/CoronavirusWest; and iii) 48 statebased subreddits (i.e., Washington, D.C. and 47 states). To analyze public perceptions towards COVID-19 for a given area (note that all users are anonymized for analysis using hash values of usernames), we first exploit Stanford Named Entity Recognizer [12] to extract the location-based information (e.g., county, city), and then utilize tools such as NLTK [1] to conduct sentiment analysis (i.e., positive, neutral or negative). More specifically, positive denotes well aware of COVID-19, while negative indicates less aware of COVID-19. For example, with the analysis of the post by a user (with hash value of "CF***6") in subreddit of r/CoronaVirusPA on March 14, 2020: "I live in Montgomery County, PA and everyone here is acting like there's nothing going on.", the location-related information of Montgomery county and Pennsylvania state (i.e., PA) can be extracted, and a user's perception towards COVID-19 in Montgomery county at PA can be learned (i.e., negative indicating less aware of COVID-19). Such automatically extracted knowledge will be incorporated into the risk assessment of the related area; meanwhile, it can also provide important information to help inform and educate about the science of coronavirus transmission and prevention.

To comprehensively describe a given area for its risk assessment related to COVID-19, based on the data collected from multiple sources above, we consider and extract higher-level semantics as well as social and behavioral information within the communities.

Attributed Features. Based on the collected data above, we further extract the corresponding attributed features.

• A1: disease related feature. For a given area, its related COVID-19 pandemic data will be extracted including the numbers of confirmed cases, new cases, deaths and the fatality rate, which is represented by a numeric feature vector a 1 . For example, as of March 22, 2020, the Cuyahoga County at Ohio State (OH) has had 125 confirmed cases, 33 new cases, 1 death and 0.8% fatality rate, which can be represented as a 1 =< 125, 33, 1, 0.008 >. • A2: demographic feature. Given a specific area, we obtain its associated city's (or town's) demographic data from the United States Census Bureau, including the estimated population, population density (i.e., number of people per square mile), age distribution (i.e., percentage of people over 65 year-old) and gender distribution (i.e., percentage of females). For example, to assist with the risk assessment of the area of Euclid Ave in Cleveland at OH, the obtained demographic data associated with it are: Cleveland with population of 383793, population density of 5107, 13.5% people over 65 year-old, and 51.8% females, which will be represented as a 2 =< 383793, 5107, 0.135, 0.518 >. • A3: mobility feature. Given a specific area, a mobility measure that estimates how busy the area is in terms of traffic density will be obtained from Google maps, which will represented by five degree levels (i.e., [1, 5] , the larger number the busier). • A4: representation of public perception. After performing the automatic sentiment analysis based on the collected posts associated with a given area from Reddit, the public perceptions towards COVID-19 in this area will be represented by a normalized value (i.e., [0,1]) indicated the awareness of COVID-19 (i.e., the larger value the more aware). For the previous example of the Reddit post of "I live in Montgomery County, PA and everyone here is acting like there's nothing going on. ", a related perception towards COVID-19 in Montgomery County at PA will be formulated as a numeric vale of 0.220, denoting people in this area were less aware of COVID-19 on March 14, 2020. After extracting the above features, we concatenate them as a normalized attributed feature vector A attached to each given area for representation, i.e., A = A 1 ⊕A 2 ⊕A 3 ⊕A 4 . Note that we zero-pad the ones in the elements when the data are not available. Relation-based Features. Besides the above extracted attributed features, we also consider the rich relations among different areas.

• R1: administrative affiliation. According to the severity of COVID-19, available resources and impacts to the residents, different states may have different policies, actionable strategies and orders with responses to COVID-19. Therefore, given an area, we accordingly extract its administrative affiliation in a hierarchical manner. Particularly, we acquire the state-include-county and county-include-city relations from City-to-County Finder 7 . • R2: geospatial relation. We also consider the geospatial relations between a given area and its neighborhood areas. More specifically, given an area, we retain its k-nearest neighbors at the same hierarchical level by calculating the euclidean distances based on their global positioning system (GPS) coordinates obtained from Google maps and Wikipedia 8 . with an entity type mapping ϕ: V → T and a relation type mapping ψ : E → R, where V = m i=1 X i denotes the entity set and E is the relation set, T denotes the entity type set and R is the relation type set, A = m i=1 A i , and |T | + |R| > 2. Network Schema [21] : The network schema of an AHIN G is a meta-template for G, denoted as a directed graph T G = (T , R) with nodes as entity types from T and edges as relation types from R.

In this work, we have four types of entities (i.e., nation, state, county and city, |T | = 4), two types of relations (i.e., R1 and R2, |R| = 2), and each entity is attached with an attributed feature vector as described above. Based on the definitions, the network schema of AHIN in our case is shown in Figure 4 . 

Although the constructed AHIN can model the complex and rich relations among different entities attached with attributed features, there faces a challenge that there might be missing values of attributed features attached to the entities in the AHIN because of limited data that might be available for learning. More specifically, given an area, there may not be sufficient social media data (i.e., Reddit data in this work) to learn the public perceptions towards COVID-19 in this area. For example, for the state of Montana, as of March 22, 2020, in its corresponding subreddit r/CoronavirusMontana, there only have been 12 posts by seven users discussing the virus. To address this issue, we propose to exploit cGANs [23] for synthetic (virtual) social media user data generation for public perception learning to enrich the AHIN.

Different from traditional GANs [13] , a cGAN is a conditional model extended from GANs, where both the generator and discriminator are conditioned on some extra information. In our case, we propose to exploit cGAN to generate the synthetic posts for those areas where the data are not available. In our designed cGAN, given an area where Reddit data are not available, the condition composes of three parts: the disease related feature vector in this area a 1 , its related demographic feature vector a 2 and its GPS coordinate denoted as o. As shown in Figure 5 , the generator in the devised cGAN aims to incorporate the prior noise p z (z), with the conditions of a 1 , a 2 and o as the inputs to generate the synthetic posts represented by latent vectors; while in the discriminator, real post representations obtained by using doc2vec [19] or generated synthetic post latent vectors along with a 1 , a 2 and o are fed to a discriminative function. Both generator and discriminator could be a non-linear mapping function, such as a multi-layer perceptron (MLP). The generator and discriminator play the adversarial minimax game formulated as the following minimax problem: D(G(z|a 1 , a 2 , o) ))].

(1)

The generator and discriminator are trained simultaneously: adjusting parameters for generator to minimize log(1 − D (G(z|a 1 , a 2 , o) )) while adjusting parameters for discriminator to maximize the probability of assigning the correct labels to both training examples and generated samples. After applying cGAN for synthetic post latent vector generation, we further exploit deep neural network (DNN) to learn the public perceptions towards COVID-19 in this area. More specifically, we first use doc2vec to obtain the representations of real posts collected from Reddit and feed them to train the DNN model; and then given a generated synthetic post latent vector, we use the trained model to gain its related perception (i.e., awareness of COVID-19).

Meta-path Expression. To assist with the risk assessment of a given area related to the fast evolving COVID-19, it might not be sufficient if only considering its vertical information (e.g., its related city, county or state's responses, strategies and policies); the horizontal information (i.e., information from its neighborhood areas) will also be important inputs. To comprehensively integrate both vertical and horizontal information, we propose to exploit the concept of meta-path [29] to formulate the relatedness among different areas in the constructed AHIN.

Definition 2. Meta-path. A meta-path P is a path defined on the network schema T G = (T , R), and is denoted in the form of

. . · R L between types T 1 and T L+1 , where · denotes relation composition operator, and L is the length of P. city denotes that, to assess the risk of a specific city, we not only consider the city itself, but also the information from its related county and nearby cities. Heterogeneous Graph Auto-encoder. Given a node (i.e., area) in the constructed AHIN, guided by its corresponding meta-path scheme (i.e., city level guided by P1, county level guided by P2, and state level guided by P3), to aggregate the information propagated from its neighborhood nodes, we propose a heterogeneous graph auto-encoder (GAE) model to achieve this goal. The designed heterogeneous GAE model consists of an encoder and a decoder: the encoder aims at encoding meta-path based propagation to a latent representation, and the decoder will reconstruct the topological information from the representation. Encoder. We here exploit attentive mechanism [9, 31, 33] to devise the encoder: it will first search the meta-path based neighbors N (v) for each node v, and then each node will attentively aggregate information from its neighbors. To learn the importance of the information from neighborhood nodes, we first present each relation type r ∈ R in the constructed AHIN by R r ∈ R d a ×d a , where d a denotes the dimension of the attributed feature vector; and then the attentive weight β of node u (the neighbor of v) indicate the relevance of these two nodes measured in terms of the space R r , that is,

where a v and a u are the attributed feature vectors attached to node v and u. We further normalize the weights across all the neighbors of v by applying softmax function:

Then, the neighbors' representations can be formulated as the linear combination:

where the weight β r (v, u) indicates the information propagated from u to v in terms of relation r . Finally, we aggregate v's representation a v and its neighbors' representations a N(v) by:

Decoder. The decoder is used to reconstruct the network topological structure. More specifically, based on the latent representations generated from the encoder, the decoder is trained to predict whether there is a link between two nodes in the constructed AHIN. To this end, leveraging latent representations learned from the heterogeneous GAE, the risk index of a given area is calculated as:

where γ i is the adjustable parameter that can be specified by human experts, indicating the importance of i-th element in a v (e.g., the number of confirmed cases, population density, age distribution, mobility measure, etc.) in the rapidly changing situation.

Because of the critical need to act promptly and deliberately in this rapidly changing situation, we have deployed our developed system α-Satellite (i.e., an AI-driven system to automatically provide hierarchical community-level risk assessment related to COVID-19) for public test. Given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable people to select appropriate actions for protection while minimizing disruptions to daily life. The link of the system is: https://COVID-19.yes-lab.org, which also include the brief description and disclaimer of the system as well as the following benchmark datasets.

Data Collection and Preprocessing. We have developed a set of crawling and preprocessing tools to collect and parse the largescale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations and digital media, demographic data, mobility data, and user generated data from social media (i.e., Reddit). We have made our collected and proprocessed data available for public use through the above link. We describe each publicly accessible benchmark dataset (i.e., DB 1 -DB 4 ) in detail below. DB 1 : disease related dataset. According to simplemaps 9 , the U.S. includes 50 states, Washington, D.C. and Puerto Rico as well as 3,203 counties and 28,889 cities. We have collected the up-to-date countybased coronavirus related data including the numbers of confirmed cases, new cases, deaths and the fatality rate, from official public health organizations (e.g., WHO, CDC, and county government websites) and digital media with real-time updates of COVID-19 (e.g., 1point3acres). By the date, we have collected these data from 1,531 counties and 52 states (including Washington, D.C. and Puerto Rico) on a daily basis from Feb. 28, 2020 to date (i.e., March 25, 2020). DB 2 : demographic and mobility dataset. We parse the demographic data collected from the the United States Census Bureau (data updated on July 1, 2019) in a hierarchical manner: for each city, county or state in the U.S., the dataset includes its estimated population, population density (e.g., number of people per square mile), age and gender distributions. By the date, we make the demographic and mobility dataset available for public use including the information of estimated population, population density, and GPS coordinates for 28,889 cities, 3,203 counties and 52 states (including Washington, D.C. and Puerto Rico). DB 3 : social media data from Reddit. In this work, we initialize our efforts on social media data with the focus of public perception analysis on Reddit, as it provides the platform for scientific discussion of dynamic policies, announcements, symptoms and events of COVID-19. In particular, we have collected and analyzed 48 statebased subreddits (i.e., Washington, D.C. and 47 states 

In this section, we evaluate the practical utility of the developed system α-Satellite for hierarchical community-level risk assessment related to COVID-19 through a set of case studies.

Case study 1: real-time risk index of a given area. Given a specific location (either user input or automatic positioning by Google map), the developed system will automatically provide its related risk index (i.e., ranging from [0,1], the larger number indicates higher risk and vice versa) associated with the public perceptions (i.e., awareness) towards COVID-19 in this area (i.e., ranging from [0,1], the larger number denotes more aware and vice versa), demographic density (i.e., the number of people per square mile in its related county), and traffic status (i.e., ranging from [1, 5] , the larger number means more traffic and vice versa). Figure 7 .(a) shows an example: given the location of Euclid Ave, Cleveland, OH 44106, the risk index provided by the system was 0.662 (with public perception of 0.529, demographic density of 1,389, and traffic status of 3) at 3:58pm EDT on March 24, 2020. At the same time, the risk indexes and public perceptions of corresponding county (i.e., Cuyahoga county with risk index of 0.665 and public perception of 0.585) and state (i.e., OH state with risk index of 0.554 and public perception of 0.557) will also be shown in a hierarchical manner to enable people to select appropriate actions for protection while minimizing disruptions to daily life. Case study 2: comparisons of risk indexes on different dates. In this study, given the same area, we examine how the generated risk indexes change over time. Using the same location above, Figure  7 .(b) shows the comparison results on different dates at the time of 3:58pm EDT, from which we have the following observations: 1) in general, its risk indexes increased over days from March 8, 2020 In this study, given the same time, we examine how the generated risk indexes change over areas. When a user inputs the areas he/she are interested in (e.g., grocery stores near me) in the search bar, the system will display the nearby grocery stores using Google maps application programming interface (API) and automatically provide the associated indexes. For example, using the same time in the first study (i.e., 3:58pm EDT on March 24, 2020), Figure 8 shows the "grocery stores near me" (i.e., near the location of Euclid Ave, Cleveland, OH 44106) and their related indexes. From Figure 8 , we can observe that the indexes of nearby areas might vary due to the factors of different public perceptions towards COVID-19 and different traffic statuses in specific areas. As shown in the right part of Figure 8 , the system also provides related Reddit posts to users. Case study 4: comparisons of different counties and states.

In this study, we compare the indexes of different counties and different states given the same time. Using the time in the first study (i.e., 3:58pm EDT on March 24, 2020), Figure 9 .(a) shows an example of comparisons. More specifically, at county-level, using OH state as an example, we choose the counties with top five largest numbers of confirmed cases on March 24 for comparisons: Cuyahoga (167), Franklin (75), Hamilton (38) , Summit (36) and Lorain (30) . Figure. 9.(b) illustrates the risk indexes associated with multiple factors versus the numbers of confirmed cases in these counties. For the comparisons of different states, we also choose five states: two most severe states (New York (NY) with 26,376 confirmed cases and 271 deaths, California (CA) with 2,628 confirmed cases and 54 deaths), two medium severe states (OH with 564 confirmed cases and 8 deaths, Virginia (VA) with 304 confirmed cases and 9 deaths) and one least severe state (West Virginia (WV) with 39 confirmed cases and 0 deaths). Figure. 9.(c) shows the risk indexes versus the numbers of confirmed cases in these states, from which we can see that there is a positive correlation between the numbers of confirmed cases and the risk indexes.

To track the emerging dynamics of COVID-19 pandemic in the U.S., in this work, we propose to collect and model heterogeneous data from a variety of different sources, devise algorithms to use these data to train and update the models to estimate the spread of COVID-19 and predict the risks at community levels, and thus help provide actionable information to users for community mitigation. In sum, leveraging the large-scale and real-time data generated from heterogeneous sources, we have developed the prototype of an AIdriven system (named α-Satellite) to help combat the deadly COVID-19 pandemic. The developed system and generated benchmark datasets have made publicly accessible through our website.

In the future work, we plan to continue our efforts to expand the data collection and enhance the system to help combat the fast evolving COVID-19 pandemic. We will continue to release our generated data and updates of the system to facilitate researchers and practitioners on the research to help combat COVID-19 pandemic, while assisting people to select appropriate actions to protect themselves at increased risk of COVID-19 while minimize disruptions to daily life to the extent possible. 

Natural language processing with Python: analyzing text with the natural language toolkit

CDC. 2020. 1918 Pandemic (H1N1 virus

Are You at Higher Risk for Severe Illness?

CDC. 2020. How COVID-19 Spreads

Implementation of Mitigation Strategies for Communities with Local COVID-19 Transmission

Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study. medRxiv

SecureDroid: Enhancing Security of Machine Learning-based Detection against Adversarial Android Malware Attacks

Adversarial Machine Learning in Malware Detection: Arms Race between Evasion Attack and Defense

Metapath-guided Heterogeneous Graph Neural Network for Intent Recommendation

Gotcha -Sly Malware! Scorpion: A Metagraph2vec Based Malware Detection System

Malicious Sequential Pattern Mining for Automatic Malware Detection

Incorporating non-local information into information extraction systems by gibbs sampling

Generative adversarial nets

Forecasting the Wuhan coronavirus (2019-nCoV) epidemics using a simple (simplistic) model. medRxiv

alphaCyber: Enhancing Robustness of Android Malware Detection System against Adversarial Attacks on Heterogeneous Graph based Model

Hindroid: An intelligent android malware detection system based on structured heterogeneous information network

Artificial intelligence forecasting of covid-19 in china

Using twitter and web news mining to predict COVID-19 outbreak

Distributed representations of sentences and documents

Enhancing Robustness of Deep Neural Networks against Adversarial Malware Samples: Principles, Framework, and Application to AICS'2019 Challenge

Semi-supervised clustering in attributed heterogeneous information networks

Early transmissibility assessment of a novel coronavirus in Wuhan

Conditional generative adversarial nets

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Identification of COVID-19 Can be Quicker through Artificial Intelligence framework using a Mobile Phone-Based Survey in the Populations when Cities/Towns Are Under Quarantine

Deep Learning-Based Quantitative Computed Tomography Model in Predicting the Severity of COVID-19: A Retrospective Study in 196 Patients

An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China

Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images. medRxiv

Pathsim: Meta path-based top-k similarity search in heterogeneous information networks

Vital Surveillances. 2020. the epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19)-China

Graph attention networks

A deep learning algorithm using CT images to screen for Corona Virus Disease

Kgat: Knowledge graph attention network for recommendation

WHO. 2020. Coronavirus disease (COVID-19

Deep Learning System to Screen Coronavirus Disease 2019 Pneumonia

Prediction of survival for severe Covid-19 patients with three clinical features: development of a machine learning-based prognostic model with clinical data in Wuhan

DeepAM: A Heterogeneous Deep Learning Framework for Intelligent Malware Detection

Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection

A Survey on Malware Detection Using Data Mining Techniques

Automatic Malware Categorization Using Cluster Ensemble

Intelligent File Scoring System for Malware Detection from the Gray List

CIMDS: Adapting Postprocessing Techniques of Associative Classification for Malware Detection

Combining File Content and File Relations for Cloud Based Malware Detection

IMDS: Intelligent Malware Detection System

An Intelligent PE-malware Detection System Based on Association Mining

Shufang Wu, and Yonghong Xiao. 2020. Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm. bioRxiv