key: cord-318879-4ual2ssa
authors: Kaveh-Yazdy, Fatemeh; Zarifzadeh, Sajjad
title: Track Iran's National COVID-19 Response Committee’s Major Concerns using Two-stage Unsupervised Topic Modeling
date: 2020-11-04
journal: Int J Med Inform
DOI: 10.1016/j.ijmedinf.2020.104309
sha: 
doc_id: 318879
cord_uid: 4ual2ssa

BACKGROUND: Since the World Health Organization (WHO) declared the COVID-19 as a Public Health Emergency of International Concern (PHEIC) on January 31, 2020, governments have been enfaced with crisis for timely responses. The efficacy of these responses directly depends on the social behaviors of the target society. People react to these actions with respect to the information they received from different channels, such as news and social networks. Thus, analyzing news demonstrates a brief view of the information users received during the outbreak. METHODS: The raw data used in this study is collected from official news channels of news wires and agencies in Telegram messenger, which exceeds 2,400,000 posts. The posts that are quoted by NCRC’s members are collected, cleaned, and divided into sentences. The topic modeling and tracking are utilized in a two-stage framework, which is customized for this problem to separate miscellaneous sentences from those presenting concerns. The first stage is fed with embedding vectors of sentences where they are grouped by the Mapper algorithm. Sentences belonging to singleton nodes are labeled as miscellaneous sentences. The remained sentences are vectorized, adopting Tf-IDF weighting schema in the second stage and topically modeled by the LDA method. Finally, relevant topics are aligned to the list of policies and actions, named topic themes, that are set up by the NCRC. RESULTS: Our results show that major concerns presented in about half of the sentences are (1) PCR lab. test, diagnosis, and screening, (2) Closure of the education system, and (3) awareness actions about washing hands and facial mask usage. Among the eight themes, intra-provincial travel and traffic restrictions, as well as briefing the national and provincial status, are under-presented. The timeline of concerns annotated by the preventive actions illustrates the changes in concerns addressed by NCRC. This timeline shows that although the announcements and public responses are not lagged behind the events, but cannot be considered as timely. Furthermore, the fluctuating series of concerns reveal that the NCRC has not a long-time response map, and members react to the closest announced policy/act. CONCLUSION: The results of our study can be used as a quantitative indicator for evaluating the availability of an on-time public response of Iran’s NCRC during the first three months of the outbreak. Moreover, it can be used in comparative studies to investigate the differences between awareness acts in various countries. Results of our customized-design framework showed that about one-third of the discussions of the NCRC’s members cover miscellaneous topics that must be removed from the data.

and the vice deputy of NCRC that is Dr. Iraj Harirchi MD have held press conferences and interview sessions to address and de-brief the NCRC's actions and decisions.

In this article, we present the results of our research aimed at analyzing the concerns and policies made by the NCRC. While our targeted issues have been addressed by different sources and members of the NCRC, we decide to collect the required information from news posts. Then, we adopt text mining methods to extract, select, group, and analyze the underlying knowledge.

Utilizing text mining methods to analyze disseminated information about a disease goes back to earlier 2008.

Disease-related text mining researches with respect to their application can be divided into four primary groups as follows, 1-Outbreak monitoring and prediction 2-Infodemic and misinformation detection 3-Social/public concern detection 4-Control Disease Centers response analyzing

The first two groups of researches are targeted disease-related information extraction, and the remaining groups mine the socio-political information reflected the government and in-charge institutes responses and society's reactions to these policies. In the following sub-sections, we introduce these groups and briefly review the recent researches in each area.

Outbreak monitoring and prediction are studies under a research field called digital epidemiology [6] and go back to the time when Google released its research project Google Flu Trends (GFT) to predict the number of flu prevalence in 25 countries. GFT service predicted the H1N1 pandemic in 2009. The pandemic as a non-seasonal flu outbreak started in summer, and it was the first critical outbreak after service release. Investigations of Cook et al. [7] showed that GFT's predicted series are highly correlated with the number of prevalence registered by the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). However, Nature reported that the predictions of GFT in the 2012-2013 flu season were two times more than the CDC's ILINet numbers [8] . In addition, Salathé [9] indicated that Page 4 By earlier 2020, the focus of the digital epidemiology researches shifts from other diseases to COVID-19 and outbreak detection using indices of search engines and tweets to predict the disease trends. Li et al. [17] used the Baidu search index and Weibo search queries to predict the raw number of COVID-19 lab-confirmed and suspected cases.

Their results show that the search index series are highly correlated with the number of prevalence and suspected cases with 8-10 and 5-7 days lag, respectively. Kaveh-Yazdy et al. [18] used query logs of a Persian search engine to study the feasibility of disease dynamics monitoring in Iran. Qin et al. [19] analyzed the social media search index (SMSI) to predict the suspected cases of COVID-19. Their analysis reveals that among the listed keywords for COVID-19, such as dry cough, fever, chest distress, coronavirus, and pneumonia, the numbers of searches for fever and pneumonia are highly correlated with the number of suspected cases. Ayyoubzadeh et al. [20] collected the Google search index of nine keywords and the number of cases of the previous day in Iran to generate a multi-feature time series. They utilize a linear regression model as well as a long short-term memory (LSTM) neural network to predicts the incidence of the next day. Their results show that the root mean square error (RMSE) of the linear regression model is less than the LSTM model; however, the LSTM model extracts the up-down trends better, which is resulted from overfitting.

Their results alert that the LSTM is not an appropriate model for short and sparse time series, while it could be reliable in data collected from longer periods.

During the 2003 SARS outbreak, the term "Infodemic" was coined from parts of the words "Information" and "Epidemic" [21] . Infodemic addresses the massive amount of misinformation about a disease circulating in media and the web. SARS [22] , MERS [23] , Zika [23] , and Ebola [24] outbreaks drive the waves of misinformation and rumors in the web; however, their infodemics is not comparable to the COVID-19's one. The enormous number of posts, including misinformation published in social media and newspapers, makes public health experts worry about the quality and validity of the information that would be accessible for the public. A short while after WHO's PHEIC report, the WHO's risk communication team launched a new portal, called WHO Information Network for Epidemics (EPI-WIN) to publish valid and evident-based information about COVID-19 [25] .

We should note that misinformation circulated in social media, such as Facebook, Twitter, TikTok, Pinterest, and YouTube can increase the levels of fear and anxiety of society. Furthermore, the rumors can affect the behaviors of the people and their reactions to disease control policies. For example, the CNN anticipation about the Lombardy region lockdown, which was published before the official announcement of the government of Italy, led to overcrowded trains and airports. People escaping from the Lombardy region spread the disease to the other regions and increase contagion [26] . Two important steps in fighting the infodemics are finding the sources and communicating faster than the rumors.

Cinelli et al. [26] collected the COVID-19-related posts from Twitter, Instagram, YouTube, Reddit, and Gab for 45 days using the top keywords of Google Trends. They extracted links to news outlets and analyzed their contents to classify them into two main groups, namely Reliable and Questionable. The Questionable news already includes Conspiracy-Pseudoscience, Pro-Science, and Questionable topics. Their results show that Gab is the environment that is more susceptible to misinformation dissemination. Hua and Shaw [27] collected news posts from Sina Weibo's hot J o u r n a l P r e -p r o o f search list, the coronavirus timeline in china compiled by the users of five social networks, and the web usage data log of Mob-Tech research institute. They defined five phases from January 31, 2020, to February 29, 2020, according to the government's responses to the outbreak and analyzed the reactions of the society to official policies during these phases. Their results suggest that the initial delay was justified by restricting regulation and successful communication, big data adoption, and digital technologies. Directives by China's Supreme Court on fake news publishing and the Tencent's website named "Rumors exposed website" are samples of the country-level responses to fight against the rumor and misinformation.

Erku et al. [28] addressed the role of pharmacists in COVID-19 infodemic, providing up-to-date and reliable information to their community via social media platforms. Furthermore, they indicated that pharmacists must ensure the education and qualified homecare for individuals, suspected patients, and family members in the lockdown period even by referrals for the psychological consultant. Kouzy et al. [29] followed 14 trending English hashtags on Twitter to extract tweets related to COVID-19. They validated the information in tweets with respect to peer-reviewed sources of medical/health resources and studied the dynamics of misinformation spreading on Twitter. According to their investigation, some of the Twitter accounts are more associated with unverifiable information. Furthermore, the proportion of misinformation-included tweets to trusted ones is growing. These phenomena alert the need for early intervention by health agencies, physicians, medical associations, and even by scientific journals.

Public concerns directly affect the attitudes and behavior of society, enfacing the diseases. Thus, analyzing concerns, anxiety, and fears of society reveals the hidden issues that are able to affect disease control responses. Nelson et al. [30] studied psychological and epidemiological concerns of society through the lens of surveys in social media.

They found that the major concerns of society are hand washing, remaining in-home, and practicing social distancing. Their study showed that the level of concern depends on the age group. Issues regarding buying hand sanitizer and food, child-care, and economical consequences of the COVID-19 outbreak are the most common difficulties. Wang et al. [31] investigated the impacts of the COVID-19 outbreak on psychological health factors in Chinese society.

Their survey showed that the psychological impacts of the outbreak were rated as moderate to severe by 53.8% of people. Moreover, 16.5% of the Chinese reported moderate to severe depressive symptoms; 28.8% reported moderate to severe anxiety symptoms, and 8.1% reported moderate to severe stress levels. Findings of Wang et al. [32] showed that the high quality of health information was associated with better mental health outcomes during the outbreak.

Van der Vegt and Kleinberg [33] investigated 5000 pieces of text (half of them are short texts, and the rest are long). In this experiment, male and female participants were invited to express their emotions about the SARS-COV-2 virus. They found that crucial sources of concern and anxiety for women are family and health issues. Moreover, women are more worried, anxious, and sad than men and also angrier than men. On the other hand, the main source of anxiety for men is socio-economic issues. Deng et al. [34] used selected keywords to collect Weibo's posts related to the 2013 H7N9 bird flu outbreak in China. Collected posts are partitioned into sentences, and their Tf-IDF vector representations are clustered using the k-means algorithm. In the end, some of the related clusters covering different aspects of the same topic are aggregated together. Tracking the topics evolved in the social media reveals that the first J o u r n a l P r e -p r o o f reaction to the first case in Shanghai was shocking, and after spreading the flu to Beijing, sleeping issues became the next primary concern. The third trend was about "Treatment and Precaution," and then the feared society started to post about 2003 "SARS" and the correlation between the new flu and the increasing number of deaths in pigs. Finally, traditional Chinese medicine attracted attention.

Lazard et al. [35] analyzed the tweets of American users during the 2014 Ebola outbreak to identify anxiety and public fears. In addition, they started to collect tweets and retweets of the offices of the Centers for Disease Control (CDC) in federal, state, and local levels. Comparing the CDC responses to the concerns raised by Twitter users shows that the CDC could not cover the lack of certain reliable information about issues such as pathogen, its spread mechanism, and fear of air travel. Lazard et al. concluded that CDC accounts must communicate with users to present guidelines with respect to their priorities. Several research groups began to analyze negative tweets to extract public concerns regarding outbreaks because it seems that concerns are more likely to be expressed in negative tweets. For example, the results of Mamidi et al. [36] showed that Zika abnormalities, neural defects, and symptoms are mostly expressed in negative tweets. Ji et al. [37] analyzed the negative personal tweets to study mental concerns to design a disease tracking framework. Their proposed system, named Epidemic Sentiment Monitoring System (ESMOS), can visualize and group user concerns and track how they evolve during the outbreak. This framework can guide disease control and prevention agencies to determine their communication priorities on-time. governments. They indicate that the OxCGRT, as a bridging the gap stringency index, can be used to evaluate the efficacy of decisions that have been made or will be made in the future.

Telegram is the most popular social media among Iranians in a way that more than 50 million Telegram users are Iranians (more than 56% of Telegram users) [42] . People use Telegram channels for education, reading news, and its public groups for social discussion, as well as its encrypted private chat for messaging. Thus, we decided to extract news posts from the official channels of the most popular news agencies, as well as private popular news re-publishing Posts quoted at least one sentence from speeches and press conferences delivered by the head, the vice deputy, and the spokesman of Iran's NCRC are selected. In the next step, the posts are split into sentences to generate the sentence dataset. The dataset contains 45,209 sentences that are likely to express concerns, actions, and decisions of NCRC.

The prerequisite of extracting the government or public concerns is topic modeling and tracking. It can be defined as the process of extracting the abstract topics underlying the contents of documents. The first generation of topic modeling methods included statistical techniques such as Latent Semantic Indexing (LSI) [43] , Probabilistic Latent Semantic Analysis (PLSA) [44] , and Latent Dirichlet Allocation (LDA) [44] . The parameter set and structure of LDA make it ideal for different optimizations and improvements. There are many variations of the LDA model, such as Correlated Topic Model (CTM) [45] , Hierarchical LDA (HLDA) [45] , supervised LDA (sLDA) [46] , relational Topic Model (rTM) [47] , and Markov Topic Model (MTM) [48] . While word embedding models evolve the language 1 List of news channels and number of posts are covered in Appendix A.

J o u r n a l P r e -p r o o f processing, the LDA leverages the rich representation provided by such models. Word embedding-based LDA with Gaussian mixture model [49] , LDA2Vec [50] , and word embedding augmented LDA [51] are samples of word embedding-based topic modeling techniques. Their successful records motivated researchers to adopt them in health concern tracking as well. Lazard et al. [35] and Kim et al. [52] utilized the LDA-based public concern tracking in the 2014-2015 season Ebola outbreak, and Glowacki et al. [53] analyzed the CDC's communications in response to the public concerns after the spreading of Zika in Florida. The rapid rate of COVID-19 spread alerts the global community to respond promptly to medical and mental concerns to accelerate the action of disease control [39] . In this way, Dong et al. [54] , Stokes et al. [55] , and Liu et al. [40] employed topic modeling techniques to detect public concerns and health communications.

In this article, a two-stage framework is devised to extract the major concerns of the NCRC. This framework receives the pre-processed/cleaned dataset and analyzes it in two stages. Each row of the input data includes news agency, date, and a sentence that probably expresses NCRC's concerns. In the first stage (i.e. stage A), sentences are vectorized by a sentence embedding, and a group of similar sentences covering the NCRC's concerns are selected to be re-processed in the next stage. The second stage (i.e. stage B) has the same structure in which a data vectorization method generates sentence representations, and then, the underlying topics are extracted using the LDA. The schematic view of the proposed framework is illustrated in Figure 1 . Figure 1 : The schematic view of the concern extraction framework.

Although the selected posts include quotes made by the members of the NCRC, additional processes are required to extract sentences that really indicate concerns of the NCRC. During the press conferences, the spokesman and other members of the committee address different issues related to the COVID-19 pandemics, such as expressing condolences to other countries and medical aids NCRC received from the health department of the Ministry of Defense. Due to the term similarity of the sentences expressing concerns and those related to peripheral issues, we J o u r n a l P r e -p r o o f design a two-stage framework. The first stage divides the domain into two sets: one extensive set of the concernaddressing sentences, and the remaining sentences. Sentences of the first set are re-vectorized in a way that they express their underlying topics with associated words. The second stage generates human-understandable topics.

Finally, the topics are re-considered by a human expert to be aggregated with respect to the list of concerns. The topic aggregation is popular for aligning the detected topics with the enlisted response concerns (similar to [40] , [55] ).

We should here explain about the order of methods used in our framework. To shed light on the topic modeling issues that occurred in this problem, we present an example we found during our investigations. Suppose that sentences are vectorized using two different schemas, i.e., Tf-IDF 2 and sentence embedding in parallel. Then, the sentences addressing an issue related to "Hospitals and Medical Centers" and "Closure of Schools/Universities" are labeled. The vectors are embedded in a 2D space using a parametric dimension reduction method, called t-Distributed Stochastic Neighbor Embedding (t-SNE) [56] . Subplots A and B of Figure 2 demonstrate the distribution of the classes in two different vector spaces that are sentence embedding and Tf-IDF, respectively. As it can be seen in Figure 2 , the subtopics (means various concerns under the one general topic, i.e., are better separable in the sentenceembedding space. In summary, the sentence embedding vector space model leverages the semantically-rich representation of the sentences, which makes it capable of separating sentences with common terms better than the Tf-IDF representation. 

Stage A includes two steps: (1) sentence embedding, and (2) sentence grouping. The whole pre-processed news posts are used for learning word representation. Among various word embedding packages, we selected Facebook's FastText package [57] , which is fast and accurate for sparse datasets. Although BERT embedding [58] is shown to be very successful, but training it on our data, which is a small-sized sparse Persian corpus, does not show promising results. Instead, FastText trained in less than an hour shows reliable results. The FastText is fed by the whole corpus to generate embedding vectors. The 150-dimensions word embedding vectors, as well as the selected sentences, are given to the FastText to build a sentence embedding representation. Embedded sentences are represented by 150dimensions vectors as well.

In the next step, sentences must be clustered or grouped based on their embedded semantics. Sentences quoted by the NCRC's members express similar topics scattered around the COVID-19 and SARS-COV-2 viruses. We know that a major proportion of the sentences express NCRC's concerns, but the exact definition of these groups of sentences or their distribution is unknown. Furthermore, the target group of sentences and the peripheral group may overlap with each other due to long sentences expressing a concern and a peripheral topic at the same time. One of the algorithms that supports overlap grouping of data points with respect to their structure is the Mapper Algorithm introduced by Singh et al. [59] , which reduces the high dimensional datasets into simplicial complexes. The low-dimensional representation is not accurate; however, it saves the topological shape of the data and is less sensitive to the selected distance metric. The Mapper algorithm has been applied to various problems such as 3D object recognition [59] , J o u r n a l P r e -p r o o f bioinformatics [60] , data visualization [61] , and cancer detection [62] . The Mapper algorithm is applied to the 150dimensions vectors to group them. There exist different packages for the Mapper algorithm, listed in Table 1 . We use the KeplerMapper package [63] , which adopts a clustering algorithm to group the data objects. In this research, the selected clustering algorithm is DBSCAN with cosine distance metric, which is tuned by epsilon of 0.05.

The number of hypercubes of the Mapper is set to 10. The generated graph has 708 nodes and 293 edges. Furthermore, this graph includes several connected components; however, there are 11799 singleton data points that are not clustered with other samples. Figure 3 shows the Mapper graph, including one major connected component and several smaller and single-node components. We should note that the single-nodes in this graph are clusters of data samples.

The mentioned singleton samples which are not grouped under any of the clusters are not shown in this figure.

Even though the Mapper graph uncovers the topological structure of the dataset, it cannot be used to extract topic models. The result of this stage tells us which sentences cannot be grouped with any other sentences. Since the set of addressed concerns of the NCRC are limited, the sentences that have not shared meaning with any other cluster do not express a concern. Consequently, we remove 11799 singleton sentences from the dataset before passing the vectors to stage B.

J o u r n a l P r e -p r o o f 

Word embedding-based topic models are benefited from the rich semantic representation provided by word embedding while probabilistic and matrix factorization methods, such as LDA, PLSA, and NMF models, are more human-readable [64] . In our research, we need to align the extracted topics with the list of NCRC's concerns; thus, we have to adopt a human-understandable topic modeling technique to obtain the topics. In this work, we use LDA as one of the well-implemented probabilistic topic modeling methods. In the first step of the Stage (B), a Tf-IDF vectorizer generates Tf-IDF of the sentences generated by stage A. The Tf-IDF is a vector-based weighting schema used in information retrieval and text mining and is defined based on the multiplication of two elements, which are term frequency ( ) and inverse of document frequency ( ) [65] . More precisely, the Tf-IDF of term in document is computed as follows

where , is the number of times term occurs in document , and addresses the logarithm of the inverse of document frequency. The simplest form of is defined as

in which is the total number of documents that include in a corpus of documents. Manning et al. [65] listed several variations of Tf-IDF; however, the above-mentioned version is the most popular one. The LDA model is fed with the Tf-IDF weighted vectors. While Petz et al. [66] noted the efficacy of cleaned text in opinion mining, we removed a predefined list of Persian stop-words and terms which are repeated in more than 80% of documents from the vectorizer's input. The LDA model is a hierarchical Bayesian model that uses a bag of words representation of documents to represent them in a semantic space with a lower number of dimensions. This model is fed with the vector representation of documents to calculate the probability of classes for each document. In this study, we use Gensim's implementation of the LDA with the Tf-IDF model [67] . In the next step, the number of topics for the LDA model must be determined.

Finding the appropriate number of topics is studied under the model coherence analysis. Topic coherence measures guarantee the understandability of the topics. Roder et al. [68] analyzed seven coherence measures and compared their ability and run-time. Amongst these measures, we select the measure which beats other measures. The best number of topics is determined with respect to the coherence of the underlying topics of the LDA model. Figure 4 shows the coherence scores for the different number of topics ( ). The best score happens when = 47. Thus, we use the LDA with = 47 to extract understandable topics. In this model, each topic is a group of documents represented in a higher-dimensional space of the same size as vectors. The 2D demonstration for the intertopic space of the trained LDA model is demonstrated in Figure 5 , which is visualized using the LDAvis package [69] . 

There is a limited number of expressed concerns and set up actions enlisted by Abdi [70] . Subsequently, we have to aggregate the topics that cover different aspects of the same action or concern. Our list of concerns and actions include eight themes as follows, The first seven themes indicate the concerns or responses of the NCRC, and the last theme covers issues, such as thanking medical care personnel. We reviewed the top 20 words of each topic to assign it to the closest theme. Table   2 shows the number of topics and sentences merged to form themes.

J o u r n a l P r e -p r o o f 

The largest number of topics merged for a theme belongs to the miscellaneous theme, which is resulted from higher diversity of the miscellaneous topics addressed by NCRC's members. Among the COVID-related themes, the test and screening theme attracts more attention. Ministry of Health and Medical Education (MHME) of Iran collaborates with the Ministry of ICT to deploy an online mobile screening application just ten days after the first confirmed prevalence. This application plays the role of an in-home triage recommending users with respect to their body temperature and symptoms. This application helps the MHME to manage the mental stress of the mass crowd and control the number of medical care requests to prevent overcrowding in hospitals. Promoting the population to use this application and presenting the results of analyzes on the App's data is a major part of the press conferences of the NCRC. The second most important concern is the education system closure. The latest weeks of winter to the middle weeks of spring is an important period in Iran's educational year because of the conducted undergraduate and graduate entrance exams. Duo to the COVID-19 crises, all kindergartens, schools, academic institutes, and universities were closed province by province. The closure decisions were made in a period from February 22 to March 1 with respect to the status of the host province. Increasing awareness by addressing the handwashing protocols, alcoholicbased hand sanitizers, and facial masks is the third most discussed concern in public press conferences and interviews.

We demonstrate the number of sentences expressed for each theme from February 1 to May 10 using an area graph J o u r n a l P r e -p r o o f system was shut down on March 1, 2020. The widest area (eq. 20.6 %) in the related theme (i.e., 7 ) is spotted between February 22 and March 1 which means that in this period the most important concern of the NCRC was decreasing the rate of spread by avoiding the contacts in the educational system. 

Analyzing the behavior of governments and centers for disease control has been done during previous outbreaks, such as SARS [22] , Zika [53] , and Ebola [35] . These studies indicate that on-time response and answering the questions raised in the web promptly decrease the probability of misinformation dissemination in social media.

Furthermore, governments and disease control centers can fight rumors and conspiracy theory posts by active monitoring of social media [71] . Utilizing text mining techniques to extract health-related information for disease surveillance, misinformation detection, and public concern detection is widely studied; however, analyzing the concerns reflected in the news is not studied well. De Coninck et al. [72] addressed the forgotten role of the news media as a powerful legacy media that enables us to influence people's behaviors. With respect to the reliability of news sources for society, we aimed at extracting and tracking the changes in concerns of Iran's NCRC in this research.

We collect news posts, including quotes made by members of the NCRC, and then group them to select a major part of the sentences covering similar topics. This stage helps us to remove sentences about peripheral issues that do not reflect the NCRC's concerns or acts. The selected sentences are clustered using a topic modeling method with maximum coherence, and then, the assigned topics are aligned to the list of concerns and actions, enlisted by Abdi [70] . In the last step, the number of news labeled by each theme is visualized. Our findings show that NCRC members make an official statement upon actions on the day that are set up. Even though NCRC is not lagged behind, it is not J o u r n a l P r e -p r o o f considered to be an on-time reaction. According to the recommendations of Liu et al. [40] , governments have to react ahead of policy set up to maximize their impacts. The themes' time series are very fluctuating, which means that the NCRC's members express their concerns regarding different issues under the influence of the closest action or concern. One of the questionable practices found in this research is the late reaction of the NCRC's to test and tracking issues. Although screening is started on the earliest days of the outbreak in Iran, the test and tracking cycle is considered late.

Le et al. [66] studied information notified as demanded information by the diverse socioeconomics groups in Vietnam during the COVID-19 lockdown. Results of their surveys show that "updated news about pandemics", "disease's symptoms", "updated news about the outbreak", and "notices on how to prevent the disease" are the most requested topics. The least demanded information is "notices on travel". Our findings show that the top four topics covered in Iran's NCRC press conferences are "lab test, screening, and diagnosis", "closure of schools and educational institutes", "preventive awareness", and "pressure on the health system". Similarly, the topic with minimum interestingness level is "information on travel restriction." Although the list of demanded topics in Vietnam and Iran are different, but they assert that major information required to be safe are notices on preventive practices and symptoms/diagnosis.

In this article, we used a two-stage framework to group, select, and cluster the sentences expressing concerns of Iran's NCRC. Our framework leverages the ability of the Mapper algorithm to discriminate the sentences covered the peripheral topics from the target sentences. The target sentences are clustered by adopting a latent Dirichlet analysis topic model. Topics addressed the different aspects of one topical theme are aggregated together. The results reveal the fast pace of NCRC reactions. Hence it would be expected from the disease control trustees to react ahead of the actions and policies. Early responses of governments addressed in news media prepare the society to behave in a manner that mitigates the risk of diseases. We plan to expand our research to analyze the impacts of the expressed concerns on people's behaviors through the lens of their social media posts and tweets. 

Naming the coronavirus disease (COVID-19) and the virus that causes it

The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application

Clinical Characteristics of Coronavirus Disease 2019 in China

Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72 314 Cases From the Chinese Center for Disease Control and Prevention

Assessment of Deaths From COVID-19 and From Seasonal Influenza

Digital Epidemiology

Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic

When google got flu wrong

Digital epidemiology: what is it, and where is it going?

Using Google Trends for Influenza Surveillance in South China

Reappraising the utility of Google Flu Trends

Correlation between Google Trends on dengue fever and national surveillance report in Indonesia

Dynamic Forecasting of Zika Epidemics Using Google Trends

Flu Outbreak Prediction Using Twitter Posts Classification and Linear Regression With Historical Centers for Disease Control and Prevention Reports: Prediction Framework Study

Use of Twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic

The Assessment of Twitter's Potential for Outbreak Detection: Avian Influenza Case Study

Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data

Search Engines, News Wires and Digital Epidemiology: Presumptions and Facts

Prediction of Number of Cases of 2019 Novel Coronavirus (COVID-19) Using Social Media Search Index

Predicting COVID-19 Incidence Through Analysis of Google Trends Data in Iran: Data Mining and Deep Learning Pilot Study

Infodemic': When Unreliable Information Spreads Far and Wide

Learning From SARS: Preparing for the Next Disease Outbreak

MERS, Rumors Spread in South Korea

Ebola, Twitter, and misinformation: a dangerous combination?

How to fight an infodemic

The COVID-19 Social Media Infodemic

Corona Virus (COVID-19) 'Infodemic' and Emerging Issues through a Data Lens: The Case of China

When fear and misinformation go viral: Pharmacists' role in deterring medication misinformation during the 'infodemic' surrounding COVID-19

Coronavirus Goes Viral: Quantifying the COVID-19 Misinformation Epidemic on Twitter

Rapid Assessment of Psychological and Epidemiological Correlates of COVID-19 Concern, Financial Strain, and Health-Related Behavior Change in a Large Online Sample

A longitudinal study on the mental health of general population during the COVID-19 epidemic in China

Immediate psychological responses and associated factors during the initial stage of the 2019 coronavirus disease (COVID-19) epidemic among the general population in China

Women worry about family, men about the economy: Gender differences in emotional responses to COVID-19

Tracking the evolution of public concerns in social media

Detecting themes of public concern: a text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat

Twitter sentiment classification for measuring public health concerns

Monitoring public health concerns using twitter sentiment classifications

Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset

Unpacking the black box: How to promote citizen engagement through government social media during the COVID-19 crisis

Health Communication Through News Media During the Early Stage of the COVID-19

Outbreak in China: Digital Topic Modeling Approach

Variation in government responses to COVID-19

Latent Semantic Indexing: A Probabilistic Analysis

Probabilistic Latent Semantic Indexing

Correlated Topic Models

Supervised Topic Models

Relational Topic Models for Document Networks

Markov Topic Models

Gaussian LDA for Topic Models with Word Embeddings

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

WE-LDA: A Word Embeddings Augmented LDA Model for Web Services Clustering

Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news

Identifying the public's concerns and the Centers for Disease Control and Prevention's reactions during a health crisis: An analysis of a Zika live Twitter chat

Understand Research Hotspots Surrounding COVID-19 and Other Coronavirus Infections Using Topic Modeling

Public Priorities and Concerns Regarding COVID-19 in an Online Discussion Forum: Longitudinal Topic Modeling

Visualizing Data using t-SNE

Enriching Word Vectors with Subword Information

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition

Knowledge Discovery and interactive Data Mining in Bioinformatics -State-of-the-Art, future challenges and research directions

Visualizing High-Dimensional Data

KeplerMapper

Introducing our Hybrid lda2vec Algorithm

Introduction to Information Retrieval

Demand for Health Information on COVID-19 among Vietnamese

Software Framework for Topic Modelling with Large Corpora

Exploring the Space of Topic Coherence Measures

LDAvis: A method for visualizing and interpreting topics

Coronavirus disease 2019 (COVID-19) outbreak in Iran: Actions and problems

Using Social Media to Mine and Analyze Public Opinion Related to COVID-19 in China

Forgotten key players in public health: news media as agents of information and persuasion during the COVID-19 pandemic