key: cord-0774383-nm0i0rtz
authors: Choi, Sungwoon; Lee, Jangho; Kang, Min-Gyu; Min, Hyeyoung; Chang, Yoon-Seok; Yoon, Sungroh
title: Large-scale machine learning of media outlets for understanding public reactions to nation-wide viral infection outbreaks
date: 2017-10-01
journal: Methods
DOI: 10.1016/j.ymeth.2017.07.027
sha: f6683efffe61175a79583ab200b048aa295ccac1
doc_id: 774383
cord_uid: nm0i0rtz

From May to July 2015, there was a nation-wide outbreak of Middle East respiratory syndrome (MERS) in Korea. MERS is caused by MERS-CoV, an enveloped, positive-sense, single-stranded RNA virus belonging to the family Coronaviridae. Despite expert opinions that the danger of MERS might be exaggerated, there was an overreaction by the public according to the Korean mass media, which led to a noticeable reduction in social and economic activities during the outbreak. To explain this phenomenon, we presumed that machine learning-based analysis of media outlets would be helpful and collected a number of Korean mass media articles and short-text comments produced during the 10-week outbreak. To process and analyze the collected data (over 86 million words in total) effectively, we created a methodology composed of machine-learning and information-theoretic approaches. Our proposal included techniques for extracting emotions from emoticons and Internet slang, which allowed us to significantly (approximately 73%) increase the number of emotion-bearing texts needed for robust sentiment analysis of social media. As a result, we discovered a plausible explanation for the public overreaction to MERS in terms of the interplay between the disease, mass media, and public emotions.

Middle East respiratory syndrome (MERS) is an infectious disease caused by the MERS-coronavirus (MERS-CoV) [1, 2] . A large outbreak of MERS occurred in Korea from May to July 2015 [3, 4] . Although the country had advanced medical systems with reliable public health monitoring capabilities, the outbreak, which started with a single case, caused massive public fear, affecting various aspects of civil life. Inappropriate initial responses and insufficient information were believed to cause unnecessary social chaos [5] [6] [7] . Because of worries about infection, the majority of the public refrained from performing normal social and economic activities [8] [9] [10] . For example, owing to a sharp decrease in Chinese tourists, the Korea Tourism Organization (KTO) reported that the number of tourists in June decreased by 41% compared with the previous year [11] . As a result, the country had to experience lower economic growth than originally estimated before the outbreak, even though the Korean government declared a de facto end to the MERS outbreak on July 28, 2015, approximately 10 weeks after the first confirmed case [12] .

An overreaction to a moderate infectious disease by the public can cause various unnecessary complications. By contrast, negligence with regard to dangerous infections can result in a widespread pandemic that could have been controlled by sufficient public attention. For instance, although the incidence and mortality rate of tuberculosis in Korea are the highest among Organization for Economic Co-operation and Development (OECD) countries, public attention has not been drawn to this airborne disease as vividly as MERS.

Evidently, it would require significant time and resources to monitor public thoughts of and reactions to infectious diseases in a traditional way, which makes it inappropriate for the purpose of controlling infectious diseases requiring prompt public and government responses. Instead, utilizing social media can provide a rapid and effective means for monitoring public health on a large scale at low cost. A well-known example is Google Flu Trends, which provides query-based estimates of influenza activities for multiple countries [13] .

To investigate what triggered public overreaction to MERS in Korea, we presumed that machine learning-based analysis of media outlets could provide a plausible explanation. From the Internet, we collected articles reported by 153 news media outlets in Korea and comments associated with these articles from day 1 (the first confirmed case on May 20, 2015) to day 70 (the de facto end declared by the government on July 28, 2015) . In Korea, in addition to Twitter and Facebook (two widely used social networks world wide), short-text comments on news articles are extremely popular and often provide a common medium for expressing personal emotions and thoughts about social phenomena. The machine learning challenges in sentiment analysis using Twitter and Facebook data (such as short text lengths and semantic heterogeneity) would remain the same for mining the short-text comments we collected.

Based on the collected data (which consisted of 86,324,566 words from 490,749 articles and 3,901,985 comments), we performed thorough text mining and comparative analysis, focusing on the interplay between the disease, social/mass media, and public emotions. We developed a machine-learning engine for sentiment analysis of a large population. Our approach utilized information-theoretic and machine-learning techniques (such as topic modeling and word embedding) and included an effective method for extracting emotions from texts that hold sentiments (such as emoticons and so-called Internet slang). For comparative analysis, we additionally collected and analyzed articles and comments data for the H1N1 influenza epidemic in Korea in 2009 and the Ebola hemorrhagic fever reports in Korea in 2014.

Through our analysis results, we discovered a loop of information transfers [14] between the media and the public. We believe that this discovery may provide a reasonable explanation of the mechanism that triggered the overreaction to MERS in Korea. In addition, we report various analysis results that should be helpful for alleviating the excessive fear and overreaction of the public regarding nation-wide infectious diseases occurring in the future.

MERS is caused by MERS-CoV, which is an enveloped, positivesense, single stranded RNA virus belonging to the lineage C of the genus Betacoronavirus (bCoV) in the family Coronaviridae [15] . It was first isolated from the sputum of a 60-year-old man with pneumonia in Saudi Arabia in 2012, and has since spread to the Middle East, Africa, Europe, the United States, and Asia including Korea [1, 16] . As of January 10, 2017, World Health Organization (WHO) has reported 1,879 laboratory-confirmed cases of MERS-CoV and 659 deaths associated with MERS-CoV [17] . 27 countries have been affected by an outbreak of MERS-CoV, but the majority of cases (>85%) have been reported from Saudi Arabia [17] . Dromedary camels (Camelus dromedarius) are known to be the natural source of infection in human, and consumption of contaminated milk, urine, or meat as well as direct contact with infected camels is the suspected transmission route [16] . In addition, humanto-human transmission through close contact of an infected individual with family members and health care workers was also confirmed [18, 19] . For example, Chen X. et al. [20] reported that 94.1% of MERS-CoV cases in the 2015 outbreaks in Korea had a history of contact in hospital facilities, and six cases (3.2%) were infected with MERS-CoV through community contacts.

Clinical features of MERS-CoV infection in humans range from asymptomatic or mild infection to severe acute respiratory diseases, renal failure, and multi-organ failure leading to death [15] . A typical MERS symptom represents fever, shortness of breath, and cough commonly, but not always accompanied by pneumonia [17] . In addition, gastrointestinal symptoms, including diarrhea, nausea, and vomiting, have also been observed [21] . The global mortality rate was about 35.7% as of July 29, 2015, while it was 19.4% in Korea as of Aug 1, 2015 [3, 4] . The high-risk group is males above the age of 60 with underlying conditions such as cancer, lung disease, and diabetes [17, 22] .

Currently, there is no specific therapeutic agent or approved vaccine against MERS-CoV. Broad-spectrum antiviral, ribavirin, in combination with interferon has been found to control MERS-CoV, but their clinical usage is limited due to toxicities [23] [24] [25] . Various attempts have been made to develop vaccines against MERS-CoV, and they are based on inactivated or attenuated viruses, viral vectors, virus-like particles, DNAs, or recombinant viral proteins [26] . In particular, subunit vaccines containing Receptor binding domain (RBD) of viral S protein has been shown to elicit strong neutralizing antibody responses in mice, representing a great potential for effective MERS-CoV vaccine development [27, 28] . In addition, RBD is an attractive therapeutic target of anti-MERS-CoV drugs [26] . The RBD binds to CD26 or dipeptidyl peptidase 4 (DPP4) expressed on epithelial cells and initiates infection of the host cells [29] . Furthermore, viral proteases such as PLpro and 3CLpro, and viral accessory proteins are also potential targets for antiviral agents [30] [31] [32] .

In text mining, the latent Dirichlet allocation (LDA) is a generative, probabilistic model for discrete data [33] and widely used in natural language processing (NLP) for modeling corpora and discovering topics therein. LDA considers a document as a mixture of topics, whose distribution is assumed to have a Dirichlet prior. The applications of LDA include topic modeling, document classification, and collaborative filtering.

For representing words in a text for analysis, we utilize the Word2Vec method, an NLP algorithm that takes a corpus and returns vector representations of the words in the corpus [34] . Word2Vec builds a vocabulary from training data and then learns word representations by, for instance, either the continuous bagof-words (CBOW) method or the continuous skip-gram method. These representations allow us to add and subtract concepts as if they were ordinary vectors. For instance, we can evaluate an interesting query ''queen À woman + man" to the result 'king.' According to Mikolov et al. [34] , the CBOW model tends to be more efficient than the skip-gram in training time and has slightly better accuracy for handling frequent words. On the contrary, the skipgram model is known to be better for limited training data with rare words of phrases.

In this study, we use the skip-gram model because of the need for handling infrequently occurring Internet slangs and their limited training data. In the skip-gram model, we associate each word w 2 W with a vector v w 2 R d , where W is a vocabulary set, and d is the embedding dimensionality. Let us suppose that the training corpus contains a sequence of 2n þ 1 words: w iÀn ; . . . ; w iÀ1 ; w i ; w iþ1 ; . . . ; w iþn . The objective function of the skip-gram model is represented by the sum of the log probabilities of the n words surrounding the target word w i [34] :

and J h is maximized with regard to the model parameters h by training of the model. In information theory, transfer entropy (TE) is a measure to quantify directed transfers of information between two random processes [14, 35] . To define the TE from a random process to another, let X and Y denote two stationary Markov processes with order p and let t indicate their time indices. Then, the TE from X to Y is defined as T X!Y ¼ HðY½tjY½t À 1 : t À pÞ À HðY½tjY½t À 1 : t À p; X½t À 1 : t À pÞ

where HðXÞ represents the Shannon entropy of the process X. Eq. 2 suggests that the TE from X to Y measures the uncertainty reduced in the future values of Y given the past values of Y by knowing the past values of X.

There have been approaches that utilize social media for public health surveillance. Paul and Dredze [36] and Signorini et al. [37] traced the trends of the attack of diseases during outbreaks and analyzed the correlation between the trends and public responses. Their studies confirmed that social media can be used as a means to measure the public interest by tracking diseases-related texts on social media. Corley et al. [38] reported that there was high correlation between the prevalence of influenza that occurred in autumn 2008 in the United States and the quantity of influenzarelated personal blogs. Aramaki et al. [39] proposed a support vector machine-based method to classify whether a Twitter user was infected by influenza or not based on the tweets of the user. Towers et al. [40] analyzed the potential influence between Ebolarelated news and Ebola-related searches or tweets using a mathematical model of contagion. In our prior work [41] , we performed proof-of-concept experiments to test the effectiveness of social media-based analysis of public reactions to widespread infectious diseases such as MERS.

In text mining, there exist approaches that perform sentiment analysis for social media data. Agarwal et al. [42] proposed a tweet mining method that performed part-of-speech (POS) tagging, tree kernel-based feature extraction, and three-class sentiment classification (positive, neutral, and negative). Yu and Hatzivassiloglou [43] reported that the naïve Bayes classifier is effective for sentiment analysis of news articles in terms of precision and recall.

The application of sentiment analysis techniques to social media has led to approaches for public health surveillance using social media. Salathe and Khandelwal [44] performed sentiment analysis of H1N1 vaccination related tweets in 2010 in the United States. They utilized the location information embedded in Tweets to reveal a network. Ji et al. [45] defined a metric called the degree of concern to classify public emotions and proposed an epidemic sentiment monitoring system. Greaves et al. [46] conducted sentiment analysis using comments on hospitals on the English National Health Service website. Based on the result, they proposed a method for recommending hospitals to other patients.

In this paper, by significantly extending our prior work [41] , we analyzed the interaction among disease, media and public. Unlike precedent research mostly dealing with non-anonymous data (such as Twitter or Facebook), our methods using comments guarantee the anonymity of the public. This attribute of data helps reflect the public emotion as it is. Compared with our prior work [41] , this paper proposes more advanced methodologies that can process a larger-scale dataset with more delicate detection of sig-nals in texts, along with a more extensive set of experiments that can verify the effectiveness of our approach for real-world media outlet datasets.

Our methodology consisted of four parts: dictionary generation, media article analysis, short-text comment analysis, and TE analysis. Fig. 1 shows the first three parts.

We collected the data (news articles and short-text comments) listed in Table 1 from 153 news media in Korea through Naver (http://www.naver.com), which is the most popular Internet portal in Korea.

After crawling the news articles and comments that contained related key words (such as MERS, Ebola, and H1N1), we first converted the multimedia articles from broadcasting companies to texts. From each article and short-text comment, we then extracted relevant information (such as the title, time, contents, and reply counts) and stored the information in the JavaScript Object Notation (JSON) format, a lightweight data-interchange format ( http://json.org). We also collected MERS epidemic data from the Korean Ministry of Health and Welfare [4] website and the WHO website [3] .

We generated three types of dictionaries to capture explicit and implicit emotions: translated and expanded emotion lexicon (TEEL), Internet slang-emotion dictionary (ISED), and emoticonemotion dictionary (EED). A key for TEEL is one of the representative emotion words (sorrow, anger, fear, and hate) and the value for each key is a set of additional emotion words (e.g., fear ! {angst, panic, terror}). A key for ISED is an Internet slang word, and its value is a pair of the meaning and the emotions embedded in the word (e.g., Baracklight ! ("the aura Barack Obama sheds on things with his presence and vision", trust/ joy/powerful)). A key for EED is an emoticon and its value is the matching emotion (e.g., :-( ! unhappy).

To generate TEEL, we first translated the NRC emotion lexicon [47] into Korean in step A0. We then collected random Korean corpora (including Twitter) by crawling (about 107 GB in size) in addition to the collection of news articles and comments mentioned previously. In step A1, we separated sentences by whitespace delimiters and removed non-alphanumeric characters (except for well-known emoticons, such as :-) and^^), converting the remaining characters into lower cases. In step A2, we filtered out irrelevant words (such as stop words and common verb forms) using the part-of-speech (POS) tagging information and generate the vocabulary set W mentioned in Section 2.2. In step A3, we converted each word w 2 W into its d-dimensional vector representation v 2 R d using the skip-gram model. In step A4, we expanded the list of emotion words (denoted by E ¼ ffear, anger, sorrow, hateg) in the translated NRC emotion lexicon by inserting similar words from the corpora into the lexicon (the similarity was measured in the d-dimensional vector space created by Word2Vec).

In addition to TEEL, we also generated the ISED and EED by comparing each Internet slang word and emoticon with emotion words in the vector space. When constructing the ISED and EED, the emotional word belonging to the TEEL was mapped after the TEEL emotional expansion process. The TEEL and EED generation methods were almost identical except for the different mapping terms. In addition to mapping the closest emotional word in the case of the ISED, the original meaning of the word was manually matched.

The input for this step was the collection of 490,749 articles collected and processed as explained previously. The output was the daily changes of the ratios of article topics.

In step B1 (preprocessing), we tokenized the text in each article and removed unnecessary tokens. In step B2 (POS tagging), we performed POS tagging to identify nouns and adjectives using Komoran (version 2.4), an open-source Korean POS tagger (http://shineware.tistory.com/entry/KOMORAN-ver-24). In step B3 (topic modeling), we utilized the lda package included in the R environment ( https://www.r-project.org/) to discover three main topics. To provide a more detailed explanation, we present an example of the use of an LDA model on MERS article data. Our data contains 229,448 documents. After removing stop words, we deleted the words that occurred less than 20 times. Finally, we obtained a total of 18,589 unique words. Then, we used the EM algorithm to determine the Dirichlet and conditional multinomial parameters for a 3-topic LDA model. The top words from some of the resulting multinomial distributions are illustrated in Fig. 6(b) . In the LDA, these distributions are expected to capture underlying topics in the corpus. We grouped the words in the corpus according to these resulting distributions of individual word, and then we named each group based on the words in the group (disease, government/politics, and economy). In step B4, we computed the daily fraction of each topic to draw Fig. 6(a) and Fig. 10(a) .

The input for this step was the set of 3,901,985 short-text comments. The output was the daily trend of each of the four types of public emotions (fear, sorrow, hate, and anger) regarding each of the three major topics discovered in Section 3.3.

In step C1 (preprocessing), we tokenized the text and removed unnecessary components while preserving emoticons and Internet slang. In step C2, we replaced Internet slang and emoticons by the words corresponding to their emotions using the ISED and EED, respectively. After C3 (POS tagging to extract nouns and adjectives), by using Word2Vec in step C4 (word embedding), we used as its input a large corpus of short-text comments we preprocessed and produced a two hundred dimension vector space with each unique word in the corpus. Then, each unique word was assigned a corresponding vector in the space. To implement Word2Vec, we used the library provided by Mikolov et al. [34] Using the resulting vector representations, in step C5, we computed the proximity of each emotion word to each of the three topics appearing in the articles, producing daily time series of emotion trends.

To measure and quantify the public emotion at a certain day as a score, we defined the notion of emotional proximity function (EPF) defined over the vocabulary W E of emotion words E and a topic word t as follows:

where E is the list of emotion words defined in Section 3.2, t represents a topic word (such as MERS, government, and economy), corr denotes the correlation coefficient, and v e and v t represent the word-embedding representation of the emotion word e 2 E and the topic word t, respectively. For the set of short-text comments made on each day, we then measured the emotional proximity score with the topic words using the EPF scores defined as above, thus monitoring the trend of emotional changes of the public. Of note is the effectiveness of the emotion separation we proposed. As stated in the final paragraph of Section 5, we could substantially boost the number of emotion-bearing words using our approach. For sentiment analysis of social media, it is often challenging to obtain a sufficient number of emotion words because of the limited text lengths. Our technique will thus be helpful for other types of sentiment analysis of social media.

As shown in Fig. 3 , we derived the time series for the MERS epidemic, mass media, and public emotion variables from the article and comments processed as above. Before using the TE, surveillance data must be transformed from non-stationary to stationary data. We did not use the time-series graphs as it is in Fig. 4(a) for the analysis. We used first-order differencing to obtain a stationary time-series as implemented in many other studies [48] [49] [50] that focused on infectious disease surveillance data. We also added ACF (auto correlation function) and PACF (partial auto correlation function) plots to identify stationarity in Fig. 2 . It is one of the widest used tools in time-series analysis. We then computed the TE values from these time series to generate the results shown in Fig. 4 . This was to quantify the information transfers between the variables. In this step, we computed TE values by using the JIDT toolkit [51] . We tested the statistical significance of the TE values by the method of block bootstrapping [35] .

The graph in Fig. 3 represents the three variables in our analysis (MERS epidemic, mass media, and public emotion) by vertices and their interactions by edges. The objective of our experiment was to reveal the interactions between these variables. Thus, we collected the time-series of the numbers of confirmed MERS cases and people under quarantine (for the MERS variable), the online articles (for the mass media variable), and the short-text comments on the articles (for the public emotion variable) for 70 days of the MERS outbreak in Korea. For comparative analysis, we also collected articles and comments on H1N1 influenza and Ebola hemorrhagic fever. Table 1 summarizes the statistics of the data used in our study.

We aimed at understanding the interplay between the disease, media, and emotion variables. Fig. 4(a) shows the time series associated with each of these variables. Note that the three time series represent the normalized numbers of confirmed MERS cases, MERS-related articles, and public fear (see Section 3 for more details of the normalization and emotion measurements we propose).

In order to quantify the flow of influence from one series to another, we measured the TE (see Section 2.2) values between each pair of time series. Fig. 4(b) shows the direction and magnitude of the TE values between variables and we show only those values that are greater than 0.5 and statistically significant. The clearest flow of information is from the disease to public fear, and there is a loop of information transfers between the media and emotion variables. This observation is compatible with reality: the outbreak of MERS would create public fear; the mass media would cover the events people fear, and reciprocally, people would feel fear about the articles that cause fear. The influence between the MERS epidemic and mass media proved to be weak, providing TE values of less than 0.5.

For finer-grained analysis of information transfers, we divided each of the three time series into five 14-day segments and measured TE between two segments for each fortnight. Fig. 4(c) shows the result as a heat map, in which rows and columns correspond to variable pairs and fortnights, respectively. This result shows that the interactions between the three variables tended to increase until the third fortnight arrived and then decreased after that.

We analyzed how the mass media reacted to the MERS epidemic. Fig. 5(a) shows the time series of the numbers of confirmed MERS cases, people under quarantine, and the MERS-related articles, measured each day. The cross-correlation analysis shown in Fig. 5 (b) reveals that these time series have high correlation with one-to eight-day delays. In particular, the delay information allows us to conduct an interesting interpretation of MERS-media interactions. First, only a one-day delay was observed between the confirmed MERS cases and the articles, which implies the immediate news coverage of confirmed cases. Second, there was approximately a two-day delay from the news article series to the quarantined people series, whereas there was longer eightday delay from the confirmed case series to the quarantined people series. Monitoring media articles thus provided a rapid indicator of the trend in the number of quarantined people.

Using the LDA technique, we analyzed what kinds of topics were covered by the mass media, as shown in Fig. 6 . Through this analysis, we discovered three topics (MERS, economy, and government/politics) dominating the MERS-related articles. Fig. 6(a) shows the change of the ratio of each of these three topics over the outbreak period. The list in Fig. 6(b) shows the top five words found to be relevant to each topic.

According to our LDA analysis, the articles covering MERS itself were dominant early in the outbreak, gradually decreasing toward the end. In contrast, articles that mentioned government/politics were negligible in quantity initially but continued to increase, exceeding the MERS articles at day 41 and eventually dominating toward the end. The economy-related articles increased from start to end but remained more moderate compared to the government/ politics-related articles. These trends in the fraction of major topics covered in the mass media seem related to the trends in public emotions presented in Section 4.3.

We analyzed how the mass media and public emotion variables interacted. Fig. 7 shows the change of the four types of emotions (anger, sorrow, fear, and hate) regarding the articles on the Korean government/politics (Fig. 7a) and MERS (Fig. 7b) during the outbreak. The y-axis of each plot indicates the proximity of each of the four emotion types to the topic (the government/politics or MERS itself). A proximity value is in the range ½À1; þ1, and þ1 (À1) indicates the perfect (mis) match between the feeling and the topic word (refer to Section 3 for details on quantifying the proximity). For each plot, we also annotated some of the key incidences.

For the government/politics-related articles (Fig. 7a) , the timing of the first two peaks of anger matched two press conferences: one by the mayor of Seoul (day 16) and the other by the owner of the hospital (day 35) where the first confirmed MERS case was reported. This result suggests that both of the press conferences caused negative feelings in the public regarding the government and politicians. All of the four types of emotion disappeared or significantly reduced at the time when the government declared the de facto end to the outbreak.

For the MERS-related articles (Fig. 7b) , the feeling of hate grew noticeably at first, but soon the feeling of fear began to dominate. To further interpret the trend of emotional changes shown here, we present our analysis results on the disease-emotion interaction in the next subsection.

To understand how the lethal MERS cases affected public emotions, Fig. 8 overlays the number of people who died of MERS and the level of each of the four types of public emotion during the outbreak. Note that the emotion level curves are the same as those depicted in Fig. 7(b) . Public fear increased initially, reaching a peak on day 36. There were no deaths for about a week between days 42 and 48, and fear decreased significantly. However, three deaths were reported serially from day 49 to day 52, and fear increased again. Fear vanished eventually because no more deaths were reported from day 53 to the end of the outbreak.

Finally, we conducted comparative analysis of three infectious diseases that either occurred in Korea (MERS and H1N1 influenza) or had news coverage in the Korean mass media (Ebola hemorrhagic fever) to compare public responses to these diseases. Fig. 9(a) shows the change in public emotions regarding H1N1 influenza and Ebola hemorrhagic fever. Both of these two plots show similar trends to those depicted in Fig. 7(b) for MERS: fear tended to be the dominant emotion during the outbreak. No sign of odd emotions from the public toward MERS could be found in this result. By contrast, Fig. 9 (b) shows the number of media articles for the three days after the first occurrence of each disease and the number of media articles after the first death from each disease. The number of articles after the first occurrence was similar for the three diseases, but there were substantially more articles about MERS deaths. As will be discussed in Section 5, this excessive number of articles may have triggered the overreaction to MERS by Korean people, along with the somewhat overestimated death rate of MERS in Saudi Arabia.

We also detected a peculiar phenomenon in Fig. 9 (c), which compares public emotions (hate, fear, sorrow, and anger) regarding the three diseases. Evidently, the emotional distribution of the public with two high-mortality rates (Ebola, MERS) was different from low-mortality rate (H1N1). In the case of the MERS, the death of the victim was triggered and the media reacted very much, whereas the Ebola did not. Therefore, it is expected that there would be the same overreaction as the MERS if there was a death during the Ebola outbreaks in Korea.

For further comparison, we conducted LDA analysis of media articles on H1N1 influenza and Ebola hemorrhagic fever, as shown in Fig. 10 . Comparing these results with those shown in Fig. 6(a) for MERS, we can observe the following: First, the trends in the changes of topics have some periodicity for the H1N1 influenza case. The start of each period is related to the occurrence of a death. Second, the trends in topic changes observed in each period of the H1N1 case (Fig. 10a ) or in the Ebola case (Fig. 10c) are very similar to those observed for the MERS case in Fig. 6 (a). Fig. 10 (b) and (d) shows the top five words relevant to each topic for the H1N1 and Ebola cases, respectively.

Before this study, it was already conjectured that Korean people showed unnecessarily high levels of fear and worries about MERS, which then put the nation in danger of short-term economic recession caused by reduced social and economic activities. By mining MERS-related social media, we could verify this conjecture rapidly and effectively, reporting analysis results only a few days after the de facto end of the MERS outbreak declared by the government. Relying on traditional approaches would require substantially more time and resources.

We then questioned the cause of this overreaction and methods to control public emotions for the next outbreak of infectious diseases. Compared to the H1N1 and Ebola cases, the MERS outbreak did not show any peculiar patterns in terms of the evolution of public emotions (Fig. 7b versus Fig. 9a ) or the change of topics in the media (Fig. 6 versus Fig. 10 ). By contrast, there was a significant difference in terms of the number of articles after the occurrence of the first death between MERS and the other two diseases, as observed in Fig. 9(b) . To link this observation to the explanation of the overreaction to MERS, scrutinizing the influence relationships shown in Fig. 4 (b) is helpful. It is obvious that the outbreak of an infectious disease will affect public emotions, as indicated by the arrow with the TE of 0.79. We rather need to focus on the interplay between the mass media and public emotion variables, as represented by the reciprocal arrows with TE values of 0.52 and 0.53.

Based on this interplay revealed by TE analysis, a substantial number of articles on MERS deaths, and a somewhat overestimated death rate of MERS (35.7% globally but 19.4% in Korea), [3, 4] we can take a step closer to the clear explanation of the public overreaction in Korea. Due to the high death rate, mass media reacted excessively (Fig. 9b) , which then triggered an unnecessarily high level of fear among the public (as indicated by the edge with 0.53 TE in Fig. 4b) . Such a high level of reaction to news articles was likely to make the reporters write more and more MERSrelated articles (as predicted by the edge with 0.52 TE in Fig. 4b) . Accordingly, a positive feedback loop was created between the mass media and public emotion variables, thus creating an overreaction to MERS. Understanding these interactions between the three variables analyzed may provide a helpful means to prevent similar reactions to nation-wide infectious diseases from occurring in the future.

From the methodology point of view, we confirmed the effectiveness of our text mining and comparative analysis techniques by applying them to the MERS case study, although we had to omit more details from our results about the performance advantages of our approach because of a space limit. In essence, the following points are worth mentioning: First, the Word2Vec approach, which was originally developed for Western languages, could successfully be applied to an East Asian language. Second, our approach for extracting emotion from a text word was effective in increasing the number of emotion-bearing words in social media corpora, which typically consist of short texts, making it difficult to perform sentiment analysis. For instance, by the proposed emotion separation technique, we could increase the number of emotion-bearing words from 582,448 to 1,010,945 (out of a total of 25,934,307 words in the short-text comment data), yielding a 73.6% increase.

In this paper, we have proposed a machine learning-based computational method for monitoring and understanding the emotional responses of the public to a widespread outbreak of infectious diseases. Our methodology is based on analyzing the massive media outlet data collected during the nation-wide outbreak of MERS in Korea in 2015. To provide a plausible explanation of the public overreaction to MERS observed in Korea, we focused on the interplay between the disease, mass media, and the public emotions, discovering an intriguing loop of information transfers between the media and the public. Moreover, using our methodology, we reported various in-depth analysis results that would be helpful for alleviating the unnecessary fear and overreaction of the public regarding infectious diseases occurring in the future. In this regard, we anticipate that our approach will provide an efficient way of rapidly monitoring the reaction of the public to a nation-scale infectious disease, revealing useful information for timely control of the disease.

Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia

Middle east respiratory syndrome

Middle east respiratory syndrome risk perception among students at a university in South Korea

The communication of risk in disease outbreaks is too often neglected; that must change

Communication gaps fuel MERS worries in Korea

BBC

Reuters

Detecting influenza epidemics using search engine query data

Measuring information transfer

Middle east respiratory syndrome coronavirus (mers-cov): announcement of the coronavirus study group

A more detailed picture of the epidemiology of middle east respiratory syndrome coronavirus

WHO MERS-CoV Global Summary and risk assessment

Hospital outbreak of middle east respiratory syndrome coronavirus

WHO: MERS-CoV Fact Sheet

Comparative epidemiology of middle east respiratory syndrome coronavirus (mers-cov) in Saudi Arabia and South Korea

State of knowledge and data gaps of middle east respiratory syndrome coronavirus (mers-cov) in humans

Development and evaluation of novel real-time rt-pcr assays with locked nucleic acid probes targeting the leader sequences of human pathogenic coronaviruses

Ifn-a2a or ifn-b1a in combination with ribavirin to treat middle east respiratory syndrome coronavirus pneumonia: a retrospective study

Case report ribavirin and interferon-a2b as primary and preventive treatment for middle east respiratory syndrome coronavirus: a preliminary report of two cases

Ribavirin and interferon alfa-2a for severe middle east respiratory syndrome coronavirus infection: a retrospective cohort study

Current advancements and potential strategies in the development of mers-cov vaccines

A safe and convenient pseudovirus-based inhibition assay to detect neutralizing antibodies and screen for viral entry inhibitors against the novel human coronavirus mers-cov

A truncated receptor-binding domain of mers-cov spike protein potently inhibits mers-cov infection and induces strong neutralizing antibody responses: implication for developing therapeutics and vaccines

A specific antidote for reversal of anticoagulation by direct and indirect inhibitors of coagulation factor Xa

Middle east respiratory syndrome coronavirus: transmission, virology and therapeutic targeting to aid in outbreak control

Coronaviruses post-sars: update on replication and pathogenesis

Accessory proteins of sars-cov and other coronaviruses

Latent dirichlet allocation, the

Efficient estimation of word representations in vector space

Entropy-based analysis and bioinformatics-inspired integration of global economic information transfer

You are what you tweet: analyzing twitter for public health

The use of twitter to track levels of disease activity and public concern in the us during the influenza a h1n1 pandemic

Monitoring influenza trends through mining social media

Twitter catches the flu: detecting influenza epidemics using twitter

Mass media and the contagion of fear: the case of ebola in america

Mining internet media for monitoring changes of public emotions about infectious diseases

Sentiment analysis of twitter data

Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences

Assessing vaccination sentiments with online social media: implications for infectious disease dynamics and control

Monitoring public health concerns using twitter sentiment classifications

Use of sentiment analysis for capturing patient experience from free-text comments posted online

Nrc emotion lexicon

Making sense of antimicrobial use and resistance surveillance data: application of arima and transfer function models

Use of time-series analysis in infectious disease surveillance

Monitoring epidemiologic surveillance data using hidden markov models

Jidt: an information-theoretic toolkit for studying the dynamics of complex systems

This work was supported in part by the Seoul National Univer