key: cord-0628661-195ptkl0 authors: Tran, Vu; Matsui, Tomoko title: Tweet Analysis for Enhancement of COVID-19 Epidemic Simulation: A Case Study in Japan date: 2021-10-29 journal: nan DOI: nan sha: dcedbce9f6cfe209f0d2bd9e0490ea15a55689b8 doc_id: 628661 cord_uid: 195ptkl0 The COVID-19 pandemic, which began in December 2019, progressed in a complicated manner and thus caused problems worldwide. Seeking clues to the reasons for the complicated progression is necessary but challenging in the fight against the pandemic. We sought clues by investigating the relationship between reactions on social media and the COVID-19 epidemic in Japan. Twitter was selected as the social media platform for study because it has a large user base in Japan and because it quickly propagates short topic-focused messages ("tweets"). Analysis using Japanese Twitter data suggests that reactions on social media and the progression of the COVID-19 pandemic may have a close relationship. Experiments to evaluate the potential of using tweets to support the prediction of how an epidemic will progress demonstrated the value of using epidemic-related social media data. Our findings provide insights into the relationship between user reactions on social media, particularly Twitter, and epidemic progression, which can be used to fight pandemics. We investigated the potential of using data from social media to enhance the prediction and simulation of an epidemic's* progression. A case study was carried out using Twitter data related to the COVID-19 epidemic in Japan. The COVID-19 pandemic has been causing global problems that have affected everyone for a lengthy period, and the end is not in sight. During the pandemic, people are seeking information or clues for use in deciding their next actions through a variety of channels: newspapers, TV, and especially social media. Studies have shown that social media greatly affects society. Twitter is one of the largest social media platforms worldwide that greatly affects several aspects of society (daily life conversations, news reports, event advertisements, etc.) in various domains (health, entertainment, economics, research, politics, etc.) . During the COVID-19 pandemic, a large volume of information on Twitter regarding the infection situation, symptoms, treatment, vaccinations, restrictions, and so on is being continuously shared and discussed. Users can share their emotions and opinions regarding the information instantaneously without geographical limitations. The effects of these emotions and opinions can thus spread rapidly. Research on predicting the progression of the COVID-19 pandemic has received much attention worldwide. Early prediction is important for implementing countermeasures against its spread. Epidemiological models, e.g., the susceptible-exposed-infected-recovered (SEIR) model, are commonly used for such prediction. The parameters are obtained from observed data or set on the basis of predefined scenarios. Complex problems, e.g., the emergence of new variants, diverging government policies, and diverging public perceptions, have arisen as the pandemic has lasted longer and longer. Many countries, including Japan, have already experienced more than four waves of the pandemic. To tackle the complicated progression of the COVID-19 pandemic and to deal with the challenge of obtaining parameters reflecting reality as conditions continue to change, recent research has focused on utilizing extra information to enhance the prediction model. One way to obtain such information is to monitor social media: Twitter, Facebook, Reddit, etc. Social networking services, which were initially simply playgrounds for small communities of computer users, have evolved into large social media platforms connecting both online and offline social networks. Twitter, one of the largest social media platforms, has been targeted in numerous studies aimed at identifying the personality traits of social media users (Wald et al., 2012; Sumner et al., 2012) . Monitoring social media is an attractive approach to gathering data for use in various types of research (Azzaoui et al., 2021; Yoneoka et al., 2020; Alessa et al., 2019; Yoo and Choi, 2020) . Several epidemic-related behaviors can be observed on social media, for instance, health information seeking. A heavy reliance on social media has been observed during the COVID-19 pandemic (Neely et al., 2021; Dadaczynski et al., 2021; Skarpa and Garoufallou, 2021) . Several studies of the formation of pandemic waves have revealed an association between non-pharmaceutical interventions and social behaviors (Cacciapaglia et al., 2021; Kupferschmidt, 2020; Ravi et al., 2021) . Previous work on using Twitter data to support predicting of COVID-19 epidemic progression have used tweet counts (with relevant keywords) (Yousefinaghani et al., 2021) and tweet full-text analysis (Azzaoui et al., 2021) . In research regarding social media affecting social behavior, emotion is a critical aspect (Settanni and Marengo, 2015; Park et al., 2012; Wald et al., 2012) . Van Bavel et al. observed that, especially in the current COVID-19 pandemic, "Social networks can amplify the spread of behaviors that are both harmful and beneficial during an epidemic, and these effects may spread through the network to friends, friends' friends and even friends' friends' friends" (Van Bavel et al., 2020) . The social network created by a popular social media platform such as Twitter is huge with instant connectivity without geographical limitations. This means that popular social media platforms can amplify the spread of behaviors to a magnitude much much greater than offline social networks (e.g., neighborhoods). Several studies have revealed the emotions of social media users towards COVID-19 progression (Wheaton et al., 2021; Arora et al., 2021; Toriumi et al., 2020; Dyer and Kolic, 2020; Mathur et al., 2020; Kaur et al., 2020) . We have investigated the utilization of emoji usage on Twitter to capture changes in the emotions of social media users for use in enhancing epidemiological models. Several studies have focused on capturing emotion from texts including posts on Twitter ("tweets"). However, accurately understanding emotional tweets by using full-text analysis is a challenging task. Emoji analysis is an attractive approach because social media users tend to express emotions using non-verbal communication, and they share a common understanding of many emoji. Several studies have shown that emojis are used on social media as non-verbal communication cues to assist communication (Suntwal et al., 2021; Elder, 2018; Cheng, 2017; Lo, 2008) . Emoji are digital images depicting simple illustrations including facial expressions (smiley face, crying face, fearful face, scary face, etc.), as illustrated in Figure 1 . Emotional messages can be directly expressed through emoji. Because social media users share a common understanding of many emoji, emotions can be effectively and conveniently communicated through emoji. One crucial point when using social media data, particularly Twitter data, is that social media users may become less engaged, i.e., performing fewer actions such as "liking," "commenting," and "sharing," as the pandemic lasts longer and longer (Yousefinaghani et al., 2021) . When engagement drops to a certain level, social media data becomes less representative of behavioral changes. The results of a study using Twitter data from the U.S. and Canada by Yousefinaghani et al. (2021) suggest that there will be less engagement through social media due to a feeling of exhaustion as waves of the pandemic continue. In this study, we also took into consideration the results of previous studies using Japanese Twitter data. The data consisted of tweet counts and COVID-19 infection data from Japan. The tweet count data were collected using the Twitter API (version 2) with academic research access. Several settings were considered, from the general COVID-19 related tweet count to more fine-grained target subsets of keywords. Three sets of keywords were used: COVID-19 related set, COVID-19 symptom related set 1 , and COVID-19 infection reporting related set. For each set, the collections were further filtered to retain only tweets containing emojis. The COVID-19 related set was the primary set used. The other sets were used for an ablation study and analysis of the characteristics of the tweets. The details of the settings are shown in Table 1 . The collected data show that the number of COVID-19 related tweets has been correlated to some degree with the COVID-19 epidemic progression since the beginning of the epidemic ( Figure 2 ). The COVID-19 infection reporting data for Japan were obtained from JX Press 2 . The dataset contains daily infection reports for all prefectures in Japan. It was used for training or calibrating two core models used by the epidemic simulation system described in Subsections 2.2 and 2.3. As shown in Figures 2 and 3 , the trend in reported infections or cases was similar to the trend in the reaction level on social media. This suggests a non-negligible correlation between the two signals. Predicting the trend of changes in the epidemic progression helps to set up appropriate scenarios for simulating the future epidemic state, which in turn supports policy makers. In this sense, given the suggestion of a potential relationship between the trends of the two signals, additional information from social media reactions may further support predicting changes in the epidemic progression. Figure 4 . COVID-19 epidemic simulation system (t marks end timing of observable data). Here, the trend representations were estimated using the ratio of the signals for days t and t − 7, which were the same day of the week: where o t represents the two signals, the reactions on Twitter measured by tweet count and the epidemic state estimated from the reported number of new infections on day t, and s t represents the trend measured as the 7-day change. This transformation absorbs the weekly effect observed in the Japanese data. The transformation was further smoothed by a 15-day moving average. To model the relationship between the trend in social media reactions and the trend in epidemic progression, we utilized a long short-term memory (LSTM) neural network (Hochreiter and Schmidhuber, 1997), a well-known and successful neural network architecture in time-series modeling, and the multivariate time-series of the two trends. LSTM neural networks have been used in various domains for modeling time-series and have achieved practical results. In previous studies of COVID-19 epidemic prediction systems, LSTM models were used as the core models (Chimmula and Zhang, 2020; Shahid et al., 2020; Kırbaş et al., 2020) . To cope with the unknown complexity of the relationship between the two time-series, we use an ensemble system of multi-layer LSTM models with various hyperparameter (number of layers, number of neurons) settings and parameter initialization of the LSTM models 3 . The LSTM system is optimized by minimizing the mean squared error: where t marks the end of the observable or training data, d = 2 is the number of time-series (including the trend of reactions on Twitter and the trend of the epidemic progression), and s, s * are the observed data and the corresponding predictions. The inference procedure has two phases. In the first phase, the LSTM ensemble system receives observed data {s k |k ∈ [1, t]} up to time t and uses them to create memory state c t+1 and prediction s * t+1 (Equation 3 ). In the second phase, from input time-step t + 1, the prediction of the previous time-step is used as the input to predict the next time-step (Equation 4). The inference procedure is illustrated in the "LSTM" box at the top-left of Figure 4 . In the training or optimization process, only the first phase is invoked, and predictions s * 2:t = {s * k |k ∈ [2, t]} are used for the aforementioned optimization. where k is the input time-step, t marks the end of the observable data, T is the length of the prediction period, c is the memory state of the LSTM, and s, s * are the observed data and corresponding predictions. The outputs of the change prediction model are used for setting up the COVID-19 simulation system described in the next subsection. The outputs of the change prediction model are processed to identify the timings when the predicted values change sign (illustrated in Figure 3 ): • From positive to negative: the signal progression changes from increasing (up-trend) to decreasing (down-trend). • From negative to positive: the signal progression changes from decreasing (down-trend) to increasing (up-trend). The COVID-19 epidemic simulation system consists of two stages: 1) change prediction, 2) simulation. The change prediction is executed as described in Subsection 2.3. The simulation is executed using SEIR, a common epidemic model. The overall flow of the system illustrated in Figure 4 is as follows. 1. Data collection: collect tweet count and COVID-19 epidemic state; 2. Data transformation: estimate trend representations for tweet count and COVID-19 epidemic progression; 3. Change prediction: predict trends and identify change timings; 4. SEIR model parameter setup: set SEIR model parameters in accordance with the identified change timings; 5. Simulation: perform epidemic simulation. We used the simulation system proposed by Lemaitre et al. (2021) with a stochastic SEIR model to model the disease dynamics. The system supports multi-location epidemic modeling to estimate the force of infection (rate at which susceptible individuals are infected) by using inter-location mobility. The formulation of the SEIR model is described in the Appendix 4. We performed prefecture-wide multi-location setup. The SEIR model uses the following parameters: the latent period 1 σ , which is the time interval between when an individual becomes infected and when he or she becomes infectious, the infectious period 1 γ , which is the time interval during which an individual is infectious, and the effective reproduction number R i (t) for each location i at time t, which is the number of cases generated in the current state of a population. While the latent period 1 σ and infectious period 1 γ depend on the COVID-19 variant, the effective reproduction number R i (t) depends not only on the variant but also on the contact rate in the community, which changes as the behaviors of the community members change. During one wave of the COVID-19 epidemic, the change in R i (t) was greatly affected by behavioral changes due to perceived events, e.g., surging of cases and policy changes (emergency declarations), resulting in up trends and down trends in the epidemic progression. Hence, determining R i (t) is the key to effective simulation. A set R i = {R i (t)} was obtained using the calibration method used by Lemaitre et al. (2021) for the period from 2020/12/24 to 2020/01/21 (the 3 rd wave in Japan) using the observed epidemic data. Two subsets of R i (t) were established: up-trend set R u i (2020/12/24 -2020/01/06) and down-trend set R d i (2021/01/07 -2021/01/21). In the simulation period from 2021/04/23 to 2021/06/30, for each trend (up or down) time span [t s , t e ], a set of {R i (t)} for each location i was drawn from a uniform distribution: where m For evaluation, we measure the errors in the change prediction and simulation stages against the observed data for the period from 2021/04/23 (in the up-trend of the 4 th wave) to 2021/06/30 (ending of the 4 th wave). We used data from 2020/12/24 to 2020/01/21 (in the 3 rd wave) to obtain the SEIR model parameters and data from 2020/11/15 to 2021/04/22 (the end timing of observable data) for training the change prediction model. Two observed timings of trend changes were used for evaluation: t a = 2021/05/15 and t b = 2021/06/25, where t a marks the change from up-trend to down-trend, and t b marks the change from down-trend to up-trend in the epidemic progression as observed in the infection reports. The evaluation metric for change prediction was the difference in days ∆days[t] between the predicted date t and the actual date t of the trend change (Equation 6). Tran et al. The evaluation metric for simulation was the root-mean-square error (RMSE). Table 2 shows the results for change prediction and simulation. Two baselines were used for reference. • Baseline 1: R i (t) was set for the entire simulation period using R i in the up-trend and down-trend periods of the 3 rd wave. R i (t) were sampled for both the up-trend and down-trend periods without knowing the exact timing of the trend change. • Baseline 2: R i (t) was set for the entire simulation period using R u i in the up-trend period of the 3 rd wave. R i (t) were sampled for only the up-trend period. For our approach, we used three system settings: • +change prediction w/o using tweet data: the epidemic simulation system was setup withchange prediction using only the epidemic state data, not the tweet data. • +change prediction using T.R.T. COVID-19 (g): the epidemic simulation system was setup with change prediction using both the epidemic state data and the COVID-19 related tweet count data. • +change prediction using T.R.T. COVID-19 (e): similar to setting for (g) except that tweets were filtered to remove ones not containing emoji. The additional use of the COVID-19 related tweet count (g) resulted in better prediction of the epidemic progression trend changes than without using the count: prediction was improved by 8.5 days for t a and 6.3 days for t b . This led to a reduction of 42.8% in the RMSE. Given that the daily tweet count of COVID-19 related tweets filtered for emoji (e) was 92.9% smaller than the more general count (g), the results are similar: the difference in change prediction was 0.2 days for t a and 2.4 days for t b , and the RMSE was 5.5% worse. In all results, the predicted trend changes preceded the observed changes. The baseline results show that without estimating the trending change, the RMSE were 7.6-18.5 times worse. The relationship between user reactions on social media and the COVID-19 epidemic progression remains close for the long term. Social media engagements related to COVID-19 have remained fairly steady over the five waves of COVID-19 epidemic surges in Japan. They reached their highest level in the first wave, dropped a bit in the second wave, and then picked up in the following waves. The engagements peaked at around the peak of each wave. This demonstrates the value of using epidemic-related social media data, particularly Twitter data. The 3 rd and 4 th waves in the period from 2020/11/15 to 2021/06/25 exhibited similar characteristics: the wave shapes were similar (Figure 2 ) and the vaccination rates were similar 4 . Despite the similar wave shapes, the reactions to non-pharmaceutical interventions and emergency declarations differed between the two waves. In the 3 rd wave, an emergency declaration was issued on 2021/01/07, and a change in the epidemic progression trend (from increasing to decreasing) was observed on 2021/01/17 (ten days later). In contrast, in the 4 th wave, an emergency declaration was issued on 2021/04/25, and a change in the epidemic progression trend was observed on 2021/05/15 (20 days later). The 10-day later response in the 4 th wave is attributed to reluctance to comply or exhaustion after already being subjected to two previous emergency declarations. The reluctance or exhaustion level is somewhat correlated with the reaction on social media, which was partially captured by the change prediction model and resulted in more accurate prediction of the change in the epidemic progression trend. For further improvement in the simulation results, the method for setting the SEIR model parameters needs to be further improved, especially for the setting of R i (t). In this study, the distribution from which the set of {R i (t)} for each location i was drawn was assumed to be uniform, and the up-and down-trend parameter sets were manually established. The setting of the SEIR model parameters would be more challenging in periods in which the epidemic conditions greatly differed, e.g., the 5 th wave in Japan in which the delta variant was dominant. Viable options include selecting values from the most recent wave with adjustment for the infectious power of newer variants and selecting from the period with the most similar social media reactions although measuring similarity would be a challenging task. Furthermore, it is necessary to consider the emergence of new COVID-19 variants and how they would affect the parameters as well as the social media reactions. These challenges will be addressed in future work. As preparation for future work, we performed experiments on training the change prediction model using different fine-grained tweet counts: The tweet counts are listed in Table 1 , and the results of the additional experiments are shown in Table 3 . Compared with using the general-topic COVID-19 related tweet counts, using more specific-topic tweet counts did not show improvement: the RMSE was 34.7%-82.2% worse for the simulation period. This suggests that the relationship between reactions on social media and epidemic progression is complex. The general count, covering a broad range of topics, exhibited greater predictive power than the more specific counts. Manual topic design thus may not be an efficient approach. The development of automatic topic discovery techniques for finding relevant topics discussed on social media that can support epidemic progression prediction could be promising. The results for tweet counts with emoji filtering (e) compared with the general tweet counts (g) showed that the emoji settings have similar representative value as the general settings: the RMSE difference was only 3.6%-5.8% even with 87.5%-96.4% fewer tweets. One advantage of using emoji settings is the ability to perform fine-grained analysis on specific emotions (fear, anger, etc.) represented by various emojis. Further studies on the specific emotions used by social media users for typical topics could help in discovering topics where changes in emotion could affect epidemic progression. This could be done by analyzing social media contents (emoji vs. topics) to identify emotions trending on topics relevant to epidemic progression This is left for future work. In this study, we used the simulation system proposed by Lemaitre et al. (2021) with a stochastic SEIR model used to model the disease dynamics. This system supports multi-location epidemic modeling to estimate the force of infection using inter-location mobility. For Japan, we performed prefecture-wide multi-location setup. Given the parameters, including the reproduction numbers R i (t), latent period 1 σ , and infectious period 1 γ , the transitions between the compartments Susceptible, Exposed, Infected, and Tran et al. The data analyzed in this study were obtained from Twitter (tweet counts), JX Press (COVID-19 epidemic state), and ZENRIN DataCom (mobility data) and used in accordance with the licenses and restrictions of Twitter's "Developer Agreement and Policy," JX Press' "License for Research Purposes," and ZENRIN DataCom's "License for Research Purposes." Requests to access these datasets should be directed respectively to https://twitter.com/, https://jxpress.net/, and https://www.zenrin-datacom.net/. Preliminary flu outbreak prediction using twitter posts classification and linear regression with historical centers for disease control and prevention reports: Prediction framework study Role of emotion in excessive use of twitter during covid-19 imposed lockdown in india Sns big data analysis framework for covid-19 outbreak prediction in smart healthy city Multiwave pandemic dynamics explained: How to tame the next wave of infectious diseases Do i mean what i say and say what i mean? a cross cultural approach to the use of emoticons & emojis in cmc messages Time series forecasting of covid-19 transmission in canada using lstm networks Digital health literacy and web-based information-seeking behaviors of university students in germany during the covid-19 pandemic: cross-sectional survey study Public risk perception and emotion on twitter during the covid-19 pandemic What words can't say: Emoji and other non-verbal elements of technologicallymediated communication Monitoring the dynamics of emotions during covid-19 using twitter data Comparative analysis and forecasting of covid-19 cases in various european countries with arima, narnn and lstm approaches Can europe tame the pandemic's next wave? A scenario modeling pipeline for covid-19 emergency planning The nonverbal communication functions of emoticons in computer-mediated communication Emotional analysis using twitter data during pandemic situation: Covid-19 Health information seeking behaviors on social media during the covid-19 pandemic among american social networking site users: Survey study Depressive moods of users portrayed in twitter How can india be prepared for the third wave Sharing feelings online: studying emotional well-being via automated text analysis of facebook posts Predictions for covid-19 with deep learning models of lstm, gru and bi-lstm Information seeking behavior and covid-19 pandemic: A snapshot of young, middle aged and senior individuals in greece Predicting dark triad personality traits from twitter usage and a linguistic analysis of tweets Pictographs, ideograms, and emojis (pie): A framework for empirical research using non-verbal cues Social emotions under the spread of covid-19 using social media Using social and behavioural science to support covid-19 pandemic response Using twitter content to predict psychopathy Is fear of covid-19 contagious? the effects of emotion contagion and social media use on anxiety in response to the coronavirus pandemic Early sns-based monitoring system for the covid-19 outbreak in japan: a population-level observational study Predictors of expressing and receiving information on social networking sites during mers-cov outbreak in south korea Prediction of covid-19 waves using social media and google search: A case study of the us and canada We are grateful to the members of the COVID-19 Project at our institute for their valuable discussions in frequent meetings. Recovered for each location i are N S i →E i (t) = Binom(S i , 1 − exp(−∆t · FOI i (t))) (7)where M i,j represent the daily mobility from location i to location j, H i is the population of location i, p a is the proportion of time that moving individuals spend away, and α is the mixing coefficient. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. VT and TM contributed to the conception and design of the study and to the data collection. VT implemented the system, performed data curation, conducted the experiments, and wrote the first draft of the manuscript. TM validated the progress and results of the study via daily discussion with VT. Both authors contributed to manuscript revision and read and approved the submitted version. This work was supported with funding from the COVID-19 Program and the Future Investment Program of the Research Organization of Information and Systems, Japan.