key: cord-1006206-hb48b3jb
authors: Nagaya, Hiroshi; Hayashi, Teruaki; Ohsawa, Yukio; Toriumi, Fujio; Torii, Hiroyuki A.; Uno, Kazuko
title: Net-TF-SW: Event Popularity Quantification with Network Structure
date: 2020-12-31
journal: Procedia Computer Science
DOI: 10.1016/j.procs.2020.09.194
sha: 2d2337a64c0017311cb2ee21cad623169a0a7a71
doc_id: 1006206
cord_uid: hb48b3jb

Event popularity quantification is essential in the determination of current trends in events on social media and the internet. Particularly, it is important during a crisis to ensure appropriate information transmission and prevention of false-rumor diffusion. Here, we propose Net-TF-SW - a noise-robust and explainable topic popularity analysis method. This method is applied to tweets related to COVID-19 and the Fukushima Daiichi Nuclear Disaster, which are two significant crises that have caused significant anxiety and confusion among Japanese citizens. The proposed method is compared to existing methods, and it is verified to be more robust with respect to noise.

Social media is an effective tool for transmitting and collecting information. Particularly, in the context of social phenomena that attract significant public attention, such as disasters, the number of relevant posts have been observed to tend to increase rapidly, and associated information is actively exchanged [1] . In order to ensure appropriate information dissemination and adequate measures to prevent hoaxes regarding such events, it is important to evaluate the timing and degree of attention paid to such incidents by users on these platforms.

However, data collected from social media posts based on specific keywords mentioned in them may contain noise that is not related to the event or incident of interest. Likewise, such data may contain a large proportion of tweets from spam accounts. These issues make it difficult to properly follow transitions in the degree of public attention to a

Social media is an effective tool for transmitting and collecting information. Particularly, in the context of social phenomena that attract significant public attention, such as disasters, the number of relevant posts have been observed to tend to increase rapidly, and associated information is actively exchanged [1] . In order to ensure appropriate information dissemination and adequate measures to prevent hoaxes regarding such events, it is important to evaluate the timing and degree of attention paid to such incidents by users on these platforms.

However, data collected from social media posts based on specific keywords mentioned in them may contain noise that is not related to the event or incident of interest. Likewise, such data may contain a large proportion of tweets from spam accounts. These issues make it difficult to properly follow transitions in the degree of public attention to a specific event (event popularity). In this paper, we propose a noise-robust event popularity quantification method that utilizes the characteristics of social media and the Web. In recent years, interpretability and explanation in data analysis techniques have been extensively researched [2] . In the method proposed in this paper, the leading information senders (influencers) corresponding to each period are clarified by incorporating the information of the user's interactions, leading to improvements in the interpretability of the result.

In this paper, we focus on COVID-19 and the Fukushima Daiichi Nuclear Disaster, which are two significant crises that have caused significant anxiety and confusion among Japanese citizens.

On 11th March, 2011, The Great East Japan Earthquake and the subsequent accident at the Fukushima Daiichi Nuclear Power Plant resulted in radiation contamination and radiation exposure of the general populace [3] . Residents in the surrounding area have since been exposed to radiation over a long period, and the fear of spreading contamination has caused social unrest throughout Japan [4] .

In December, 2019, an outbreak of the novel coronavirus (COVID-19) [5] was reported in Wuhan, China. On 19th March, 2020, at the time of composition of this study, the total number of confirmed cases has been reported to have exceeded 200, 000 worldwide. While the infection took over three months to claim its first 10, 000 confirmed cases, it took only 12 days to claim the next 100, 000 [6] . As in the rest of the world, Japan has also experienced an outbreak of domestic infections [7] .

In this section, we introduce some existing research related to this paper, primarily concerning two perspectives ̶ "Social media Analysis for Crisis Situations" and "Event Popularity Quantification". We also remark on the relevance of this paper with respect to the two aforementioned topics.

Various forms of social media, including Twitter, are useful communication tools during crises [8] , and certain studies have analyzed the behavior of users on such platforms during various crises. There exists some research that primarily explores the online behaviors of Twitter users and their interactions with each other during the Fukushima Daiichi nuclear disaster and The Great East Japan Earthquake [9] [10]. However, even though such investigations are essential to ensure appropriate information transmission and prevention of false-rumor diffusion during crises, there have been very few studies that have analyzed the transitions in the degree of attention corresponding to some events based on the keywords used by users, especially over long periods following disasters. To address this shortcoming, in this paper, we analyze tweet data related to the Fukushima Daiichi nuclear disaster and COVID-19, two exemplary crises that have caused significant anxiety and confusion among Japanese citizens.

Event popularity quantification is crucial to understanding burst detection and information diffusion in social media and search engines. Many researches have been conducted on this topic in the past [11] [12] [13] .

In particular, a method that incorporate semantic information, such as the TF-SW (semantic-aware popularity qualification model) [14] , has been proposed as a method of analysis of event robustness that is robust to noise. The focus of this research lies on this method. TF-SW has also been applied to another method [15] that considers emotional information in combination with an emotion prediction model that uses pictograms, and its performance has been evaluated. However, these text-based methods only use the information obtained from the text corpus and do not consider the relationships and interactions between various speakers. For this reason, these methods sometimes fail to adequately eliminate noise. The method proposed in this paper, Net-TF-SW, utilizes network information composed of interactions between users. Consequently, the proposed method improves robustness with respect to noise and interpretability.

In this study, we propose a new method, Net-TF-SW, which incorporates network information composed of user interaction into the existing TF-SW method. Fig.2 depicts the algorithm flow of the proposed method. 

In the base method, TF-SW, the period following the transition is divided into E i (i = 1, ..., n), and the event popularity corresponding to each period is calculated based on the information obtained from the text. It is a method of outputting time-series data that represents the transition of event popularity, called EPTs (Event Popularity Time Series). It consists of the procedures A, B, C, and D that have been depicted on the right-hand side of the Fig. 1 .

First, expressions that can be judged to be irrelevant based only on their constituent words and phrases are excluded. It is known that the frequency distribution of words in a natural language corpus obeys the power law according to Zip's law [16] , and can be expressed by f (r) = H · r −α . Here, r denotes the rank (rank) of each word when their appearance frequencies in the corpus are arranged in descending order, and H and α denote parameters that are unique to the data set and are obtained via linear regression. The threshold, threshold, of r is then derived by substituting the same median value r 1 2 as that of the threshold. Words whose cumulative appearance rate in the word appearance frequency distribution is lower than this threshold are excluded from the dataset.

However, if a bias exists in the distribution of the corpus, it cannot be approximated by a power law, and it is not possible to calculate the threshold correctly. For this reason, we obtain the threshold that separates the original frequency distribution directly.

In the following step, corresponding to each word, w k , a word embedding vector [17] is generated for w k R based on the original text and for w k D based on the Wikipedia corpus. We calculate the similarity between the words and denote it by sem(w j , w k ). β denotes a weight parameter of the priority.

Further, str(w j , w k ), which is the degree of matching of the character string itself, is calculated as follows.

The final inter-word similarity is determined by sim(w j , w k ), which combines the two.

γ denotes another weight parameter of the priority of sem(w j , w k ) and str(w j , w k ). In this study, we set β = 0.5 and γ = 0.5.

Based on the similarity, sim(w j , w k ), between the words, w k and w j , in the dataset after screening, calculated in procedure B, we construct a network using words as nodes, calculate TextRank [18] corresponding to each word w k , and denote by TR(w k ). Here, TextRank denotes a graph-based ranking model applied to keyword extraction and sentence extraction, based on Google's PageRank [19] .

Based on TR(w k ), as calculated above, a score is assigned to each word, w i k , as follows.

Here,

, and fre(w i k ) denotes the appearance frequency of the word w k in E i .

A network is constructed based on user interactions to incorporate network information into the TF-SW method, which only utilizes text information. When the tweet of a particular user is retweeted or responded to by another user, it is regarded as a connection between those two users during that period. In this method, we use a network comprising aforementioned connections as links and users as nodes.

For each user u n , a value denoting reliability is calculated based on PageRank. N i denotes the set of users included in the network, and minPR indicates the minimum value of PageRank corresponding to the users included in it.

Incorporation of the calculated values into the formulae (5) and (6) constitute a method that considers the reliability of each speaker.

U i denotes the set of all relevant users corresponding to the period, E i . By applying this method, the weights assigned to words included in the tweets of users who respond to topics that are relevant to the subject of interest is increased, and the influence of tweets corresponding to unrelated topics is relatively reduced. This is expected to render the estimation of event popularity more robust to noise.

Finally, pop(w i k ), as defined in the previous section, is calculated for each word w i k corresponding to each period E i , and their sum pop(E i ). Further, pop(E i ) is calculated by dividing this value by the sum of E i , ..., E n over the entire period.

pop

EPTs denotes the sum of these values < pop(E 1 ), ..., pop(E n ) >. It is considered that the degree of attention corresponding to an event or keyword during A period is proportional to this value.

For the following experiments, we prepared two datasets related to the two crises in Japan that are being considered ̶ the Fukushima Dataset and the COVID-19 Dataset. The Fukushima Dataset used in this study was an 8% sampled dataset purchased from the NTT DATA Corporation. During preprocessing, only words that corresponded to nouns and adjectives were extracted via morphological analysis and Twitter-specific symbols such as "@user id:" and "RT" were removed. 

To verify the noise robustness of our proposed method and compare its performance with that of the existing method, we conducted certain experiments. First, we divided the data from the Fukushima dataset and the COVID-19 dataset into two classes ̶ normal data and noise data. Then, we interspersed noise data into normal data in stages, and compared the values calculated via each model. Fig. 3 and 4 depict the comparisons with TF (frequency) and TF-IDF, respectively, in addition to the existing methods, and Table 1 

We followed the transition of the event popularity in the Fukushima dataset and the COVID-19 dataset by applying Net-TF-SW to them. The calculated EPTs have been plotted and displayed in Fig. 5 and 6 . A black circle has been used to indicate each peak point, and the three users corresponding to the highest scores calculated using (7) have been indicated by assigning temporary user IDs to them. Fig. 7 and 8 depict popular words associated with the respective disasters; and the size of each character corresponds to the value calculated via (12) . Further, for the top 1 user IDs, the attributes read from the profile and the characteristics of remarks within the period have been included in the Tables 3 and 4 . @user E ′ Anonymous account Ⅳ @user K ′ Anonymous account Ⅴ @user M ′ President of a country Ⅵ @user Q ′ Anonymous account

Based on the data presented in Fig. 5 and 6 , a tendency of a sudden climax is observed, although a gradual trend is also observed to exist. These results can be interpreted to be accurate representations of the characteristics of social media, such as the reactions of users to related news and the flames of a particular user. Based on the data presented in the Tables 3 and 4 , it was confirmed that a list of prospective influencers with respect to the Fukushima disaster and COVID-19 could be successfully compiled based on user attributes detected from their profiles and tweets. Because of this, the network constructed based on the interactions between users and the associated scoring based on (7) can be regarded to be functioning correctly. Interpretability is also improved by being able to identify high-frequency speakers and popular keywords.

It is known that adequate noise robustness of the method implies that the value of EPTs can be considered to not be affected by noise. Based on the data presented in Fig. 3 and 4 , the value of EPTs calculated via the proposed method Net-TF-SW is more stable than that obtained via the existing method. This demonstrates that noise robustness is improved in our method.

Event popularity quantification is an important step in the detection of the degree of public attention to an event. It is particularly useful in disseminating information and preventing hoaxes during disasters, such as earthquakes. In this paper, we proposed Net-TF-SW, a method that incorporates network information comprising user interactions into the existing event popularity quantification method that uses only text information. The proposed method was applied to two datasets comprising tweets containing keywords related to the Fukushima nuclear accident and COVID-19. Its effectiveness was evaluated by verifying its improvement with respect to noise robustness and interpretability compared to the existing method. Future topics of research include improving the definition of scoring functions and network construction methods and applying them to other tasks such as burst detection and cross-media analysis. 

Twitter as an instrument for crisis response: The Typhoon Haiyan case study

Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI)

UNSCEAR 2013 Report. Volume I: report to the General Assembly, Annex A: levels and effects of radiation exposure due to the nuclear accident after the 2011 great east-Japan earthquake and tsunami

Epidemic of fear

A new coronavirus associated with human respiratory disease in China

Coronavirus disease 2019 (COVID-19) Situation Report -59

About Coronavirus Disease 2019 (COVID-19)

Information sharing on Twitter during the 2011 catastrophic earthquake

Regional analysis of user interactions on social media in times of disaster

Twitter use in scientific communication revealed by visualization of information spreading by influencers within half a year after the Fukushima Daiichi nuclear power plant accident

Keysee: Supporting keyword search on evolving events in social streams

A model-free approach to infer the diffusion network from event cascade

ESAP: a novel approach for cross-platform event dissemination trend analysis between social network and search engine

DancingLines: an analytical scheme to depict cross-platform event popularity

SENTI2POP: Sentiment-Aware Topic Popularity Prediction on Social Media

Power laws, Pareto distributions and Zipf's law

Distributed representations of words and phrases and their compositionality

Textrank: Bringing order into text

The anatomy of a large-scale hypertextual web search engine

This work was partially supported by JSPS KAKENHI Grant Numbers JP16H01836 and the Research on the Health Effects of Radiation initiative organized by the Ministry of the Environment, Japan.