key: cord-0908322-27rr7oca
authors: Amen, Bakhtiar; Faiz, Syahirul; Do, Thanh-Toan
title: Big Data Directed Acyclic Graph Model for Real-time COVID-19 Twitter Stream Detection
date: 2021-10-26
journal: Pattern Recognit
DOI: 10.1016/j.patcog.2021.108404
sha: 7da20c3d520b5462f0d2df041cdc15aaf45d5b72
doc_id: 908322
cord_uid: 27rr7oca

Every day, large-scale data are continuously generated on social media as streams, such as Twitter, which inform us about all events around the world in real-time. Notably, Twitter is one of the effective platforms to update countries leaders and scientists during the coronavirus (COVID-19) pandemic. Other people have also used this platform to post their concerns about the spread of this virus and a rapid increase of death cases globally. The aim of this work is to detect anomalous events associated with COVID-19 from Twitter. To this end, we propose a distributed Directed Acyclic Graph topology framework to aggregate and process large-scale real-time tweets related to COVID-19. The core of our system is a novel lightweight algorithm that can automatically detect anomaly events. In addition, our system can also identify, cluster, and visualize important keywords in tweets. On 18 August 2020, our model detected the highest anomaly since many tweets mentioned the casualties’ updates and the debates on the pandemic that day. We obtained the three most commonly listed terms on Twitter: “covid”, “death”, and “Trump” (21566, 11779, and 4761 occurrences, respectively), with the highest TF-IDF score for these terms: “people” (0.63637), “school” (0.5921407) and “virus” (0.57385). From our clustering result, the word “death”, “corona”, and “case” are grouped into one cluster, where the word “pandemic”, “school”, and “president” are grouped as another cluster. These terms were located near each other on vector space so that they were clustered, indicating people’s most concerned topics on Twitter.

also used this platform to post their concerns about the spread of this virus and a rapid increase of death cases globally. The aim of this work is to detect anomalous events associated with COVID-19 from Twitter. To this end, we propose a distributed Directed Acyclic Graph topology framework to aggregate and process large-scale real-time tweets related to COVID-19. The core of our system is a novel lightweight algorithm that can automatically detect anomaly events. In addition, our system can also identify, cluster, and visualize important keywords in tweets. On 18 August 2020, our model detected the highest anomaly since many tweets mentioned the casualties' updates and the debates on the pandemic that day. We obtained the three most commonly listed terms on Twitter:

"covid", "death", and "Trump" (21566, 11779, and 4761 occurrences, respectively), with the highest TF-IDF score for these terms: "people" (0.63637), "school" (0.5921407) and "virus" (0.57385). From our clustering result, the

In December 2019, the global pandemic of COVID-19 hit the world, and people began to worry about the rapid spread of this virus. Researchers, especially from the computer science field, have started to propose many innovative solutions for this pandemic crisis and prevent the spread of this virus [1, 2] . Some 5 have used AI and machine learning methods to detect patterns from large-scale real-time video, image, and text data. For video-based cases, researchers utilised computer vision to identify human's body temperature or detecting whether or not people wear masks [3] . As for the text-based cases, we contribute by discovering the virus outbreak or the infection cases from event stream tweets through 10 developing a novel big data stream analytic method.

Every day, over 500 million tweets are posting on Twitter [4] , in particular from early 2020, people have started to post information related to the COVID-19 on Twitter. This platform has been utilised as a real-time communication media between world leaders and their citizens, scientists and healthcare organ- 15 isations. According to [5] , there have been an increased amount of generated real-time tweet contents associated with COVID-19 started from the early weeks of the outbreak, where it reached about 1 million tweets per day. As of April 2020, this has approximately reached 10.5 million tweets per day. This evidence showed that the use of social media, in particular Twitter, has increased dur- 20 ing the pandemic. For this purpose, we investigate the adoption of incremental real-time pattern detection from large-scale Twitter events. Hence, we define the keyword "death" as the event, which could lead to the anomaly, and we need to investigate the source and cause of the anomaly in this research.

The aim of this paper is to detect anomalous events associated with COVID- 25 19 from Twitter. Here, we identify our objectives for this research:

1. We build a distributed Directed Acyclic Graph topology model to aggregate large-scale real-time tweets related to COVID-19.

2. We propose a novel algorithm that uses the predictive statistical analysis technique (i.e., "PESCAD" Algorithm) to detect anomalous events. 30 3. We examine the frequency and the importance of keywords to figure out what people are thinking on Twitter during this pandemic period.

We will discuss the previous related works in Section 2, and we will break down the detailed methodology of our approach in Section 3. We also illustrate the experiment of our algorithm in Section 4. Finally, we outline the discussion 35 of our results and findings in Section 5.

In this research, we acknowledge the fundamental concept of anomaly or event detection from large-scale data and its applicability in real-world problems from [6] . The theoretical concept of large-scale anomaly detection of both 40 batch and data streams, along with its constraints and limitations, were also discussed in [7] . Meanwhile, the authors in [8] have implemented collective anomaly detection on data sensor streams, where the algorithm's accuracy outperformed compared to Adaptive Stream Projected Outlier Detector (A-SPOT) algorithm. The survey about the anomaly detection technique with its various 45 big data solution technologies is also explained in [9] , including the performance of various anomaly detection algorithms such as Bayesian Network, Neural Network (NN), and Support Vector Machine (SVM). We can also understand the anomaly detection method using the Isolation Forest Algorithm from [10] . Meng Li et al. in [11] proposed a k-Nearest Neighbour algorithm implementation to 50 detect anomaly using blockchain and sensor networks.

According to [12] , a big data framework (e.g., Apache Storm) has been improved and outperformed well to detects large-scale abnormal events in realtime. In [13] , Patel et al. introduced the sentiment-based classification to detect the anomaly in Twitter, whereas [14] discussed the sentiment analysis technique 55 using a tree-learning algorithm using Apache Storm framework. Toshiwal et al.

in [15] discussed the comprehensive Twitter monitoring utility function and how to evaluate throughput as well as the performance of the framework mentioned above in [16] . Meanwhile, in [17], we can observe the example implementation of Apache Storm for anomaly detection in a real-time network.

In [18] , Twitter was utilised to monitor incidents such as an earthquake but without using any big data middleware. Gupta et al. in [19] , discussed how to identify hybrid hashtags for Twitter classification, with several machine learning classification algorithms (i.e., Nave Bayes, k-Nearest Neighbor, and SVM), and [20] extends the experiment in big data domain with Apache Storm. Both [21] 65 and [22] demonstrated the Poisson distribution's implementation for detecting the anomalies. However, in [22] , the method is only used in a non-distributed environment to track log-in accounts. Meanwhile, Turcotte et al. in [23] implemented the Poisson factorisation to find the anomaly in user credentials in a corporate network. Keval et al. [24] also briefly discussed anomaly detection us-70 ing the Poisson probability and machine learning but also not in the distributed problem domain.

To support our understanding of the basic concept of Term Frequency-Inverse Document Frequency (TF-IDF), we acquire the fundamental concept from [25] , as well as the method to extract the keyword using semantic asso-75 ciation in [26] and learn about keyword relevance using TF-IDF in [27] . From [28] , we also discover a novel proposed technique for automatic monitoring for dengue disease detection, based on analysing the Twitter's statuses only, and decide whether the people are infected or not, including for dengue virus control spread.

According to [29] , there are several ready-to-use libraries for machine learning utilisation, such as Deeplearning4j, in which we adopt this library since this library provides us with the functionality we require (i.e. Word Embedding, Clustering, and Principal Component Analysis). Meanwhile, the scalability of the previously mentioned library on the GPU-cored distributed computing is 85 also discussed in [30] . In [31] , Doshi et al. demonstrated the implementation of the leaflet.js library for locating the user's tweets coordinates on the world map and the chart.js library function for the visualisation.

As we mentioned earlier, we acknowledge and are inspired by the research from [22] , which discussed the Poisson distribution's implementation for detect-90 ing the anomalies in a company network. The research applies Point-based anomaly detection explicitly as compared to our research method, Collective anomaly detection. Apart from that, the anomaly detection method is only used in a non-distributed environment to monitor the log-in accounts where ours is on distributed computing. One interesting novelty about the previous research 95 is how they introduce the elbow function as the anomaly detection threshold.

The elbow function is one way to find a curvature in the optimum/minimum value of the function (or 'elbow point').

Therefore, we initially planned to design our PESCAD algorithm using the elbow function. For the computation of this threshold, the 'death' keyword from 100 the last intervals will be counted, and a lower bound is defined and initialised as the upper limit of the 'elbow function' calculations. The method will need to calculate a second-order central difference derivative with this 'elbow function' where it represents the curvature of discrete data (compatible with the Poisson discrete random variable) [32] . Hence, it is expected to obtain a min-105 imum value of the threshold using the elbow function. We did not adopt this method/function because iterating the curvature array caused a slower system and consumed enormous computing resources (memory). The previous algorithm consists of two outer for-loop. We analyse that the algorithm will have asymptotic order of magnitude (i.e., the Θ-class) as Θ(n 2 ). On the contrary, our 110 proposed algorithm (Algorithm 1) will only have one outer loop (i.e., implicitly, as incoming data streams are fed continuously from Twitter). Therefore, the order of our proposed algorithm will only have asymptotic order of magnitude (i.e., the Θ-class) as Θ(n). Hence, since Θ(n 2 ) > Θ(n), we assume that our algorithm is more lightweight than the previous research.

In summary, the previous related works only focused on particular study cases of abnormal behaviour, or some were too specific types of anomaly detection. On the contrary, in this research, we point out our novelty and our research contribution by proposing a lightweight solution for anomaly detection in real-time Twitter data stream by implementing the Directed Acyclic Graph 120 model and the Poisson distribution.

The followings are the overview of the methodology framework of our research: 

We state what tools/software we use to implement our theory in System Environment (subsection 3.6). In Data Collection (subsection 3.7), we discuss how 135 we collect our data. After collecting data, we describe how we pre-process the data in Pre-processing Phase (subsection 3.8) and subsequently break down the data processing in parallel: TF-IDF Phase (subsection 3.9), Clustering Phase (subsection 3.10), and Anomaly Detection Phase (subsection 3.11). Overall, we illustrate our implementation elements, as we explain above, in Fig. 1 . After that, the cluster will divide the topology instances according to the server configuration cluster (either standalone or distributed). During the runtime, the user will receive a message regarding the group of anomalous events in the idStatusList array from the system. 

There are challenges/dimensions of big data called 'the 4V's of Big Data':

Volume, Velocity, Variety and Veracity. Meanwhile, there are two types of big data processing: batch-based processing and stream-based processing [33] . In batch-based processing, each data block is processed sequentially one by one in a 155 period of time. This processing type is mainly for overcoming the 'volume' challenges of Big Data. The famous framework for this batch processing is Apache

Hadoop. On the other hand, stream-based processing is always associated with 'velocity' challenges where the real-time processing of fast-growth data is needed (such as Twitter datastream). For stream processing type, Apache Storm is the 160 forefront framework and is designed to answers the challenges in the velocity aspect. Not only to solve the velocity challenges, but the Apache Storm can also be implemented in larger organisation clusters (i.e., scalable) for online decision making. Therefore, we use Apache Storm since it is scalable and it can process a million tuples per second in real-time from Twitter [34] . 165 

We utilise the distributed Directed Acyclic Graph topology model in our system and implement using Apache Storm, which consists of the spout(s) and bolt(s) (Fig. 3) . The spout is the source of the event stream (i.e., tuple or of the data tuple which related to these keywords, such as "death", "covid", and "corona".

To pass the tuple (data) from the spout to the bolt or from one bolt to the next bolt, we need to define and implement the type of grouping. This type of 180 grouping is also one fundamental concept in Big Data processing for large-scale, distributed real-time data analytics [15] . Global grouping is a grouping type where all the tuples go to one of the bolt workers. This grouping type is proper when we have to run a computational process with a tuple value. The downside to this grouping type is the network 185 and memory overhead.

Shuffle Grouping is a grouping type where each worker in a bolt is guaranteed to receive the same amount of tuples. The advantage of shuffle grouping is to provide load balancing and avoid overhead as the worker is allocated to the process, and the tuples are partitioned in parallel.

Field Grouping is a grouping type where the tuples are partitioned by the "id" field defined by the programmer/user. We implemented this grouping type in our research to combine collective tuples with identical values into a given worker in a bolt. The example illustration of the grouping type is shown in 

The Twitter stream we collect includes an unbounded information sequence (event) known as a tuple, which is a multi-field, key-value pair data structure [6] . For our three separate primary bolts, there are three types of event stream equations:

Eq. 1 illustrates the grouping formula in our PESCADBolt (i.e. Poisson Event Stream Collective Anomaly Detection Bolt) as the system performs fieldgrouping for both id status and is event for each event. We assign is event = 1 205 if a tweet contains a "death" keyword. Then we sum is event and compare it to the predicted rate, which is calculated using the Poisson probability. If the sum of is event is greater than the predicted rate, we will mark all id status as collective anomalies. As shown in Eq. 2, we also group word and topic in TFIDFBolt. We calculate the TF-IDF frequency score from each keyword 210 relative to all the tweets collected in our system.

We will then apply the field grouping for id status and word in the Cluster-ingBolt, as given in Eq. 3, which transforms each keyword into a vector with the word embedding technique and groups it into a graphical 2-D representation in its associated clusters. 

For aggregating the incoming rapid real-time tweets, we require a framework for Streaming Processing Computation, Apache Storm (at least version 1.2.x) [16] . We also require Java JDK 11 installed and Ubuntu 18.04 LTS (for the latest package updates and the convenience of configuration). 220 We need the Twitter4j library (at least version 4.0) configured with Twitter API keys to access several Twitter entities. To compile the project and use the dependencies needed, we require 'maven' (at least version 3.x). We launch our experiments by creating and configuring two nodes in this research: a nimbus node and a supervisor node. In the nimbus node, we configured Apache 

We collect the data using the Twitter API, allowing us to retrieve the essential object information (such as accounts, hashtags, tweets), namely as 'entities'.

In our case, we use 'status' and 'user' entities. From status entity, we extract following fields: 'created at', 'geolocation', 'place', and 'status'. On the user entity, we obtain the 'location' and 'screen name' fields. Since we will need to detect the origin/source of the anomaly tweet, both geolocation and location are essential information. However, in reality, not all user account disclose these 

After acquiring the tuple from Spout ("start" mark in Fig. 1 ), we retrieve the hashtag and user-mention entities, tokenise the tweet sentence, and delete the stopwords on the PreprocessingBolt (orange rectangles). Then we do the lemmatisation to get a standardised keyword that suits the dictionary in the real world. We compute the TF-IDF scores and return the keyword with the highest score.

Subsequently, we define the keyword with the highest score as the 'important'

keyword of the tweet. To measure the Term Frequency (TF), we determine how 260 many times a keyword occurs in a tweet divided by the total number of word counts in that tweet sentence (to get the normalised value). Let |{w ∈ T }| be the number of times the keyword shown in a tweet, and let |T | be the total number of all words in a tweet:

For the Inverse Document Frequency (IDF), we determine the natural loga-265 rithm of the total number of tweets obtained in our tweet database divided by the number of tweets where a specific keyword occurs. Let CT be the collection of the tweet obtained in our database, then, the IDF formula:

Therefore, the whole TF-IDF formula:

Following is an example of TF-IDF calculation. Given a tweet sentence T 270 containing 200 words where the word w 'covid' appears 5 times in that sentence.

The term frequency (i.e., tf) for 'covid' is then (5/200) = 0.025. Subsequently, assume we have 1000 tweets in document collection CT and the word 'covid' appears in 10 tweet sentences T among these 1000 tweets in document collection CT. Therefore, the inverse document frequency (i.e., idf) is calculated as 275 log(1, 000/10) = 2. Thus, the Tf-idf weight of the word 'covid' is the multiplication of these two: 0.025 * 2 = 0.5.

In the ClusteringBolt (red/lower-right rectangles in Fig. 1) , we use the word embedding to represent each keyword as a vector of 100 dimensions. We group 280 the keyword vectors using clustering (k-means clustering) and then project the

Analysis (PCA) to plot the keyword clusters in a 2D visual image.

Word2vec is a neural network model with a hidden layer that transforms a 285 word into a real number vector (i.e., Word Embedding). This vector represents the coordinate in a high dimensional vector space, such that keyword with a high similarity can be located next to each other [35] .

We decide to choose this strategy to represent a keyword as a vector and map it in a visual representation. Our word embedding approach will use the 290 Skip-gram algorithm [36] , which employs a set of keywords extracted from the tweets (i.e., as a corpus), then the model loops on the words and applies the current keyword to infer or predict its neighbours (i.e. context). In this research, we implement Word2Vec with the help of deeplearning4j library [29] .

With built-in functions, the string collection (CollectionSentenceIterator ) 295 will be tokenised (DefaultTokenizerFactory), and the model will iterate through tokens, and delete the stopwords. After the pre-processing step, the library collects the tokenised keywords, and it selects unique words only, one by one, until they form a vocabulary that consists of 2000 unique words. Each token will be supplied to Word2Vec neural network (using Word2Vec.Builder()).

In this project, the size of our vocabulary is 2000 words. Also, we use windowSize parameter of 5. Meanwhile, we set the hidden layer size of our neural network (i.e., layerSize) with 100, i.e., every word in the vocabulary will be represented by a 100-dimensional vector.

Once transformed into a 100-dimensional vector, we use the deeplearning4j library [29] to perform KMeans clustering.

KMeans Clustering is a type of unsupervised learning which clusters finite n instances of the dataset with d dimensional real vectors into k clusters by minimising the distances between data instances and several cluster cen-310 tres/centroids. The data instances, in this case, are the keywords, and each keyword consists of a vector with real numbers. The distance metric we use in this KMeans Clustering is the Euclidean distance.

In this research, we use Principal Component Analysis (PCA) from the 315 deeplearning4j library [29] . PCA is a technique for dimensionality reduction to project and keep the essential information from a higher dimension to a smaller vector subspace. The technique maximises the projected data variance.

The word embeddings will be projected to a two-dimensional space using PCA, which allows us to visualise the word clusters. The rate of events (i.e., the keyword "death") is computed as the actual events, and we forecast the predicted number of events, which we want to approximate 325 by using Poisson distribution. If the number of actual events in the interval is greater than the predicted number of events, we shall mark them as a collective anomaly.

The Apache Storm adopts the scalable Directed Acyclic Graph topology design [37] , and our PESCADBolt can scale according to our topology definition 330 in our code along with the number of nodes configuration in our cluster (either standalone or distributed). Intuitively, with more nodes in the cluster, it can detect multiple anomalies at the same time.

In the previous related works, there are three forms of anomaly detection:

Point, Collective, and Contextual [38] . Point anomaly indicates a single data (point) anomaly compared with the rest of the data, e.g., tracking a user's network intrusion detection. Meanwhile, Contextual anomaly is associated with abnormal occurrence in particular/specific circumstance (context), such as the network's obscure intrusion detection late at night. Lastly, the Collective anomaly 340 detects the group of abnormal occurrences over a period of time. Therefore, we decide to use collective anomaly detection because we attempt to detect a collection of 'death' keyword/event in a specific time interval.

Given the average rate event (λ), the Euler's constant number (e = 2, 71828), 345 we use the Poisson P (x, λ) as the statistical method to calculate the probability of x occurrences of the event over a specific period:

We get the λ by dividing the total number of occurrences of events by the total of all tweets: λ = sum total event sum total tweet (8)

We specify the time interval in one minute because this death event could 350 occur at a monitoring interval of at least one minute from the six weeks of our observation.

This algorithm, PESCAD, is our highlight since event detection is the moti-355 vation of our research (Algorithm 1). The main principle is to use the Poisson distribution to measure occurrences using the current interval and compare it to actual occurrences to decide whether an anomaly occurs.

Firstly, we extract from the tuple these fields: id status and is event, and calculate sumTotalTweet. From is event counts, we count the sumTotalEvent (also 360 as "ActualOccurrences") and sum current interval event (SCIE) independently.

We will then use two parameters (λ and sum current interval event(SCIE)) to calculate the Poisson probability. After that, by multiplying the Poisson probability and the sumTotalTweet, we can calculate the PredictedOccurrences (PO). 

After designing our methodology and implemented our algorithm, we then test our system. Since we undertake research related to the outbreak or anomaly detection monitoring, there is a hypothesis and primary aim that we require to test: our system should identify abnormal events/abnormal rate occurring in a 395 specific interval, which leads to an incident, and the system should be able to detect the incident's source at the same time.

This Section 4 only discusses and analyses a subset of our findings from our monitoring on 14 August 2020. We will discuss the complete result discussion from research and monitoring during 1-30 August 2020 in Section 5. We have 400 also designed a web app 1 with interactive visualisation charts and map chart for the convenience of analysing the anomalies using chart.js 2 and leaflet.js 3 .

We analysed the two essential aspects: apache storm parameters and topology components of our framework. We subsequently observed whether these two 405 aspects would affect how many tweets we will obtain during a specific monitoring time. parameter.

The number of workers denotes how many workers instances in Java Virtual

Machine that storm creates for the topology. The number of ackers is the number of threads for processing tuple acknowledgements. The maximum task parallelism defines maximum number of threads that generates spout and bolts.

Maximum spout pending specifies how many data tuples have been processed from the spout and ready to be processed by ackers.

We perform five monitoring time durations (i.e., 1, 5, 10, 20, or 60 minutes)

with the above parameters. We use a similar tuning value as demonstrated in [42] .

For example, when we set the number of workers=3 and we start to monitor for 1 minute long, we obtained 3254 tweets. The same applies when the number of workers equals 6, 9, and 12 (also in 1 minute); we obtained pretty similar amounts: 3045, 3629, and 3390 tweets, respectively.

When we experimented on the other three parameters, we kept obtaining 425 about 3000 tweets in 1 minute despite the value we assigned. 1 The homepage of the Github project: PESCAD Storm 2 The homepage of the library: https://www.chartjs.org/ [39] . 3 The homepage of the library: https://leafletjs.com/ [40] . bolt1 will receive the tweets from the spout and then forward to bolt2. Therefore, we tuned only for spout and bolt1. We set bolt2 =1 because this bolt task is to accumulate the number of tweets collected. If we set bolt2 into 2 or 3, they 440 would work independently so that we would not be able to record the total number of tweets, and we would have to add them manually.

As shown in Table 2 , when we set the spout=3, bolt1 =2, and bolt2 =1, we collected 141545 tweets. It also applies when we increased both spout and bolt, we obtained a fluctuated amount of tweets ranging from 142899 to 148695. 445 We can conclude that we kept obtaining about 140000 tweets during 60 minutes of monitoring despite how many topology components we added. However, we will show that if we add more computer nodes (in distributed mode), we will obtain more tweets in Section 5.

If we execute the system in standalone modeparticularly with a suitable Integrated Development Environment (IDE) that displays output consoleswe can see our PESCAD Algorithm in detail . Fig. 5 is the excerpt of the example message console during our standalone test run. As we can see, we observe following variables: 455 1. How many total tweets we capture during all monitoring period (sumTo-talTweet).

2. The average of events during all monitoring period (lambda).

3. The sum of all events detected during all monitoring period (ActualOccurrences). Apart from that, we also show the list of anomalous tweet on our system, as shown in Fig. 7 . If we click one of the tweet lists, we may redirect automatically to the tweet page source (Fig. 9 ). In the following subsection, alternatively, we 490 can also locate the anomalous tweet's source using the world map. When we click on the located pin and the popup on the map, we can be redirected to the exact web page of the anomalous tweet, which for example, explains the casualties updates on 14 August 2020 reached 46,707 in the United 500 Kingdom (Fig. 9 ).

It demonstrates that the Collective Anomaly was accurately detected by our system on 14 August 2020 at 21:41, because three events of 'death' in tweets associated with COVID-19 occurred, and our system can correctly determine and pinpoint the source of the incident (i.e., accomplished the second objective 505 Figure 9 : The source of the tweet: status of our project). As opposed to this Section 4, which only discuss a specific finding on 14 August 2020, Section 5 will discuss our findings in accumulated monitoring during 1-30 August 2020.

We have conducted 15 tests between single versus dual machines (e.g. 15 August 2020 (134 anomalies from the total of 136 events). When we analyse the tweet source (on that date), multiple tweets are associated with the COVID-19 casualties update. Also, most of the tweets contain debates about whether comorbidity causes death. This problem caused a spike in the event rate (i.e., many tweets contained the keyword "death"), and subsequently, our PESCAD 525 algorithm can successfully detect the anomaly (i.e. accomplished the second objective of the project). Fig. 12 shows the most mentioned words on Twitter during our monitoring, e.g., "covid" (21566 incidents), "death" (11799 incidents) and "trump" (4761 incidents). Meanwhile, Fig. 13 showed the words with highest TF-IDF scores, 530 e.g., "people" (0.63637), "school" (0.5921407), and "virus" (0.57385). When we observed Twitter in August 2020, we find that the United States has had trouble with COVID-19, causing people ask to their president at that time (Trump) to In the figure, we can find the keywords "school", "pandemic", and "president" grouped into a cluster. Meanwhile, the keywords "death", 540 "case", and "corona" are in another separated cluster. We can conclude, the relationship of Fig. 12, Fig. 13 , and Fig. 14 as follows: Fig. 12 illustrates the keyword occurrence (term frequency), and in Fig. 13 , we use that information 

In conclusion, we have obtained more tweets from distributed computing results than a single machine from 1 to 30 August 2020. On 18 August 2020, we received the highest number of anomalous tweets, discussing the pandemic casualties' updates and COVID-19 debates. During our monitoring time, we obtain 555 the three most-appeared words on Twitter: "covid", "death", and "Trump", which illustrate the most frequently mentioned keywords. Meanwhile, the keywords "people", "school", and "virus" have the highest score of the TF-IDF and reflects the most important keywords on Twitter. The keywords "death", "corona", and "case" are grouped in a cluster, whereas "pandemic", "school", 560 and "president" also grouped in another different cluster. Those results indicate what people concern during this pandemic period. We have proven that our distributed Directed Acyclic Graph model framework collected more tweets than the standalone machine. Our system also successfully detect anomalous events from Twitter with its source location and account in real-time.

The weakness of our work is that we require extra testing hours to give us more precise insight into our conclusions, where they may be different from our current results. We also developed our project with minimal resources. Hence, we hope larger organisations (e.g., public health organisations) can benefit from this work by adopting our concept in larger computing clusters. The strength of 570 our work is that we have successfully designed the Directed Acyclic Graph model combined with the Poisson distribution to detect the anomaly (i.e., PESCAD algorithm) so that others can benefit from the idea of the algorithm for their research. Also, although we have built our research with small computer nodes, we have achieved our research's objectives. For future work, we plan to use more 575 computing resources to improve our performance. We also plan to implement our algorithm and apply our method in other event detection scenarios such as disaster monitoring, earthquakes, fires or storms.

Review of Big Data Analytics, Artificial Intelligence and Nature-Inspired Computing Models towards Accurate Detection 585 of COVID-19 Pandemic Cases and Contact Tracing

Significant Applications of Big Data in COVID-19 Pandemic

A Survey on how computer vision can response to urgent need to contribute in COVID-19 pandemics

Proceedings of the Recommender Systems Challenge 2020, RecSysChallenge '20, Association for Computing Machinery

GeoCoV19: A Dataset of Hundreds of Millions 600 of Multilingual COVID-19 Tweets with Location Information

Distributed Contextual Anomaly Event Stream Detection Using Directed Acyclic Graph Model

A Theoretical Study of Anomaly Detection

Collective Anomaly Detection Using Big Data Distributed Stream Analytics

Real-time big data processing for anomaly detection: a survey

An Anomaly Detection Approach Based on Isolation Forest Algorithm for Streaming Data Using Sliding Window

Blockchain-based anomaly detection of electricity consumption in smart grids

Sketch of Big Data Real-Time Analytics Model

Real-Time Sentiment-Based Anomaly Detection in Twitter Data Streams

Distributed real-time sentiment analysis for big data social streams

Proceedings of the

ACM SIGMOD International Conference on Management of Data, SIGMOD '14

Scalable distributed event detection for Twitter

Real-time network anomaly detection system using machine learning

Earthquake Shakes Twitter Users: Real-Time Event De-650 tection by Social Sensors

Harnessing the Power of Hashtags in Tweet Analytics

Unleashing the Power of Hashtags in Tweet Analytics with Distributed Framework on Apache Storm

Statistical techniques in business & economics

Poisson-based anomaly detection for identifying malicious user behaviour

Programmable Networking

Poisson factorization for peer-based anomaly detection

Online anomaly detection in surveillance videos with 670 asymptotic bound on false alarm rate

Introduction to Information Re-675 trieval

Text Features Extraction based on TF-IDF Associating Semantic

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Recurrent Neural Networks With TF-IDF Embedding Tech-685 nique for Detection and Classification in Tweets of Dengue Disease

A State-of-the-Art Survey on Deep Learning Theory 690 and Architectures

Scaling Word2Vec on Big Corpus

2017 International Conference on Com-695 puting, Communication, Control and Automation (ICCUBEA)

Knee/Elbow Estimation Based on First Derivative Threshold

Big Data Processing: Batch-based processing and stream-based processing

ference On Intelligent Computing in Data Sciences (ICDS)

Research in Intelligent and Computing 710 in Engineering

Efficient Estimation of Word Representations in Vector Space

Workshop Track Proceedings

Distributed Representations of Words and Phrases and their Compositionality

Sigma: a Scalable High Performance Big Data 725

2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

Open source HTML5 Charts for your website

Leaflet -a JavaScript library for interactive maps

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Towards Automatic Parameter Tuning of Stream 740

Processing Systems

The research was undertaken by Syahirul Faiz, sponsored by the Indonesia 

However, we significantly collected more tweets than the previous duration when we increased the monitoring duration (i.e., from 5 until 60 minutes). For instance, when we set the maximum spout pending as 8000, we collected 2988 tweets in one minute. On the other hand, when we increased the duration for Meanwhile, Table 2 shows how the topology components will affect the number of tweets collected. We decided to use 60 minutes of monitoring time as our baseline, based on the previous Table 1 experiment.

In this experiment, we designed a simple topology with 3 components: spout, bolt1, and bolt2. A number of spouts will collect many tweets, and a number of

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: