key: cord-032684-muh5rwla authors: Madichetty, Sreenivasulu; M., Sridevi title: A stacked convolutional neural network for detecting the resource tweets during a disaster date: 2020-09-25 journal: Multimed Tools Appl DOI: 10.1007/s11042-020-09873-8 sha: doc_id: 32684 cord_uid: muh5rwla Social media platform like Twitter is one of the primary sources for sharing real-time information at the time of events such as disasters, political events, etc. Detecting the resource tweets during a disaster is an essential task because tweets contain different types of information such as infrastructure damage, resources, opinions and sympathies of disaster events, etc. Tweets are posted related to Need and Availability of Resources (NAR) by humanitarian organizations and victims. Hence, reliable methodologies are required for detecting the NAR tweets during a disaster. The existing works don’t focus well on NAR tweets detection and also had poor performance. Hence, this paper focus on detection of NAR tweets during a disaster. Existing works often use features and appropriate machine learning algorithms on several Natural Language Processing (NLP) tasks. Recently, there is a wide use of Convolutional Neural Networks (CNN) in text classification problems. However, it requires a large amount of manual labeled data. There is no such large labeled data is available for NAR tweets during a disaster. To overcome this problem, stacking of Convolutional Neural Networks with traditional feature based classifiers is proposed for detecting the NAR tweets. In our approach, we propose several informative features such as aid, need, food, packets, earthquake, etc. are used in the classifier and CNN. The learned features (output of CNN and classifier with informative features) are utilized in another classifier (meta-classifier) for detection of NAR tweets. The classifiers such as SVM, KNN, Decision tree, and Naive Bayes are used in the proposed model. From the experiments, we found that the usage of KNN (base classifier) and SVM (meta classifier) with the combination of CNN in the proposed model outperform the other algorithms. This paper uses 2015 and 2016 Nepal and Italy earthquake datasets for experimentation. The experimental results proved that the proposed model achieves the best accuracy compared to baseline methods. Micro-blogging [10, 14, 36, 40] sites like Twitter, Facebook, Instagram, etc. are helpful for collecting situational information [13] during a disaster like an earthquake, floods, disease outbreaks [25] , etc. During these events, minor tweets are posted relevant to the specific classes such as infrastructure damage, resources [6, 33] , service requests [24] , etc., and also spam tweets, communal tweets and emotion information are posted [8, 16, 17, 19, 31, 38] . Therefore, it is required to design the powerful methodologies for the detection of specific class tweets (like Need, Availability of resources, etc.), so that relevant tweets can be automatically detected from the large set of tweets. The detection of specific class tweets [1, 11, 21, 35] has received much attention in the last two years. In the next few years, the detection of specific class tweets is likely to become more important in social media. Specifically, the detection of two types of tweets contains information related to Need and Availability of resources is a challenging task. During the disaster, victims post tweets with information such as where essential resources such as food, water, medical aid, shelter, etc. are needed or required. Similarly, humanitarian organizations post tweets with information such as where specific resources such as medical resources, food, water packets, etc., are available in the affected area. Examples of Need and Availability of Resource tweets are shown in Table 1 . The first four tweets represent the need for resources such as mobile hospitals, password-free Wi-Fi, blood and ambulances. The next four tweets reflect the availability of information on resources such as the Italian Army to provide services to earthquake victims, the availability of shelter tents, money and ambulances. However, detection of Need and Availability of Resource tweets is very beneficial for both humanitarian organizations and victims during the disaster. The main objective of this work is to assist the victims and humanitarian organizations in the event of a disaster by designing a method for automatic identification of Need and Availability of Resource tweets (NAR) from Twitter. The problem of detecting NAR tweets can be treated as a multi-classification problem. The classes are (i) Need of resource tweet (ii) Availability of resource tweet and (iii). None of both. Only a few existing works [1, 3, 11] are only focused on extracting the need and availability of resource tweets during the disaster. Among them, most of the works used informationretrieval methodologies such as word2vec, a combination of word embeddings and character embeddings, etc. Specifically, the authors in [3] used both information-retrieval methodologies and classification methodologies (CNN with crisis word embeddings) to extract the Need and Availability of Resource tweets during the disaster. The main drawback of CNN with crisis embeddings is that it does not work well if the number of training tweets is small and, in the case of information retrieval methodologies, keywords must be given manually to identify the need and availability of resource tweets during the disaster. To overcome the above-mentioned issues, a novel method is proposed by using the stacking mechanism [44] to identify NAR tweets during the disaster. The stacking mechanism uses a two-level classifiers. The first level uses multiple classifiers and the classifier output is used as the second level classifier input, while the second level uses only one classifier. Search and rescue dogs 20 ambulances on the ground in #Perugia following #earthquake volunteers from @crocerossa on the scene. The stacking method does not produce improved results if the models used in the stacking method are stable. Therefore, different models such as CNN and KNN classifiers with domain-specific features are used in this work. CNN is used to capture the semantic similarity between words, and even vocabulary words are different in the testing phase. In order to overcome the problem of a lower number of training tweets, new features are proposed and used in the KNN classifier to detect NAR tweets. The two models (CNN and KNN classifiers with proposed features) have different functionality for the detection of tweets. The output of these two models is given as input to the SVM (second level) classifier. The SVM classifier is trained to determine the relationship between the output of the two CNN and KNN classifier models. It gives the final prediction of tweets whether a tweet label is a resource need or a resource availability or none. The efficacy of the final prediction depends on the classifiers used in level-1 and level-2. The reason for selecting the KNN and SVM classifiers as first and second level classifiers is clearly explained in Sections 4.4.2 and 4.5.2. The main contributions are summarized as: This paper is organized as follows. The second section examines the related work. The proposed approach for the detection of NAR tweets during a disaster is described in the third section. Experimental results and analysis are discussed in the fourth section. The last section is the conclusion of the paper. Many studies [22, 28, 32, 41] focused on the detection of the tweets related to a disaster. Preliminary work [41] focused mainly on extracting the features such as uni-gram and bigram frequency, Parts-Of-Speech (POS), objective or subjective, personal or impersonal and formal or informal from tweets and used the classifiers for classifying the tweets based on the relevancy. Classifiers such as Naive Bayes and Max entropy classifiers are used for detection of the situational tweets related to the disaster. The authors explained that their work depends on the vocabulary of a specific event. In [32] , the authors investigated and developed an application for detecting the earthquake based on the features such as context words, keyword position, content words and length of the tweets. It is applicable only for Japanese tweets. To overcome the problem domain dependent, the authors in [28] proposed a novel framework for classifying the situational and non-situational information based on the low-level lexical and syntactical features. After classification, the tweets are summarized based on the content words and also concluded that it works on cross-domain (domain independent). However, all the methods are focused only on situational tweets related to disaster but they failed to address specific class tweets. In recent years, more researchers focused on the detection of user-defined class tweets during a disaster. Several studies, for instance [2, 11, 21, 29] have been proposed on different specific classes. The authors in [21] , suggests that decision tree with context and content features give the best results for recall and F1-measure parameters among the classifiers such as SVM, AdaBoost and random forest. However, it does not focus on NAR tweets. In recent literature, the authors in [35] developed a method by extracting the features by applying maximum frequency of words from the tweets to detect resource tweets during a disaster. Resources include both availability and need of the resources. However, it's not focused alone on the availability and need of the resources tweets during a disaster. The authors in [9] designed Artificial Intelligence Disaster Response (AIDR) system for classifying the tweets into user-defined categories for detecting the tweets related to the disaster. In AIDR, the uni-gram and bi-gram features are used for detecting the tweets related to the user-defined categories. These features are applied for detecting any user-defined classes during a disaster. In [2] , the authors manually analyzed WhatsApp messages for the requirement of medical, human, infrastructural resources during a disaster by considering the case study of Nepal earthquake dataset 2015. However, they have not proposed an automatic method for identifying the resources. In [11] , the authors found that neural network retrieved models by integrating the character-level and word-level embeddings with pattern recognition techniques perform well than state-of-art models. The authors applied information retrieval techniques for detecting the NAR tweets. In [7] , the authors used a novel vector training approach for clustering the tweets about the emergency situations and compared their method with Bag-Of-Words (BOW), word2vec-sum and doc2vec. And described that clustering of tweets will be helpful further for identifying the different aspects of topic in emergency situations. However, they are not proposed a method for identifying the NAR tweets during a disaster. The problem can be defined as follows: Given a 'N' number of tweets X = {x 1 , x 2 , x 3 , x 4 , .....x N }, identify the tweets which are related to the three classes such as 1). Need of the resource 2). Availability of the resource and 3). None of the above. This section describes the stacked convolutional neural network for identifying the NAR tweets during a crisis. The overview of the proposed stacked convolutional neural network is shown in Fig. 1 . The stacking mechanism [44] combines the predictions of diverse classifiers in the best way by learning the relationship between the models. Different classifiers vary in prediction errors from the data. For instance, some classifiers mispredict the data, while some other classifiers predict the same data correctly. It increases the generalization ability of the model and reduces the misclassification rate, bias and variance of the model. The stacking based classifiers give a high performance than the individual classifier models due to its generalization ability [42] . However, most of the resource detection systems focus on the individual classifier models rather than the ensemble methods (a combination of diverse classifiers). In this work, stacked convolutional neural network is proposed for detecting the resource tweets from social media during the disaster. It consists of two phases of the classifier. In the first phase, the Convolutional Neural Network and the KNN classifiers are used and referred to as base-level classifiers. The SVM classifier is used as a meta-level classifier in the second phase. Before the tweets are given as inputs to the base-level classifiers, the following pre-processing and extraction steps are performed, such as: -All tweets are changing to lower case letters to avoid the multiple copies of same words. -These are divided into words and it referred as tokens -The user mentions (@users), hash-tags (#) and URL's are removed from the tweets. -Similarly, stop-words, numerical and unknown symbols are omitted from tweets. For each tweet, two types of feature representation, and the following techniques are used to generate a feature representation from tweets, such as: We used pre-trained crisis word embeddings to represent the 300-dimensional vectors for each word in a tweet. It is mainly based on 52 million crisis-related tweets collected during 19 crisis events and used word2vec tool for training the word embeddings. It uses the Continuous Bag Of Words Model (CBOW) architecture with negative sampling to generate word embeddings. [45] to extract the top-most informative words from tweets because it has already been shown to be one of the most efficient feature selection algorithm for text categorization. The SVM classifier is used for the χ 2 − static feature selection algorithm because the authors in [20] concluded that the SVM with χ 2 statistic feature selection performed well than other traditional methods. The extracted domain-specific features are shown in Table 2 . The first, second, and third columns are the serial number, features and information category, respectively. χ 2 − static feature selection algorithm is used The above two methods provide two feature vector representations for each tweet that are given as input to base-level classifiers such as CNN and KNN Classifiers. CNN is suitable to elicit local and deep features from natural language. The authors [12] have shown that CNN has had better results in sentence classification. The authors in [34] have extended a convolutional-recursive deep model for 3D object classification that employs a combination of Convolutional and Recursive Neural Networks (RNN) cooperatively. The CNN layer discovers the low-level translation stable features that are feed into multiple, fixed-tree RNNs to formulate higher-order features. In [27] , the authors have shown that CNN outperforms many traditional methods in biomedical text classification, Embedding Layer It is the very first layer of CNN. It takes a fixed number of words from the tweets as input and converts into a corresponding 300-dimensional crisis word vector. The 300-dimensional tweet vector is passed into a series of convolution and pooling operations to understand high-level feature representations. In the convolution layer, the new features 'F ' are generated by using convolution kernel 'U ∈ R gd ' to a window of g words (filter size) as shown in (1). Where 'x j :j +g−1 ' is the concatenation of input vectors '(x j , x j +1 ...x j +g−1 )', 'b' is a bias term and 'f' is a non-linear activation function like 'sig', 'tanh', etc. The filter is used to the window of 'g' words for getting the feature map with 'F ∈ R n−g+1 ' which is shown in (2). Different 'g' values (3 ,4 ,5) are used to capture the different n-gram features from the tweet. This process is repeated for 100 times (100 filters) to produce the 100 feature maps to learn the complementary features of the same filter size. After getting the feature map, maximum pooling is applied to each feature map. where 'μ q (F i )' refers to the maximum pooling operation [4] used to the each window of 'q' features in the feature map 'F i '. The output dimension is reduced by the max-pooling while keeping important features from each feature map. After the maximum pooling operation, different feature vectors are generated from the convolution layer with filter sizes (3, 4, 5) . Then, the concatenation operation is applied to the different feature vectors to become a single block. The dense layer with the softmax activation function is used on the top of the pooling layer to keep the features generated from the pooling layer. It is shown in the (4). Where 'W' is a weight matrix, 'b e ' is a bias vector and 'e' is a non-linear activation function. The input of dense layer may be variable length, which produces fixed output 'z', and it is given as input for classification. The output layer defines the probability distribution and uses a softmax function. The probability of the 't' label output is given by (5) . Where 'W t ' is the weights associated with the class 't' labels in the output layer. We adopted the K-Nearest Neighbour as a base-level classifier in the proposed model to get the feature vector of the tweet to the meta-level (second-level) classifier. It acts as a firstlevel classifier for getting better performance than other classifiers (Decision tree, Naive Bayes classifier), and a detailed explanation is shown in Sections 4.4 and 4.5.2. It accepts domain-specific features such as aid, needs, etc., as an input feature vector of the tweets. The KNN classifier gives the scores to the tweet neighbors among the training tweets and uses the class labels of 'k' most similarity neighbors to predict the probability vector of the tweet. We use the Euclidean distance 'E(T w, T w 1 )' to measure the similarity between the tweets 'T w' and 'T w 1 ' that is shown in (6) Where 'N' is dimension size of the tweet vectors 'T w' and 'T w 1 '. The classes of these neighbors are weighted using the similarity of each neighbor to T w 0 as follows: where 'KNN(T w)' indicates the set of K-nearest neighbors of tweet Tw. δ(T w j , C i ) represents the probability of T w j with respect to the class C i and i=3 represents the number of classes are three such as Need of resource, Availability of resource and None of the both. Finally, it produces the three-dimensional probability vector for each tweet in testing data. Results indicate that the KNN classifier also plays a significant role in the proposed model for detecting the NAR tweets. In this work, we have adopted the SVM classifier [39] and it is one of the traditional machine learning algorithms in the proposed model. SVM is used as a meta-level classifier for getting better performance than other classifiers (Decision tree, Naive Bayes classifier) and a detailed explanation is shown in Sections 4.4 and 4.5.2. It accepts the concatenation of the predicted outputs of the CNN and KNN classifiers as input features. The size of the input vector is six-dimensional. We used the Radial Basis Function (RBF) kernel in the SVM classifier for transforming the data into a higher dimensional feature space. Given a set of testing tweets to the base-level classifiers and it produces the output of six-dimensional vectors. The results are sent as input features to the meta-level classifier (SVM classifier). The output of the SVM (second level classifier) is used as a final tweet prediction. Later, the learned model will be used to detect NAR tweets during a disaster. The main advantage of the proposed stacked convolutional neural network for detecting NAR tweets during a disaster is that it works effectively, even for small datasets, due to the use of domain-specific features. And also, even though the words are different in both training and testing tweets using the CNN model. The summarization of the proposed method is shown in algorithm 1. The summarization of the proposed method. CNN and KNN with proposed features 1: It represents tweet related to the availability of resources 0: It represents tweet non-related to the need and availability resources 2: It represents tweet related to the need resources Steps: 1. The tweets are preprocessed by applying the following techniques. -Removal of stop-words, numerical and unknown symbols. -Changing to lower case letters. In this section, we first introduce the datasets, parameters details of the model and metrics used for performance evaluation. Subsequently, the experimental results include the results of the preliminary experiments, the classifier selection experiments in the proposed model and the ablation experiments. Furthermore, a comparison is made between the proposed approach and existing approaches. The data are collected from Nepal and Italy earthquakes that occurred during 2015 and 2016, respectively. Tweets are crawled from the tweet-id's through the Twitter API the tweet-id's are obtained from the authors [11] . Out of the total tweets, 80% and 20% of tweets are used for training and testing the proposed model, respectively. The details of disaster datasets are given in Table 3 . The code is made available to the public 1 . Training the CNN model by optimizing the sparse-cross entropy of (5) using the ADADELTA [46] algorithm. The maximum epoch number is set at 50. The mini-batch sizes of 32, 64, 128 are used. The mini-batch size is 64, which gives better results compared to other batch sizes and is tabulated in Table 6 and filter sizes of 3, 4, 5 are used. To avoid the over-fitting, 0.5 dropout [37] and early stopping criteria based on the loss of the validation data are used. All the experiments are performed using the python language scikit [23] package. Table 4 gives the inscription of the various methods. The first column, second column and third column indicate the serial number, method name and abbreviation, respectively. In the abbreviation, the methods before and after '+' symbol are the base-level classifiers (first level classifiers) , '+' indicates the concatenation of predicted output of the base-level classifiers (first level classifiers) and '→' symbol indicates the flow of predicted output of the base-level classifiers as input to the metaclassifier. The method after '→' symbol indicates the meta-level classifier (second level classifier). The performance of the proposed models is assessed based on the standard measures such as accuracy, precision, recall and f1-score are calculated using Eqs. 8 to 11, respectively. where T P Table 6 for various batch sizes. However, the batch size of 64 got the best accuracy compared to the batch sizes of 32 and 128. Therefore, for further experiments batch size of CNN, 64 is considered. This section explains the results of the preliminary experiments, the classifier selection experiments in the proposed model, and the ablation experiments. Initially, the experiment is performed on the SVM classifier based on the proposed domainspecific features for the identification of NAR tweets and compared to the BoW model shown in Table 5 . It highlighted the impact of the proposed domain-specific features compared with the BoW model for the proposed solution. It is beneficial for the proposed solution to identify tweets, especially for smaller datasets. Later, various experiments are performed using the CNN model to determine the best batch size. The batch sizes such as 16, 32 and 64 are used. Results of the CNN model using the accuracy parameter is shown in Table 6 by varying the batch sizes. The results show that the CNN model provides the best outcome for the batch size of 64 compared to others, such as 32 and 128. Therefore, for additional experiments, 64 batch size is considered. It is noted that the values reported in all tables are based on the average Need and Availability of resource classes. The following four different experiments are performed for the proposed method to choose the best appropriate classifier for base-level and meta-level classifiers. 1. In the first experiment, the output of CNN and SVM (base-level classifiers) are given as features to the meta-level classifier. By varying the meta-level classifiers (SVM, KNN, Decision tree and Naive Bayes), the results are reported in Table 7 . KNN gives the best performance than other classifiers for the Nepal earthquake dataset. But in the case of the Italy earthquake dataset, SVM gives the best performance than the other classifiers. 2. In the second experiment, the CNN output and the decision tree (base-level classifiers) are given as features to the meta-level classifier. The models used in the second experiment by different meta-level classifiers are CDS, CDK, CDNB and CDD, and the results are reported in Table 8 . Among the other models, CDK gives the best accuracy for the Nepal earthquake dataset and Italy earthquake dataset. CDNB also provides the same accuracy as CDK in the case of the Italy Earthquake dataset. 3. In the third experiment, the output of the CNN and Naive Bayes classifiers (base-level classifiers) is given as a feature to the meta-level classifier. The models used in the third experiment to vary the meta-level classifiers are CNBS, CNBK, CNBNB and CNBD, and the results are reported in Table 9 . CNBNB has the best accuracy among the models for both disaster datasets. CNBS gives the same accuracy as the CNBNB in the case of the Italy earthquake dataset. 4. Finally, in the fourth experiment, the output of the CNN and KNN classifiers (baselevel classifiers) is given as input to the meta-classifier. The models used in the fourth experiment to vary the meta-classifiers are CKS, CKK, CKNB and CKD, and the results are tabulated in Table 10 . CKS achieves the highest accuracy among the models for both disaster models. After performing four different experiments, the best f1-score models (models that achieve the best f1-score) are selected from the four various experiments of models such as CDK, CKS / CKK, CNBS and CSK for both disaster datasets. In the same way, the best precision models (models that achieve the highest precision) such as CKNB, CDNB, CNBB / CNBD and CSNB on the Nepal earthquake dataset are selected. Similarly, CSNB, CDS, CNBNB and CKS models achieve the best precision for the Italy earthquake dataset. In the case of the execution time, CDS runs very fastly on average of both disaster datasets. However, it does not give the best results compare to other models. Finally, all models are compared and selected as the CSK model that achieves the best f1-score for the Nepal earthquake dataset. In the case of an accuracy parameter, the CSK model gives the best performance for the Nepal earthquake dataset but not provide for the Italy earthquake dataset. Overall comparison of all the models, CKS performs well than the other models on both disaster datasets. Therefore, CKS is selected to identify NAR tweets during the disaster. Various experiments are conducted to assess the effectiveness of the individual component in the proposed model (CKS) on two datasets, such as Nepal and Italy earthquake. The proposed model is initially evaluated and the results for two datasets are tabulated in Table 11 . Later, the experiments are performed by excluding informative (domain-specific) features and CNN individually in the proposed model and the results are reported in Table 11 . The informative features play a crucial role in the proposed method for Italy's earthquake dataset, which reduces the performance of the proposed model by almost 5.31% accuracy. In the case of the Nepal Earthquake, the performance is reduced by approximately 0.90% accuracy. By removing the CNN model, the performance of both datasets is drastically reduced by almost 25% and 15% for the Nepal and Italy earthquake datasets, respectively. It indicates that CNN plays a significant role in both disaster datasets. By removing both CNN and SVM classifiers from the proposed model, the performance reduction is the same as when CNN is removed. It indicates that the SVM classifier alone does not have much impact on the performance of the model. However, the proposed method (CKS) provides the best accuracy than any of the components used to identify NAR tweets during the disaster. It is also proved by using statistical validation and it is given in Section 4.5.2. This section provides a brief explanation of the methods that are compared with the proposed model. It can be categorized into two subsections based on the methods. 1. Classification Methodologies. 2. Statistical validation of the classifier models. This section describes the comparison of the proposed model with the existing classification methodologies [9, 12, 30, 35] . In [9] , the authors presented an AIDR platform for automatic classification of tweets into user-defined categories with the use of uni-gram and bi-gram features. Similarly, in this paper, the SVM classifier with features such as uni-gram and bi-gram used as a baseline, and experiments are performed. In [35] , the authors used features such as location, infrastructure damage, communication, etc., for identifying the resources during a disaster and SVM classifier is used for classification. The authors [12] used CNN for sentence classification by hyper-tuning the parameters. Similar to this, CNN is experimented and compared with the proposed model. In [30] , the authors used the low-level lexical and syntactical features for identifying the situational information during a disaster. The proposed CKS model achieves the best accuracy compared to existing methods on the Nepal and Italy earthquake dataset and the results are reported in Table 12 . However, the proposed model outperforms existing methods on both Nepal and Italy earthquake datasets for identifying the NAR tweets. Better accuracy is achieved for the proposed model when compared to the existing method due to the use of informative features and traditional classifiers, which enhanced the diversity of the model for identifying the NAR tweets. In general, stacking models give better accuracy than individual models when the models have diversity. And also, it is observed that from Table 12 , for Italy earthquake dataset has a huge impact on the proposed method compared to the Nepal earthquake dataset due to the small dataset. In case of the execution time, Rudra model [30] runs very fastly and BoW model [9] runs very slowly compared to other models. However, it does not give the best result for detecting the NAR tweets during the disaster. In this section, we have investigated the statistical significance of the different classification models. The authors in [5] suggest that the use of the MCNemar statistical test for the deep learning models. Therefore, we have used the MCNemar statistical methods [5] to study the efficacy of statistical significance for classification methods. The contingency table of the MCNemar test is shown in Table 13 . Here 'N 01 ' represents the number of tweets corrected detected by Model A and Model B. 'N 02 ' represents the number of tweets corrected detected by Model B and wrongly detected by Model A. 'N 11 ' represents the number of tweets corrected detected by Model A and wrongly detected by Model B. 'N 12 ' represents the number of tweets wrongly detected by Model A and Model B The chi-squared (χ 2 ) can be defined as follows: The hypothesis is: 1. Null hypothesis (N0): There exists no significant difference between the performances of the classifier model. 2. Alternate hypothesis (N1): It can be defined as the existence of a significant difference between the performances of the classifier model. If N0 is accepted, then the probability (p) value is greater than 0.05. If N1 is accepted, then the probability (p) value is less than 0.05. Tables 14 and 15 show the results of the MCNemar statistical test of the performance of the various proposed methods and the comparison with the existing methods. In tables, the '↑↑' indicates that the strong evidence of the proposed method is statistically significant compared to the other method and that the probability value is less than 0.01 (p<0.01). It represents the confidence level of 99.99% of the proposed method. '↑' indicates that the weak evidence of the proposed method is statistically significant compared to the other method and the probability value is between 0.01 and 0.05 (0.01