key: cord-0122562-4hiaiq8r authors: Zahera, Hamada M.; Jalota, Rricha; Sherif, Mohamed A.; Ngomo, Axel N. title: I-AID: Identifying Actionable Information from Disaster-related Tweets date: 2020-08-05 journal: nan DOI: nan sha: 164ad73c904d3acc26ed75d2de2d13a30d944e6b doc_id: 122562 cord_uid: 4hiaiq8r Social media plays a significant role in disaster management by providing valuable data about affected people, donations and help requests. Recent studies highlight the need to filter information on social media into fine-grained content labels. However, identifying useful information from massive amounts of social media posts during a crisis is a challenging task. In this paper, we propose I-AID, a multimodel approach to automatically categorize tweets into multi-label information types and filter critical information from the enormous volume of social media data. I-AID incorporates three main components: i) a BERT-based encoder to capture the semantics of a tweet and represent as a low-dimensional vector, ii) a graph attention network (GAT) to apprehend correlations between tweets' words/entities and the corresponding information types, and iii) a Relation Network as a learnable distance metric to compute the similarity between tweets and their corresponding information types in a supervised way. We conducted several experiments on two real publicly-available datasets. Our results indicate that I-AID outperforms state-of-the-art approaches in terms of weighted average F1 score by +6% and +4% on the TREC-IS dataset and COVID-19 Tweets, respectively. Social media has become a key medium for sharing information during emergencies [1] . The major difference between social media and traditional news sources is the possibility of receiving feedback from affected people in real time. Relief organizations can benefit from this two-way communication channel to inform people and gain insights from situational updates received from affected people. Hence, extracting crisis information from posts on social media (e.g., tweets) can substantially leverage situational awareness and result in faster responses. Most previous works [2] , [3] addressed information extraction from social media as a binary text classification problem (e.g., with the labels Relevant and Irrelevant). However, there is a lack of efficient systems that can map relevant posts to more fine-grained labels as, for example, defined in [4] (see Figure 1 ). Such fine-grained labels are particularly valuable for crisis responders as they filter critical information to deliver disaster responses quickly. In particular, labeling disaster-related tweets using multiple labels allows the rapid detection of tweets with actionable information. Table 1 shows the list of information types (which we use as labels) defined by [4] . We adopt the definition of actionable tweets as formalized in [1] . Actionable tweets are defined as the ones that would generate an immediate alert for individuals (i.e., stakeholders) responsible for the information type with which they are labeled (e.g., SearchAndRescue, MovePeople). This stands in contrast to non-actionable tweets that are labeled with labels such as Hashtags or FirstPartyObservation (see Table 1 ). On the other hand, categorizing tweets is known to be a challenging short text Natural Language Processing (NLP) task [5] . This is because tweets i) does not possess sufficient contextual information, and ii) is inherently noisy (e.g., contains misspellings, acronyms, emojis, etc.). Moreover, in the multi-label case, the classification task becomes even more challenging because a tweet can belong to one or more labels simultaneously. In this paper, we aim to i) label disaster-related tweets with fine-grained information types so as to ii) identify actionable or critical tweets that might be relevant for disaster relief and support disaster mitigation. Our approach contains three components: First, we use BERT as a sentence encoder to capture the semantics of tweets and to represent them as lowdimensional vectors. Second, we employ a graph attention network (GAT) to capture correlations between the words and entities in tweets and the labels of said tweets. Finally, we use a Relation Network [6] as a learnable distance metric to compute the similarity between the vector representation of tweets (obtained from the BERT encoder) and the vector representation of labels (obtained from the GAT) in a supervised way. By these means, our system integrates a contextualized representation of tweets with correlations between tweets and their labels. The main contributions of this paper can be summarized as follows: • We propose a multimodel approach (dubbed I-AID) to categorize disaster-related tweets into multiple information types. • Our approach leverages a contextualized representation from a pretrained language model (BERT) to capture the semantics of tweets. In addition, our approach employs a GAT component to capture the structural information between the words and entities in tweets and their labels. • We employ a learnable distance metric, in a supervised way, to learn the similarity between a tweet's vector and the labels' vectors. • We conduct several experiments to evaluate the performance of our approach and state-of-the-art baselines in multi-label text classification. The rest of this paper is structured as follows: In Section II, we discuss previous work on the classification of crisis information on social media. In Section III, we describe the preliminaries and architecture of our proposed approach. Finally, we discuss the experimental results in Section IV and conclude the paper in Section V. The objective of this work is to categorize disaster-related tweets into multiple information types. Therefore, we relate our work to extract disaster-related information on social media, multi-label text classification and meta learning. In the following, we briefly discuss the state of the art in each of these areas. Several studies demonstrate the role of social media as a primary source of information during disasters [7] . While some works [8] focused on filtering relevant information from tweets, others (e.g., [9] , [10] ) proposed annotation schemes to classify tweets into fine-grained labels that consider the attitude, information source and decision-making behavior of people tweeting before, during and after disasters. To advance the state of social media crisis monitoring solutions, initiatives like [11] have been rolled out in recent years. One of them is the Incident Streams (TREC-IS) track [10] of the Text REtrieval Conference, which commenced in 2018. The track aims to categorize disaster-related tweets into multiple information types. In this work, we study the TREC-IS dataset and adopt the definition of actionable information from the authors of the TREC-IS challenge. In addition, we employ their performance metric (called Accumulated Alert Worth [12] ) to evaluate our system in identifying actionable information in tweets. Earlier works in text classification [13] consider feature engineering and model training as different subtasks. With the advent of end-to-end deep learning approaches [14] and the attention mechanism [15] , there has been a significant advancement in the field of multi-label text classification. Pretrained language models (e.g, BERT [16] ) are becoming increasingly popular for text classification [14] . However, since BERT only captures the local contextual information, the BERT embeddings do not sufficiently capture the global information about the lexicon of a language. [17] To circumvent this and comprehend the global relations among words in a vocabulary, graph-based approaches such as graph convolution network (GCN) [18] and graph attention network (GAT) [19] have been promising. Recent studies [17] , [20] have exploited the advantages of combining BERT and graph networks. In VGCN-BERT [17] , a GCN is used to capture the correlation between words at the vocabulary level (i.e., global information). For instance, given a vocabulary, the GCN would relate the meaning of "new" to "innovation" and "exciting", similar to contextindependent word embeddings like word2vec [21] . For an input sentence, the local contextual information is captured using BERT embeddings, while the global information pertaining to words in a sentence is extracted using graph embeddings and subsequently concatenated with BERT. The two representations of BERT and GCN then interact via the self-attention mechanism to perform the classification task. In a similar work, Ankit Pal et al. [20] leverage the combination between BERT embeddings and GAT to learn feature representation for text in a multi-label classification task. Their proposed approach (dubbed MAGNET) employs two components: First, a BiLSTM network with BERT embedding is used to capture text representation into an embedding vector. In the second component, the authors The post is reporting an event that has occurred ContinuingNews The user is providing/linking to a continuous event Advice Provide some advice to the public Sentiment The post is expressing some sentiments about an event Discussion Users are discussing an event Irrelevant The post is irrelevant use GAT to learn a feature vector for labels. In particular, their GAT models the correlation between words and the corresponding labels, then averages the labels' vectors into a single output vector. Finally, the authors use a dot-product function to compute the similarity between the input's vector from BiLSTM and the label's vector. In contrast, our approach differs from both MAGNET and VGCN-BERT in computing the similarity between a tweet's representation and its labels' vectors. We employ a GAT model to explicitly infuse the correlation information of the entities and labels of a tweet with the tweet's contextualized BERT representation. While MAGNET and VGCN-BERT use either a fixed and linear distance metric (dot-product function) or self-attention to measure the similarities, our approach benefits from a deeper end-to-end neural architecture to learn this distance function. In particular, we employ meta-learning to learn the mapping between the input features and multi-label output in a supervised way. Meta learning (also called learning-to-learn paradigm) refers to the process of improving a learning algorithm over multiple learning episodes. In contrast to conventional machine learning approaches, which improve model prediction over multiple data instances, the meta-learning framework treats tasks as training examples to solve a new task [22] . In our study, we employ a specific branch of meta learning called metric learning. Metric learning learns a distance function between data samples so that the test instances get classified by comparing them to the labeled examples. The distance function consists of i) an embedding function, which encodes all instances into a vector space, and ii) a similarity metric, such as cosine similarity or Euclidean distance, to calculate how close two instances are in the space [23] . Recently, many approaches have been developed to perform this task, such as Siamese [24] , Matching [25] , Prototypical [26] , and Relation Network [6] . While the embedding function in all of these approaches is a deep neural network, they differ in terms of the similarity function. Unlike its predecessors, which rely on a fixed similarity metric (such as cosine, Euclidean, etc.), Relation Network employs a flexible function approximator to learn similarity and focuses on learning a good similarity metric in a supervised way. The use of function approximators eliminates the need to manually choose the right metric (e.g., Euclidean, cosine, Manhattan). By jointly learning the embedding and a nonlinear similarity metric, Relation Network can better identify matching/mismatching pairs [27] . For this purpose, we use the Relation Network in our work for learning the similarity metric. We begin this section by giving a formal specification of the multi-label tweet classification problem. Afterward, we discuss the details of each component of our approach in Section III-B. Figure 2 gives an overview of our approach and how its components work together. Tweet's Vector Text Token Embeddings Position Embeddings FIGURE 2. The I-AID architecture: BERT-Encoder embeds tweet t (i) into a feature vector τ (i) . TextGAT builds a graph G from our dataset, employs graph attention layers and output labels vectors ι. Relation Network learns a distance metric between τ (i) and ι, then outputs predicted labelsŷ (i) for t (i) . Description S Number of tweets in the dataset. w Tweet tokens (e.g., word or entity). Ground-truth multi-label assigned to a tweet i. y (i) Predicted multi-label assigned to a tweet i. λi A single label/information type for a tweet. Nodes of a graph E Edges between nodes in a graph A Adjacency matrix of a graph τ (i) Embedding vector of tweet i learned by BERT The concatenated vector of τ (i) and ι L Binary cross-entropy loss function αij Attention score between nodes v (i) and v (j) hP W (t) Scoring function for high priority tweets. hP W (l) Scoring function for low priority tweets. Let T be a set of tweets and Λ = {λ 1 , λ 2 , · · · , λ k } be a set of k predefined labels (also called information types, see Table 1 ). We formulate the problem of identifying crisis information from tweets as a multi-label classification task, where a tweet t can be assigned one or more labels from Λ simultaneously. Our task is to learn a multi-label classifier M : T → {0, 1} k that maps tweets T to relevant labels from Λ. We assume a supervised learning setting where a training = 0 means that t (i) does not belong to the class λ j . The goal of our approach is to learn the function M by using three neural networks. First, we transform tweet t (i) into an embedding vector τ (i) using a pretrained BERT model. In parallel, our approach learns labels' embeddings ι using a graph attention network (GAT). These are then concatenated with the tweet embedding τ (i) . Finally, these vectors are fed to our last component (Relation Network) to identify relevant labels for t (i) . This is the first component in our system that transforms an input tweet into a vector representation τ of its contextual meaning. As shown in Figure 2 , the BERT-Encoder takes tweet and outputs the embedding vector τ (i) . We employ a BERT-base architecture with 12 encoder blocks, 768 hidden dimensions, and 12 attention heads. We refer readers to the original BERT paper [16] for a detailed description of its architecture and input representation. Furthermore, a special preprocessing is performed for BERT input. A [CLS] token is appended to the beginning of the tweet and another token [SEP] is inserted after each sentence as an indicator of sentence boundary. Each token w (i) is assigned three kinds of embeddings (token, segmentation, and position). These three embeddings are summed to a single output vector τ (i) that captures the meaning of an input tweet. 2) Text-Graph Neural Network (TextGAT) Traditional methods (e.g., word2vec [21] ) can properly capture features from a text. However, these methods ignore the structural information and relationship between words in a text corpus [28] . The recently proposed graph networks [19] aim to tackle this challenge by modeling text as a graph where words are nodes and relations between them are edges. In our work, we build a graph G = (V, E) from the dataset D, where V and E represent nodes set and their edges, respectively. Each node v (i) ∈ V can be a word, named-entity 1 or label (tweet's class or information type). We represent nodes using a feature matrix H = {h (1) , h (2) , · · · , h (N ) } where h (i) ∈ R F is the feature vector of node v (i) with F dimension and N is the number of nodes. First, we initialize the nodes' representation H with pretrained embeddings from Glove embedding [29] . Further, relations between nodes are modeled using an adjacency matrix A ∈ R N ×N . As shown in Figure 2 , TextGAT component has two graph attention layers. Each layer takes nodes' features H as input and performs an attention operation [30] to learn a new featureĤ = {ĥ (1) ,ĥ (2) , · · · ,ĥ (N ) } for each node based on its neighbours' importance (i.e., attention from its neighbours). Hence, we employ the shared attention mechanism att : RF × RF −→ R over all nodes. The graph attention operated on the node representation can be written as: where att is a single-layer feedforward network, parametrized by a weight matrix W ∈ RF ×F which is applied to every node. Finally, we use a softmax function to normalize the attention scores as shown in Eq. 2. To this end, TextGAT learns the structural information between nodes based on the relative importance of neighbours. The learned representations of labels are then extracted and concatenated with the tweet's vector as input for the last component, as shown in Figure 2 . In this component, we aim to learn a similarity metric in a supervised way (also called learning-to-learn or meta learning) between the tweet's vector τ (i) and labels vectors ι. Furthermore, we employ a neural network as a learnable, nonlinear distance function that learns how to match similarity (i.e., relation) between the tweet's vector and each label. Relation Network takes as input the concatenated matrix Z = τ (i) ⊗ ι of BERT-Encoder output with the labels' vectors. Since our task is multi-label classification, we use the binary crossentropy as a loss function in Eq. 3. Then we use a sigmoid function in the output layer to compute the probability of each label independently over all possible labels (Λ), in contrast to a softmax function which only considers the label with highest probability. Finally, a set of relevant labels is returned as a final output of our approach. (3) where y (i) andŷ (i) are the predicted and ground-truth labels of tweet i respectively. S is size of tweets in the training dataset. In this section, we report the evaluation results of our approach and baseline methods. We aim to answer the following research questions: Q 1 : How does our approach perform compared with stateof-the-art multi-label models in short text (e.g., tweets) classification? Q 2 : How effective is our approach in identifying tweets with actionable information? Q 3 : How does each component in our approach affect the overall performance (i.e., Ablation Study)? We conducted a set of experiments on two public datasets provided by TREC [10] . Table 3 gives an overview of each dataset: the number of tweets used in training (# Train), validating (# Valid), and testing (# Test) our approach and baselines, in addition to the total number of classes (# Classes). In particular, we split each dataset with 80% − 20% ratio, where we use 80% of tweets for training and 20% for testing. During the training phase, we use 20% from the training data to validate the model. We briefly summarize each dataset as follows: • TREC-IS: This dataset contains approximately 35K tweets collected during 33 different disasters between 2012 and 2019 (e.g., wildfires, earthquakes, hurricanes, bombings, and floods). The tweets are labeled with 25 information types by human experts and volunteers. • COVID-19 Tweets: This dataset contains a collection of tweets about the COVID-19 outbreak in different affected regions. In total, the data has 7, 590 tweets labeled with one or more of the full 12 information type labels (the same as for the TREC-IS dataset). Figure 3 shows the distribution of tweets per information type in both datasets. Apparently, the datasets VOLUME 4, 2021 are highly imbalanced w.r.t. tweets' distribution across information types. For example, in the TREC-IS dataset, there are more than 6, 000 tweets that are categorized into the information types Hashtags, News, MultimediaShare, and Location. In contrast, the information types CleanUP, InformationWanted, and MovePeople have significantly fewer tweets. Similarly in COVID-19 Tweets, the tweets' distribution is extremely imbalanced: most tweets are categorized into Irrelevant, ContextualInformation, Advice, or News. This skewing distribution in tweets renders multi-label classification more challenging. We consider a set of state-of-the-art approaches in multilabel classification 2 as baselines in our evaluation. We briefly describe each baseline as follows: • BERT embeddings to represent tweets and GAT for labels classifiers. Then it uses a dot-product function to compute similarities between tweet vectors and labels' vectors. We use the open-source implementations for TextCNN, HAN, and BiLSTM models provided by the corresponding authors in their GitHub repositories. Furthermore, we implemented the code for the MAGNET model as it has not been open-sourced to date. In our approach, we use the implementation of BERT-Encoder from the Huggingface 3 library. Hyperparameters in the baselines are set with same values as mentioned in their original papers. In our model, we tune hyperparameters via the grid search method to find optimal values for best performances. Specifically, our model achieves its best performance with the following values: training-epochs to 200 with batch-size of 128 and Adam optimizer [34] with a learning-rate of 2e −5 . To avoid overfitting, we add a dropout layer with a rate of 0.25 and apply an early-stopping technique during model's training. The implementation of the I-AID model is opensourced and available on the project website 4 . Given that the evaluation datasets are tweets, we perform adhoc preprocessing steps to capture the tweets' semantics. In particular, we perform the following preprocessing steps: (1) We use the NLTK's TweetTokenize 5 API to tokenize tweets and retain the text content. (2) Stop-words, URLs, usernames, and Unicode-characters were removed. (3) Extra white spaces, repeated full stops, question marks, and exclamation marks are removed. (4) Emojis are converted to text using the emoji 6 python library. Finally, (5) spaCy 7 library is used, to extract named-entities from tweets. We consider standard evaluation metrics for a multi-label classification task. In particular, we use a weighted average F1 score, hamming loss and Jaccard index to evaluate the system's performance: • Weighted average F1 score: F1 score is the harmonic mean of precision and recall scores. We use a weighted average that calculates the F1 score for each label independently, then adds them together and uses a weight relative to the number of tweets in each label. where |T λi | denotes the number of tweets with label λ i and |T | is the total number of tweets. Precision λi and recall λi are the values of precision and recall for λ i . • Hamming Loss: To estimate the error rate in classification, we use the hamming loss function [35] that computes the fraction of incorrectly predicted labels out of all predicted labels. Hence, the smaller the value, the better the performance. where S is the dataset size, k is the total number of labels (i.e., |Λ| ), ⊕ denotes the XOR operator, and y (i) andŷ (i) are the groundtruth and predict labels, respectively, of tweet i . • Jaccard Index: To assess the system's accuracy, we use the Jaccard index to evaluate the similarity between predicted labelsŷ (i) and groundtruth labels y (i) . Jaccard index computes the percentage of common labels in two sets of all labels as: where y i andŷ i are the groundtruth and predicted labels for tweet i. ∩ and ∪ denote intersection and union set operations, respectively. We aim to evaluate the efficacy of our system in identifying tweets with actionable information, i.e., the system should trigger an alert if an input tweet includes actionable information (e.g., requests for search and rescue or reports of emerging threats). For this purpose, TREC-IS [36] introduces a new evaluation metric called Accumulated Alert Worth (AAW) to evaluate systems in detecting actionable information during crisis. The AAW score ranges from −1 to +1, where a positive value indicates highly critical information in a tweet while a negative score indicates it is less critical. More details about the AAW metric can be found in [12] . Here, we summarize the AAW metric as follows: where T h and T l denote the sets of tweets with high and low priorities, respectively. hP W (t) is a scoring function for tweets that should generate alerts and lP W (t) is a scoring function for tweets that should not generate alerts. Formally, where p s t is the priority score of a tweet by the system, and ϕ(t) andφ(t) are actionable and non-actionable scores, respectively, for tweet t,. We use different metrics in multi-label classification to evaluate the performance of I-AID and baseline methods. To ensure a fair evaluation, we use the same train dataset for training all models and the test dataset for evaluation. Table 4 reports our evaluation results for each model on both datasets (TREC-IS and COVID-19 Tweets). We consider the weighted average F1 score as the primary metric to compare and rank systems. Weighted average F1 takes into account the average performance of each system across all information types. Overall, our approach (I-AID) achieves superior results to the other baselines under several metrics. In particular, our approach outperforms the weighted average F1 score of MAGNET-the state-of-the-art baseline in multilabel tweets classification-by +6% on TREC-IS and +4% on COVID-19 Tweets. We employ the Jaccard index and Hamming loss in further analysis to evaluate accuracy and error rate. using Jaccard index, our approach outperforms all baseline methods in both datasets. In particular, I-AID achieves 43% Jaccard index for both datasets compared with MAGNET's score of 38% for the TREC-IS dataset and 40% for COVID-19 Tweets. On the other hand, our approach achieves suboptimal results using Hamming loss. For the TREC-IS dataset, I-AID achieves the best performance with rate 0.07%. While in COVID-19 Tweets, it achieves the second best score with 0.08% compared with HAN model's score 0.04%. Our experiments demonstrate that I-AID performs fairly well when categorizing disaster-related tweets into multiple information types. This is due to three facts: i) we constructed a multimodel framework that leverages contextualized embeddings from the BERT model to capture contextual information in tweets. ii) Our approach enriches the semantics of tweet representation by injecting label information and integrating additional structural information between tweets' tokens and labels using GAT. iii) Finally, we employ a Relation Network to learn automatically similarities between tweets and labels. By using a learnable distance function, we learn an efficient metric in a supervised way to facilitate the mapping between a tweet and multi-label output. To answer Q 2 , we use the AAW metric, proposed by TREC (Eq. 7), to evaluate the I-AID's ability to identify tweets with critical information. There are two ways to define an VOLUME 4, 2021 actionable tweet [1] : i) in terms of high priority information, commonly marked as critical by human assessors, and ii) in terms of information type, for instance, a tweet with the labels MovePeople or CleanUP is considered more actionable than News or Multimediashare. In our evaluation, we consider the second definition of actionable posts. The evaluation results of the AAW metric are presented in Table 5 , where the top 6 rows show the evaluation results for the baseline approaches in multi-label classification. The rest of Table 5 shows the AAW results of the best approaches from the TREC-IS challenge (2019 edition [10] RUN B). The result of our approach (I-AID) is presented at the bottom of Table 5 . Our approach (I-AID) substantially outperforms all baseline approaches. In particular, in high priority AWW, I-AID achieves an absolute improvement of +26% compared to the MAGNET model and +32% compared to nyu-smap (the best-achieved result in TREC-IS 2019). Furthermore, I-AID outperforms the Median score of TREC-IS participants by +28% in high priority and by +30% in overall AWW. Remarkably, our approach is the first to achieve a positive AAW score on high priority tweets. Although we outperform the state-of-the-art in both classification and AAW, the results of our evaluation suggest that a significant amount of research is still necessary to spot high priority tweets in a satisfactory manner. As discussed in Section III, our approach employs two main components (namely, BERT-Encoder and TextGAT) for representing input tweets. We perform an ablation study to evaluate the performance of each component individually. To do so, we implement two more versions of I-AID: the I-AID-BERT and the I-AID-TGAT. In the I-AID-BERT we deploy our system with the BERT-Encoder only to classify tweets into multiple information types. In the same manner, we implement I-AID-TGAT with only the TextGAT component. Table 6 shows the evaluation results for each component in our ablation study. Evidently, the BERT-Encoder-based implementation of I-AID achieves better performance than the TextGAT version. In particular, on the TREC-IS dataset, I-AID-BERT reach 50% F1 score compared with 26% by I-AID-TGAT. These results prove that I-AID-BERT can learn rich representation features from short text better than I- AID-TGAT. Moreover, in our approach we demonstrate that leverageing BERT and GAT together in a multimodel framework improves the overall performance. On the TREC-IS datastet, the I-AID model achieves superior results by +9% in F1 score compared with the BERT-based model and by +33% compared with the TGAT-based model. Our experiment on COVID-19 Tweets leads to a similar conclusion. Our approach outperforms BERT-based and GAT-based systems in F1 scores by +8%, +19%, respectively. On the other hand, we observe that the GAT-based model achieves better performance with fewer output labels. In COVID-19 Tweets with 12 labels, I-AID-TGAT achieves an improved F1 score of +10% compared with its performance in TREC-IS with 25 labels. In this paper, we propose I-AID, a multimodel approach for multi-label tweets classification. Our system combines three components: BERT-Encoder, TextGAT, and Relation Network. The BERT-Encoder is used to obtain locality information, while the TextGAT component aims to find correlations between tweets' tokens and their corresponding labels. Finally, we use a Relation Network as a last component output to learn the relevance of each label w.r.t. the tweet content. Our main findings are as follows: i) Combining local information captured by BERT-Encoder and global information by TextGAT is beneficial for rich representation in short text and significantly advances multi-label classification. ii) Leveraging transfer learning from pretrained language models can efficiently handle sparsity and noise in social media data. iii) Benchmarking multi-label classification is a challenging task that requires proper evaluation metrics for finegrained evaluation. I-AID achieves its best weighted average F-score of 0.59 on the TREC-IS dataset. This result clearly indicates the sensitivity of our approach to the dataset's balancing. Dealing with unbalanced classes remains a future extension to our approach. We plan to use data augmentation and natural language generation to address this problem. DR. MOHAMED AHMED SHERIF is a postdoctoral researcher at the Data Science chair (DICE) at University of Paderborn. Mohamed's research interests revolve around knowledge graphs and semantic web technologies, especially (explainable) machine learning for data integration. Mohamed developed number of algorithms for link specification learning, data repair, load balancing and relation discovery. Currently, he is leading the data integration tasks of many research projects. PROF. DR. AXEL-CYRILLE NGONGA NGOMO is the Data Science chair (DICE) at the Computer Science department at University of Paderborn. His research interests revolve around knowledge graphs and semantic web technologies, especially link discovery, federated queries, machine learning and natural-language processing. Axel has (co-)authored more than 200 reviewed publications and has developed several widely used frameworks. From situational awareness to actionability: Towards improving the utility of social media data for crisis response On identifying disasterrelated tweets: Matching-based or learning-based? Improving classification of twitter behavior during hurricane events TREC incident streams: Finding actionable information on social media Short text classification: A survey Learning to compare: Relation network for few-shot learning Crisis mapping during natural disasters via text analysis of social media messages Weakly supervised and online learning of word models for classification to detect disaster reporting tweets What to expect when the unexpected happens: Social media communications across crises Trec incident streams: Finding actionable information on social media Twitter as a lifeline: Humanannotated twitter corpora for nlp of crisis-related messages Accumulated alert worth for evaluating actionable information Short text classification in twitter to improve information filtering Label embedding using hierarchical structure of labels for twitter classification Attention is all you need Bert: Pre-training of deep bidirectional transformers for language understanding VGCN-BERT: augmenting BERT with graph embedding for text classification Semi-supervised classification with graph convolutional networks Graph convolutional networks for text classification MAGNET: multi-label text classification using attention-based graph neural network Efficient estimation of word representations in vector space Metalearning in neural networks: A survey Meta-learning for few-shot natural language processing: A survey Siamese neural networks for one-shot image recognition Matching networks for one shot learning Prototypical networks for few-shot learning Few-shot learning with graph neural networks Large-scale hierarchical text classification with recursively regularized deep graph-cnn Glove: Global vectors for word representation Graph attention networks Convolutional neural networks for sentence classification Hierarchical attention networks for document classification Attention-based bidirectional long short-term memory networks for relation classification Adam: A method for stochastic optimization Improved boosting algorithms using confidence-rated predictions Incident streams 2019: Actionable insights and how to find them