key: cord-0136371-0fkznkx7 authors: Wang, Congcong; Nulty, Paul; Lillis, David title: Crisis Domain Adaptation Using Sequence-to-sequence Transformers date: 2021-10-15 journal: nan DOI: nan sha: 7caf35ef1f72fc71fecca11cf1646562b75eb9df doc_id: 136371 cord_uid: 0fkznkx7 User-generated content (UGC) on social media can act as a key source of information for emergency responders in crisis situations. However, due to the volume concerned, computational techniques are needed to effectively filter and prioritise this content as it arises during emerging events. In the literature, these techniques are trained using annotated content from previous crises. In this paper, we investigate how this prior knowledge can be best leveraged for new crises by examining the extent to which crisis events of a similar type are more suitable for adaptation to new events (cross-domain adaptation). Given the recent successes of transformers in various language processing tasks, we propose CAST: an approach for Crisis domain Adaptation leveraging Sequence-to-sequence Transformers. We evaluate CAST using two major crisis-related message classification datasets. Our experiments show that our CAST-based best run without using any target data achieves the state of the art performance in both in-domain and cross-domain contexts. Moreover, CAST is particularly effective in one-to-one cross-domain adaptation when trained with a larger language model. In many-to-one adaptation where multiple crises are jointly used as the source domain, CAST further improves its performance. In addition, we find that more similar events are more likely to bring better adaptation performance whereas fine-tuning using dissimilar events does not help for adaptation. To aid reproducibility, we open source our code to the community. As evidenced by a number of previous research works (Imran, Mitra, et al. 2016; Alam et al. 2018; McCreadie et al. 2019) , exploring computational techniques for finding useful information from user-generated content (UGC) on social media during crises remains an important research question. This is the case for two important reasons. Firstly, it is not easy to manually filter useful information since UGC is usually enormous and noisy (Imran, Castillo, et al. 2015) , thus motivating the development of automatic filtering techniques. Secondly, since UGC contains a good deal of actionable information (e.g. a need for rescue following an earthquake), it has the potential to be utilised by emergency response agencies to aid these people in a timely manner (McCreadie et al. 2019) . Most approaches to crisis message classification require UGC from prior crises for training purposes (e.g. to fine-tune language models). The nature of such training data is the focus of this work. Crisis events of di erent types may feature similar characteristics. For example a bombing or an earthquake may both result in a person requiring rescue from a fallen building. Similarly, flooded conditions may arise from weather events such as hurricanes, or for other reasons such as a dam breach. As a new crisis unfolds, it is important to consider the training data that is used. Using training data from one crisis in order to classify posts from another is a form of cross-domain adaptation. Two primary research questions are to be addressed in this study. Firstly, to what extent does the similarity of crisis event types a ect the quality of cross-domain adaptation in this context (one-to-one adaptation)? Secondly, although more training data is generally considered to be a positive, does this presumption hold when the training data relates to merging di erent crisis events with di erent characteristics (many-to-one adaptation)? In this work, we propose CAST that explores sequence-to-sequence (seq2seq) transformers for domain adaptation between di erent crises for message classification tasks. Unlike the standard method of fine-tuning seq2seq transformers for a downstream task (Ra el et al. 2020; Lewis et al. 2020) , CAST is simple and trained by using a task description or/and event description added to each example, as illustrated in Figure 1 . In comparison to similar work for crisis domain adaptation (H. Alam et al. 2018; Liu et al. 2020) , CAST uses only labeled source data without any unlabeled target data. This makes CAST more fit to real-world use cases. This is because a crisis usually focuses on a specific "topic" at a certain stage. For example, an earthquake is normally more about "Emerging Threats" than "Donation" at early stages. Hence, to get a good-quality distribution of the target event, the unlabeled target data needs to be collected as the target crisis unfolds. To test the e ectiveness of CAST, we conducted comparative experiments in both one-to-one (a single source event to a single target event) and many-to-one (several source events to a single target event) adaptation settings. In one-to-one adaptation, it is found that CAST outperforms the state of the art in both in-domain and cross-domain adaptation when not using any target data. To compare it with the standard, CAST is more e ective than the standard in cross-domain adaptation when trained with a larger model. In the many-to-one adaptation, CAST's advantage over the standard in cross-domain adaptation becomes more obvious even with a small model. Moreover, based on our experimental figures, there is evidence to suggest that the use of multiple similar source events as the source domain improves the adaptation to a target event. Our main contributions are summarised as follows: • We propose CAST: a seq2seq transformer-based approach for crisis domain adaptation. Our approach makes the model event-aware by taking into account a task descriptor and event descriptor in the input construction. In addition, it does not rely on any target data in model training which makes it more suitable for real-world use cases. • In the one-to-one adaptation setting, experimental results show that CAST achieves competitive performance with existing work that uses unlabeled target data for self-supervision and it outperforms the state of the art when not using any target data. Moreover, we found CAST is particularly e ective in cross-domain adaptation when trained with a larger model or combining multiple events as the source domain. • Using CAST, we study the e ectiveness of adaptation from similar events to a target event. The experimental results show that, to bring better adaptation performance, it is suggested to combine similar events as the source domain whereas adding dissimilar events does not help for adaptation. Our work aims to tackle the crisis domain adaptation problem by training computational models for crisis-related message classification. Hence, we surveyed related work from two perspectives: i), crisis domain adaptation and ii), crisis message classification. To overcome the issue of scarcity of training data for a new crisis, the problem of crisis domain adaptation has been widely studied in the literature (H. Li, Guevara, et al. 2015; Imran, Mitra, et al. 2016; H. Li, X. Li, et al. 2018; Alam et al. 2018; X. Li and D. Caragea 2020) . This line of work can be broadly divided into two categories: target-data independent and target-data dependent. The former is a supervised approach, where no unlabeled target data is used in model training. For example, Imran, Mitra, et al. (2016) investigated the domain adaptation between di erent combinations of past disasters across di erent languages. They found that similar events of the same type (e.g. earthquakes) tend to be useful for adaptation and even cross-language domain adaptation is useful when the source events and target events are from similar languages. Another target-data independent work is by H. Li, X. Li, et al. (2016) , who explored a wide range of both word embeddings and sentence encodings with traditional machine learning (ML) algorithms for crisis adaptation classification tasks. They found that general pre-trained GloVe embeddings overall outperform other embeddings and the GloVe embeddings trained on crisis data bring better results on more specific crisis tweet classification tasks. The other category is target-data dependent where classifiers are trained with labeled source data as well as unlabeled target data. Existing work in this category has consistently found that the adaptation performance can be improved by additionally using the unlabeled target data in model training (known as semi-supervised or self-training) (H. Li, Guevara, et al. 2015; Alam et al. 2018; X. Li and D. Caragea 2020) . For example, H. trained Naïve Bayes classifiers for crisis domain adaptation with unlabeled target data taken into account via a self-training strategy. They compared the classifiers with their corresponding supervised classifiers learned only from labeled source data, showing the classifiers trained with extra unlabeled target data bring better adaptation performance. Moreover, they selected eleven event pairs for cross-domain adaptation study, presenting evidence that the adaptation between similar event pairs is likely to bring better performance. With the recent success of deep learning approaches based on neural networks (NNs) in short-message processing tasks (Y. Kim 2014), some work has been done to apply NN methods for the crisis adaptation problem. The representative work in this direction is from Alam et al. (2018) . They applied a convolutional NN (CNN) with adversarial training and graph embeddings for domain adaptation between two crisis events in both supervised and semi-supervised (with unlabeled target data) settings, showing the semi-supervised way outperforms the supervised way. Another recent target-data dependent work is by X. Li and D. Caragea (2020) . Instead of CNN, this work applied a recurrent neural networks (RNN) based seq2seq model that is trained jointly on a classification task with labeled source data and sequence reconstruction task with unlabeled target data. It is found that the reconstruction task can bring benefits to the adaptation performance as compared to the classification task alone. Considering that even unlabeled target data is not readily available for a new crisis, CAST is a NN-based target-data independent approach, aiming to explore the limit of adaptation performance without knowing any target data. To achieve the objective of finding useful information among UGC from social media, the literature has seen several works on classifying the messages by various "information types" (C. Caragea et al. 2011; Nguyen et al. 2017; Liu et al. 2020; Wang and Lillis 2020a) . The information types exist in a wide range of forms. They can simply be binary indicating whether a message is relevant or informative to a certain disaster, or can be more fine-grained indicating di erent information nuggets such as requesting search and rescue, reporting infrastructural damage, etc. (Olteanu, Vieweg, et al. 2015; McCreadie et al. 2019) . Given the importance of such classification tasks in emergency response, many computational techniques have been proposed for this purpose. The techniques vary from traditional ML algorithms to NN algorithms (C. Caragea et al. 2011; Nguyen et al. 2017 ). In particular, since the attention-based transformer NN architecture was introduced (Vaswani et al. 2017) , recent years have witnessed great success of its variants (Devlin et al. 2019; Ra el et al. 2020) , fine-tuned on downstream tasks. Two broad categories of the variants are encoder-based (e.g., BERT) and seq2seq-based (e.g., T5). Related work in both categories is found in the literature. For example, Liu et al. (2020) applied BERT for two crisis message classification tasks, leading to the state of the art performance. Per Wang and Lillis (2020), they leveraged the seq2seq T5 model for finding useful information from messages relating to the COVID-19 pandemic by treating a slot-filling classification task as a question-answering task. Our work is directly inspired by this work in constructing the input sequence with the addition of an event description. Since the core idea behind CAST is crisis domain adaptation, our work additionally takes into account the event type embedding (i.e, the event description) in the input construction. To the best of our knowledge, our work is the first to systematically study the problem of domain adaptation between di erent disasters by leveraging seq2seq transformer models for crisis message classification. In this section, we first describe the background of fine-tuning seq2seq transformers for general downstream tasks and then introduce CAST for crisis domain adaptation based on this background. At a high level, a seq2seq model consists of a encoder and decoder. The encoder learns to encode an input example to a vector that can represent the contextualised linguistic features of the example. Conditional on the input representation, the decoder then learns to generate the prediction words iteratively. Mathematically, given a source sequence -: {G 1 , G 2 , ..., G = }, the seq2seq model generates predictions denoted as the target sequence . : {H 1 , H 2 , ..., H < } through a parameterised estimation of conditional probability distribution as follows. Where 5 \ 4 (·) refers to the mapping function from the source sequenceto its contextualised representation, learnt by the encoder with tunable parameters \ 4 . Likewise, \ 3 is the tunable parameters for the decoder to learn the function of conditional generation: ? \ 3 (·). In order to optimise \ 4 and \ 3 , the model is trained with the objective function defined as follows. arg min As can be seen, \ 4 and \ 3 are tuned with the objective of minimising the cross entropy loss between the ground truth targets . : {H 1 , H 2 , ..., H < } and the predicted targets . : In the fine-tuning of a downstream classification task, \ 4 and \ 3 are first initialised from their corresponding pre-trained parameters and then tuned on the objective function where the . : {H 1 , H 2 , ..., H < } refers to the ground truth labels of the task. The problem of domain adaptation between crisis events can be simply described as follows. Given a task T , for a set of source events ( with dataset ( 3 , a model trained on ( 3 is directly tested on the test set ) 3 of a set of target events ) within the same task T . Referring to the aforementioned seq2seq model as the model mentioned here, this means that \ 4 and \ 3 are learned to fit the source dataset ( 3 through fine-tuning the model first and then the model does inference directly on the target dataset ) 3 without further training. To di erentiate in-domain and cross-domain adaptation, the former refers to the case when ( equals ) while the latter is the case when ( is not intersected with ), which is the focus of this study. To put the cross-domain adaptation in the context of crisis response, the source events set ( refers to the past crises whose datasets are available and the target events set ) usually contains one element referring to an emerging new crisis. For the standard method of fine-tuning a seq2seq model for a downstream task as seen in (Ra el et al. 2020; Lewis et al. 2020) , the input example -: {G 1 , G 2 , ..., G = } simply consists of the textual content itself (see Figure 1 ). CAST is specifically proposed for cross-domain adaptation, taking into account both a task description ) 34B2 and an event description ⇢ 34B2 , leading to the new input-: {Ĝ 1 ,Ĝ 2 , ...,Ĝ : }, formulated as follows. Where Z is the transformation function that concatenateswith ) 34B2 and ⇢ 34B2 in a natural language form. For example, in Figure 1 ,-is constructed as a question-answering sequence. Following this, Equation 1 now becomes: ? \ 3 (y 8 |Y 0:8 1 , 5 \ e (X 1:: )) (4) As described, CAST di ers from the standard approach in two main aspects. First, it considers ) 34B2 , which is inspired by prior work (Wang and Lillis 2020b) using a task description in the input example for a COVID-related event extraction task. In addition, CAST considers ⇢ 34B2 for making the model domain-aware when tested on di erent events. In this section, we describe experimental details, report and discuss the results from di erent dimensions, aiming to comprehensively test the e ectiveness of CAST and share the insights from what we have learned. Since CAST is proposed for cross-domain adaptation between crises, we use datasets containing examples from di erent crisis events. The datasets we used in our experiments are nepal_queensland and CrisisT6, described briefly as follows. • nepal_queensland is a famous benchmark dataset for cross-domain adaptation in this field, consisting of two crisis events, Nepal Earthquake and Queensland Floods whose word distributions are depicted in the word clouds of Figure 2a and 2b. The samples of this dataset are tweets related to the two events and each tweet is annotated by two well-balanced classes: relevant implying the tweet is relevant to the event and not_relevant implying the opposite. We use the standard train, validation and test splits from Alam et al. (2018) in our experiment for parallel comparison. • CrisisT6 was originally released by Olteanu, Castillo, et al. (2014) . It is a collection of approximately 60,000 tweets posted during six crisis events with approximately 10,000 in each event. Six events are presented in Figure 2c . Considering both datasets relate to a similar task, (i.e., T is defined as a binary classification task), we unify their labels to the same target labels. In nepal_queensland, relevant is changed to yes and not_relevant is changed to no, likewise for CrisisT6. Following this unification, the task description ) 34B2 becomes the same for the two datasets. Using the datasets, our experiments include two scenarios for training, as outlined in Figure 1 . Before determining postQ, we also experimented with di erent variants of CAST with CrisisT6. These are summarised as follows: • variant 1. This is similar to postQ except that we remove the {;>20C8>=_=0<4} from ⇢ 34B2 , making the input location-agnostic to the model. The final input sequence is constructed like: "Content: {CF44C_C4GC}. Question: Is this message relevant to {2A8B8B_=0<4}?". • variant 2. This is similar to postQ except that we re-arrange {;>20C8>=_=0<4} and {2A8B8B_=0<4} such that the input is like: "Content: {CF44C_C4GC}. Question: Is this message relevant to a {2A8B8B_=0<4} event that occurred in {;>20C8>=_=0<4}?". • variant 3. This variant constructs the input sequence by setting ) 34B2 to be empty such that the input is like: The experimentation on these variants and postQ did not present any noticeable di erence in performance. It is interesting to notice that there is no performance di erence between variant 3 and postQ where variant 3 simply uses location and crisis name without including ) 34B2 in the extended text. This is because given a specific classification task, ) 34B2 will be the same for all training examples, thus leading to no di erence to the model. However, we ultimately chose the variant with ) 34B2 (i.e., postQ) in our subsequent experiments mainly because we expect to expand our approach to multi-task learning settings as future work where ) 34B2 becomes di erent for di erent tasks. Our experiments are conducted to examine the performance of the above two scenarios in domain adaptation through fine-tuning seq2seq transformers on the two benchmark datasets. Given a number of existing such seq2seq models (Wolf et al. 2020 ), we select T5 (Ra el et al. 2020) as the target model in our study due to the availability of multiple pre-trained weights and its strong performance in various downstream language tasks. To be specific, the o -the-shelf t5-small and t5-base weights implying di erent model sizes are used in our study, which we abbreviate to small and base3. In fine-tuning, we configure most of the hyper-parameters in keeping with related work (Wang and Lillis 2020b). We fine-tuned both the small and base models with 12 training epochs as we see no further improvement when training with more epochs 4. The learning rate is set to be 54 05 using Adam optimizer (Kingma and Ba 2015) , updated by a linear decay scheduler with warmup ratio 10% of the total training steps. All our experimental runs are accelerated by a 6GB RTX2060 GPU, so we adopt memory saving strategies including Mixed Precision Training (FP32 and FP16) (Micikevicius et al. 2018 ) and gradient accumulation steps (up to 4) to ensure the training batch size is always 16. In addition, we set the maximum source and target sequence length to be 128 and 10 since we are in the context of processing short crisis messages that do not exceed this length. Having conducted extensive experiments with the two selected benchmark datasets, we report the results regarding both in-domain and cross-domain adaptation. To align with the metrics used in the literature, the weighted F1 scores are reported for nepal_queensland (Alam et al. 2018 ) and the accuracy scores are reported for CrisisT6 . We report and discuss the experimental results from the following two perspectives, which is inspired by the intuition behind real-world crisis domain adaptation. • One-to-one adaptation is when both the source events set ( and the target events set ) relate to a single one crisis event. This helps answer a question like: "Among a number of source events whose training sets are available, which one is most suitable to be adapted to a new emerging target event for which training data is not yet available?". • Many-to-one adaptation is similar to the one-to-one adaptation except that the source events set ( can contain multiple events. It helps answer a question like: "Which combination of available source events, is most suited to be adapted to an emerging target event?". One-to-one Adaptation Table 1 and 2 present the in-domain and cross-domain performance respectively on the nepal_queensland dataset across di erent runs. Regarding the in-domain performance, we compare our runs with the CNN run (Alam et al. 2018) . It is found that our runs substantially outperform this run in both NE and QQF events. For cross-domain adaptation, we include CNN+DA+GE (Alam et al. 2018 ) and RNN+AE (X. Li and D. Caragea 2020) that use unlabeled target data in training, i.e., target-data-dependent (TDD). Apart from our target-data independent (TDI) runs, we also report CNN+DA and RNN that do not use any target data like our runs. Table 1 ), we found that the postQ performs basically the same as the standard. This makes sense since postQ is only di erent from standard in appending an extra task and event description. For in-domain adaptation, the appended text is the same for all training examples during training and at inference time, thus leading to no di erence for the model when the extra text exists or not, which explains why they have the same level of performance. Interestingly for cross-domain adaptation, the standard method can achieve comparable performance to our postQ when using a small model. When using a bigger model (i.e., postQ-base), it seems that the standard does not maintain comparable performance and instead our postQ performs much better (see Table 2 ). This finding is further verified in our subsequent experiments on the CrisisT6 dataset. Based on the nepal_queensland dataset, we have identified some evidence of the e ectiveness of our CAST-based runs (particularly for TDI cross-domain adaptation) as compared to the SOTA. However, one limitation of this benchmark dataset is that it only o ers two crisis events, which limits its value in fully evaluating the e ectiveness of our approach. Hence, we extend our experiments to be conducted upon the CrisisT6 dataset that consists of six di erent crisis events representing a wide range of domains. 3t5-small and t5-base have around 60M and 220M parameters respectively. There are larger versions originally released by the authors, such as t5-large, t5-3B and t5-11B. We did not include these since they are too large to be handled by our training resources. 4Except that we fine-tuned base on nepal_queensland with 6 epochs. The row represents the source events while the column stands for the target events. In addition to the adaptation matrices, we also add a correlation matrix to the right side of each of the adaptation matrices. The correlation matrices calculate the Pearson correlation between the rows of source events, which helps indicate how correlated two source events are in terms of their applicability to other target events. Examining the matrices on the left, we find that both the in-domain and cross-domain performance is consistent with nepal_queensland. For example, in in-domain adaptation, postQ achieves similar performance to the standard with di erent model sizes5. Speaking of cross-domain adaptation, postQ slightly outperforms the standard when using the small model (see Figure 3a and 3c). For the base model, the postQ substantially outperform the standard in cross-domain adaptation (see Figure 3e and 3g). Since H. has done a similar work on this dataset, we compare our postQ-base with their NN-S and NB-ST runs on 11 event pairs, as presented in Table 3 . It shows that the postQ-base substantially outperforms the target-data independent NB-S which is consistent to the results we report for nepal_queensland. However, the postQ-base achieves strong performance overall across the 11 pairs adaptation as compared to the target-data dependent NB-ST (90.02 versus 86.06 in average accuracy). Regarding cross-domain adaptation, we also noticed some interesting points. The results from H. present some evidence that similar event pairs such as QF!AF and BB!WTE are more likely to bring better scores than dissimilar pairs like QF!BB and BB!AF (see Table 3 ). This evidence is enhanced in our study. We noted that the Alberta Floods (AF) and Queensland Floods (QF) relate to the same type of crisis (flooding) albeit in di erent locations at di erent times. It is interesting that in our four runs these two events are reciprocal, indicating that either of them as the source event is well-suited to being adapted for the other as the target event. For example, the AF!QF adaptation always achieves accuracy around 95 and QF!AF adaptation achieves accuracy of 89.28 at the worst ( Figure 3c ). Examining their correlation scores on the right, we find that they are not only reciprocal, but also highly correlated, ranging from 0.91 to 0.96 (Figure 3h and 3b ). This lends credence to the idea that similar event types have similar characteristics in terms of their applicability to cross-domain adaptation and could potentially be used interchangably for a novel target event. Another event pair with some similar characteristics are the Boston Bombing (BB) and the West Texas Explosion (WTE). These are perhaps less similar to the floods above in that one was an intentionally planted explosive device whereas the other was a factory fire that later resulted in an explosion. In this situation, it can be seen that whereas BB can be successfully adapted to WTE, the reverse is not the case. This may be related to the observation that BB is itself a di cult target event to adapt to, as evidenced by the fact that the cross-domain adaptation tends to be poorest in general when BB is the target. Surprisingly, we find that the Sandy Hurricane (SH) and Oklahoma Tornado (OT) datasets can not only be well adapted to AF and QF in most cases but also adapt well to WTE. This implies that there may be a certain degree of common linguistic features shared between the tornado/hurricane events and the explosion event. It seems that these findings indicate that the more similar a source event is to a target event, the more likely it is to exhibit better adaptation performance for that target event. Now such a question is naturally raised: does combining multiple similar source events add further benefit (many-to-one adaptation)? Considering the variety of crisis events and training e ciency, our many-to-one experimental runs are based on the CrisisT6 dataset trained with the small model. The first experiment we conduct is leave-one-out cross-domain adaptation where we choose only one crisis as the target domain and the union of the others as the source domain. Table 4 presents the results of fine-tuning using both the standard and postQ approaches. To compare our runs with the work by H. Li, X. Li, et al. (2018) , it reveals that our CAST-based postQ run achieves 91.03 versus 89.6 in average accuracy. In addition, from this table it can be seen that there is no substantial di erence between standard and postQ when AF, QF and OT are left out (they already achieve high accuracy). However, when leaving out SH and BB, standard performance is substantially lower than for postQ. This experiment indicates, even with the small model, that postQ outperforms the standard approach when considering multiple events as the source domain. This is further justified by our next experiment. Our next experiment is conducted to test what e ect of enriching the source domain by combining multiple events. For this purpose, we select two event pairs: (QF, AF) and (BB, WTE). As indicated above, each pair contains two similar event types, while the two pairs themselves are dissimilar. The decision to choose these two pairs as similar event pairs is guided by existing work (H. ) and the correlation scores that are discussed above. Table 5 presents the results of combinations of multiple source evens adapted to the two pairs. First, we see from the figures that postQ overall outperforms the standard in most situations (accuracy is much higher when BB and WTE are the target events and is at least the same level for QF and AF). However, the more interesting thing we note from this experiment is that simply increasing the number of source events does not guarantee benefits to the adaptation performance but it depends on what source events to be added. For example, when QF is the target event, we see only a trivial di erence between AF-to-QF and leaving QF out (thus combining all other crises as the source domain). A similar pattern is observed when AF is the target event. Table 5a also indicates that adding BB+WTE seems not to add any benefit to the performance (indeed this reduces performance when compared with a source domain of SH+OT), and SH+OT can help to a degree (adding SH+OT to QF results in an improved cross-domain performance when AF is the target). Table 5b demonstrates a similar outcome. When BB and WTE are the target events, AF+QF contributes little to the adaptation performance. Surprisingly, SH+OT not only helps QF and AF but also helps BB and WTE, which coincides with the one-to-one results as reported in the previous section. Hence, as a recommendation to maximise the adaptation performance to a target event (e.g., AF), it is good to combine its similar events (e.g., QF+SH+OT ! AF) as the source domain and exclude dissimilar events (e.g., BB+WTE) 6. In this work, we propose CAST -a sequence-to-sequence (seq2seq) transformer-based approach for domain adaptation between crisis events. To test the e ectiveness of CAST, we conduct extensive experiments on two benchmark crisis-related messages classification datasets. In one-to-one adaptation setting, CAST is demonstrated to be e ective in cross-domain adaptation outperforming the state of the art without using any target data and its advantage over the standard approach is more pronounced when training with a bigger model. In a many-to-one adaptation setting with a small model, as compared to the standard method, CAST adds substantial improvements to the cross-domain adaptation performance. Interestingly, our results indicate that for cross-domain adaptation there is merit in choosing a source domain with similar characteristics (i.e. fine-tuning based on a similar type of crisis). If multiple existing similar events are available, these can be combined to form a larger source dataset to improve adaptation performance. Dissimilar events may harm classification performance, however. Regarding future work, so far our method has only been tested on the binary relevance classification tasks for crisis messages. We intend to test our method in a wider range of tasks such as eyewitness identification (Zahra et al. 2020 ) and more fine-grained information type classification (Olteanu, Vieweg, et al. 2015; McCreadie et al. 2019 ). In addition, as our method includes a task description for model training, we also intend to extend our approach to domain adaptation in a multi-task learning setting, extending our work in Wang and Lillis (2021) . Domain Adaptation with Adversarial Training and Graph Embeddings Classifying text messages for the haiti earthquake BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Processing social media messages in mass emergency: A survey Cross-language domain adaptation for classifying crisis-related short messages Convolutional Neural Networks for Sentence Classification Adam: A Method for Stochastic Optimization see the last two rows of Table 5a and 5b), it is suggested to exclude them which can help reduce the training size and thus improve the training e ciency Disaster response aided by tweet classification with a domain adaptation approach Twitter Mining for Disaster Response: A Domain Adaptation Approach Comparison of word embeddings and sentence encodings as generalized representations for crisis tweet classification tasks Domain Adaptation with Reconstruction for Disaster Tweet Classification CrisisBERT: Robust Transformer for Crisis Classification and Contextual Crisis Embedding TREC incident streams: Finding actionable information on social media Mixed Precision Training Robust classification of crisis-related data on social networks using convolutional neural networks Crisislex: A lexicon for collecting and filtering microblogged communications in crises What to expect when the unexpected happens: Social media communications across crises Exploring the limits of transfer learning with a unified text-to-text transformer Attention is all you need Classification for Crisis-Related Tweets Leveraging Word Embeddings and Data Augmentation UCD-CS at W-NUT 2020 Shared Task-3: A Text to Text Approach for COVID-19 Event Extraction on Social Media Multi-Task Transfer Learning for Finding Actionable Information from Crisis-Related Messages on Social Media Transformers: State-of-the-Art Natural Language Processing Automatic identification of eyewitness messages on twitter during disasters CoRe Paper -Social Media for Disaster Response and Resilience Proceedings of the 18th ISCRAM Conference This work was supported by the Enterprise Ireland and European Union Career-FIT programme under the Marie Sklodowska-Curie grant agreement No. 713654