key: cord-0827732-2fimkpd8 authors: Kejriwal, Mayank; Zhou, Peilin title: On detecting urgency in short crisis messages using minimal supervision and transfer learning date: 2020-07-08 journal: Soc Netw Anal Min DOI: 10.1007/s13278-020-00670-7 sha: 1d9529d3a101773ad158072a13e23db5ea3f0054 doc_id: 827732 cord_uid: 2fimkpd8 Humanitarian disasters have been on the rise in recent years due to the effects of climate change and socio-political situations such as the refugee crisis. Technology can be used to best mobilize resources such as food and water in the event of a natural disaster, by semi-automatically flagging tweets and short messages as indicating an urgent need. The problem is challenging not just because of the sparseness of data in the immediate aftermath of a disaster, but because of the varying characteristics of disasters in developing countries (making it difficult to train just one system) and the noise and quirks in social media. In this paper, we present a robust, low-supervision social media urgency system that adapts to arbitrary crises by leveraging both labeled and unlabeled data in an ensemble setting. The system is also able to adapt to new crises where an unlabeled background corpus may not be available yet by utilizing a simple and effective transfer learning methodology. Experimentally, our transfer learning and low-supervision approaches are found to outperform viable baselines with high significance on myriad disaster datasets. The 1 United Nations Office for the Coordination of Human Affairs (OCHA) reported 2 that in 2018, more than 141 million people were in need of humanitarian assistance, with over 9 billion dollars of unmet requirements. Using technology to address this shortfall by assisting aid agencies and first responders to mobilize and send resources where they are needed the most is an important problem with the potential for widespread long-lasting social impact (Palen and Anderson 2016; Sakaki et al. 2010) . To achieve this goal, the problem of semi-automatic urgency detection needs to be solved, especially on short message streams like social media that support real-time news feeds and micro-updates from citizens on the ground. We define an urgent message in the crisis context as one that expresses an actionable need that needs to be resolved in a short time frame. Urgency detection is related to the problem of detecting relevant or informative tweets from a stream of tweets, not all of which are pertinent to the crisis at hand. Urgency detection may be understood to be a very specific version of the relevance detection problem. Similar to the latter, urgency detection also falls in a class of information retrieval (IR) problems, which attempt to detect and rank relevant messages and documents. However, there is an added dimension to urgency detection, since (as defined above) an actionable need, possibly implied, must be expressed in the tweet that could potentially be resolved if dealt with in a time-sensitive manner. For example, a message such as 'Roof collapse in building on Main Street; multiple people trapped inside' may be deemed to be urgent; however, messages such as 'Roof collapse due to storm at midnight; all people successfully evacuated' and 'Avalanche in Nepal caused four deaths' are relevant and may assist in studying the disaster further (or even mobilizing long-term response) but are not particularly urgent, either because it does not require immediate action or because the damage has already occurred. Informativeness as a broad problem has undergone some study (Olteanu et al. 2014 ) (see also Sect. 2), but to our knowledge, urgency as a specific IR area has not received the same kind of special attention despite its utility to first responders in times of crisis. Put intuitively, solutions to the urgency detection problem can be framed in terms of probabilistic binary classification, a common machine learning paradigm involving other related tasks like sentiment analysis (Pang et al. 2008) . Although urgency detection has some similarity with sentiment analysis (both are subjective to a degree, since annotators can, and do, sometimes disagree), the core problem is different, since the goal is to flag messages that express urgency, which is almost always a negative or panic-ridden emotion. However, it can be difficult to distinguish urgencyrelated tweets from just negative tweets. We provide an illustrative set of real-world examples 3 in Table 1 . In this paper, we present practical approaches for crisisspecific minimally supervised urgency detection on short message streams such as Twitter. The presented approaches cover two scenarios that often emerge in the real world. In the first scenario, a small amount (a few hundred messages) of training data labeled as urgent or non-urgent is available, along with a copious 'unlabeled' background corpus. In the second scenario, similar data are available for a 'source' domain but not for the target domain (expressing a 'new crisis') for which the urgency detection needs to be deployed. In other words, as messages are streaming in for this new domain, investigators label a few samples, but cannot rely on the availability of a background corpus since urgency needs to be tagged in real time before the crisis has fully subsided. To accomplish this challenging goal, our approach relies on a simple and robust transfer learning methodology (Pan and Yang 2010) . Experimental results on three real-world datasets and several performance metrics validate our methods. To the best of our knowledge, this is the first such paper investigating the problem of urgency detection in social media, both algorithmically and empirically, for arbitrary disasters in low-supervision and transfer learning settings. The rest of this paper is structured as follows. Section 2 describes some related work, Sect. 3 specifies our two research questions, and Sect. 4 describes our approaches in support of answering those questions. Section 5 covers the experiments, and Sect. 6 concludes the paper. Crisis informatics is emerging as an important field for both data scientists and policy analysts. A good introduction to the field was provided in a recent Science policy forum article (Palen and Anderson 2016) . The field draws on interdisciplinary strands of research, especially with respect to collecting, processing and analyzing real-world data. Particularly, social media platforms like Twitter have emerged as important channels ('social sensors ' Sakaki et al. 2010) for situational awareness in support of crisis informatics. Although situational awareness is a broad notion extending beyond crisis informatics (e.g., military situational awareness), urgency detection is a special kind of situational awareness that tends to arise mainly in the crisis domain. A direct application is to help first responders and aid agencies assess needs in crisis-stricken areas and mobilize resources effectively (i.e., where needs are most urgent). While the initial primary focus of situational awareness and sensing systems was on earthquakes (Avvenuti et al. 2014; Crooks et al. 2013) , the focus has diversified in recent years to disasters as diverse as floods, fire, and hurricanes (Arthur et al. 2017; Vieweg et al. 2010) . We note that Twitter is by far the most monitored social media platform during crises (Simon et al. 2015) due to the availability of the published data and its real-time nature. Increasingly sophisticated approaches have been presented for data collection, including dynamic lexicons (Olteanu et al. 2014) and analysis tools like TweetTracker (Kumar et al. 2011) . In the last few years, and even just the last few weeks (in the wake of the COVID-19) crisis, a number of important works in network science have addressed crises. We only cite a few recent papers by way of reference. Recently, (Purohit et al. 2020) have described a method to rank and group social media requests for emergency services, a work that is particularly relevant since the outbreak of COVID-19. Recent work in opinion mining (e.g., see Keyvanpour et al. 2020) , especially using lexicons and machine learning in social media, is also relevant to our work. Another extremely relevant work is a recent article that described a lightweight and multilingual framework for crisis information extraction from Twitter data (Interdonato et al. 2019 ). The research presented in that paper, though not resolving the problem of detecting urgent tweets, is compatible with our own work since it presents a relatively unsupervised and lightweight paradigm, and uses similar metrics. Other papers have tried to look at specific crises, e.g., the work by (Ladner et al. 2019) in analyzing tweets to determine the activeness of the Syrian refugee crisis. Another article has tried to do disaster damage assessment from Twitter data using statistical features and 'informative words,' not dissimilar to our own lexicon-based approach (Madichetty and Sridevi 2019) . A last example is the work in (Klein et al. 2012) , which describes a project called SABESS that uses social network analysis for identifying reliable tweets and apply content analysis in order to summarize important 'emergency facts.' These six examples are among a sample of several pieces of work that have tried to use social media productively in helping to analyze or provide actionable intelligence during a crisis situation attesting to the ongoing importance of the problem. NLP methods have been widely used in extracting situational awareness from Twitter, e.g., see the work by (Verma et al. 2011) . Another important line of work is in analyzing events other than natural disasters (such as mass convergence and disruption events), but still relevant to crisis informatics. For example, Stabird et al. presented a collaborative filtering system for identifying on-the-ground 'Twitterers' during mass disruptions (Starbird et al. 2012) . Similar techniques could be employed to supplement the work in this paper. In a similar vein, the CrisisTracker system (Rogstadius et al. 2013) is another example of a system that uses crowdsourced social media curation for disaster awareness. The system does not specifically address the urgency detection problem, however. AIDR is a system that is more closely aligned with the goal of using AI for better disaster response (Imran et al. 2014) , but its goal is to classify messages into a set of user-defined categories of information such as 'needs' and 'damages.' In contrast, we consider needs at a higher-level of classification; namely, is it urgent or nonurgent? The outputs of AIDR are compatible with our own since both systems provide actionable information to first responders. Another important crowdsourcing tool that has been especially useful in working with SMS messages is Ushahidi, a project that was a grassroots effort that started in Kenya and that was used initially to encourage Kenyans to report incidents (especially, acts of violence) that they have witnessed. The website was very successful, and the model has since been replicated in other countries. Just like the other systems considered in this section, we believe Ushahidi's goals and technology are compatible with the capabilities presented herein. More generally, projects like CrisisLex, Crisis Computing 4 and EPIC (Empowering the Public with Information in Crisis) have emerged as major efforts in the crisis informatics space due to two reasons: First, the abundance and fine granularity of social media data implies that mining such data during crises can lead to robust, real-time responses; second, the recognition that any technology that is thus developed must also address the inherent challenges (including problems of noise, scale and irrelevance) in working with such datasets. CrisisLex provides a repository of crisisrelated social media data and tools, including collections of crisis data and lexicons of crisis terms (Olteanu et al. 2014) . It also includes tools to help users create their own collections and lexicons. In contrast, Project EPIC, launched in 2009 and supported by a US National Science Foundation grant, is a multi-disciplinary effort involving several universities and languages with the goal of utilizing behavioral and technical knowledge of computer-mediated communication for better crisis study and emergency response. Since its founding, Project EPIC has led to several advances in the crisis informatics space; see, for example, Palen et al. 2015; Kogan et al. 2015; Anderson et al. 2013; Soden et al. 2014) . The work presented in this article is intended to be compatible with these efforts, although we are addressing a specific problem that was not addressed by any of the works cited above. We have released our model openly, and potentially, this released model could be integrated into some of the platforms described above. Crowdsourcing could be used in lieu of (or even in addition to) the active learning framework presented as one of the solutions to the low supervision challenge described later in this article. It could also be used to provide more confidence in the annotations, since there is an inherent element of subjectivity when one is labeling a tweet as 'urgent.' Note that most labeling problems in machine learning involve some subjectivity, and inter-annotation agreement has been found to be a concern in some cases. Whether such concerns arise in the case of urgency detection is an unknown issue that does not fall within the scope of the presented work, but could be a valuable issue to address in future research. Other lines of work relevant to this paper involve minimally supervised machine learning, representation learning and transfer learning. Concerning minimally supervised machine learning (ML), in general, ML techniques where there are few, and in the case of zero-shot learning (Palatucci et al. 2009; Romera-Paredes and Torr 2015) , no observed instances for a label has been a popular research agenda for many years (Uszkoreit et al. 2009; Aggarwal and Zhai 2012) . In addition to weak supervision approaches Aggarwal and Zhai (2012) , both semi-supervised and active learning have also been studied in great depth, with surveys provided by (Zhu 2005; Settles 2010 ). However, to the best of our knowledge, a successful systems-level conjunction of various minimally supervised ML techniques has not been achieved for the task of short-text urgency detection. Such as empirical assessment is an important goal of this paper. Due to the current renaissance of neural networks (Sahlgren 2005), embedding and representation learning methods have become more popular due to the advent of fast and effective models like skip-gram. Recent work has used such embeddings in numerous NLP and graph-theoretic applications (Collobert et al. 2011) , including information extraction (Kejriwal and Szekely 2017) , named entity recognition (Nadeau and Sekine 2007) and entity linking (Moro et al. 2014 ). The most well-known example is word2vec (for words) (Mikolov et al. 2013) , followed by similar models like paragraph2vec (for multi-word text) and fasttext (Dai et al. 2015; Joulin et al. 2016) , the last two being most relevant for the work in this paper. For a recent evaluation study on representation learning for text, including potential problems, we refer the reader to (Faruqui et al. 2016) . Finally, transfer learning is a central agenda in this paper; an excellent survey of dominant techniques may be found in (Pan and Yang 2010) . More recent work on domain adaptation may be found in , with the work in (Pedrood and Purohit 2018 ) applied specifically to the disaster response problem. Pedrood and Purohit (2018) also applied transfer learning to the problem of mining help intent on Twitter. Other relevant work in crisis informatics, both in terms of defining 'actionable information' problems like urgency and need mining, and providing multimodal Twitter datasets from natural disasters, may be found in (He et al. 2017; Purohit et al. 2018) and . Caragea et al., for example, present an approach to identifying informative messages in crises by using CNNs (Caragea et al. 2016) . Other similar works include Burel et al. (2017) Burel and Alani (2018), Nguyen et al. (2016) , (2017) and ). An important difference between the class of papers cited and our own work is that we are not seeking to detect events in crisis situations, but are instead trying to assign urgency scores to sub-events that are happening in the aftermath of a disaster. The two problems are related in that better accuracy on event detection (for which these deep learning systems could be used to great effect) would lead to better identification of urgent events. However, urgency detection is a difficult problem in and of itself, beyond the broader problem of isolating informative events related to the disaster from an incoming stream of messages. Some of the work above is multimodal (e.g., the paper by Nguyen et al. (2017) ), which would be an interesting direction for future research for urgency detection (from images and videos, rather than just text). An alternate way of looking at the problem is as an 'event detection' problem, e.g., in (Zheng et al. 2017 ) Zheng et al. study semi-supervised event-related tweet identification which also tries to identify the urgent tweets related to earthquakes and floods. These works are complementary to the minimally supervised, low-resource setting in this paper. Finally, we note that there has been some very recent work in few-shot models that use little to no training data and are similar to this paper in that regard , Kruspe et al. (2019) . However, there are significant differences from our own work. For example, while consider a specific disaster situation (flood risk assessment in a particular city in India), Kruspe et al. (2019) considers the earlier problem of detecting tweets that are relevant to the crisis itself, rather than the problem of assigning an urgency score to events that are detected. In general, we are not aware of a few-shot or minimally supervised technique that tackles urgency detection for the purposes of triage. Page 5 of 12 58 We briefly enumerate below the research questions under consideration in this paper. While the first question captures the classical low-supervision setting, the second question introduces an element of transfer learning. How do we build an urgency detection system for a specific crisis when given as training input both a small number of manually labeled tweets and a large number of unlabeled tweets (background corpus) for that crisis? 2. Low-supervision Transfer Learning for Urgency Detection: How do we build an urgency detection system for a specific crisis when given as training input a small number of manually labeled tweets for that crisis, as well as 'auxiliary' training input of (a small number of) manually labeled tweets and unlabeled background tweets from a different crisis? Unlike the first scenario, the second scenario applies to a very short period (hours, or even minutes) after the crisis has struck; this is why a background corpus is not available (yet) for that crisis. Instead, only a few manually labeled messages that have been acquired till that point are available. The approach for addressing the first research question is schematized in Fig. 1 . The first step in the workflow involves data preprocessing of the corpus. We follow a standard set Fig. 1 Training workflow for urgency detection of preprocessing steps. First, we apply a tokenizer to split the sentences into lists of words and delete words with special prefixes (including @ and RT, which are particularly prevalent in Twitter) and special suffixes. We also remove non-alphanumeric characters and convert the entire sentence to lowercase. Next, similar to traditional machine learning pipelines, we extract a set of manual features for expressing prior human knowledge about urgency detection. Our manual features are thus called because they are primarily keyword-based and binary, with keywords selected based on data exploration and domain knowledge. We consider ten such keywords, namely hit, help, kill, injure, strand, miss, urgent, die, need, food. If any of these keywords are present 5 , the corresponding feature is set to 1. Note that these keywords are associated with situations that are generally urgent, like people who have been attacked or affected by a crisis and need urgent help, but some are noisier than others 6 . Additionally, we also utilize an eleventh feature that checks to see if any numeric digits are present in the dataset. The rationale behind this feature is that, in more urgent tweets, numbers are often present, e.g., '15 climbers are currently trapped on Everest due to the avalanche.' In the Experiments section, we show that the manual features are not adequate for addressing low-supervision urgency detection. Besides, it is prudent to utilize the large number of unlabeled tweets (background corpus) if it serves a useful purpose in improving performance. To that end, we train a skip-gram based word embedding model based on the 'bag of tricks' model released by researchers from Facebook in a package called fastTextJoulin et al. (2016) . The reason behind using fastText, as opposed to alternate word embedding models like GloVe and word2vec Mikolov et al. (2013) , is several-fold. First, fastText is very fast and easy to execute and is well maintained. Second, preliminary analyses showed that it does quite well on social media tasks and because of the bag of tricks methodology (that uses character and sub-word embeddings to gracefully deal with OOVs 7 and misspellings), it is able to generalize much better. Finally, fastText's APIs include a way to get sentence embeddings directly after training the word embedding model. By training fastText on the background corpus, we are able to train a robust embedding model. In both the training and test phases, we use this model to get feature vectors for our messages besides the 11-dimensional manual feature vector described earlier. However, given that the background corpus might not be as extensive or representative as a 'general' corpus like Wikipedia, we try to smooth the feature space by also using a pre-trained embedding model trained over the English Wikipedia corpus and publicly available 8 . The vectors obtained from this model have 300 dimensions and were trained using skip gram with default parameters. As Fig. 1 illustrates, we use all of these feature sets to build an ensemble by combining local embedding features, manual features and Wikipedia pre-trained word embedding features. The final score of the ensemble model is achieved by weighting the scores of the three linear regression models (one for each feature set), with weights adding to 1. The weights are set using a held-out validation set. When the urgency of a new 'test' message needs to be determined, we preprocess the message, extract all three feature sets 9 and get the weighted score from the three regression models. If the score falls above a pre-determined threshold (again, determined through validation), then the message is flagged as urgent; otherwise, it is not. In this section, we describe our approach for 'urgency detection transfer' whereby a source dataset is given (similar to RQ1, where both an unlabeled background corpus and a small manually labeled training set are available) along with a target dataset (only a small manually labeled training set and no background corpus), representing the crisis under investigation. Our approach for urgency transfer is captured in Algorithm 1. Many of the steps are similar to those for RQ1, including preprocessing, but there are some important differences; hence, we use pseudocode to express the workflow more precisely. For example, while the Wiki embedding model remains the same as earlier, the manual features are obviously extracted over the target domain (since they do not require a background corpus) and importantly, the 'local' embedding model is now trained over the source domain corpus, since there is no target domain unlabeled background corpus available. To 'sync' the source and target domains, we consider a simple, but empirically effective, approach. Rather than use just the labeled target domain data for training the three linear regression models, we combine the labeled training data from both the source and target domains, but the target training data are up-sampled to allow its properties to emerge more concretely in the training. The up-sampling margin is a parameter in Algorithm 1; in practice, a factor of 6 (meaning the target labeled dataset is up-sampled by 6x) has been found to work well. To maximize training dataset utility, we do not use a validation set for classifier weight optimization, but consider the average of all three classifiers as the final score. comprises a collection of tweets collected in the aftermath of the 2015 Nepal earthquake (also called the Gorkha earthquake), while Macedonia was not an actual disaster but a realistic live-action simulation (of a disaster) conduced in 1. Train word embedding W s on text in D su ∪ D sl ; 2. Up-sample D t by factor u and 'mix' with D sl to get expanded training set, D train : D tu ∪ D s l 3. Extract manual feature set F m , source embedding feature set F s (using W s ), and Wiki feature set F w (using W w ) from each message in D train ; 4. Train linear regression models C s , C m and C w on F s , F m and F w resp. to get classifier; 5. Return final classifier model C : avg score(C s , C m , C w ); For evaluating the approaches laid out in Sect. 4, we consider three real-world datasets described in Table 2 . Two of the datasets (Nepal and Macedonia) were made available to us through the DARPA LORELEI program 10 , under which this project is funded. The Nepal dataset Macedonia toward the end of 2018. Macedonia does not have much noise and is 'information-dense,' but small. As such, it provides a good test of the transfer learning abilities of the approach presented. Kerala describes tweets in the aftermath of the Kerala floods in South India in 2018 and is the largest dataset, with many relevant and irrelevant tweets. We note that these datasets were collected independently by an external participant in the program and made available to all performers in the program for research. Originally, all the raw messages for the datasets described in Table 2 were unlabeled, in that their urgency status was unknown. Since the Macedonia dataset only contains 205 messages and is a small but information-dense dataset, we labeled all messages in Macedonia as urgent or non-urgent (hence, there are no unlabeled messages in Macedonia as given in Table 2 ). For the two other Twitter-based datasets, we used active learning to compose a labeled set that would contain challenging examples. The basic process was to do data preprocessing as described in Sect. 4, followed by training the local fastText-based word embedding model on all messages in the corpus. Next, we randomly labeled 50 urgent and non-urgent tweets and fed them into a classifier. The classifier was applied on the rest of the unlabeled data to obtain 'ambiguous' examples (where the classifier's probability of the positive label was closest to 50%). We labeled another 100 samples this way and continued to re-train and apply the classifier for two more iterations till we obtained a total of 400 labeled points 11 . Note that the final labeled dataset may not be balanced in terms of urgent and non-urgent messages. Table 2 shows that Nepal is roughly balanced, while Kerala is imbalanced. We used stratified sampling therefore to split the labeled pool into training and testing datasets for evaluating the two research questions. We used 90% for training and 10% for testing. While the data cannot be made publicly available due to privacy concerns, we have released both the trained models and instructions for how to re-train the model on novel datasets as a Docker container 12 . We consider four standard metrics, namely accuracy, precision, recall and F-measure. Accuracy is simply the ratio of correctly labeled messages to the size of test set, precision is the ratio of the true positives to the sum of true positives and false positives, recall is the ratio of true positives to the sum of true positives and false negatives, and finally, F-measure is the harmonic mean of precision and recall and captures their trade-off. Datasets for investigating RQ1 include Nepal and Kerala since Macedonia does not have a large unlabeled corpus available, which is an assumption made per RQ1. Recall that we used stratified random sampling to split the labeled data for each dataset into training (90%) and test (10%) sets. Of the 90% training set, a further split was done, with 90% kept for 'training' and 10% for setting optimal weights for the 3 linear classifiers 13 trained in Section IV. To account for the effects of randomness, each experiment was conducted across ten trials, with averages reported on all four metrics described previously for all baselines described below and our approach. Among the different machine learning classifiers in the sklearn package tested, the linear regression was found to work well and used as the classifier of choice where applicable. We use six baselines to evaluate the approach for RQ1 described in Sect. 4. Note that statistical significance is tested using the one-sided Student's paired t-test by comparing the best system (on each metric) against the Local baseline, which is a reasonable choice since in a high-supervision (or even normal-supervision) setting, this baseline has been found to perform quite well. Significance at the 90% level is indicated with a *, at the 95% level with a ** and at the 99% level with a *** (Table 3) . The protocol for investigating RQ2 is similar to the one for RQ1. We consider three baselines besides our own approach: Target-only Local (Target Local): This baseline is essentially the Wiki-Manual baseline described in the previous section and trained on the target dataset (i.e., no transfer learning is used, and no source is assumed). This baseline is used to illustrate the benefits of transfer learning, since this baseline sets the minimum benchmark that has to be bested by a transfer learning baseline. Locally Supervised with Source Embedding (Embedding Transform): Similar to our approach on RQ1, manual features, source embeddings and pre-trained Wikipedia embeddings are used to train three classifiers (but on the labeled target domain) and average their probabilities as the final result. While the local embeddings are trained on the source domain (since unlabeled data are not available for the target domain), all classifier training is always done on the target. Locally Supervised with Up-sampling and Source Embedding (Upsample): This baseline is the same as Embedding Transform, except to boost the power of the baseline, we upsample the labeled data (in the target dataset) by 6x. Thus, this baseline tries to mitigate source bias and concept drift by giving more importance to the transfer domain. This baseline is also more appropriate for the case where the target training data are extremely limited. Table 4 illustrates the result for RQ1 on the Nepal and Kerala datasets. The results illustrate the viability of urgency detection in low-supervision settings (with our approach yielding 69.44% F-measure on Nepal, at 99% significance compared to the local baseline), with different feature sets contributing differently to the four metrics. While the local embedding model can reduce precision, for example, it can help the system to improve and accuracy and recall. Similarly, manual features reduce recall, but help the system to improve accuracy and precision (sometimes considerably). To truly address the urgency problem, therefore, a multipronged ensemble approach is justified, as also argued intuitively in Section IV. We also note that the pre-trained Wikipedia embedding model proved to be an important tool in improving the generalization ability of the model and not requiring any labeled or unlabeled data; in essence, serving as a free resource that could be helped to regularize and stabilize models that would otherwise be uncertain in lowsupervision settings. Concerning transfer learning experiments (RQ2), we note that source domain embedding model can improve the performance for target model, and upsampling has a generally positive effect (Tables 5, 6, 7, 8) . As expected, transfer learning performance (RQ2) is generally lower compared to the low-supervision urgency detection on a single dataset 14 (RQ1). Note that at least one of the transfer learning methods always bests the Local baseline on all metrics (except precision in Table 7 , a result not found to be significant even at the 90% level). Our approach shows a slight improvement over the upsampling baseline on two of the four scenarios (Tables 5, 7) by 2-2.7% on the F-measure metric, which shows the diminishing returns from mixing source and target labeled training data. Further improving performance by high margins will require a radically new approach left for future work. This paper presented minimally supervised urgency detection approaches for short texts (such as tweets) in the aftermath of an arbitrary humanitarian crisis such as the 2015 Nepal earthquake. The presented systems covered two scenarios that often emerge in the real world. In the first scenario, a small amount (a few hundred messages) of training data labeled as urgent or non-urgent is available, along with a copious background corpus. In the second scenario, similar data are available for a 'source' domain but not for the target domain (expressing a 'new crisis') for which the urgency detection needs to be deployed. As messages are streaming in for this new domain, investigators label a few samples, but cannot rely on the availability of a background corpus since urgency needs to be tagged in real time before the crisis has fully subsided. To accomplish this challenging goal, our approach relies on a simple but robust transfer learning methodology. Experimental results on three real-world datasets validate our methods. Some of the obvious avenues for future work are to improve the existing approach incrementally by (for example) adding more manual features and using more sophisticated local embedding model, possibly with more advanced tuning of hyperparameters like the learning rate and vector dimensionality. For improving transfer learning, we are considering using a deep learning model with priors to truly leverage the presence of a source, albeit one covering a domain that is different from the target. Deep learning for transfer learning is still in its infancy in the machine learning community, e.g., a recent survey on deep transfer learning (Tan et al. 2018) shows that most current research 'focuses on supervised learning, how to transfer knowledge in unsupervised or semi-supervised learning by deep neural network may attract more and more attention in the future'. In looking at the references they cite, the effectiveness of deep transfer learning does not seem to have been demonstrated thus far for difficult and irregular social media datasets However, we believe that this presents an opportunity for further study, especially as new and different crises like COVID-19 continue to threaten our way of life at a global scale. Flood risk assessment of Srinagar city in Jammu and Kashmir, India Domain adaptation with adversarial training and graph embeddings Crisismmd: multimodal twitter datasets from natural disasters Architectural implications of social media analytics in support of crisis informatics research Social sensing of floods in the UK Getting the query right: User interface design of analysis platforms for crisis research On semantics and deep learning for event detection in crisis situations Crisis event extraction service (crees)-automatic detection and classification of crisis-related content on social media Identifying informative messages in disaster events using convolutional neural networks Natural language processing (almost) from scratch # earthquake: Twitter as a distributed sensor system Document embedding with paragraph vectors Problems with evaluation of word embeddings using word similarity tasks The signals and noise: actionable information in improvised social media channels during a disaster A lightweight and multilingual framework for crisis information extraction from twitter data Bag of tricks for efficient text classification Information extraction in illicit web domains Robust filtering of crisis-related tweets Omlml: a helpful opinion mining method based on lexicon and machine learning in social networks Detection and extracting of emergency knowledge from twitter streams Think local, retweet global: retweeting by the geographically-vulnerable during hurricane sandy Detecting event-related tweets by example using few-shot models Tweettracker: an analysis tool for humanitarian and disaster relief Activeness of Syrian refugee crisis: an analysis of tweets Disaster damage assessment from the tweets using the combination of statistical features and informative words Distributed representations of words and phrases and their compositionality Entity linking meets word sense disambiguation: a unified approach A survey of named entity recognition and classification Applications of online deep learning for crisis response using social media information Automatic image filtering on social networks using deep learning and perceptual hashing during crises CrisisLex: a lexicon for collecting and filtering microblogged communications in crises Zero-shot learning with semantic output codes Crisis informatics?new data for extraordinary times Success & scale in a data-producing organization: the socio-technical evolution of openstreetmap in response to humanitarian events A survey on transfer learning Opinion mining and sentiment analysis Mining help intent on twitter during disasters via transfer learning with sparse coding Ranking and grouping social media requests for emergency services using serviceability model Social-EOC: serviceability model to rank social media requests for emergency operation centers Crisistracker: crowdsourced social media curation for disaster awareness An embarrassingly simple approach to zero-shot learning Earthquake shakes twitter users: real-time event detection by social sensors Active learning literature survey Socializing in emergencies?a review of the use of social media in emergency situations Resilience-building and the crisis informatics agenda: lessons learned from open cities Kathmandu Learning from the crowd: collaborative filtering techniques for identifying on-the-ground twitterers during mass disruptions Analysis and improvement of minimally supervised machine learning for relation extraction Natural language processing to the rescue? extracting" situational awareness" tweets during mass emergency Microblogging during two natural hazards events: what twitter may contribute to situational awareness Semi-supervised event-related tweet identification with dynamic keyword generation The authors gratefully acknowledge the ongoing support and funding of the DARPA LORELEI program and our partner collaborators in providing detailed analysis. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, AFRL, or the US Government. Conflict of interest The authors declare that they have no conflict of interest.