key: cord-0106764-xxihlncq authors: Levy, Sharon; Wang, William Yang title: Cross-lingual Transfer Learning for COVID-19 Outbreak Alignment date: 2020-06-05 journal: nan DOI: nan sha: 80ce4d36563c9097295c8b4a35bf545307cceed9 doc_id: 106764 cord_uid: xxihlncq The spread of COVID-19 has become a significant and troubling aspect of society in 2020. With millions of cases reported across countries, new outbreaks have occurred and followed patterns of previously affected areas. Many disease detection models do not incorporate the wealth of social media data that can be utilized for modeling and predicting its spread. In this case, it is useful to ask, can we utilize this knowledge in one country to model the outbreak in another? To answer this, we propose the task of cross-lingual transfer learning for epidemiological alignment. Utilizing both macro and micro text features, we train on Italy's early COVID-19 outbreak through Twitter and transfer to several other countries. Our experiments show strong results with up to 0.85 Spearman correlation in cross-country predictions. During the COVID-19 pandemic, society was brought to a standstill, affecting many aspects of our daily lives. With globalization, it is intuitive that countries have followed earlier affected regions in patterns of outbreaks and measures to contain to them (Cuffe and Jeavans, 2020) . A unique form of information that can be used for modeling disease propagation comes from social media. This can provide researchers with access to unfiltered data with clues as to how the pandemic evolves. Current research on the COVID-19 outbreak concerning social media includes word frequency and sentiment analysis of tweets (Rajput et al., 2020) and studies on the spread of misinformation (Kouzy et al., 2020; Singh et al., 2020) . Social media has also been utilized with respect to other disease predictions. Several papers propose models to identify tweets in which the author or nearby person has the attributed disease (Kanouchi et al., 2015; Aramaki et al., 2011; Lamb et al., 2013; Kitagawa et al., 2015) . Iso et al. (2016) and Huang et al. (2016) utilize word frequencies to align tweets to disease rates. A shortcoming of the above models is that they do not consider how one region's outbreak may relate to another. Additionally, many of the proposed models rely on lengthy keyword lists or syntactic features that may not generalize across languages. Sentence embeddings from models such as multilingual BERT (mBERT) (Devlin et al., 2019) and LASER (Artetxe and Schwenk, 2019) can allow us to combine features from hundreds of languages in order to make connections across languages for semantic alignment. We present an analysis of Twitter usage for crosslingual COVID-19 outbreak alignment. We utilize millions of tweets in several languages to evaluate how social media can help detect epidemiological outbreaks across countries. In particular, we aim to analyze how one country's tweets align with its own outbreak and if those same tweets can be used to predict the state of another country. To this end, we show that we can achieve strong results with cross-lingual transfer learning. Our contributions include: • We formulate the task of cross-lingual transfer learning for epidemiological outbreak alignment across countries. • We are the first to investigate state-of-the-art cross-lingual sentence embeddings for crosscountry epidemiological outbreak alignment. We propose joint macro and micro reading for multilingual prediction. • We obtain strong correlations in domestic and cross-country predictions, providing us with evidence that social media patterns in relation to COVID-19 transcend countries. (Chen et al., 2020) , in various languages. The peaks are marked by events relating to each language's main country's initial outbreak. 2 Twitter and COVID-19 An intriguing question in the scope of epidemiological research is: can atypical data such as social media help us model an outbreak? To study this, we utilize Twitter as our source, since users primarily post textual data and in real-time. Furthermore, Twitter users transcend several countries, which is beneficial as COVID-19 is analyzed by researchers and policymakers on a country by country basis (Kaplan et al., 2020) . Our motivation in this paper is the intuition that social media users can provide us with indicators of an outbreak during the COVID-19 pandemic. In such case, we reformulate our original question: can we align Twitter with a country's COVID-19 outbreak and apply the learned information to other countries? We utilize the COVID-19 Twitter dataset (Chen et al., 2020) , comprised of millions of tweets in several languages. These were collected through Twitter's streaming API and Tweepy 1 by filtering for 22 specific keywords and hashtags related to COVID-19 such as Coronavirus, Wuhanlockdown, stayathome, and Pandemic. We consider tweets starting from February 1st, 2020 to April 30th, 2020, and filter for tweets written in Italian, Indonesian, Turkish, Japanese, and Thai. Specifically, we filter for languages that are primarily spoken in only one country, as opposed to languages such as English and Spanish that are spoken in several countries. In Table 1 We start by investigating a basic feature in our dataset: tweet frequency. We plot each country's tweet frequency in Figure 1 . There is a distinct peak within each country, corresponding to events within each country signaling initial outbreaks, denoted by the vertical lines. These correlations indicate that even a standard characteristic such as tweet frequency can align with each country's outbreak and occurs across several countries. Given this result, we further explore other tweet features for epidemiological alignment. We determine that it is most helpful for researchers to first study regions with earlier outbreaks to make assumptions on later occurrences in other locations. Within the five countries we examine, Italy has the earliest peak in cases. As a result, we analyze various textual features in Italy. When aligning outbreaks from two different countries, we experiment with the transfer learning setting. We train on Italy's data and test on the remaining countries. We present this as a regression problem in which we map our input text features x ∈ R n to the output y ∈ R. Our ground-truth output y is presented in two scenarios in our experiments: total cases and daily new cases. The former considers all past and current reported cases while the latter consists of only cases reported on a specific day. The predicted outputŷ is compared against ground truth y. During training and test time, we utilize support vector regression. For each day, we concatenate the chosen features as input to our regression model. Following related disease prediction work, we evaluate our predictions with Spearman's correlation (Hogg et al., 2005) to determine how our features align with the official reported cases. In the wake of the COVID-19 crisis, society has adopted a new vocabulary to discuss the pandemic (Katella, 2020) . Quarantine and lockdown have become standard words in our daily conversations. Therefore, we ask: are there specific features that indicate the state of an outbreak? We create a small COVID-19-related keyword list consisting of lockdown, quarantine, social distancing, epidemic, and outbreak and translate these words into Italian. We also include the English word "lockdown" as it has been used in other countries' vocabularies as well. We aim to observe which, if any, of these words align with Italy's outbreak. In addition to word frequencies, we also utilize mBERT and LASER to extract tweet representations for semantic alignment. We further filter Italy's tweets for a balanced representation of tweet embeddings. We remove duplicate tweets, retweets, tweets with hyperlinks, and tweets discussing countries other than Italy. Using bert-as-a-service (Xiao, 2018) , we extract representations for each tweet. We explore two options when utilizing our tweet representations: average-pooling and max-pooling. Our final feature consists of daily tweet frequency after filtering. Can tweet text align with confirmed cases? We combine combinations of our frequency features with our tweet embeddings and show results in Table 2 . Through manual tuning, we find our strongest model (polynomial kernel) contained the keyword lockdown (in English) and averaged tweet representations from mBERT for the total case scenario. When aligning to new cases, the best model (sigmoid kernel) contained keyword lockdown (in English) and max-pooled LASER embeddings. While mBERT and LASER provide very little difference in alignment to total cases, LASER is noticeably stronger in the new case setting, particularly in time II. For the total case setting, our predictions show strong alignment with ground truth, which is monotonically increasing, in all time settings. When measuring against new daily cases, the correlations are not as significant in time II. When investigating this we find that Italy's new cases form a right-skewed Gaussian curve with a peak in late March, as shown in Figure 2 . As a result, there is a distribution shift when training on February data only (tail of the distribution) and testing in both March and April. While we can align historical data to future cases within Italy, researchers may not have enough data to train models for each country. Therefore we ask, can we use Italy's outbreak to predict the outbreak of another country? In particular, we determine whether users from two different countries follow similar patterns of tweeting during their respective pandemics and how well we can align the two. We follow the same tweet preprocessing methodology described in Section 2.5 and the timeline cuts for training and testing defined in Section 2.2. We also add another time setting (V): training in February, March, and April and testing all three months. This serves as an upper bound for our correlations, indicating how well the general feature trends align between the two countries and their outbreaks. Figure 2 , we find that Indonesia is the only country that had not yet reached a peak in new daily cases by the end of April, and is steadily increasing. Meanwhile, the other countries follow normal distributions like Italy. However, given that we train our model on February and March data, it does not learn information on postpeak trends and cannot generalize well to these scenarios that occur in April in the other countries. What can we learn from our results? Overall, transfer learning in the total case setting leads to stronger correlations with case counts. While results show that training in February and testing in March and/or April works best, our results for setting V's upper bound correlation show that weaker correlations can be due to the limited sample sizes we have from the start of the pandemic. Additionally, training in February, March, and April in Italy allows us to model a larger variety of scenarios during the pandemic, with samples during pre-peak, mid-peak, and post-peak. Therefore, as we obtain more data every day, we can build stronger models that can generalize better to varying distributions of cases and align outbreaks across countries that can fully reach their upper bound correlations and beyond. Doing so is especially important for analyzing Twitter trends and enabling researchers to potentially predict future case surges in other countries. In this paper, we performed an analysis of crosslingual transfer learning with Twitter data for COVID-19 outbreak alignment using cross-lingual sentence embeddings and keyword frequencies. We showed that even with our limited sample sizes, we can utilize knowledge of countries with earlier outbreaks to correlate with cases in other countries. With larger sample sizes and when training on a variety of points during the outbreak, we can obtain stronger correlations to other countries. We hope our analysis can lead to future integration of social media in epidemiological prediction across countries, enhancing outbreak detection systems. Twitter catches the flu: Detecting influenza epidemics using twitter Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond Covid-19: The first public coronavirus twitter dataset How the uks coronavirus epidemic compares to other countries BERT: Pre-training of deep bidirectional transformers for language understanding An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases Introduction to mathematical statistics. Pearson Education Syndromic surveillance using generic medical entities on twitter Forecasting word model: Twitter-based influenza surveillance and prediction Who caught a cold ? -identifying the subject of a symptom Countries around the world are reopening here's our constantly updated list of how they're doing it and who remains under lockdown Our new covid-19 vocabularywhat does it all mean? Disease event detection based on deep modality analysis Student Research Workshop Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter Separating fact from fear: Tracking flu infections on twitter Word frequency and sentiment analysis of twitter messages during coronavirus pandemic A first look at covid-19 information and misinformation sharing on twitter bert-as-service