key: cord-0556180-9a38gvfu authors: Chen, Ninghan; Zhong, Zhiqiang; Pang, Jun title: An Exploratory Study of COVID-19 Information on Twitter in the Greater Region date: 2020-08-12 journal: nan DOI: nan sha: 9c610f88b872d79fa9137f17e6948af2033274ca doc_id: 556180 cord_uid: 9a38gvfu The outbreak of the Coronavirus disease (COVID-19) leads to an outbreak of pandemic information in major online social networks (OSNs). In the constantly changing situation, OSNs are becoming a critical conduit for people in expressing opinions and seek up-to-the-minute information. Thus, social behaviour on OSNs may become a predictor or reflection of reality. This paper aims to study the social behaviour of the public in the Greater Region (GR) and related countries based on Twitter information with machine learning and representation learning methods. We find that tweets volume only can be a predictor of outbreaks in a particular period of the pandemic. Moreover, we map out the evolution of public behaviour in each country from 2020/01/22 to 2020/06/05, figuring out the main differences in public behaviour between GR and related countries. Finally, we conclude that tweets volume of anti-contiguous measures may affect the effeteness of the government policy. On January 20th 2020, the World Health Organisation (WHO) declared a global health emergency over the COVID-19 outbreak. Later, on March 12th 2020, WHO announced the COVID-19 outbreak as a pandemic. 1 The outbreak of the COVID-19 Coronavirus leads to an outbreak of pandemic information in major online social networks (OSNs), including Twitter, Facebook, Instagram, and YouTube [1] . In the middle of a massive COVID-19 outbreak and constantly changing situation, OSNs are becoming a critical conduit for people to seek up-to-the-minute and local information. Moreover, due to physical isolation and social distancing, people spend much more time on OSNs -engaging in expressing opinions, encouraging others, openly lambasting mismanagement, and voicing vitriol, etc. On the one hand, social behaviour on OSNs may become a predictor or reflection of reality. On the other hand, the related information diffusion over OSNs can strongly influence peoples behaviour, and thus have an impact on the effectiveness of control and protective measures deployed by the governments. There is a growing body of research that links OSNs activities to COVID-19. Some existing results have already shown that OSNs conversations can be a leading indicator of COVID-19 cases [2, 3] , discussions on OSNs can be categorised into multiple specific topics [4, 5, 6, 7] and OSNs may help to design more efficient pandemic models for social behaviour and to implement more responsive government communication strategies [1, 8, 9] . However, there are three main problems with the existing research. First of all, researches with geographic data are conducted through rough processing of the location information [2, 10] . Second, the current topic modelling study is mostly focused on a relatively long period (weeks or months) [6, 1] , which does not provide a precise representation of how topics change from day to day and the existing studies mainly focus on general characteristics of user behaviour. Third, shared information on OSNs at a global or country level [11, 2, 10] are coarser in terms of geographic dividing. When analysing the COVID-19 information on Twitter by geographic locations, it cannot be ignored that COVID-19 performs in a way which is inescapably spatial [12] . Existing studies [13, 14] have shown that whether it is Spanish flu, Ebola or COVID-19, the geographic changing of contagion reveals that economic, logistical, and flowsoriented relationality is intrinsically linked to the transmission pattern 1 https://bit.ly/39EMOtY of infectious diseases [15] . Hence, research results focusing only on political sovereign states would be biased. To fill this gap, we innovatively introduce relational urbanisation, the efflorescent idea from human geographic, that city orientation is influenced by the network of materials, capital, information and culture, and reinforced by financialised capitalism [16] to determine the study area, and the Greater Region (GR), a typical relational urbanisation product with Luxembourg at its centre and including adjacent regions of Belgium, Germany and France (i.e., Wallonia, Saarland, Lorraine, Rhineland-Palatinate and the Germanspeaking Community of Belgium) is chosen as the representative in our case study and we define the neighbouring countries mentioned above as the related countries of GR. GR has the highest number of cross-border commuters in Europe, with approximately 250, 000 commuters per day 2 . This makes GR a particular and classic example: virus spreads with high mobility, while the whole business model in GR requires a large number of cross-border workers to sustain. With the implementation of a set of policies including border closures and the progression of the pandemic, the people living in the GR area are affected in economy, daily life, travel, and other aspects. This study is divided into two parts (Sections 4 and 5) to address following three main questions: RQ1 Can the Twitter posts volume be a predictor of COVID-19 daily cases during a long-term period in GR and its related countries? RQ2 Whether there are certain patterns of social behaviour on OSNs at different periods of the pandemic and how GR, as a relational urbanisation product, differs from other countries in terms of social behaviour on OSNs during the pandemic? RQ3 Whether more attention to the pandemic and prevention measures in the early stages of the pandemic will impact the effectiveness of control and protective measures deployed by the governments? To answer these questions, we collected 51, 966, 639 tweets from Twitter, which are posted by 15, 551, 266 Twitter users from 2020-01-22 to 2020-06-05 globally. Among them are 1, 643, 308 posts posted by 41, 690 users in GR and its related countries. To investigate RQ1, basic reproductive rate R 0 and effective reproductive rate R(t) in epidemiology [17] are introduced to slice the pandemic periods, and correlations between tweets volume and daily cases in each period are calculated by Pearson Correlations (PC). A novel topic modelling method combing Bidirectional Encoder Representations from Transformers (BERT) [18] and the Latent Dirichlet Allocation (LDA) topic modelling method [19] is utilised, and a supervised Support Vector Machine (SVM) [20] for classifying topics into given categories is trained to study RQ2 and RQ3. The main contributions in this paper are fourfold. (I) We generate a novel Twitter dataset from 2020-01-022 to 2020-06-05 which contains users with locations labelled in GR, and related countries including Luxembourg, France, Germany and Belgium, and the COVID-19 related tweets and conversations posted by the users. This dataset will be shared with the public to advance related research. (II) A Spatio-temporal analysis is carried out to figure out how the COVID-19 cases are correlated with the Twitter posts during a long-term period. We find that tweets volume only can be a predictor of outbreaks during the early period of the pandemic. Before R(t) value peaks, there is a spike of public concern about the pandemic, which is the best time to conduct the pandemic precaution advocacy [21] . (III) We find that for countries with a long interval (average 27.3 days) between the date of the first case and the date of an outright outbreak, the appearance of the first case did not attract enough attention of the public to anti-contagion and treatment measures. Furthermore, GR and Luxembourg showed a greater concern for anti-contagion and treatment measures before COVID-19 reaches its peak, and exhibited a higher level of interest in policy and daily life before R(t) < 1 than that of Germany, France and Belgium. (IV) Discussions about anti-contagion and treatment measures on OSNs before the COVID-19 reaching its peak may have an influence on government protective policies and shorten the period for the policies to take effect. This study sheds light on how the public reacts differently over time in GR and related countries through an interdisciplinary approach. It may, therefore, be useful to understand changes in social behaviour on OSNs during the pandemic, and in particular, the distinction of social behaviour between relational urbanisation region with high mobility and political sovereign states. Identifying when messages work best on public not only helps to generate policy support but also to ensure individuals actions needed to combat the spread of the virus. Some existing results have already shown that social media conversations can be a leading predictor of a new pandemic cases [2, 22, 3] , and in many countries tweets increase in volume before the number of confirmed cases increases. Studies have shown that anti-contagion policies can significantly and substantially reduce the spread of COVID-19 [23, 24, 25] , and the effect of policies on the mitigation of spread varies, influenced by factors including culture, demographic information, socio-economic status and national health systems, where changes in public knowledge may affect the impact of the policies. If the public adjusts their behaviour in response to other new information not related to the policies, such as from online sources, this may change the spread of COVID-19 [23] . Researches of public behaviour patterns of the pandemic have been conducted based on data from smart devices [8] , search index [26, 27] , and COVID-19 related conversations on Twitter. Bento et al. [9] mention that, there is a spike in searches for basic information about Covid-19 when the first case was announced in each state in the United States, but the first case report does not trigger discussions about policy and daily life. Topic modelling, an unsupervised approach that detects latent semantic structure [4] is widely used. Cinelli et al. [1] extract topics with word embedding on a global scale, making the conclusion that social media may help to design more efficient epidemic models for social behaviour and to implement more efficient communication strategies. The LDA model is used by Medford et al. [5] and Ordun et al. [6] to analyse the topics in early period of the pandemic. Sharma, et al. [7] use character embedding [28] and Term Frequency Inverse Document Frequency (TF-IDF) word distribution with manual inspection for topic modelling. However, LDA, a bag-of-words approach, which is widely used to identify latent subject information in a large-scale document collection or corpus, has some drawbacks: it needs large corpus to train, ignores contextual information and performs mediocrely in handling short texts [29] . As a result, these studies extract the topic over certain time periods, and the time granules are too coarse to accurately reflect the trend of the topics. In this section, we briefly describe how to collect COVID-19 tweets and COVID-19 daily cases information for GR and the related countries. Twitter, one of the most prominent online social media platform, has been used extensively during the pandemic. In this study, 51, 966, 639 tweets posted by more than 15 million Twitter users are retrieved. The data collection consists of the following steps. First of all, we collect posts with COVID-19 related keywords from 2020/01/22 to 2020/06/05 and the userid and location of users who posted them based on Chen et al.'s work [30] with the Twitter Streaming API. Secondly, as the user location information we collected so far is user-defined, it is not necessarily a true location, nor machine-parseable, so the fuzzy location context is processed into real location information by leveraging geocoding APIs, Geopy 3 and ArcGis Geocoding 4 . After getting the valid geographic location of each user, we select users which are located in the GR, Luxembourg, France, Germany, and Belgium to form the dataset. Table 1 gives an example of the final dataset, and Table 2 shows the summary of the collected tweet data of GR, Luxembourg, France, Germany, Belgium and the global. For the COVID-19 cases data, the dataset published by European Center for Disease Prevention and Control 5 allows us to obtain COVID-19 data including daily cases, deaths and locations for the country we selected. As there is no official COVID-19 data published for GR, which is composed of Luxembourg, Wallonia in Belgium, Saarland and Rhineland-Palatinate in Germany and Lorraine in France, when counting daily cases and deaths in the GR, we add up all the data for the cities and regions mentioned above from the datasets 6 published by corresponding countries as the final GR region data. It should be noted that in France, the number of daily new cases is not available at the region level, and deaths, hospitalisations, departures data have been published only since March 18, 2020. So the data for Lorraine is counted as zero until March 18, 2020, and the sum of hospitalisations, hospital departures and deaths is considered as the total number of cases on that particular day. To investigate whether tweets volume can be a predictor of daily cases during the pandemic (RQ1), we introduce basic reproductive rate R 0 and effective reproductive rate R(t) in epidemiology to slice the period as the research covers a long duration, and a Spatio-temporal analysis of correlations between tweets volume and daily cases in each period is conducted by Pearson Correlations (P C). L u x e mb o u r g B e l g i u m F r a n c e Ge r ma n y R 0 is the expected number of cases arising directly from a single case in a population where all individuals are susceptible to infection [17] and R(t) represents the average number of new infections caused by an infected person at time t. If R(t) > 1, the number of cases will increase, e.g. at the beginning of an epidemic. When R(t) = 1, the disease is endemic, and when R(t) < 1, the number of cases will decrease. For the calculation of real-time R(t), we use a Bayesian approach [31] with Gaussian noise to calculate the time-varying R(t) based on daily new cases, which is also the official method for calculating R(t) in Luxembourg. 7 In this cases, While the study of R 0 of COVID-19 is still ongoing, in this research we use the R 0 estimated by WHO 8 which ranges between 1.4 and 2.5. The results of time-varying R(t) for GR, Luxembourg, Belgium, France, GR L u x e mb o u r g B e i l g i u m F r a n c e Ge r ma n y and Germany are shown in Figure 2 . Here, we divide the pandemic into four periods, which are: It is necessary to note that during the Freecontagious period, R(t) is based on R 0 , taking into account the results of the prevention measures [32] , so during this period the measures taken to prevent the spread of the pandemic have not shown results or have failed to stop the spread of the pandemic, and COVID-19 still spreads during this period with an R 0 reproductive rate. The precise time duration of these pandemic periods for each country and region are summarised in Table 3 . The exact numbers of days of each pandemic period are shown in Figure 3 for the countries and GR. The Free-contagious period in Luxembourg and GR is particularly short (4& 6 days) compared to the other countries (24-20 days). Existing research indicates that the differences in periods in each region and country are influenced by a variety of factors including government policy [33, 34] , population density [35] , mobility [36, 37] and so on. In this paper, we will analyse it from the impact of the social behaviour on OSNs in Section 5 (RQ3). To answer RQ1, we hypothesise that daily tweets volume can be predictive of daily cases and we calculated the relationship between them by P C, where a P C with a large absolute value means greater relation strength. The results are shown in Table 4 . A lag refers to the tweets occurring after the cases; a Lag = -5 days means that we match the daily cases with the tweets volume from five days earlier, in other words, a 5-days lead. Pre-peak period. As shown from Table 4 , there is a clear trend of strong correlation (P C > 0.8, p < 0.05) with lags during the Pre-peak period, reaching it's maximum at -5 or -6 days, indicating that tweets volume may be a predictor of outbreaks at an early period of the pandemic. Although the current research on the incubation period of COVID-19 is inconclusive, several studies have suggested that the incubation period of COVID-19 is on average 5-6 days [38, 39, 40] . In this regard, we propose that the 5-6 day lead may be related to the lag between infection and the onset of symptoms to be detectable and confirmed. Free-contagious period. There is no clear trend of correlation with lags except the value of Luxembourg, indicating that tweets volume cannot be an indicator to predict the daily cases in the Free-contagious period. The period only lasted for 4 days in Luxembourg, which is too small to make P C a reflection of the correlation. However, the P C values show a highly negative correlation between Tweets volume and daily cases, which reflects there is a short period of a downward trend in the discussion of the pandemic after it reached its peak, even though the number of cases continued to rise rapidly. This result validates the conclusion of Smith et al. [41] from our dataset, they noted that public awareness of disease declines sharply after the peak, even though the infection rates remain high. In other words, the public's interest in the pandemic declines after a period of heightened attention during the Pre-peak period. Measures period. There is a clear trend of correlation with lags, tweets volume begins to level off, with a 0 or 1-day-lag moderate correlation (0.8 > P C > 0. 3 Decay period. The correlations between tweets volume and daily cases occur in two ways here -one is weakly correlated and the other is that although there is a correlation, the trend of correlation with lags is not significant. Both demonstrate that it is not possible to predict daily cases by tweets volume during this period. In summary, with the Spatio-temporal analysis of the correlation between tweets volume and COVID-19 new cases during the four period of the pandemic, we find that tweets volume only can be a predictor of outbreaks during the Pre-peak period of the pandemic. Regardless of the time at which R(t)peaks, there is a 5-6 day lead between tweets volume and COVID-19 daily cases, which may be related to the lag between infection and the onset of symptoms. What's more, Before a pandemic strikes, there is a high level of public concern about the pandemic, and the Pre-peak period is the perfect time to conduct the pandemic precaution advocacy [21] as public concern about the pandemic will decrease after enter the Free contagious period. On the particularity of GR, we find that the Free-contagious period in GR and Luxembourg are exceedingly short, while Luxembourg is similarly short in the Measure period. The reasons for this will be explored further in Section 5 from the perspective of social behaviour on OSNs (RQ2). In order to have an actual understanding of the situation, and gain further insights into the behaviour of the public in GR and related countries on social media, the tweets posted by users in GR and related countries are analysed with BERT [18] and the LDA [19] . We extract the main daily topics on tweets, and categorise the generated topics, figuring out whether there are certain patterns of social behaviour on OSNs at the periods of the pandemic (RQ2). More importantly, we investigate how GR, as a relational urbanisation product, differs from other countries in terms of of public concern on OSNs during the pandemic (RQ2) and whether more attention to the pandemic and prevention measures in the early stages of the pandemic will shorten the Free-contagious period duration (RQ3). The overall workflow of our topic modelling and classification tasks is shown in Figure 4 . Text prepossessing. Prior to topical modelling, the tweets data needs to be preprocessed. Particularly, missing delimiters are detected according to the Topic modelling. Aiming to identify the latent topics of the tweets posted by the public in GR and related countries, we adopt the general structure of contextual topic embedding method (CTE) 10 in this paper, to extract daily topic data and get a more accurate picture of topic trends. CTE mainly consists of two components, LDA and BERT, to extract different information from sentences to embedding. LDA, a bag-of-words approach which is widely used to identify latent subject information in a large-scale document collection or corpus has some drawbacks: it needs large corpus to train, ignores contextual information and performs mediocrely in handling short texts [29] . BERT utilises bidirectional transformers for pre-training on a large unlabelled text corpus, taking both left and right context into account simultaneously, which compensates for the shortcoming of LDA. And BERT is a method available for sentence embedding, thus we concatenate the generated tokens of each tweet as input sentences for BERT to obtain tweet representation. CTE combines the sentence embedding vector generated by BERT with the probabilistic topic assignment vector generated by LDA with a hyper-parameter γ. Besides, as our data are multilingual, some words appear less frequently than in English which is predominantly spoken and are easily overlooked in the topic modelling, hence we adopt the TF-IDF model to determine word relevance in the documents [42] . And further feed the generated corpus by TF-IDF to LDA, instead of sample bag-of-words corpus. After obtaining the concatenated vector in high-dimensional space, CTE uses an autoencoder to learn a low-dimensional latent space representation of the concatenated vector with more condensed information. Then k-means [43] is implemented for clustering, the number of clusters k, that is, the number of topics, reserved as a hyper-parameter. We extract the word frequency in each cluster, sort and then take the top ten as the representation topic of that cluster. In terms of visualisation, Uniform Manifold Approximation and Projection (UMAP) [44] is used for low-dimensional latent space degradation, which is the state-of-the-art visualisation and dimension reduction algorithm. Average coherence score and average silhouette score are utilised as the metrics of CTE. We calculated an average coherence score by calculating the topic coherence for each topic individually and averaging them. And the average silhouette score is the mean of the silhouette score for each day. Topic modelling is conducted on daily tweets of GR and the related countries, We finetune the topic models and arrive at the optimal n = 7 and γ = 0.5 with highest average coherence score [45, 46] and average silhouette score [47] . The results are shown in Table 5 and a sample of clustering result from UMAP is shown in Figure 5 . It can be observed from Table 5 and Figure 5 that the results generated by CTE are coherent and can be observed as well-separated clusters. After getting 4,763 topics from topic modelling, we then randomly selected 2,435 topics and classified manually into the following 7 categories: These manually classified topics are used to train a Support Vector Machine (SVM) [20] for supervised classification. In particular, words of each topic are converted to word frequency vectors with TfidfVectorizer 11 and country are encoded with Label Encoder 12 . The feature vector is made up with these two elements. Since our manually labelled dataset is class-imbalance, Synthetic Minority Oversampling Technique [48] , is utilised for oversampling imbalanced the dataset and mitigate imbalances. The dataset is split 80% as the training dataset and 20% as the test dataset. Grid search with 10-fold cross-validation is deployed on training dataset to find the optimal hyperparameter, and the final SVM model is obtained with the entire training set. Table 6 shows the precision, recall, F1 score, support and Macro-average F-Score of the trained classifier for each topic category. Then, the obtained SVM model is utilised to classify the rest of the topics. Table 7 shows the number of topics of each category for each country. The categories with the high percentages are identified as Wuhan & China and policy and daily life. In general, the number of topics about policy and daily life is much higher in Luxembourg (56.6%) than in other countries (ave = 33.0%). France, on the other hand, shows a high level of interest in local news (30.2%), relative to other countries (9.4%). In terms of the overall data of GR, however, it does not show particular differences from other countries. Note that as there may be cases where the cluster for a topic contains no more than two tweets, we treat such topics as the invalid topic and remove them. This leads to a different total number of topics in each country. Next, we introduce dates to plot the changes in categories over time. Figure 6 shows the tweets volume contained in each category as a percentage of the total tweets volume on that day (CR), with the darker red representing the higher CR. The interval shaded in white represents the period from 22 January to Pre-peak period, the shaded regions in different colours indicate, in order, Pre-peak period, Free-contagious period, Measures In this section, we aim to answer the RQ2: whether there are certain patterns of social behaviour on OSNs at different periods of the pandemic; and how GR, as a relational urbanisation product, differs from other countries in terms of of social behaviour on OSNs during the pandemic. This behavioural pattern obtained from Figure 6 is slightly different in our cases compared with the conclusion of Bento et al. [9] . In France, Germany and Belgium, the appearance of the first case triggered only a small amount of discussion about the protective measures, and discussion about them does not start to increase until OD. In other words, the public did not really heed the pandemic until OD, when the virus was already spreading. This may explain by the existence of a large (average = 27.3 days) interval between the date of the first case and OD in France, Germany, and Belgium. During this interval, sporadic cases may not attract enough public attention, and the public's attention was still focused on China-related news. What's more, the report of first case does not stimulate discussions about policies and daily life as well, and discussion about it did not emerge frequently until OD. The early picture in Luxembourg and GR is different. Figure 6 shows that the public in Luxembourg and GR started to have discussions about measures 1-2 days before the first case appeared. This may be explained by the late occurrence of the first case in Luxembourg and GR, where the other three countries have already passed OD, the outbreak in other countries may have attracted public attention in GR and Luxembourg. Interestingly, in Luxembourg, the discussion about policies and daily life persisted before the first case was announced and increased immediately after then. A word cloud of the topics from 22 January to 1 March (date of the first case) of Luxembourg is depicted in Figure 7 , it shows that the topics are mainly travel-related. This may be explained by the fact that the proportion of foreign residents in the Luxembourg region is 47.4% 13 , and residents are more concerned about travel-related policies in Luxembourg and other countries. As a relational urbanisation product, however, GR exhibits a high level of interest in policy and daily life with tweets volume for 47.1% of total tweets volume during the Free-contagious and the Measures period, while Luxembourg, the central region of GR, this percentage is 66.1%. Figure 8(a) shows boxplots of the distribution of the CR on policy and daily life during the Free-contagious and the Measures period. This shows that the public is more responsive to policies as a region that relies on foreign labour and has high mobility than Belgium, France and Germany. Figure 8(b) shows that during the Free-contagious and the Measures period, public interest in local news was similar in all regions except for France. Moreover, during the Decay period, while there is a downward trend (p < 0.05) in the total daily tweets volume, there is a upward trend (p < 0.05) in the CR of policy and daily life, except in Luxembourg, where the rate is consistently high. As can be seen from Figure 6 and Figure 3 , the discussion volume about measures during the Pre-peak period may affect the length of the Freecontagious period (p < 0.05). Table 8 lists the anti-contagion policies in different countries, the countries have implemented highly similar policies including lockdown, gatherings limitation and school closures. During the Pre-peak period, the CR of measures was much higher in GR (3.41%) and Luxembourg (7.62%) than in France (1.90%), Belgium (1.84%) and Germany (0.0%). It should be noted that the discussion of measures is not actually non-existent in Germany, but the tweets volume may be too small to be recognised as separate topics during the topic modelling process. From the results of topical classification, it is evident that there may be an underestimation of the severity of the pandemic by the public in Germany, France and Belgium during the Pre-peak period. The underestimation can be interpreted as optimism bias, which make people believe their exposure to disease is low [49] . During a pandemic, people often exhibit an optimism bias, a cognitive bias that causes someone to believe that they will be less likely to get involved in negative events [50] . In our results, it shows that even though the public in Belgium, France and Germany have shown sustained and long-term concern about COVID-19 occurring in China on OSNs, optimism bias emerged when COVID-19 appeared, causing the public to ignore the emergence of the cases and to show low support for anti-contagious measures and government policies [51] . Thus, tweets about anti-contagion or treatment measures may serve as an indicator of whether the public are experiencing an optimism bias, and to some extent, shorten the Free-contagious period and public knowledge may affect the impact of the policies. In this paper, we have studied the COVID-19 related content on Twitter, and innovatively introduced the idea of relational urbanisation and chosen the Greater Region and its related country as the research area. Our analyses focused on three research questions: can the Twitter posts volume be a predictor of COVID-19 daily cases during a long-term period? (RQ1), whether there are certain patterns of social behaviour on OSNs at different periods of the pandemic and how GR differs from other countries? (RQ2), and whether more attention to the pandemic and anti-contagious measures in the early stages of the pandemic will impact the effectiveness of control and protective measures deployed by the governments? (RQ3). With the spatio-temporal analysis of the correlation between tweets volume and COVID-19 new cases during the four periods of the pandemic, our general answer to RQ1 is that tweets volume only can be a predictor of outbreaks during the Pre-peak period of the pandemic and the lead between tweets volume and COVID-19 daily cases may be related to the lag between infection and the onset of symptoms. For RQ2, we have shown a certain pattern in the main categories of public discussion on social media during the pandemic. In the early stage, the public focused on news related to China. If the first case occurs in a country and there is no rapid outbreak, the public concern will not be diverted to anti-contagious and treatment measures, policies and local news until a complete outbreak, that cases begin to occur daily in that region. GR, as a region with a large number of cross-border workers, has shown high interest in policies since the first case, even if no lockdown policy has been implemented at that time. At the same time, Luxembourg, which has a foreign resident population of 47.4%, has shown a great concern for policies including travel from the beginning of the pandemic, which is not found in other regions. For RQ3, our answer is that the discussion volume about measures during the Pre-peak period may affect and shorten the length of the Free-contagious period. Our results in this paper can be used to understand social behaviour on OSNs during the pandemic, and the differences of the social behaviour in relational urbanisation regions when facing the pandemic. We identify when is the perfect time to conduct the pandemic precaution advocacy which help to generate policy support. There are still some limitations of our study. First, in our dataset, we did not take into account bots that post misleading information, which can lead to a possible bias in topical modelling and classification. For our initial exploration of topic categories, we chose SVM to build a baseline method for topic classification. We will utilise other state-of-the-art text classification methods to refine the classification in further study. Second, our cases study has some statistical limitations, and data from more countries will be included in future studies to ensure the statistical significance of the conclusions. Third, more research can be performed based on our dataset. For example, in future, we will conduct sentiment analysis on the tweets of different categories at each pandemic period to find out if there is a certain pattern in the public's sentiment about the pandemic and how it differs from GR to other countries. And for RQ3, multi-class sentiment analysis with BERT will be conducted to figure out whether and to what extent people are optimistic or pessimistic about being affected by a pandemic during the Pre-peak period. Finally, during the writing of this article, the second wave of COVID-19 began to appear in Luxembourg and some countries. In a future study, we will conduct a comparative study focusing on the regions where the second wave occurred. Sentiment analysis and text classification with the state-of-the-art method will be deployed to investigate whether OSNs information that may reflect public attitude and behaviour. We will attempt to identify social behaviour that may lead to the second wave, such as laxity or resistance to policies and anti-infection measures. Such timely indicators are potentially useful for appropriate policy changes to avoid a new pandemic outbreak. The covid-19 social media infodemic A first look at COVID-19 information and misinformation sharing on Twitter Using Twitter and web news mining to predict COVID-19 outbreak Collaborative topic modelling for recommending scientific articles Exploratory analysis of covid-19 tweets using topic modelling, UMAP, and digraphs Infodemic": Leveraging high-volume Twitter data to understand public sentiment for the COVID-19 outbreak COVID-19 on Social Media: Analyzing Misinformation in Twitter Conversations Tracking public and private response to the COVID-19 epidemic: Evidence from state and local government actions Simon, Evidence from Internet search data shows information-seeking responses to news of local COVID-19 cases Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset Retweeting for COVID-19 Mega regions and pandemics Networked disease: Emerging infections in the global city Extended urbanisation and the spatialities of infectious disease: Demographic change, infrastructure and governance Relational cities disrupted: reflections on the particular geographies of COVID-19 For small but global urbanisation in The role of tax havens and offshore financial centres in shaping corporate geographies: An industry sector perspective The concept of Ro in epidemic theory BERT: Pre-training of deep bidirectional transformers for language understanding Latent dirichlet allocation LIBSVM: A library for support vector machines Crisis communication best practices: Some quibbles and additions Can Twitter predict disease outbreaks? The effect of large-scale anti-contagion policies on the COVID-19 pandemic Strong social distancing measures in the United States reduced The COVID-19 Growth Rate Effectiveness of government policies in response to the COVID-19 outbreak More effective strategies are required to strengthen public awareness of COVID-19: Evidence from Google trends Association of the COVID-19 pandemic with Internet search volumes: A Google trends T M Analysis Fasttext. zip: Compressing text classification models A biterm topic model for short texts COVID-19: The First Public Coronavirus Twitter Dataset Real time Bayesian estimation of the epidemic potential of emerging infectious diseases Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures Variation in government responses to COVID-19, Blavatnik School of Government Working Paper Transmission dynamics of the COVID-19 outbreak and effectiveness of government interventions: A data-driven analysis High population densities catalyse the spread of COVID-19 Staying at home: mobility effects of covid-19 Others, The effect of human mobility and control measures on the COVID-19 epidemic in China World Health Organization The incubation period of Coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: Estimation and application Clinical characteristics and outcomes of patients undergoing surgeries during the incubation period of COVID-19 infection Towards realtime measurement of public epidemic awareness: Monitoring influenza awareness through twitter Others, Using TF-IDF to determine word relevance in document queries Others, Constrained kmeans clustering with background knowledge Umap: Uniform manifold approximation and projection for dimension reduction An analysis of the coherence of descriptors in topic modelling Automatic evaluation of topic coherence Clustering categorical data using silhouette coefficient as a relocating measure SMOTE: synthetic minority over-sampling technique Using social and behavioural science to support COVID-19 pandemic response The optimism bias Public support for government actions during a flu pandemic: lessons learned from a statewide survey Acknowledgements. This work was partially supported by Luxembourgs Fonds National de la Recherche, via grant COVID-19/2020-1/14700602 (Pan-demicGR).