key: cord-0915028-joe9e4dh authors: Gupta, M.; Bansal, A.; Jain, B.; Rochelle, J.; Oak, A.; Jalali, M. S. title: Whether the Weather Will Help Us Weather the COVID-19 Pandemic: Using Machine Learning to Measure Twitter Users' Perceptions date: 2020-08-01 journal: nan DOI: 10.1101/2020.07.29.20164814 sha: ab30d08e2f964b3dca9f366c55c38efa09aacad1 doc_id: 915028 cord_uid: joe9e4dh Objective: The potential ability for weather to affect SARS-CoV-2 transmission has been an area of controversial discussion during the COVID-19 pandemic. Individuals' perceptions of the impact of weather can inform their adherence to public health guidelines; however, there is no measure of their perceptions. We quantified Twitter users' perceptions of the effect of weather and analyzed how they evolved with respect to real-world events and time. Materials and Methods: We collected 166,005 tweets posted between January 23 and June 22, 2020 and employed machine learning/natural language processing techniques to filter for relevant tweets, classify them by the type of effect they claimed, and identify topics of discussion. Results: We identified 28,555 relevant tweets and estimate that 40.4% indicate uncertainty about weather's impact, 33.5% indicate no effect, and 26.1% indicate some effect. We tracked changes in these proportions over time. Topic modeling revealed major latent areas of discussion. Discussion: There is no consensus among the public for weather's potential impact. Earlier months were characterized by tweets that were uncertain of weather's effect or claimed no effect; later, the portion of tweets claiming some effect of weather increased. Tweets claiming no effect of weather comprised the largest class by June. Major topics of discussion included comparisons to influenza's seasonality, President Trump's comments on weather's effect, and social distancing. Conclusion: There is a major gap between scientific evidence and public opinion of weather's impacts on COVID-19. We provide evidence of public's misconceptions and topics of discussion, which can inform public health communications. Since the beginning of the outbreak, one of the major questions has been whether the transmission of SARS-CoV-2 is seasonal, such as with influenza, 1 MERS, 2 or SARS. 3 While there was limited research and consensus at the beginning of the pandemic on the impact of weather and seasonality on the transmission of SARS-CoV-2, 4-12 a growing body of evidence has suggested that the effect of weather conditions is modest and that weather alone is not sufficient to quench the pandemic. 13 Despite (limited) academic consensus, what the public thinks is unknown, which motivated our research. As COVID-19 has disrupted the global population, many have turned to social media platforms such as Twitter to navigate COVID-19. While Twitter's effectiveness at disseminating information can be leveraged to share public health information for social good, it can also promote misinformation. 14 As the virus continues to spread, chatter online has increased in volume, and one particularly contentious topic of discussion surrounds the myth that heat can effectively kill the virus. 15 While it is not uncommon for public opinion to contradict scientific literature, the continuous debate, uncertainty, and lack of consensus among experts exacerbated this specific public misconception. 16, 17 As public comments are good predictors for individual's behaviors, measuring and analyzing the social perception of the weather's impact on COVID-19 may help predict adherence to public health policy and guidelines. This study examined Twitter users' perceptions concerning the weather's effect on the spread of COVID-19 with natural language processing and machine learning techniques. Specifically, the research objectives were to identify: (1) the perceived impact of weather in relevant tweets and classify them accordingly, and (2) if and how these perceptions changed throughout the pandemic. To investigate these, we trained a support vector machine classifier to measure what proportion of tweets claim there is an effect of weather, and exhibit time-series trends for a subset of relevant tweets. To detect perceptions outside of this effect-oriented framework, we employed unsupervised learning to discover unexpected discussion topics. This study is one of many to use machine learning and natural language processing to retrieve information about public perception through social media for public health purposes, 18, 19 but the first to study the perception of the weather's impact on COVID-19. We hope that this work can inform public policy and research as the COVID-19 pandemic response continues. Using Twitter's Premium application programming interface (API) for historical search, we collected 166,005 tweets from January 23 to June 22, 2020 with the query "(coronavirus OR covid OR covid19) AND weather." This query checked all tweet components for a match, including the tweet's text, the text of any attached articles or media, and any URL text included with the tweet. We only collected quoted or original tweets, not retweets, that were written in English. We also did not limit the data to any specific location. For tweets replying to or quoting another tweet, we fetched the text of the other tweet. For tweets sharing an article, we collected the article headline and description as displayed on Twitter. The tweet text, article data, and any replied-to/quoted tweets were then merged for analysis. Figure 1 presents our research method and the flow of its processes, which are discussed below. Initially, we cleaned tweets by removing any non-alphanumeric characters (including emojis), mentions of other users, and hashtags at the end of the tweet, and then we further standardized with lemmatization and stemming (see Supplementary S2 for more details). Following common techniques used for social media analysis in other domains, 20 we employed rule-based filtering to narrow our corpus down and remove noise. The rule-based filtering consisted of three rules applied sequentially. First, we filtered out false positives coming from the sheer popularity of our keywords (e.g., a tweet commenting on pleasant weather and ending with "#coronavirus") and removed tweets where the keywords were split across different parts of the tweet (e.g., "weather" only appearing in the article text, and "covid" only in the tweet itself). Second, we discarded tweets using "weather" as a verb or idiomatically (e.g., "under the weather"). Finally, we restricted the tweets to those posted by individuals, not news organizations, since individual perception was the focus of study. The strengths of these three rules were verified manually (see S3). We used machine learning to further reduce the corpus to tweets that had insightful relationships between weather and the spread of COVID-19. To create training data for the classifier, two annotators (JR, BJ) labeled a set of tweets based on predefined inclusion criteria, which defined a tweet as relevant if it referenced a causal or correlative relation between weather and coronavirus spread, and irrelevant otherwise. Tweets presenting a causal relationship declared the weather to have a direct impact on the spread of COVID-19 (e.g., high temperatures killing the virus) while a correlative relationship declared an indirect impact (e.g., reduced social distancing during pleasant weather). Irrelevant tweets mentioned weather and COVID-19 but did not establish a relationship between them (e.g., extreme weather causing additional strain in hard-hit areas). Annotators marked a shared pilot set of 100 tweets to calibrate on these criteria. After resolving any discrepancies, annotators labeled a full set of training data for our machine learning classifiers. Text featurization was used to convert tweets into meaningful vectors for machine learning analysis. Three vectorization techniques were used: Bag of Words (BOW), Term Frequency-Inverse Document Frequency (TF-IDF), and Embeddings from Language Models (ELMo), a state-of-the-art technique that utilizes word embeddings. 21 ELMo factors in the surrounding context for each word (i.e., the words around it) for its vectorization, while BOW and TF-IDF do not. 22 For BOW and TF-IDF, we removed stop words (set of commonly used words which do not contribute to the context of the tweet) and also words that only appeared in 1% of all tweets or less. We tested 11 classification models for performance on relevancy classification: Ridge Classifier, Logistic Regression, k-Nearest Neighbors, Support Vector Machine, Logistic Regression with Gradient Descent, Support Vector Machine with Gradient Descent, Multinomial Naïve Bayes, Complement Naïve Bayes, Bernoulli Naïve Bayes, Random Forest Classifier, and Decision Trees (see S5). We used Scikitlearn's machine learning libraries. 23 We performed a five-fold outer cross-validation on our training dataset to select the optimal model with five-fold inner cross-validation to find the ideal hyperparameters (see S4). For each of our models, we evaluated and reported the Area under the Precision-Recall curve (AUC-PR) and Area under the Receiver Operating Characteristic curve (AUC-ROC)-for definitions, see 24 . Both metrics are presented, but we chose to optimize with respect to AUC-PR since it provides a better assessment of model performance for imbalanced datasets, where AUC-ROC be overly optimistic. [24] [25] [26] We took the best performing model to be our "Relevancy Classifier" that produced the corpus for analysis, both for the claimed effect of weather and for topics of discussion. To classify tweets based on the type of effect the user expected the weather to have on the spread of COVID-19, we trained another machine learning classifier. We first annotated a new batch of tweets (distinct from the relevancy annotation set) based on if they claimed weather to have some effect and used this as training data. After calibrating on a pilot set of 200 tweets, annotators (JR, BJ, MG) first labeled tweets into one of three categories: "effect," where the tweet suggested that weather had an impact on COVID-19; "no effect," where the tweet suggested weather had no impact; and "uncertain," where the tweet was uncertain to the effect or made no clear claim to an effect. Additionally, within the "effect" category, tweets were labeled based on whether the tweet suggested COVID-19 would: i) improve with warmer weather, ii) worsen with warmer weather, iii) improve with cooler weather, or iv) worsen with cooler weather. This class scheme assumed that temperature was the key driver of discussions; we found this to be representative of discussion on Twitter as well as the main focus of academic literature on the weather's impact. 4, 5, 7, 8 The inclusion, for instance, of both "improve with warmer weather" and "worsen with cooler weather" was to avoid any assumption of a linear effect of temperature given that non-linear effects have been documented. 13 For qualitative analysis, the annotators recorded the mechanisms users reported for the weather's impact on coronavirus spread, such as sunlight destroying the virus. These mechanisms provided insight into the theories of the weather's impact being discussed and are reported in the Discussion. For our Effect Classifier, the same machine learning techniques were used from our Relevancy Classifier (as described above) with one modification: for the trinary classification, we optimized with respect to balanced accuracy, since AUC-PR and AUC-ROC do not extend to multiclass problems. To extract unexpected topics of discussion, we performed unsupervised learning to cluster the tweets and determined topics through inspection of the clusters. After removing repeated tweets (not retweets) and attached article data, we used k-means clustering to group tweets into k clusters-other methods, specifically k-medoids and latent Dirichlet allocation 27 were also explored (see S7). Clustering was performed on the same TF-IDF vectors generated for effect analysis, and cluster sizes in k=10, 15, 20, 25, and 30 were tested. Each cluster was associated with an output of the top 20 keywords, based on highest TF-IDF scores. Outputs from each of the clustering configurations were inspected manually for the cohesiveness of topics. The data pipeline is displayed in Figure 1 , with inspiration taken from Ong et al. 28 Overall, rule-based filtering reduced the corpus from 166,005 to 84,201 tweets. For relevancy classification, annotators labeled a random sample of 2,786 tweets, and the Relevancy Classifier was trained on this. Then, for effect classification, the "effect" of a random sample of 2,442 relevant tweets (out of 28,555) was annotated per the Effect Class and annotation scheme introduced earlier, with results shown in Table 1 . Our relevancy classifier identified tweets discussing the weather's impacts on COVID-19, with the The 2,442 annotated tweets were separated according to their effect label (effect, no effect, uncertain) and plotted in Figure 3 . Using the manual annotations for our Effect Classifier, we attempted to predict the perception of a tweet according to the three classes. However, the multiclass scheme proved too difficult for machine to solve (see S5), but after collapsing our class scheme to a binary "effect" vs. "no effect/uncertain" (combining those two categories) the performance of the model improved (see Table 2 and S5). We still present these to show the machine did learn to identify effect to an extent, accomplishing our goal of identifying perception even after limiting our analysis to the coarser class scheme. The AUC-PR and AUC- Table 2 ; for reference, a baseline classifier (one that randomly predicts the class) has an AUC-PR of 0.261-the proportion of the "effect" class in Table 1 -and an AUC-ROC of 0.5. The optimal configuration for k-means clustering was k=25 to retrieve clear topics of discussion (see S7). After dropping 4,803 repeated tweets, we clustered on 23,752 tweets. Twenty-four of the assigned clusters produced clearly delineated topics, while the remaining cluster was vague and contained general comments about weather and coronavirus. Figure 4 displays a heatmap tracking discussion frequency across ten selected topics over time. Boxes in the heatmap are shaded only for weeks where a topic exceeded its average level of discussion in the corpus, which allows for meaningful interpretation of when a topic is more active than usual. In Figure 4 , where cluster topic frequencies are plotted over time, trends are shown in discussions about the weather's impact on the spread of COVID-19. Furthermore, from January to February, there is a high frequency of discussion about cold weather and the flu, as these months exhibit both cold temperatures and the flu season, and the seasonality of COVID-19 was being discussed in reference to these topics. This was followed by an increase in discussion about reports made by scientific experts, from January 30 to March 19, about the weather's impact on the spread of COVID-19, as the virus was just beginning to spread globally, and its seasonal behavior was unknown. Simultaneously, there was an increase in discussion about Trump's comments from February 13 to 27, on April 9 and after April 23, following the same pattern seen in Figure 2 where the three illustrated peaks occurred. The high frequency of the Trump cluster shows the impact of President's statements and their constant relevance throughout the discussion of the weather's impact on the spread of COVID- It is also interesting to note that the social distancing cluster did not show up in Figure 4 until April 2 and increased in frequency from May 7 to 28. This is likely because discussion about social distancing was not prevalent until after the nationwide lockdowns in the United States in late March, and the discussion increased as the weather got warmer and people were more tempted to avoid social distancing guidelines. Similarly, discussion about social distancing peaked the same day discussion about Trump peaked on April 23, when the White House promoted new evidence about heat possibly slowing the spread of COVID-19. This is curious, as many users claimed that heat will not slow the spread of COVID-19, only social distancing will. Using clustering to reveal these topics helped understand which conversations were generating the greatest public response, allowing researchers to look into why these particular topics around the weather's impact on COVID-19 were standing out. The clustering analysis revealed a structure to the data beyond the effect class framework that we pursued for the supervised learning. For instance, comparisons to the seasonality of influenza was a notable topic in size, yet sample tweets from that topic exhibited entirely different claims to the effect of weather (see S7). Therefore, our decision to include both our supervised and unsupervised analyses was verified by the different characteristics of the data revealed by each approach, which together enabled us to understand Twitter chatter. During the manual annotation of tweets for effect, annotators recorded users' proposed mechanisms for the impact of weather, which are of interest as they exhibit potential misconceptions or unfounded theories. Some users who expected warm weather to decrease coronavirus spread discussed the following mechanisms: sunlight increasing Vitamin D levels and boosting immune response to the virus; hot weather destroying the viral capsid; and higher malaria resistance in populations with warmer climates correlating with resistance to COVID-19. Conversely, some users believed that warm weather could negatively impact the pandemic due to an increased temptation to avoid social distancing guidelines, increased transmission through air conditioning units or higher humidity, and decreased compliance to wear recommended personal protective equipment. These mechanisms demonstrate that in the absence of consensus among experts, speculative theories can take hold on social media. Understanding the drivers of this information can inform public health response to the pandemic. From an NLP perspective, automatically detecting causal mechanisms from text could be integrated into opinion mining to summarize perceptions more quickly. 32 This research is subject to limitations. As mentioned, the tri-class of "effect," "no effect," and "uncertain" problem proved too difficult for machine learning. Indeed, part of this arose from annotator difficulty in separating "no effect" and "uncertain" tweets. Several tweets were found to straddle the border of these two categories, partially due to the similarity of words across the "no effect" and "uncertain" tweets. This partly explains why collapsing these two categories into one improved our analysis performance enough to present results, and our adjusted Effect Classifier was able to successfully recognize users who claimed an effect. An additional limitation in the effect annotation scheme was that we did not label for the magnitude of the effect. With this, we lose the nuance of whether tweets are claiming a strong, impactful or weak, inconsequential effect of the weather. One solution to this is to annotate for 'weak' or 'strong' effect or assign a numerical score for the strength of effect; with more ample training data it is plausible a model may successfully learn which tweets claim a strong effect or otherwise. One significant language pattern that helped train our NLP analysis was the use of certain geographical locations to support a claim. For example, annotators noticed that warm locations, such as Florida and Singapore, were typically mentioned amongst users as a counterexample to undermine the possibility that warm weather will reduce the spread of COVID-19, and the names of these locations became a negative predictor for the "effect" class. Of course, not all mentions of warm locations in the data were as part of a counterexample, which exhibits one limitation of our model. Additionally, the Effect Classifier found the mention of "Trump" to be an accurate predictor for the "no effect/uncertain" class; this was largely due to sarcastic responses to Trump's February predictions of the weather's impact. Future directions include improving the performance of the Effect Classifier to detect more nuances of language, such as sarcasm and tone, which confused our models in some instances and are well-documented as difficult for machine learning models. 33 Our analyses revealed a surprising variety in conversations discussing potential seasonal impacts on COVID-19. The discussion went beyond the effect framework we chose that was centered around temperature and revealed broader beliefs on the impact of weather. For instance, the discussion around warm weather tempting the public to violate social distancing guidelines was unexpected, and points to an effect that has not yet been considered by researchers and could furthermore be modeled. Similarly, the presence of alternative facts such as increased air-conditioning use during warmer months worsening spread or increased transmission through mosquitos raises questions of how many subscribe to them. With these results in mind, social media can be used to crowdsource such mechanisms and provide topics for study in order to address public misconceptions. Especially during a pandemic, when everything is novel and unsettling for most, the understanding of public opinion is crucial for public health. In the future, computational methods could be used to detect public's opinion in real-time from social media to prepare for pandemic responses. This study showed that not only is detecting public opinion on social media possible, but that careful attention should be placed on the individuality of perception and how to undermine misconceptions. No funding was used to conduct this study. MG and BJ conducted pilot testing for data collection, and MG and AB worked jointly on final data collection, preparation, and machine learning classification analyses. JR, BJ, and MG annotated training data and contributed to qualitative analyses of data. MG designed the topic analysis, for which BJ, AO, AB, and MG wrote code and BJ executed. AO ................................................................................................................... 11 References . ......................................................................................................................... 16 We tested queries consisting of two parts: a term describing COVID-19 ('coronavirus,' 'covid,' or 'covid19') and a combination of weather keywords ('weather', 'temperature', 'climate', 'humidity'). Queries were directly tested on Twitter's website. The inclusion of 'weather' as a search term was necessary given the topic, and upon manual inspection, we determined that the queries with keywords 'temperature' and 'climate' returned a relatively low volume of tweets with largely irrelevant results. Results including 'temperature' tended to focus on symptoms associated with coronavirus infection, such as fevers and chills. Results including 'climate' tended to focus on discussion regarding the intersection between COVID-19 and climate change. Results for 'humidity' were often related but had low volumes. We decided not to include any additional terms due to the low volumes and to reduce the amount of filtering. We made use of the Twitter search operators (see https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators) to return tweets in English, and only original tweets, not retweets. The operators for doing so were "-is:retweet" and "lang:en," which are included in the query. Twitter provides a Counts Endpoint under their premium tier of service to return the number of tweets matching a query over a given time span (see https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search). This number is an upper bound, since the count may include deleted tweets that will not be returned with an actual query. The Counts Endpoint returned a count of 174,987 for the query "(coronavirus OR covid OR covid19) weather -is:retweet lang:en" between January 23, 2020 and June 22, 2020. For tweets sharing an article or quoting/replying to another tweet, if available, we gathered the article headline and description and the quoted/replied-to tweet text, to factor in with the analyses. Twitter provides some of this info in the returned tweet object -the article headline/description, and quoted tweet object are included, as well as the ID of the replied-to tweet. Using Twitter's API, we collected all such replied-to tweets and linked them to their referencing tweet. A small number of fetched tweets contained URL's of news articles but did not have any attached news article data. To ensure we had all possible data, we visited each of these URL's and extracted from the website's HTML the headline/description. Websites containing this information will include a tag for Twitter in their HTML, which we searched for to ensure that we collected the article data as presented on Twitter. Code for this collection is available on the Github. Before feeding data into any of these steps, we first preprocessed tweets with the following five steps: 1) Removing any HTML, non-ASCII text, and emojis from the tweet text. 2) Removing any references to popular weather channels, which were picked up by our query and generally false positives. 3) Removing any "mentions," where a user tags another user, from tweets. The exception to this was President Trump's Twitter account, @realDonaldTrump, was replaced with "Trump"; 4) Removing any trailing hashtags, where trailing hashtags are a chain of hashtags present at the end of a tweet meant to associate the tweet with a topic. Any hashtags within the middle of the tweet were kept, with the hashtag symbol removed. 5) Tokens were then standardized to lower-case form and standardized further with lemmatization (mapping each word to its root form, e.g., "running" to "run") and stemming (removing suffixes from words, e.g., "chairs" to "chair"). For more details see the Github repository. The decision to replace Trump's Twitter handle with his name was to provide additional context for tweets referring to him. As shown in the clustering results, Trump's February comments on weather's impact drove a substantial amount of discussion on Twitter about the weather's potential impacts, with several tweets referring to him by name or by his account handle. Given his status as a public figure and the lingering discussion around his comments, we standardized referenced to him to one form in order to better track discussion relating to him. As mentioned in the article, we employed rule-based filtering to narrow our corpus down and remove noise. The rules-based filtering consisted of three rules applied sequentially, which discards a tweet if it fails the rule. Figure S1 gives a high-level overview of the mechanics of each rule, and comments on the rules follow. Figure S1 : Logic for tree Rule-Based Filters Rules are applied sequentially. In this context, "true" means we kept the tweet and "false" means we discarded it. A "cleaned tweet/article" is one that has been processed as described in S2. This rule filtered out false positives due to the sheer popularity of our keywords (e.g., a tweet may comment on pleasant weather and end with "#coronavirus), and removed tweets where the keywords were split across different parts of the tweet (e.g., "weather" appeared in the article text, and "covid" in the tweet itself), as such tweets were generally found to be irrelevant. This rule dropped 32,054 tweets. Weather Usage: This rule assessed tweet relevancy by limiting the dataset to tweets using the desired form of the keyword "weather" as a noun, not as a verb or in an idiom (e.g. "under the weather"). This rule dropped 42,387 tweets. This rule restricted the tweets to those posted by individuals, who are the focus of study, not news organizations. It detected organizations by checking if the user's bio contained a news-organization keyword, or by seeing if the bio mentioned an individual-occupation keyword (e.g. "reporter") or first-person pronoun. This rule dropped 7,372 tweets. To assess our rule-based filters, a sample of 200 tweets discarded at each step were inspected to look for false negatives. Here, false negative are tweets that were truly relevant but that were discarded by the rule. The "Query Match After Cleaning" filter had 16 false negatives (false negative rate= 16/200 = 8%). The "Weather Usage" filter had 13 false negatives (false negative rate= 13/200 = 6.5%). The "Drop Organization" filter had 12 false negatives (false negative rate= 12/200 = 6%). The low false negative rates for all these filters suggest that the filters did not discard a significant amount of relevant tweets. The high performance of all classifiers for the Relevancy Classification raises the question of whether the problem is too simple for machine learning, and if it could be replaced by simple rule-based logic. To explore this, a rule-based classifier was written after manual inspection and looking at the various keywords with their respective weights that the ML algorithm produced. In our simple rule-based relevancy classifier, a tweet was classified as relevant if it contained, after lemmatizing and stemming, any of the following words: ('kill', 'outbreak', 'away', 'scientific', 'study', 'distance', 'death', 'report', 'help', 'slow', 'curb', 'reduce', 'increase'), and none of the following: ('plan', 'countries', 'state', 'home', 'forecast', 'closed', 'pleasant', 'beautiful', 'nice', 'beaches'), which were generally found to be in unrelated tweets. Table S1 shows a confusion matrix for the attempted rule-based classifier. The overall rule-based accuracy was 0.67, suggesting that our manual rules were not good enough to replace machine learning. Indeed, if there were a simple rule-based method for the problem a decision tree likely would have discovered it. [1] As mentioned in the article, for each tweet collected, the tweet text, article headline and description (if any), and quoted or replied-to tweet (if any) were merged into one text sample to be analyzed be our classifiers. The text in addition to the tweet text, which includes any article text, quoted or replied-totweet text, are referred to as our "reference" text as this additional text provides necessary context to the tweet text. Below we further describe the motivation for this reference text. The reason for including referenced texts in the Relevancy Classifier was to provide additional context for seeing if the tweet is related to our study. The inclusion aided annotators in deciding the class label, and so the same information was supplied to the Relevancy Classifier. The referenced texts were also included for the Effect Classifier to account for tweets that endorsed the opinion stated in the referenced text (such as tweets that share and comment on a news article). One potential issue that was discussed was disagreements between the user's tweet and the referenced text, but during annotation the Effect Class of the user's text and the referenced text rarely differed. Due to this and out of simplicity (i.e. to avoid having to predict the opinion of each text part and then aggregate them) we predicted the classification of "effect" vs. "no effect/uncertain" based on all text. For the type and direction of effect (e.g. "improve with warmer weather"), the similar issue of disagreement between the user and referenced text also arose. To handle this, annotators recorded the type and direction of effect based on the user's opinion, which was separated from the referenced text. Because we did not attempt machine learning on the type and direction of effect, the issue of what data to supply the model never arose. As mentioned in the article, three featurization methods -Bag of Words (BOW), Term Frequency-Inverse Document Frequency (TF-IDF), and Embeddings from Language Models (ELMO) -were used to generate vectorized inputs for the machine learning models. A fourth featurization method, Word2Vec, was briefly explored. Word2Vec is a word-embedding model that maps each word in a text corpus to a vector in some high-dimensional space, where the representation of the word depends on the context surrounding it. [2] Once a representation is built for each word in a text corpus, a document (tweet) can be represented as the average of its word's vectors. While Word2Vec seemed promising due to its ability to factor in context, preliminary results showed that Word2Vec's performance was subpar, potentially due to the short, limited context provided by tweets. Therefore, we did not formally include it in our analysis. Figure S2 shows a schematic for the training process for the two classifiers used. For our binary classifications, we optimized with respect to average precision (ROC-PR); for the trinary classification ("effect" vs "no effect" vs "uncertain"), we optimized with respect to balanced accuracy. After a first round of inner cross-validation, the top features are selected from one of the best initial classifiers, Gradient Descent SVM. Top features were determined by their coefficients for the initial Gradient Descent SVM classifier and were taken if their coefficient is above 1e-5 in absolute value. The machine learning hyperparameters are listed below in Table S2 . The relevancy classifier metrics are listed below in Table S3 . The three-way effect classifier metrics are listed in Table S4 below. These metrics are the results of the effect classifier with the effect scheme "effect", "no effect" and "uncertain." The two-way effect classifier metrics are listed below in Table S5 . These metrics are the results of the effect classifier after grouping together the "uncertain" and "no effect" categories, to have a classifier that categorized tweets as either "effect" or "the rest." Table S5 : Effect Classifier Metrics using groups "effect" vs "the rest" Based on the day-wise distribution of the Tweets, we found the ten days with most tweets and reported the most discussed news article from that day. These headlines with the corresponding days are shown below in Table S6 . In Table S6 , where the top ten days with the highest volume of tweets were pulled, it is clear most of the dates with the highest volume of tweets occurred in the time period from March 13 th to March 22 nd when there was an overall increase in twitter activity. Therefore, to isolate key events (and not just pull the top ten days by volume) and identify the dates where twitter activity significantly increased, we identified the top ten local maxima (i.e., peaks) in the tweet volumes. After identifying these local maxima, we pulled the most shared headline from each day where the maximum occurred. Out of the ten local maxima, three key peaks were presented in Figure 2 in the article. These three key peaks were the only maxima that corresponded to real world events. The top ten local maxima and the corresponding headlines are shown in Table S7 below. We tested k = 10, 15, 20, 25, and 30 on three clustering algorithms: k-means, k-medoids, and latent Dirichlet allocation (LDA). We used term-frequency inverse document-frequency weighting over ELMo contextual vectors because the former approach retains features to a word map, allowing them to be more interpretable. [3] The decision to choose k = 25 for the clustering algorithm was determined primarily by a qualitative approach. Under the k-means clustering, qualitative analysis showed that the topic clusters were not well separable at k < 25 or k > 25. Specifically, with k < 25, every cluster had similar volume and largely consisted of generic tweets about the influence of weather on coronavirus spread. With k > 25, one or two clusters dominated, and these clusters were comprised of generic tweets, while the other clusters, although less generic, maintained a low volume. High k values tended to reflect noise in the data, especially phrases used commonly in tweets that did not provide any common topic linking the clusters together. Thus, we did not test beyond k = 30. At k = 25, we deduced important topics (e.g., social distancing, Trump administration, influenza comparisons) that maintained a significant volume (see Table S8 ). The k-medoids algorithm failed to provide separation at any value of k; we speculate that this is because the method anchors a cluster's centroid on one of the tweets, creating several low volume clusters with repeated tweets and one high volume cluster with generic tweets. The LDA algorithm failed to provide separation at any value of k; we speculate that this is because the corpus was already significantly narrowed during the relevancy classification phase. Thus, it may be difficult for LDA to distinguish among a corpus discussing relatively similar content. These conclusions are supported by Table S9 , which highlights each algorithm's ability to separate topics of discussion into distinct clusters. It is clear that the k-means algorithm appropriately partitions text throughout all clusters, while k-medoids and LDA partition the majority of text to one cluster.  Since it appears than sunlight and warm humid weather kills COVID, these people were probably safer at the pool and golf course than they were in a cool, dry dark home anyway.  Will we be saved by summer in the northern hemisphere? Probably not.  He's so dumb he probably wants to hold out 'til April weather warms up, and then Covid-19 will magically go away.  Interesting theory about warmer weather slowing down the spread of the Covid19.  The notion that warm weather will slow the coronavirus will be proven one way or the other in Houston in the next 2 weeks.  Warm weather won't slow the spread of COVID-19, but closing schools and maintaining physical distancing will, suggests a study led by St. Absolute Humidity and Pandemic Versus Epidemic Influenza Climate factors and incidence of Middle East respiratory syndrome coronavirus A climatologic investigation of the SARS-CoV outbreak in Beijing Temperature dependence of COVID-19 transmission. medRxiv Climate affects global patterns of COVID-19 early outbreak dynamics. medRxiv Analysis of meteorological conditions and prediction of epidemic trend of 2019-nCoV infection in 2020. medRxiv Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Seasonality and uncertainty in COVID-19 growth rates. medRxiv The role of absolute humidity on transmission rates of the COVID-19 outbreak Temperature, humidity, and wind speed are associated with lower Covid-19 incidence. medRxiv Role of temperature and humidity in the modulation of the doubling time of COVID-19 cases. medRxiv Humidity and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19 The Modest Impact of Weather and Air Pollution on COVID-19 Transmission. medRxiv Media Use and Communication Inequalities in a Public Health Emergency: A Case Study of 2009-2010 Pandemic Influenza a Virus Subtype H1N1 A first look at COVID-19 information and misinformation sharing on Twitter Will heat kill the coronavirus Will Coronavirus Pandemic Diminish by Summer? Proceedings of the first workshop on social media analytics On the importance of comprehensible classification models for protein function prediction Distributed Representations of Words and Phrases and their Compositionality Tools and approaches for topic detection from Twitter streams: survey We thank Yicheng Wang and Elizabeth Mason who provided feedback and suggestions on earlier versions of this manuscript. We also thank Catherine DiGennaro for her contribution in framing the research project. None declared.