key: cord-1006967-2gzgp2vl authors: Zhang, Xinxuan; Maggioni, Viviana; Houser, Paul; Xue, Yuan; Mei, Yiwen title: The impact of weather condition and social activity on COVID-19 transmission in the United States date: 2021-11-11 journal: J Environ Manage DOI: 10.1016/j.jenvman.2021.114085 sha: a1f2c756b09ff920e2a33a76c383da8dbd06f725 doc_id: 1006967 cord_uid: 2gzgp2vl The coronavirus disease 2019 (COVID-19) has been first reported in December 2019 and rapidly spread worldwide. As other severe acute respiratory syndromes, it is a widely discussed topic whether seasonality affects the COVID-19 infection spreading. This study presents two different approaches to analyse the impact of social activity factors and weather variables on daily COVID-19 cases at county level over the Continental U.S. (CONUS). The first one is a traditional statistical method, i.e., Pearson correlation coefficient, whereas the second one is a machine learning algorithm, i.e., random forest regression model. The Pearson correlation is analysed to roughly test the relationship between COVID-19 cases and the weather variables or the social activity factor (i.e. social distance index). The random forest regression model investigates the feasibility of estimating the number of county-level daily confirmed COVID-19 cases by using different combinations of eight factors (county population, county population density, county social distance index, air temperature, specific humidity, shortwave radiation, precipitation, and wind speed). Results show that the number of daily confirmed COVID-19 cases is weakly correlated with the social distance index, air temperature and specific humidity through the Pearson correlation method. The random forest model shows that the estimation of COVID-19 cases is more accurate with adding weather variables as input data. Specifically, the most important factors for estimating daily COVID-19 cases are the population and population density, followed by the social distance index and the five weather variables, with temperature and specific humidity being more critical than shortwave radiation, wind speed, and precipitation. The validation process shows that the general value of correlation coefficients between the daily COVID-19 cases estimated by the random forest model and the observed ones are around 0.85. The coronavirus disease 2019 has been first reported in December 2019 and rapidly spread worldwide. According to the World Health Organization (WHO), there are more than 174 million COVID-19 cases that have been confirmed across 219 countries, areas or 60 territories globally as of early June 2021, and number of COVID-19 related deaths are over 3.7 million. The global pandemic has affected our society in many aspects. To better understand this challenging situation, there has been a significant number of studies investigating the dynamics of COVID-19 transmission (Kucharski et al. 2020 , SAPKOTA et al. 2021 ) and the short and long term impacts of COVID-19 on people's life, health condition, 65 and social activities (Goodell et al. 2020 , Melo-Oliveira et al. 2021 , Sonza et al. 2021 , Joseph 2021 ). The first COVID-19 case in the United States (U.S.) was identified in Washington state in January 2020 and remained at a relatively slow rate of transmission throughout February of the same year. The daily number of U.S. confirmed cases started to increase dramatically in March 70 until hitting its first peak in early April 2020. Then, the spread of COVID-19 in the U.S. slowed down due to the stay-at-home orders issued by most of the states. However, the daily number of confirmed cases began to rise again in mid-June since the states reopened gradually. The number of daily confirmed cases started to reduce again in August through October 2020, but a new extreme peak came right after (November 2020 to early January 2021) with over of 75 200,000 daily confirmed cases during the winter time. As a plausible consequence of the effective COVID-19 vaccines that became available towards the end of 2020, the number of daily confirmed cases started to decrease in mid-January 2021. However, the sharp decreasing curve of daily confirmed cases became flat since mid-March 2021, while the number of administered vaccine doses are increasing rapidly. The pandemic is still ongoing. As of early 80 June 2021, there are about 63% of the adults in the U.S. had received at least one dose of J o u r n a l P r e -p r o o f vaccine, but the average number of daily confirmed COVID-19 cases was still around 20,000 and the cumulative confirmed cases already reached 33 million. More than 607,000 deaths happened in the U.S. indicating an overall COVID-19 death rate of 1.8% while the death rates of the seasonal influenza is usually below 0.1% according to a recent report from World Health 85 Organization (WHO, 2020). The COVID-19 case rate (number of cases per million people) varies dramatically across the U.S. Despite the socioeconomic differences among different states and counties, it is speculated that the transmission mechanism of the COVID-19 might be related with local meteorological conditions as other respiratory viruses. Several studies have investigated the relationship 90 between the weather condition and COVID-19 transmission. For instance, Wu et al. (2020) used a log-linear generalized additive model to explore the effect of temperature and humidity on the COVID-19 transmission in 166 countries and found that temperature and relative humidity were negatively related to the COVID-19 cases. Chen et al. (2020) established a statistical model to estimate the number of COVID-19 cases with four weather variables 95 (temperature, relative humidity, wind speed, and visibility) during the January to March 2020 time frame. The model-estimated case counts showed an acceptable correlation with the real counts based on the data of 54 countries around the world. Wang et al. (2021) applied a Fama-Macbeth Regression and found effective reproductive number of COVID-19 to decrease with increasing air temperature and relative humidity based on the data of 100 Chinese cities and 100 1005 U.S. counties from January to April 2020. Haque et al. (2020) focused on Bangladesh in the March-to-May 2020 period using a linear regression framework to conclude that high temperature and humidity significantly reduce the COVID-19 transmission. Another study (Mofijur et al., 2020) in Bangladesh using the Spearman rank correlation test showed different results, i.e., only minimum and average temperatures had a significant relationship with the 105 number of COVID-19 cases. More recently, He et al. (2021) studied 9 major Asian cities with generalized additive modeling (GAM) and Pearson correlation. The GAM analysis showed the J o u r n a l P r e -p r o o f 6 number of daily COVID-19 cases to be positively associated with the weather variables (i.e., temperature and relative humidity), while the Pearson correlation showed the relationships between COVID-19 cases the weather variables can be either negative or positive depending 110 on different cities. The results from previous studies all highlight the impact of weather variables such as temperature and humidity on the spread of COVID-19, though the conclusions may vary significantly, e.g., Wu et al. (2020) found a negative relationship between temperature and COVID-19 cases whereas He et al. (2021) reported an increasing in temperature may yield an 2 Study area and Data 2.1 Study area 130 The study was carried out in 48 states across CONUS, including 3,142 counties and independent cities. The population density varies widely across counties (Figure 1 The nationwide county level COVID-19 data were provided by the University of Maryland COVID-19 Impact Analysis Platform (https://data.covid.umd.edu) that was originally developed by Zhang et al. (2020, preprint) The variables used in this study include the number of new COVID-19 cases (NewC) and the social distance index (SDI) at the county level. Both variables were extracted for the period of March 23 rd to September 1 st , 2020. NewC represents the number of daily confirmed cases that tested positive to coronavirus detection. The SDI is an integer that ranges from 0 to 100 and 155 represents the extent residents and visitors are practicing social distancing. A value of zero indicates that no social distancing is observed in the community, while 100 indicates that all residents are staying at home and no visitors are entering the county (Zhang et al. 2020, preprint) . Specifically, Zhang et al. (2020 preprint) defined the SDI as a combination of six mobility metrics, according to the following equation: where SH stands for Staying Home, which is the percentage of residents staying at home; RAT is the percentage of Reduction of All Trips compared to a pre-COVID-19 benchmark; RBT is 165 the reduction of business trips (%), RNT is the reduction of non-business trips (%), RDT is the percent Reduction of Travel Distance; and ROT is the Reduction of Out-of-county Trips (%). The weights are chosen based on shared residents and visitor trips (e.g., about 20% of all trips are out-of-county trips, which led to the selection of a weight of 0.8 for resident trips and 0.2 for out-of-county trips); what trips are considered more essential (e.g., business trips more 170 essential than non-work-related trips); and the principle that higher SDI scores should correspond to fewer chances for close-distance human interactions and virus transmissions. The county attributes used in this study, including boundary, area, and population, are collected The and hourly spatial and temporal resolution, respectively. We processed the gridded hourly 185 NLDAS-2 data from March 23 rd to September 1 st 2020 and obtained the county level daily mean values of temperature, specific humid, wind speed, shortwave radiation, and precipitation over CONUS. The study is organized in two parts. The first part presents a traditional statistical analysis to 190 explore the impact of each weather variable on the COVID-19 transmission. In the second part of the study, we develop a machine learning algorithm. The overarching goal is to investigate which weather variable(s) can explain most of the COVID-19 transmission variability across the U.S. This analysis assesses Pearson correlation coefficients of each weather variable and SDI versus the number of COVID-19 cases per 1000 people (NewC1000) at the county level. The NewC1000 is used here instead of NewC for the sake of eliminating the influence of the county population on the results. In addition, to minimize the uncertainty of COVID-19 spreading due density: i) greater than 10,000 people per mile 2 , ii) around 1,000 people per mile 2 , iii) around 500 people per mile 2 , iv) around 100 people per mile 2 , and v) around 10 people per mile 2 . Moreover, as the COVID-19 disease transmission might be largely influenced by human activities, policies, and social distancing, the study time frame is divided into three periods Correlation coefficients used above are valid metrics when investigating linear or slightly nonlinear problems. Given the complicated nature and highly non-linear response of the COVID-215 The RF model was trained with three versions of predictor list ( The estimated number of the new COVID-19 cases by all three models for the validation data set are compared with the observed number of cases to determine which model gives the best 260 prediction. As described in section 3.2, there are more than 151,000 county level data records included in the validation procedure. We adopt the normalized root-mean-square-error (NRMSE) and the Pearson correlation coefficient (CORR) to evaluate the model performances. The NRMSE can be calculated using (3) 270 where is the number of county level validation data records; ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ is the mean of the NewC values in the validation dataset, and ̅̅̅̅̅̅̅̅̅̅̅̅̅̅ is the model estimated ones. The CONUS total COVID-19 NewC values reached two peaks from January to September 2020 ( Figure 2) . The NewC showed two peaks during this period. One was in late March and the other one was in mid-July. The spreading of COVID-19 mildly slowed down after the first peak mainly due to the mandatory stay-at-home orders issued in late March in many states. precipitation, e) shortwave radiation, and f) SDI from January 1 st to September 1 st 2020. The correlation coefficients between NewC1000 and each of the weather variables and SDI are shown for different group of counties in three periods at different public health intervention 300 levels ( Figure 4 ). In the first period (January 1 st to March 22 nd 2020), most of the weather variables show near-zero values in terms of correlation with NewC1000. This was because the COVID-19 had just started to spread across U.S. at that time. The number of COVID-19 cases was small and the health departments in most states were experiencing difficulty to collect the real-time data. In the second period (March 23 rd to May 10 th 2020), although the stay-at-home 305 orders began to be effective, the confirmed COVID-19 cases maintained at a relative high level in many regions ( Figure 2 ). The correlations between the weather variables and the NewC1000 became more obvious except for precipitation which are still close to zero. In the third period, the correlation between temperature/specific humidity and the NewC1000 became even Overall, SDI has a more obvious impact on COVID-19 transmission in densely populated regions. Among the five weather variables, temperature and specific humidity have higher influences on the COVID-19 transmission comparing to wind speed and shortwave radiation. 320 The precipitation can be considered as noninfluential to the COVID-19. These findings are consistent with the results of the RF regression model in section 4.3. Nevertheless, all the correlation coefficients (even the relative high values) shown in Figure 4 are not high enough to demonstrate a convincing relationship between the weather variables or SDI and the number of COVID-19 cases. None of the graphs provides a strong and clear trend. Finally, we chose the 50-tree RF model because it is accurate enough and more efficient than the 60-tree or 100-tree model settings. Table 2 lists the predictor importance for the 8-predictor RF model. The importance of a 370 predictor in the RF algorithm is estimated by looking at how much the prediction error increases when the out-of-bag data for that predictor is permuted while all others are left unchanged. Specifically, for each tree in the forest, the model records the mean square error (MSE) of the prediction on the out-of-bag portion of the data, then the same procedure is followed after permuting each predictor. Finally, the difference between the two MSEs is 375 averaged over all trees and normalized by the standard deviation of the differences. The most important predictors in this RF model for estimating NewC are population and population density, followed by the SDI and the 5 weather variables. Temperature and specific humidity are more important than shortwave radiation, wind speed, and precipitation, with precipitation being the least important predictor in the RF model, which is consistent with the results of the Considering the low importance of shortwave radiation, wind speed, and precipitation shown 385 in the 8-predictor RF model, a 5-predictor RF model was trained without these 3 predictors. Furthermore, a RF model in which all weather variables were eliminated was trained with only 3 predictors (population, population density, and SDI). Normalized RMSEs for the three RF models ( Moreover, it is worth to mention that the process of data collection, especially for the daily number of COVID-19 cases, can be influenced by some uncertainties. For example, the number 435 of reported cases tends to decrease during weekends due to the lab working schedule; testing or reporting delays could happen in many circumstances; the case number can also be affected by the accessibility of testing resources in a region, etc. All these factors introduce uncertainties in the original data set thus affect the accuracy of the model. The study presented a traditional statical approach and a machine learning algorithm to analyse the impact of weather, population, and social activity factors on COVID-19 transmission in terms of daily COVID-19 cases (i.e., NewC) for all the counties over CONUS. Specifically, we considered 8 factors: county population, county population density, county SDI (i.e. social distance index), and the 5 daily weather variables (air temperature, specific humidity, The traditional statistical approach (i.e., correlation coefficients between each weather factor and NewC) shows weak correlations coefficients between NewC values and most of the weather factors, while the precipitation does not show any correlation with NewC. The machine learning approach adopts a random forest model to estimate county level daily 450 NewC values by the 8 factors aforementioned. Three version of the RF model are tested using different subsets of the 8 factors (Table 1 ). The validations of the three RF models show that the general value of correlation coefficient between the daily COVID-19 cases estimated by the random forest model and the observed ones is around 0.85. Results also show that the most important predictors in the RF model for the NewC estimation are population and population 455 density, followed by the SDI and the 5 weather variables. Temperature and specific humidity are more important than shortwave radiation, wind speed, and precipitation. Precipitation is the least important predictor in the RF model, which is consistent with the result by the traditional statistical approach. Most weather variables in the RF models presented in this study are based on their daily 460 maximum value, except for precipitation. The daily mean weather variables are also tested in the models but found be to less accurate than the daily maximum weather variable-based model (Table 5 ). To our current knowledge, this paper is among few of the studies presenting COVID-19 transmission at county level and daily scale across the entire U.S. We acknowledge that a single COVID-19 pandemic and innovation activities in the global airline industry: A review Random Forests Classification and regression by randomForest Predicting the Dynamics of the COVID-19 Pandemic in the United States Using Graph Theory-Based Neural Pandemic on the Experience of Back Pain Reported on Twitter® in the United States: A Natural Language 495 Processing Approach COVID-19 and finance: Agendas for future research Association between temperature, humidity, and COVID-19 500 outbreaks in Bangladesh The Influence of Average Temperature and Relative Humidity on New Cases of COVID-19: Time-Series Analysis. JMIR Public Health and Surveillance Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The lancet infectious diseases Influenza virus transmission is dependent on relative humidity and temperature Md Asraful Alam, and Md Uddin, 2020. Relationship between Weather Variables and New Daily COVID-19 Cases in Dhaka Reported quality of life in countries with cases of COVID19: A systematic review The Chaotic Behavior of the Spread of Infection during the COVID-19 Pandemic in the United States and Globally COVID-19 Lockdown and the Behavior Change on Physical Exercise, Pain and Psychological Well-Being: An International Multicentric Study Impact of temperature and relative humidity on the transmission of COVID-19: a modelling study in China and the United States Coronavirus disease (COVID-19): Similarities and differences with influenza Effects of 535 temperature and humidity on the daily new cases and new deaths of COVID-19 in 166 countries An interactive COVID-19 mobility impact and social distancing analysis platform. medRxiv