key: cord-0610623-nxu3x0d2
authors: She, Zhaowei; Wang, Zilong; Ayer, Turgay; Toumi, Asmae; Chhatwal, Jagpreet
title: Estimating County-Level COVID-19 Exponential Growth Rates Using Generalized Random Forests
date: 2020-10-31
journal: nan
DOI: nan
sha: dbfcdfe954b2a2a7f34eaf352d02e8125c133373
doc_id: 610623
cord_uid: nxu3x0d2

Rapid and accurate detection of community outbreaks is critical to address the threat of resurgent waves of COVID-19. A practical challenge in outbreak detection is balancing accuracy vs. speed. In particular, while estimation accuracy improves with longer fitting windows, speed degrades. This paper presents a machine learning framework to balance this tradeoff using generalized random forests (GRF), and applies it to detect county level COVID-19 outbreaks. This algorithm chooses an adaptive fitting window size for each county based on relevant features affecting the disease spread, such as changes in social distancing policies. Experiment results show that our method outperforms any non-adaptive window size choices in 7-day ahead COVID-19 outbreak case number predictions.

Early and accurate detection of community outbreaks is critical to address the threat of resurgent waves of COVID-19. Specifically, an epidemic outbreak is confirmed when its incident cases are estimated to grow exponentially. Furthermore, the potential impact of an outbreak is also measured by its exponential growth rate, as higher rates indicate more rapid disease spread. Last but not least, for a fixed epidemiology model (e.g. SIR, SEIR), there is an one-to-one correspondence between the exponential growth rate of an epidemic outbreak and its basic reproduction number R 0 , a common measure of intensity of epidemic outbreaks (Lipsitch et al., 2003) . Therefore, the exponential growth rate of an epidemic outbreak's incident cases is the most important "model-free" parameter to estimate for detecting the outbreak (Chowell et al., 2003) .

It remains an epidemiological challenge to obtain accurate exponential growth rate estimates for disease outbreaks (Ma et al., 2014) . Specifically, it is difficult to choose the fitting window size for the exponential growth rate estimation to balance the speed and accuracy of outbreak detection. On one hand, a longer fitting window is preferable as larger sample size would reduce variance of the exponential growth rate estimates of outbreaks.

On the other hand, shorter fitting windows are better at detecting early-stage outbreaks, especially if these outbreaks were driven by recent policy changes such as school reopening. In the current practice, this fitting window size is treated as a hyperparameter that is either directly specified by the user (c.f. The University of Melbourne (2020)) or determined by some cross validation methods (c.f. Chowell et al. (2007) ).

This paper develops a machine learning framework that balances the speed-accuracy tradeoff of outbreak detection via dedicated feature engineering and GRF (c.f. Athey et al. (2019)), and apply it to detect countylevel COVID-19 outbreaks. Specifically, the algorithm chooses an adaptive fitting window size for each county based on a rich set of features that affect the disease spread, such as face mask mandates, social distancing policies, the CDC's Social Vulnerability Index, changes in tests performed and rate of positive tests. Furthermore, for counties with insufficient data to capture the recent policy changes, the algorithm pools together all relevant incident case growth trends across U.S. counties and throughout the COVID-19 pandemic history to adjust for these policy changes.

During an epidemic outbreak, incident case number at county c ∈ C ⊂ N on day t ∈ T ⊂ N, i,e. I t,c , is governed by an exponential growth model, for all counties c ∈ C, while I 0,c captures the initial incident case numbers at the beginning of the exponential case growth at county c. To obtain the most recent exponential growth rate of county-level COVID-19 incident case number, we estimate the instantaneous counterpart of (1),

where the dependent variable is the loglinearized incident case number ln(I t,c ); independent variable is day t. The parameter of interest is the COVID-19 incident case exponential growth rate of county c at day t, r t,c . The intercept term, α t,c , is a parameter capturing the log-linearized initial incident case number of county c at the beginning of the outbreak. ε t,c is the error term. Notably, the exponential growth rate r t,c in (2) varies in both day t and county c. In other words, we are interested in estimating the instantaneous county-level exponential growth rate of COVID-19 incident cases. Specifically, since COVID-19 related regulations are changing every day in the U.S. (c.f. Raifman et al. (2020)), the county-level exponential growth rates, affected by these policies, also change from day to day. Therefore, in order to detect recent outbreaks, we need to estimate the most recent incident cases exponential growth rate. This instantaneous exponential growth rate r t,c , similar to the instantaneous reproduction number commonly used in the epidemiology literature (c.f. Fraser (2007); Cori et al. (2013) ), captures the projected exponential growth rate of incident cases in county c should the future COVID-19 regulations remain the same as those in day t.

The exponential growth rate of COVID-19 incident cases can be affected by many factors ranging from day-to-day changes in COVID-19 regulations to difference in population density and healthcare resources between counties. Hence, in order to estimate the instantaneous county-level exponential growth rate r t,c defined in (2), we need to control for these day-level and county-level heterogeneity. Provided that we have relevant features X t,c ∈ R m that captures these aforementioned factors affecting the instantaneous county-level exponential growth rate r t,c , 2 we can identify the conditional average partial treatment effect r t,c (X t,c ) := E[r t,c |X t,c ] as defined by Wooldridge (2010). When data of these relevant features affecting COVID-19 disease spread is available, we can rewrite (2) as

following the common "redundant" assumption (c.f. Wooldridge (2010)), i.e.

This section discusses how we estimate the exponential growth rate of COVID-19 incident cases r t,c (X t,c ) defined in (3). Specifically, we first formulate the estimation problem in §3.1 and then present our estimation algorithm in §3.2.

First, we need the following "unconfoudness" assumption (c.f. Rosenbaum and Rubin (1983)):

2. Refer to Appendix C for the list of relevant features used in this study.

That is, day t is independent of all unobservable heterogeneity conditional on the feature vector X t,c . Under Assumption 2, we can derive the following moment equations for (3):

However, we note that the moment equations (4) cannot identify the exponential growth rate r t,c (X t,c ). Specifically, the moment equation system (4) is underidentified as the number of unknown parameters {α t,c (X t,c ), r t,c (X t,c )} t∈T,c∈C is exactly two times the number of moment equations. To address this problem, we assume that the exponential growth rate of day t * equal the local average causal response (c.f. Abadie (2003) ; Angrist and Imbens (1995)) from day t * − 1 to day t * , i.e.

Parameters {α t,c (X t,c ), r t,c (X t,c )} t∈T,c∈C are exactly identified by the moment equations (4) under Assumption 3.

Given the above assumptions, the exponential growth function of COVID-19 incident cases (3) can be estimated using the GRF algorithm by Athey et al. (2019). Specifically, Assumption 3 implies that the exponential growth rate of a county c on day t * is identified by a targeted data "block"

Hence, when estimating r t * ,c (X t * ,c ), we can partition the panel dataset {(I t,c , X t,c )} t∈T,c∈C into |C| × t * 2 data blocks, and pool those data blocks "similar" to the targeted block to construct an adaptive window size for this estimation. Particularly, the similarity measure is provided by the GRF algorithm. In Appendix A, Algorithm 1 provides the pseudocode to construct these data blocks. Algorithm 2 explains how we feed these data blocks into the GRF algorithm to obtain the estimates of interest.

To benchmark our method against nonadaptive window size choices, we compare the Mean Absolute Percentage Error (MAPE) of these methods' 7-day ahead predictions. Specifically, we use the NYTimes COVID-19 Dataset (c.f. The New York Times (2020)) as the source of daily reported cases per county. Additional results for performance evaluation are available at Appendix D.

As shown in Figure 1 and Table 1 , our method outperforms methods with fixed 2-, 4-, 8-or 16-day fitting window sizes. Specifically, Figure 1 shows that our method provides a uniformly better performance roughly 100 days after the first recorded COVID-19 case in the NYTimes COVID-19 Dataset, when there were enough historical data for the GRF algorithm to conduct meaningful partition. Furthermore, Table 1 demonstrates that even if the earlyday MAPEs are included in the comparison, our method still has the best median MAPE among the 4 methods. Last but not least, we note that when only comparing the performance of non-adaptive window size choices, there are no obvious best choice. While choosing shorter fitting window sizes could in general lead to lower median MAPEs (c.f. Table 1) , it was still frequently outperformed by longer fitting window size choices (c.f. Figure 1 ). 

In this work, we developed a novel framework that allows GRF to adequately balance the the speed-accuracy tradeoff in COVID-19 outbreak detection. This estimation framework can be readily extended to other epidemiology problems where fitting window size impacts model performance.

Joshua D Angrist and Guido W Imbens. Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association, 90 (430) 

3. Assign the outcome variable

4. Assign the treatment variable for day t *

5. Feed (X,Y,W) into the GRF algorithm

For this work, where we wish to provide useful estimates of case trajectories for epidemiologists and decision makers in the government, we ideally wish to measure the growth rate of active case numbers

where Y is the cumulative number of cases so far, D is the cumulative number of deaths, and R is the cumulative number of recovered cases. This intuitively captures the number of still "infectious" cases, as we can assume for COVID-19 that those who have recovered or died are no longer capable of infecting others. Unfortunately, as the number of recovered cases R t are no longer reported, we are unable to directly calculate the number of active cases. While other works (c.f. The University of Melbourne (2020)) approximate the number of recovered cases with

Here the underlying assumption is that those who were infected but not dead after 22 days have recovered. However, upon testing this assumption at the county-level, we find that this assumption does not hold as there will be some days where this newly approximated active case number becomes negative. We hence rely upon a commonly used proxy: the incident case count, I t , defined as

which is the number of new cases within a 22 day time period and is a useful proxy for infectious cases.

All feature data we used, i.e. {X t,c } t∈T,c∈C , are publicly available online. In this section, we briefly describe what these datasets are and how we incorporated them in our work.

The 2019 United States Census Gazetteer Files (c.f. United States Census Bureau (2019)) were used to obtain the geographic locations (latitude-longitude centroids) of each county officially registered by the United States Census Bureau. This provides a spatial feature space for the GRF to further split upon.

The Social Vulnerability Index (SVI) database (c.f. Centers for Disease Control and Prevention (2018)) is a compilation of socio-economic factors such as unemployment rate, poverty rate, education attainment level etc. at the county level. These features are also included in our methodology.

The CUSP database (c.f. Raifman et al.

(2020)) tracks when each state implemented and ended policies such as mask mandates, lockdowns, economic policies in response to the COVID-19 pandemic. As such, it is a vector of features capturing daily policy changes in each state. As these policies is state-wide, we naturally extend them to the respective county level.

From the COVID Tracking Project (c.f. The Atlantic (2018)), we obtained the he daily numbers of PCR, Antibody and Antigen tests performed and their positivity rates at each state. These are used as features in our framework as well.

Appendix D. Additional Results for Performance Evaluation D.1. MAPE Plots 

Semiparametric instrumental variable estimation of treatment response models