key: cord-0889596-yxgp9uur authors: Zhu, Shixiang; Bukharin, Alexander; Xie, Liyan; Yamin, Khurram; Yang, Shihao; Keskinocak, Pinar; Xie, Yao title: Early Detection of COVID-19 Hotspots Using Spatio-Temporal Data date: 2021-05-31 journal: IEEE Journal on Selected Topics in Signal Processing DOI: 10.1109/jstsp.2022.3154972 sha: 8609c88d9377547bfd1e427f40d3fc3cdce18acb doc_id: 889596 cord_uid: yxgp9uur Recently, the Centers for Disease Control and Prevention (CDC) has worked with other federal agencies to identify counties with increasing coronavirus disease 2019 (COVID-19) incidence (hotspots) and offers support to local health departments to limit the spread of the disease. Understanding the spatio-temporal dynamics of hotspot events is of great importance to support policy decisions and prevent large-scale outbreaks. This paper presents a spatio-temporal Bayesian framework for early detection of COVID-19 hotspots (at the county level) in the United States. We assume both the observed number of cases and hotspots depend on a class of latent random variables, which encode the underlying spatio-temporal dynamics of the transmission of COVID-19. Such latent variables follow a zero-mean Gaussian process, whose covariance is specified by a non-stationary kernel function. The most salient feature of our kernel function is that deep neural networks are introduced to enhance the model's representative power while still enjoying the interpretability of the kernel. We derive a sparse model and fit the model using a variational learning strategy to circumvent the computational intractability for large data sets. Our model demonstrates better interpretability and superior hotspot-detection performance compared to other baseline methods. : Examples of COVID-19 hotspots identified by CDC [1] . In general, these hotspots define the onset of a local outbreak of COVID-19. The Centers for Disease Control and Prevention (CDC) with other federal agencies have identify counties with a significant increase in COVID-19 incidence (hotspots) [1] , which offers a unique opportunity to investigate the spatio-temporal dynamics between the identified hotspots. Fig. 1 gives some real examples of the hotspots at four different counties in the state of Georgia. The identified hotspots indicate the relative temporal increases in confirmed cases and mark the onset of local outbreaks. In this paper, we propose an effective COVID-19 hotspot detection framework that utilizes the hotspot data and multiple other data sources, including community mobility, to enhance hotspot detection accuracy. We assume the hotspot and number of cases in the same location depending on common priori factors, represented by a latent spatio-temporal random variable. This latent variable is modeled by a Gaussian process, whose covariance is characterized by an interpretable non-stationary kernel. We note that the non-stationarity of our kernel plays a pivotal role in the success of our model because the spread of the virus shows heterogeneous spatial correlation across different regions. For example, the virus is likely to spread more slowly in a sparsely populated area such as rural Nebraska compared to a densely populated area such as New York City. We formulate our kernel function using carefully crafted feature functions incorporating neural networks, which provide greater flexibility in capturing the complex dynamics of the spread of COVID-19 while still being highly interpretable. To tackle the computational challenge of the Gaussian process with a large-scale data set, we also derive a sparse model and fit the model efficiently via a variational learning strategy. The remainder of the paper is organized as follows. We first discuss the literature relevant to COVID-19 hotspot detection and other related work. We describe the data sets in Section II. We introduce the proposed hotspot detection framework in Section III. We present an efficient computation strategy and the learning algorithm for our detection framework in Section IV. Finally, we present the interpretation of our model and the numerical results on COVID-19 data in Section V. Related work. Hotspot detection is closely related to the prediction of confirmed cases. Therefore, we first review some prediction methods for completeness. Compartmental models are mathematical modeling of infectious diseases and have been widely used in epidemiology. In simple SIR models, [4] , the population is assigned to compartments with labels S (susceptible), I (infectious), and R (recovered), respectively. The transition rates between compartments are typically modeled using differential equations. Extensions and variants of SIR models include the SIRD model [5] , [6] which considers deceased individuals, and the SEIR model [7] [8] [9] which considers exposed periods, to name a few. Compartmental models work well when applied to large regions/populations, such as a state or a country because they assume a fixed/closed population. However, populations between geographic areas, such as counties, may interact with each other in the desired high-resolution modeling. Therefore, we use a spatio-temporal model that is more flexible and can capture the spread between different counties. Much work has been done on predicting the number of COVID-19 cases and deaths at the national level or state level, without considering the spatial correlation across smaller regions [10] [11] [12] [13] [14] [15] [16] . Machine learning-based approaches have also been considered in [17] . Some work [18] attempts to use neural networks to model the accumulative number of confirmed cases. Recurrent neural network-based methods [19] , [20] have been applied to model the temporal dynamics of the COVID-19 outbreak. Moreover, online COVID-19 forecasting tools include the COVID-19 simulator [21] and the COVID-19 Policy Alliance developed by a group in MIT. In this paper, hotspot detection is a binary classification problem, which differs in nature from the regression investigated by these studies. Besides accurate prediction and detection, understanding the spatial spread underlying the COVID-19 outbreak is also of great importance. An interpretable spatial model can help the government develop efficient public health policies to slow the spread during the early stages of COVID-19. Compared with literature that focuses on prediction, studies evaluating the spatial spread of the COVID-19 pandemic are still limited [22] . Previously, the spatial spread has been studied for the outbreak of severe acute respiratory syndrome (SARS) in Beijing, and mainland China [23] , [23] [24] [25] [26] [27] using only limited or localized data. In [28] , the multivariate Hawkes process has been applied to model the conditional intensity of new confirmed COVID-19 cases and deaths in the U.S. at the countylevel, without considering the influence from the big cities (main transportation hubs) and other important demographic factors. In [29] , two types of county-level predictive models are developed based on the exponential and linear model, respectively. It focuses on modeling the dynamics of cumulative death counts. In [30] , graph neural networks are adopted to capture the spatio-temporal dynamics between various features; however, a common disadvantage of the neural network-based methods is the lack of interpretability, which hinders from further understanding the mechanism underlying the COVID-19 spread. Few studies have so far been conducted to investigate COVID-19 hotspots and their early detection. Similar to the CDC's definition, a recent study [31] considers a sudden increase in the number of cases in a specific geographical region. Unlike hotspot detection, they focus on estimating disease prevalence using logistic regression based on both symptoms and swab test results. In [32] , hotspots are defined as spots with the highest incidence rate. This paper adopts statistical and spatial analysis to determine the spatial distribution and spatial clustering patterns of the COVID-19 incidence rate. To identify the COVID-19 hotspots, the Getis-Ord spatial statistic [33] was then applied. In another work [34] , topological data analysis was applied to identify the hotspots of COVID-19 infections, which is defined as regions with higher case counts than their surrounding areas. However, no quantitative results, such as the estimation of confirmed cases or the prediction of future hotspots, were provided in [34] . Many studies used the Gaussian process for COVID-19 case prediction. In [35] , a Gaussian process regression model is applied to mortality rate prediction in India. Unlike our model, [35] does not consider any spatial factors in the spread of COVID-19 and instead predicts cases on a national level. Some recent work [36] , [37] have used Gaussian process models with a squared exponential kernel to forecast cases in the United States. However, these works provide state and national-level forecasts less granular than the city-or county-level forecasts produced by our model. In addition, [36] , [37] use a stationary kernel, meaning that the kernel cannot adapt to different spatial patterns in different locations like the non-stationary spatial kernel discussed in this paper. The data sets we used in our study include the number of cases and deaths, COVID-19 hotspots identified by the Centers for Disease Control and Prevention (CDC), and community mobility provided by Google. The study period is from March 15, 2020, to January 17, 2021, consisting of 50 weeks and 3,144 US counties. We excluded the data after February 2021, when a large-scale COVID-19 vaccine rollout had been launched across the United States, which significantly shifted the dynamics of the COVID-19 spread. Confirmed cases and deaths. We used the data set from The New York Times (NYT) [38] which includes two parts: (i) confirmed cases are counts of individuals whose coronavirus infections were confirmed by a laboratory test and reported by a federal, state, territorial, or local government agency; (ii) confirmed deaths are individuals who have died and meet the definition for a confirmed COVID-19 case. In practice, we have observed periodic weekly oscillations in daily reported cases and deaths, which could have been caused by testing bias (higher testing rates on certain days of the week). To reduce such bias, we aggregate the number of cases and deaths of each county by week. Hotspots. On May 7, 2020, the CDC and other federal agencies began identifying counties with increasing COVID-19 hotspots to better understand transmission dynamics and offer targeted support to health departments in affected communities. The CDC identified hotspots daily starting on January 22, 2020, among counties in U.S. states and the District of Columbia by applying standardized criteria developed through a collaborative process involving multiple federal agencies [1] , [39] . In general, hotspots were defined based on relative temporal increases in the number of cases. To match the temporal resolution with the number of cases and deaths, we expand the definition of a hotspot from daily-level to weekly-level. A week is identified as a hotspot if it contains at least one hotspot day identified by CDC. The weekly number of counties meeting hotspot criteria peaked in early April, decreased and stabilized during mid-April-early June, then increased again during late Juneearly July. The percentage of counties in the South and West Census regions meeting hotspot criteria increased from 10% and 13%, respectively, during March-April to 28% and 22%, respectively, during June-July. Fig. 2 gives snapshots of the identified hotspots at two particular weeks. Community mobility. The COVID-19 Community Mobility Reports [40] record people's movement by county daily, across various categories such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. The data shows how visitors to (or time spent in) categorized places change compared to the baseline days (in percentage). The negative percentage means that the level of mobility is lower than the baseline, and the positive percentage represents the opposite. The mobility on a baseline day represents a normal value for that day of the week. This mobility report sets the baseline as the median value from the five weeks from January 3rd to February 6th, 2020. Similar to the two data sets mentioned above, we aggregate each county's mobility data by week. Examples of two categories, transit stations, and workplaces, are shown in Fig. 3 . This section presents our hotspot detection framework, consisting of two spatio-temporal models: confirmed cases and hotspots. Consider weeks = { = 1, . . . , } starting from March 15, 2020 to January 17, 2021 and locations (counties) ℐ = { = 1, . . . , }, with latitude and longitude ∈ ⊂ R 2 , ∈ ℐ, where represents the space of geographic coordinate system (GCS). The two models, respectively, focus on weekly confirmed cases ∈ Z + and identified hotspots ℎ ∈ {0, 1} of COVID-19 at location ∈ ℐ and time ∈ , where ℎ = 1 if there is a hotspot at location and time , and 0, otherwise. CDC [1] defined the hotspots based on relative temporal increases in the number of cases, i.e., the occurrence of the hotspots depends on the spatio-temporal correlation across different locations and over time, and not on the mean number of cases (see the observation in Fig. 1 ). Hence, we capture the correlation between and ℎ by connecting these two models in the spatio-temporal space ( , ) through a latent spatio-temporal random variable ( , ), characterized by a Gaussian process (GP) with zero mean and covariance specified by a kernel function . The goal is to find the optimal pair of these two models that best predict the hotspots and the cases for one week ahead. We refer to the proposed framework as the spatio-temporal Gaussian process (STGP). For the notational simplicity, we first denote the spatiotemporal coordinate ( , ) by x ∈ , where × ⊂ R 3 represents the spatio-temporal space. For any subset X ⊆ with spatio-temporal coordinates, the set of function variables f { (x)} x∈X has joint zero-mean Gaussian distribution where K is a × matrix and its entries are pairwise evaluations of (x, x ), ∀x, x ∈ X. Case model. We define a spatio-temporal model for the confirmed cases in the following form: where is the mean of number of confirmed cases at time in location . For a set of observed spatio-temporal coordinates X, we denote the number of confirmed cases and their means as y { } x ∈X and { } x ∈X , respectively. We assume the mean of the number of confirmed cases at a particular location relates to the covariates in its nearby counties according to an underlying undirected graph = (ℐ, ℰ), where ℐ is the set of vertices representing all the locations, and ℰ ⊆ {( , ) ∈ ℐ 2 } is a set of undirected edges representing the connections between locations. There is an edge between two vertices whenever the corresponding locations are geographically adjacent. Let [ 1 , . . . , , . . . , ] ∈ R denote the data of these covariates at location ∈ ℐ and time ∈ , and let [ 1 , . . . , , . . . , ] ∈ R denote the parameters of the corresponding covariates; denotes the number of features. In practice, we use the number of confirmed cases, the number of deaths, and six community mobilities variables in the past two weeks as the input covariates with = 16. Formally, we define as where H = { : − ≤ < } represent the recent history with memory depth < . Hotspot model. We express the conditional probability of the hotspots h {ℎ } x ∈X for a set of spatio-temporal coordinates X as: where B (ℎ | ( (x ))) = ( (x )) ℎ (1 − ( (x ))) 1−ℎ is the likelihood for the Bernoulli distribution and is a sigmoid function. Learning objective. We aim to detect the hotspot while taking advantage of the information that has been recorded in the number of confirmed cases. To this end, we learn the model by optimizing the following combined objective: where > 0 controls the ratio between two objectives and ∈ Θ is the set of parameters defined in the kernel . The ℓ ( ) log (y) denotes the log marginal likelihood of observed confirmed cases and ℓ ℎ ( ) log (h) denotes the log marginal likelihood of observed hotspots. We note that log marginal likelihood of cases in the second term plays a key role in "regularizing" the model by leveraging the information in the case records as shown in Appendix A. We also present the 5-fold cross-validation that quantitatively measures the 1 score of the hotspot detection and the mean square error of the case prediction with different in Fig. 15 (Appendix A). The result confirms that the appropriate choice of can significantly improve the performance of hotspot detection. We discuss the choice of the kernel function in this subsection. Standard GP models use a stationary covariance, in which the covariance between any two points is a function of their Euclidean distance. However, stationary GPs fail to adapt to variable smoothness in the function of interest. This is of particular importance in geophysical and other spatial data sets, in which domain knowledge suggests that the function may vary more quickly in some parts of the input space than in others. For example, COVID-19 is likely to be spreading slower than in sparsely versus densely populated regions. Here, we consider the following non-stationary spatio-temporal kernel: where ( , ) is a stationary kernel that captures temporal correlation between time and ; ( ) ( , ) is a component of the non-stationary spatial kernel which evolves over the space and ( ) is the corresponding weight satisfying is the number of components considered. By likening the relationship between the spatial kernel component to that of the Gaussian component in the Gaussian mixture, we seek to enhance the representative power of our kernel by adding more independent components to the spatial kernel. Stationary temporal kernel. We define the kernel function that characterizes the temporal correlation between , ∈ as an stationary Gaussian function: where ∈ R + is the bandwidth parameter. This kernel function hypothesizes that the virus' transmission is highly related to its recent history and their correlation will decay exponentially over time. Non-stationary spatial kernel. To account for non-stationarity, we now allow the smoothing kernel to depend on spatial location . For ease of discussion and simplicity of notation, we omit the superscript in ( ) ( , ) and ( ) , and present the structure of a single non-stationary spatial kernel component. We use (·) to denote a kernel which is centered at the point and whose shape is a function of location . Once (·) is specified for all ∈ ⊆ R 2 , the correlation between two points and is then Because of the constructive formulation under the moving average specification, the resulting correlation function ( , ) is certain to be positive definite. We favor working with the kernels (·) rather than directly with the correlation function ( , ) since this makes it difficult to ensure positive symmetry definiteness for all and . Following the idea of [41] , [42] , and their corresponding covariance Σ. The horizontal coordinate represents different focus points realizations; the vertical coordinate represents the value of these realizations. The ellipses portray the shape of the corresponding covariance for the kernel (·) associated with that location . we define each (·) to be a normal kernel centered at with spatially varying covariance matrix Σ . In this case given the parameterized Σ and Σ , the correlation function is given by an easy to compute formula The derivation of this formula can be found in Appendix B. To assure that the kernel { (·)} vary smoothly over space , we parameterize Σ and then allow the parameters to evolve with location. For this paper we will focus on a geometrically based specification which readily extends beyond the use of the Gaussian kernel considered here. There is a one-to-one mapping from a bivariate normal distribution to its one standard deviation ellipse, so we define a spatially varying family of ellipses which, in turn, defines the spatial distribution for Σ . Let the two focus points in Ψ ⊂ R 2 denoted by ( ( ), ( )) ∈ Ψ and − (− ( ), − ( )) ∈ Ψ define an ellipse centered at with fixed area . This then corresponds to the Gaussian kernel with covariance matrix Σ defined by where = tan −1 ( ( )/ ( )), = √︁ 4 2 + 4 2 /2 , and is a scaling parameter that controls the overall intensity of the covariance. Fig. 4 shows a series of randomly generated focus points and their resulting ellipses. This demonstrates how the spatially distributed pairs ( ( ), ( )) give rise to a spatially distributed covariance matrix Σ . The derivation of (8) can be found in Appendix C. Neural network representation for focus points. Here we represent the mapping : S → Ψ × [0, 1] from the location space S to the joint space of focus point Ψ and the weight [0, 1] using a deep neural network. To be specific, the input of the network is the location , and the output of the networks is the concatenation of the corresponding focus points of that location and the weight defined in (6) . The architecture of the neural network has been described in Fig. 5. In Fig. 6 , we also demonstrate two specific instances of the resulting spatial kernel given two different . This implies that the Fig. 5 : An illustration of the deep neural network that maps an arbitrary spatial location to its covariance Σ and the corresponding weight . Fig. 6 : An examples of the spatial kernel with two components ( ) ( ) (·, ) evaluated at the same location . This instance is constructed using two different kernel , which are parameterized by two randomly generated 1 and 2 . neural network encodes the non-homogeneous geographical information across the region in spreading the virus. IV. E C L -D S There are two major challenges in learning the model and calculating the objective defined in (5) . First, the GP approach is notoriously intractable for large data sets since the computations require the inversion of a matrix of size × , which scales as ( 3 ) [43] . In this study, the data set includes 3,144 counties and more than 50 weeks extending from March 2020 to January 2021 ( = 3, 144 × 50). Second, the inference of the posterior distribution of the hotspot (f|h) requires the calculation of integral ∫ (h|f) (f) f, which is an intractable integration. To circumvent these two issues, we derive sparse models for both cases and hotspots similar to [44] [45] [46] , where their log marginal likelihood is computationally tractable for large data sets and they do not require an analytical expression for inferring the non-Gaussian posterior distribution. First, we define a small set of inducing variables that aim to best approximate the training data. Then we adopt a variational learning strategy for such sparse approximation and jointly infer the inducing inputs and other model parameters by maximizing a lower bound of the true log marginal likelihood [44] , [46] . Since the learning strategy can be applied to the above two models, we use y to represent both cases and hotspots for ℎ ! Fig. 7 : A diagram of our graphical model. The gray and white nodes represent the observed and latent variables, respectively; the black dots represent the input variables; the black boxes represent the prior distribution. We use the dashed box to highlight the joint distribution of f and u defined in (9) . notational brevity in the following discussion. Lastly, the objective is jointly learned by performing stochastic gradient descent. Unlike the exact GP approaches approximating the true covariance by the Nyström approximation [43] , we desire a sparse method that directly approximates the posterior GP's mean and covariance function. Now we introduce a small set of auxiliary inducing variables u evaluated at the pseudoinputs Z {z ∈ }; Z can be a subset of the training inputs or auxiliary pseudo-points [47] . u are function points drawn from the same GP prior as the training functions f in (1), so the joint distribution can be written as where K is formed by evaluating the kernel function pairwisely at all pairs of inducing points in Z, and K is formed by evaluating the kernel function across the data points X and inducing points Z similarly. Fig. 7 presents the diagram of our graphical model, consisting of observed variables y, h, latent variable f, and the introduced auxiliary variable z. To obtain computationally efficient inference, we approximate the posterior distribution (f, u|y) over random variable vector f and u by a variational distribution (f, u). We assume this variational distribution (f, u) can be factorized as (f, u) = (f|u) (u). To jointly determine the variational parameters and model parameters, the variational evidence lower bound (ELBO) substitutes for the marginal likelihood ℓ ( ) and ℓ ℎ ( ) defined in (5): where KL[ || ] denotes the Kullback-Leibler (KL) divergence between two distributions and [48] . We have defined: (f) ∫ (f|u) (u) u and assume (u) N (m, S), which is the most common way to parameterize the prior distribution of inducing variables in terms of a mean vector m and a covariance matrix S. To ensure that the covariance matrix remains positive definite, we represent it using a lower triangular form S = LL . This leads to the following analytical form for (f): where A = K K −1 . In classification or regression, we also factorize the likelihood as (y|f) = =1 ( | ) for the ease of computation in (10) . Therefore, the ELBO objective can be rewritten as In practice, the one dimensional integrals of the log-likelihood in (11) can be computed by Gauss-Hermite quadrature [49] . In contrast to directly maximizing the marginal log likelihood defined in (5), computing this objective and its derivatives can be done in ( 2 ) time. The derivation of the ELBO can be found in Appendix D. To make one-week ahead predictions for the hotspots and the number of confirmed cases, we first need to derive the posterior distribution of prediction (f|y, h) given the past observation. Suppose we have the spatio-temporal coordinates X {x } ∈ℐ, ≤ and their observations y { } ∈ℐ, ≤ , h {ℎ } ∈ℐ, ≤ until time and the optimal inducing points Z. We assume that the unobserved future data comes from the same generation process. Therefore, for all the locations at time + 1, i.e., X * {x , +1 } ∈ℐ , we first estimate their means according to (3) denoted by * { , +1 } ∈ℐ , then the distribution of one-week-ahead prediction f * {ˆ, +1 } ∈ℐ is given by where A * = K * K −1 and B * = K * * − K * K −1 K * . The K * denotes a × matrix and its entries are pairwise evaluations of (x * , z) where x * ∈ X * and z ∈ Z. The derivation of the predictive posterior can be found in Appendix E. The prediction for the number of cases and the probability of hotspots therefore can be made by plugging (12) into (2) and (4), respectively. We consider that a detector would raise an alarm if the hotspot probability ℎ for location at time is above the pre-set threshold . This threshold is chosen for each location by a grid search in [0, 1]. For each location, the threshold with the largest in-sample 1 score is chosen. The above formula reveals that predictive posterior distribution only depends on the inducing variables u at learned spatio-temporal coordinates Z and does not depend on the f at training coordinates. This shows that all the information from the training data has been summarized by the proposed posterior distribution (u) defined in Section IV-A and the prediction for future weeks can be carried out efficiently. Now we describe our learning algorithm. The optimal parameters of the proposed model can be found by maximizing Sample a subset X , y , h with points from X, y, h, respectively; Calculate ELBO of ℓ ℎ and ℓ based on (11) given data X , y , h ; Calculate the gradient of (5) w.r.t. ; Calculate the natural gradient of (5) w.r.t. Z, m, S; Ascend the gradient of , Z, m, S; end the combined objective (5) using gradient-based optimization. However, the full gradient evaluation can still be expensive to carry out. With a sparse prior (inducing variables), even though we can tackle the computational challenge in inverting a big matrix, evaluating the gradient of the first term in (11) still requires the full data set, which is memory-intensive if the size of the data set is too large. To alleviate the problem of expensive gradient evaluation, we adopt a stochastic gradientbased method [45] and only compute the gradient of the objective function with respect to the GP's parameters denoted by {{ ( ) } =1,..., , }, evaluated on a random subset of the data at each iteration. Additionally, the conventional stochastic gradient descent algorithm assumes that the parameters' loss geometry is Euclidean. This is a non-ideal assumption for the parameters of many distributions, e.g., Gaussian. Here we follow the idea of [45] , [50] and apply adapted stochastic gradient descent to the variational parameters (Z, m, S) in our GP model by taking steps in the direction of the approximate natural gradient. These gradients are computed by the usual gradient re-scaled by the inverse Fisher information. The Kullback-Leibler divergence is used to measure the "closeness" in the variational distribution space. Our learning algorithm is summarized in Algorithm 1. This section reports the numerical results of our study. In the following examples, we consider the spatial kernel with = 4 components and fit the model with = 500 inducing variables using the COVID-19 data set described in Section II. For each spatial kernel component , we choose a three-layer neural network with 64 nodes per layer to represent its mapping ( ) through cross-validation (Appendix G). We first evaluate the explanatory power of the proposed framework by investigating the learned spatial kernel function using COVID-19 data. We demonstrate the interpretable components of our model and visualize the spatio-temporal correlation across regions discovered by our fitted model. Then we examine the result of the hotspot detection and the case prediction by visualizing the oneweek-ahead predictions and their distribution. We emphasize that our model not only generates accurate predictions, but also quantifies the uncertainty about the predictions. Lastly, we compare our method with six other commonly-used binary classification approaches by evaluating their out-of-sample predictive performance. The inputs to the hotspot prediction model are past county-level case and death records, identified hotspots, and community mobility information. For ease of presentation, we only focus on the counties in the contiguous United States. The proposed framework offers a unique opportunity for understanding the dynamics of the spread of COVID-19 utilizing the carefully crafted kernel design. In this experiment, we fit the model using the entire COVID-19 data and then visually examine the learned spatial kernel. We first visualize the learned kernel induced feature ( ) of each spatial kernel component in Fig. 8 , which portrays the spatial pattern of virus' propagation. Recall that, for any arbitrary , ( ) is a normal kernel centered at with spatially varying covariance matrix Σ ( ) , which can be uniquely represented by its focus points. Here, we connect two focus points with a red line at each location and plot them on the map. Length and rotation angle of the red line at represents the strength and direction of the influence of location , respectively, jointly determining the shape of its covariance matrix Σ ( ) . The color depth indicates the weight ( ) , representing the "significance" of location in the kernel component . To intuitively interpret the learned spatial kernel, we also visualize the kernel evaluation given one of its inputs, i.e., =1 ( ) ( ) ( , ·). Such kernel evaluation represents the spatial correlation (or sphere of influence) of a particular location spreading the virus. The area in darker red signals that the corresponding counties are more likely to become the next hotspot if an outbreak is detected in the region of interest. In Fig. 9 , we observe that these major metropolitan areas have a substantially different spatial correlation with their neighboring regions due to the non-stationarity of the spatial kernel. For example, as one of the nation's major economic and transportation hubs, New York has a significant impact on the entire Eastern United States, while Atlanta only has a regional influence in the Southeastern United States. Chicago and Los Angeles, the second and third most populous cities in the United States, can extend their influences to the entire north and south of the country, respectively. Such interpretability of our spatial kernel function could be of particular importance for the policymakers or the local authorities, suggesting that counties with more extensive influence on other regions should receive more attention in epidemic prevention. We also note that increasing the number of spatial kernel components could further enhance the flexibility and the interpretability of the model (Appendix F); however, due to the need for additional parameters in the neural networks, the computational time dramatically increases when ≥ 3, with minimal performance improvement. We evaluate the one-week-ahead in-sample prediction accuracy of our proposed hotspot detection framework at the (a) Fig. 8 : Visualizations of the learned kernel induced feature ( ) using COVID-19 data set. Each panel shows one of four kernel components, where the line segment is the edge that connects two focus points of , indicating the shape and the rotation of the kernel at that location; the shaded area shows the intensity of the corresponding weight ( ) at location ; the darker the region, the larger the weight. county level. We first fit the model using the entire data set from March 15, 2020, to January 17, 2021, which contains 3,144 counties and 50 weeks in total. The in-sample prediction for time is obtained by feeding the data before into the fitted model and predicting the one-week-ahead hotspots. We test our model with different values and perform crossvalidation to identify the optimal . In Fig. 10 , we report the in-sample prediction results for eight representative locations, which include six major metropolitan cities and two sparsely populated counties. The shaded gray area indicates the number of cases reported in that location, and the black star indicates the time of the identified hotspot. The solid red line represents the corresponding estimated hotspot probability. The hotspot probability resulting from our model is considerably high whenever a genuine hotspot occurs and considerably low otherwise, which confirms the effectiveness of our framework. In Fig. 11 , we visualize the prediction results on the map to intuitively examine the predictive performance from the spatial perspective. We select four particular weeks representing different stages of the COVID-19 pandemic in 2020. The black dot indicates the genuinely identified hotspot, and the color depth indicates the hotspot probability suggested by our fitted model. As we can see, our method can capture the spatial occurrence of these hotspots nicely, in which regions with Fig. 10 : Temporal view of one-week-ahead and county-wise hotspot probability (h * ) suggested by our fitted model ( = 10 −5 ) using COVID-19 data. The first 6 panels represent major metropolitan areas and the last two panels represent less populated counties in the United States. sparsely distributed hotspots usually have a lower probability. In comparison, other regions with densely distributed hotspots have a higher probability. We emphasize that our hotspot detection framework can provide a realistic prediction that varies smoothly over time and space due to our GP assumption. This can be extremely useful when we try to make a continuous prediction or estimate the likelihood of a hotspot to happen at an arbitrary spatio-temporal coordinate. Our proposed framework also provides case prediction and uncertainty quantification besides hotspot detection. In Fig. 12 , we present the predicted case number y * over time as well as its confidence interval for the eight same locations that appeared previously. The black dash line represents the real reported cases, and the solid blue line represents the prediction y * suggested by our case model. The one and two-regions are highlighted by the dark and light blue shaded areas. As we can see, the prediction result captures the general trend of the case records, which confirms that the case model can successfully extract useful information from the cases that will be used to regularize the hotspot model by optimizing (5) . We note that the estimated confidence interval reflects the uncertainty level of our prediction for both cases and hotspots since the confidence interval only depends on the latent spatio-temporal variable f. In Fig. 13 , we show the confidence interval over the map for four different weeks. The color depth indicates the uncertainty level (the length of the confidence interval) for that location. This intuitively tells us which area we are confident in making predictions and how this confidence changes over space. We adopt standard performance metrics, including precision, recall, and 1 score. This choice is because hotspot detection can be viewed as a binary classification problem. We aim to identify a hotspot for a particular location at a particular week in the data. Define the set of all identified hotspots as , the set of detected hotspots using our method as . Then precision and recall are defined as = | ∩ |/| |, = | ∩ |/| |, where | · | is the number of elements in the set. The 1 score combines the precision and recall: 1 = 2 /( + ) and the higher 1 score the better. Since numbers of hotspots in real data are highly sparse (comparing to the total number of spatio-temporal coordinates), we do not use the ROC curve (true positive rate versus false-positive rate) in our setting. The evaluation procedure is described as follows. Given the observed hotspot and other covariates (cases, deaths, and mobility) until , we perform detection for all the locations at time + 1. If the detected hotspot were indeed identified as a genuine hotspot by CDC, then it is a success. Otherwise, it is a misdetection. In our data, there are 50 × 3144 = 157, 200 spatio-temporal coordinates in total, and 12,000 of them were identified as genuine hotspots. We compare the hotspot detection results of our proposed method and several standard methods in binary classification, including perceptron, logistic regression, linear support vector machine (SVM), -nearest neighbor ( -NN), kernel SVM with Gaussian kernel, and decision tree; see [51] for a detailed review of those machine learning algorithms and see Appendix H for hyperparameter choices. Table I shows the 1 score for the out-of-sample prediction at county-level using our method. The result confirms that our model significantly outperforms other baseline methods. This paper proposes a Bayesian framework that combines hotspot detection and case prediction cohesively through a latent spatio-temporal random variable. The latent variable is modeled by a Gaussian process, where a flexible non-stationary kernel function characterizes its covariance. The framework has shown immense promise in modeling and predicting the COVID-19 hotspots in the United States. Our numerical study has also shown that the proposed kernel enjoys great representative power while being highly interpretable. This section presents the cross-validation result for in (5) defined in Section III-A. Fig. 14 gives several examples of the predictions using different . Fig. 15 summarizes the 5-fold cross-validation that quantitatively measures the 1 score of the hotspot detection and the mean square error of the case prediction with different . Note that the likelihood of observed confirmed cases plays a critical role in "regularizing" the hotspot model and the optimal choice of ( = 10 −5 ) leads to the best performance in hotspot prediction (out-of-sample). We also investigated the model with = 0 (an alternative model only including the hotspot component). The result shows that the 1 score of out-of-sample prediction for the model with = 0 is 0.57, which is significantly lower than the model with = 10 −6 . Given two independent Gaussian random variables and and their probability density functions and , the distribution of = + equals the convolution of and , i.e., Denote the probability density function of and as (·), (·), we have We also have the following equation due to the property of the Gaussian function: Let = 2 or = 2 , we therefore have which leads to (7) . Since + follows a Gaussian distribution + ∼ N ( + , Σ + Σ ), the non-stationary kernel ( , ) can be written as Let = ( 2 − 1) 2 and = ( 2 − 1) 2 , we have Assume an ellipse centered at the origin with area has two focus points ( , ), (− , − ) in R 2 where , ∈ R. We define the semi-major and semi-minor axis of the ellipse as 1 , 2 . According to the ellipse formula we have: where (f) is the marginal of f from the joint variational distribution (f, u), by integrating u out. The inequality ( ) holds due to the the Jensen's inequality. The equality ( ) holds because To calculate the ELBO, we also need to derive the analytical expression for (f) and KL[ (u)|| (f)]. First, given the joint distribution defined in (9), we apply the multivariate Guassian conditional rule and have the closed-form expression for the conditional distribution: Similar to (9) , since we assume that the unobserved future data comes from the same generation process, i.e., where A * = K * K −1 and B * = K * * − K * K −1 K * . This section presents the comparison of our model using different number of spatial components in the kernel function. Fig. 16 gives an example of visualized kernel evaluation centered at Chicago. It shows that the representative power of the kernel can be greatly enhanced by increasing the number of spatial components . We choose the neural network architecture for each spatial kernel component by 5-fold cross-validation. Specifically, we compare the 1 score of hotspot detection and its corresponding memory usage for 11 different neural network architectures under the same experimental setting. As shown in Fig. 17 , our model attains better predictive performance as the number of nodes or layers of the neural network grows. However, for any architecture with a number of nodes greater than 64, the performance gain starts to vanish while the GPU memory usage is increasing dramatically. Therefore, we use a threelayer neural network with 64 nodes per layer for each spatial kernel component, which achieves a balance between predictive performance ( 1 score is 0.621) and memory usage (4,096 bytes per component) in practice. We use a three-layer neural network with 64 nodes per layer for each spatial kernel component, which achieves the balance between predictive performance and memory usage in practice. In this section, we provide a detailed description of the baseline methods used in Section V-D, including the specific choice of hyperparameters. The perceptron classifier is an algorithm used for supervised learning, and its main component is a single-layer neural network. The perceptron classifier takes all the input feature values and computes their weighted sum. The weighted sum is then applied to the sign activation function, which outputs 1 if the weighted sum is greater than zero and outputs −1 otherwise. The logistic regression is a similar method that takes all the input feature values and applies their weighted sum to the sigmoid function, which outputs the probability for the input feature to be in class 1. The weights can be solved by maximum likelihood estimation through gradient descent. The linear support vector machine (SVM) is another type of binary classifier that aims to find the optimal linear decision boundary (hyperplane) to separate data from two classes in such a way that the separation is as wide as possible. The kernel SVM is a variant of linear SVM, and it considers nonlinear separations; we use the Gaussian kernel with bandwidth parameter chosen as 0.1. The -nearest neighbor ( -NN) classifier takes a test sample as input and uses the vote from its -nearest neighbors as the output class; we set the number of neighbors to be = 5 and use the Euclidean distance to quantify pairwise distances between two samples features. Finally, the decision tree is a white box type of machine learning algorithm, and it has a flowchart-type tree structure. Each internal node represents the feature value, the branch represents the corresponding decision rule, and each leaf node represents the outcome. We set the maximum depth of the tree as four. Trends in number and distribution of covid-19 hotspot counties-united states COVID-19 forecasts: Deaths A framework for identifying regional outbreak and spread of covid-19 from one-minute populationwide surveys Exact analytical solutions of the susceptible-infected-recovered (sir) epidemic model and of the sir model with equal death and birth rates Estimating and simulating a sird model of covid-19 for many countries, states, and cities Chinese and italian covid-19 outbreaks can be correctly described by a modified sird model The mathematics of infectious diseases Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions The effectiveness of quarantine of wuhan city against the corona virus disease 2019 (covid-19): A well-mixed seir model analysis LANL COVID-19 cases and deaths forecasts The University of Texas COVID-19 Modeling Consortium. (2021) COVID-19 mortality projections for US states Laboratory for the Modeling of Biological and Socio-technical Systems. (2021) COVID-19 modeling Institute for Health Metrics and Evaluation. (2021) COVID-19 projections for the united states Highresolution spatio-temporal model for county-level covid-19 activity in the us The challenges of modeling and forecasting the spread of covid-19 Covid-19 in india: Statewise analysis and prediction Covid-19 prediction and detection using deep learning Forecasting of covid-19 cases based on prediction using artificial neural network curve fitting technique Generated time-series prediction data of covid-19' s daily infections in brazil by using recurrent neural networks How well can we forecast the covid-19 pandemic with curve fitting and recurrent neural networks?" medRxiv Covid-19 simulator: An interactive tool to inform covid-19 intervention policy decisions in the united states Real-time forecasting of the covid-19 outbreak in chinese provinces: Machine learning approach using novel digital data and estimates from mechanistic models Understanding the spatial diffusion process of severe acute respiratory syndrome in beijing Geographical spread of sars in mainland china Spatial epidemic dynamics of the covid-19 outbreak in china Population flow drives spatio-temporal distribution of covid-19 in china The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (covid-19) in china Hawkes process modeling of covid-19 with mobility leading indicators and spatial covariates Curating a COVID-19 data repository and forecasting countylevel death counts in the united states Examining covid-19 forecasting using spatio-temporal graph neural networks Detecting covid-19 infection hotspots in england using large-scale selfreported data from a mobile application: a prospective, observational study Spatiotemporal analysis and hotspots detection of covid-19 using geographic information system The analysis of spatial association by use of distance statistics Topological data analysis of spatial systems Covid-19 mortality rate prediction for india using statistical neural networks and gaussian process regression model Forecast and evaluation of covid-19 spreading in usa with reduced-space gaussian process regression Enhanced gaussian process regression-based forecasting model for covid-19 outbreak and significance of iot for its detection We're sharing coronavirus case data for every u.s. county Transmission dynamics by age group in covid-19 hotspot counties-united states Covid-19 community mobility reports Non-stationary spatial modeling Imitation learning of neural spatiotemporal point processes Gaussian processes in machine learning Variational learning of inducing variables in sparse gaussian processes Gaussian processes for big data Scalable variational gaussian process classification Sparse gaussian processes using pseudoinputs On information and sufficiency A note on gauss-hermite quadrature New insights and perspectives on the natural gradient method Understanding machine learning: From theory to algorithms Assume the posterior distribution (f, u|y) over random variable vector f and u is approximated by a variational distribution (f, u). Suppose this variational distribution (f, u) can be factorized as (f, u) = (f|u) (u). Hence, the ELBO can be derived as follows: