key: cord-0577382-5oyeq06k
authors: Li, Shuang; Wang, Lu; Chen, Xinyun; Fang, Yixiang; Song, Yan
title: Understanding the Spread of COVID-19 Epidemic: A Spatio-Temporal Point Process View
date: 2021-06-24
journal: nan
DOI: nan
sha: a48e1c4fc515049f715074e9e8501c5cf792b0a1
doc_id: 577382
cord_uid: 5oyeq06k

Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined propagation of this disease so that accurate prediction or smart external intervention can be carried out. In this paper, we model the propagation of the COVID-19 as spatio-temporal point processes and propose a generative and intensity-free model to track the spread of the disease. We further adopt a generative adversarial imitation learning framework to learn the model parameters. In comparison with the traditional likelihood-based learning methods, this imitation learning framework does not need to prespecify an intensity function, which alleviates the model-misspecification. Moreover, the adversarial learning procedure bypasses the difficult-to-evaluate integral involved in the likelihood evaluation, which makes the model inference more scalable with the data and variables. We showcase the dynamic learning performance on the COVID-19 confirmed cases in the U.S. and evaluate the social distancing policy based on the learned generative model.

Since the first coronavirus case was confirmed in Washington state on Jan. 21, up to May 21 more than 1.5 million people have confirmed COVID-19 and more than 93,000 people have died from the disease in the U.S. 1 This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states, with the exponential growth of confirmed case count in March and with all 50 states reporting cases by March 17. In April, the U.S. became the nation with the most confirmed cases and most deaths globally. On March 15, the Centers for Disease Control and Prevention advised against gatherings of 50 or more people for the next two months, and two of the first U.S. hot spots, Washington state and Illinois, closed all bars and restaurants. On the next day, many cities and states shut down social life and many schools began to close.

The increasing temporal patterns exhibit significant differences county by county, which are influenced by features such as population and location. The three states, New York state, Connecticut, and New Jersey alone, have accounted for about 50% of all U.S. confirmed cases since March 20. This paper is motivated by modeling and predicting the spread of COVID-19 the exhibits clustering and triggering patterns in time and space. We propose a generative model to track the spread of the disease and directly captures how infections are transmitted. Our model can help understand how one state's outbreak compares with another's and provides a simulator to evaluate policy, such as when is the best time to start and ease the restrictions on the social distancing.

We treat confirmed COVID-19 cases as discrete events, and directly model the transmission of the events by spatio-temporal point processes (STPPs). STPPs model the generative process of discrete events in continuous time and space by intensity function, without the need to divide the space and time into cells [13, 2, 16] . The occurrence intensity of events is a function of space, time, and history, and explicitly characterizes how the events are allocated over time and space. The propagation of contagious diseases such as COVID-19 often exhibit self-exciting patterns [16] that the occurrence of a previous event will boost the occurrence of new events [11, 8, 15 ] within a region centered around the current location. Existing spatio-temporal self-exciting models require handcrafting the triggering kernel to capture the propagation patterns. The log-Gaussian Cox process, where the log intensity function is a random realization drawn from a Gaussian process [12, 3] , although flexible, requires a prespecified mean and covariance function to incorporate an accurate prior belief on the spacetime interleaved correlation. Moreover, this model faces challenges to scale with voluminous data like the COVID-19 in the U.S. and is not proper in this setting.

To alleviate the model-misspecification, we propose a customized imitation learning framework for spatiotemporal point processes. Our policy-like generative models are intensity-free, with the output events (i.e., confirmed case in space and time) directly produced by nonlinear transformations to the history embedding. This generative process mimics the self-exciting mechanism, but the neural-based nonlinear transformations add flexibility to the triggering kernel that can be learned in a data-driven fashion. Furthermore, by incorporating features relative to population, lockdown time, and other spatial and temporal covariates to the intensity function, we add flexibility and interpretability to the model. The learned model can be used to evaluate how population and lockdown time will impact the spread of the virus.

We adopt an imitation learning framework [1, 14, 18, 7] to learn the generative model (i.e., policy) by minimizing the discrepancy between the generated events and the observed events, where the learning method is an extension of [10] to the spatio-temporal setting; yet our point process generator is intensity-free. We empirically demonstrate the sound performance of our method in generating and forecasting the confirmed COVID-19 cases in the U.S.

We are interested in learning the generating dynamics of events localized randomly in time and space. Each event is recorded as a tuple e := (t, u), where t ∈ R + is the occurrence time and u ∈ S is the occurrence location of the event. A spatio-temporal point processes (STPP) is a random process whose realization consists of an ordered sequence of events, i.e., H t := {e 1 = (t 1 , u 1 ), . . . , e i = (t i , u i ) | t i < t}, where H t is the history up to time t and H t is σ-algebra.

Conditional Intensity Function. Denote N (A) as the number of events, such as e = (t, u), falling in the set A ⊂ R + × S. The dynamics of STPP can be characterized by a conditional intensity function, denoted as

(1) which specifies the mean number of events in a region (i.e., infinitisemal interval and region around t and u) conditional on the past. The propagation of contagious diseases often exhibit self-exciting patterns that can be characterized in terms of the conditional intensity function of the form

where β 0 (s) is the exogenous event intensity that models drive outside the region, and the endogenous event intensity i:ti<t g(u − u i , t − t i ) models interactions within the region and g is the triggering kernel. Using the chain rule and the conditional density function, we can obtain the joint likelihood for a realization of events {e 1 = (t 1 , u 1 ), . . . , e n = (t n , u n )|t n < t} as

Suppose a parametric model λ θ (t, u|H t ) for the conditional intensity has been specified by an unknown parameter θ, then using the maximum-likelihood learning paradigm, one can get an estimationθ by maximizing the likelihood (3) in terms of θ. In situations where events exhibit complex or time-evolving patterns, it is difficult to design an expressive intensity function to reflect reality beforehand. Moreover, the evaluation of likelihood (as shown in Eq. (3)) involves a three-dimensional integral that usually requires expensive numerical approximation.

We propose a policy-like generative model to mimic the self-exciting triggering patterns of the events as in Eq. (2) . Our model is intensity-free, and it is easy to perform model inference, simulate new data, and forecast new events. Pseudo events (i.e., actions of policy) are sequentially generated from learner policy π θ (a|S t ), where S t includes all the previously generated actions. Our generative model is intensity-free in the sense that the output events are directly produced via nonlinear transformations to random noise and history embedding as illustrated in Fig. 1 . We design policy π θ (a|S t ) under the principles to: (i) capture the long-term and nonlinear spacetime intertwined dependency of events; (ii) mimic the doubly stochastic nature of the triggering spatio-temporal point processes (i.e., the occurrence of an event will influence the occurrence of future events and this is the dynamic of how is the COVID-19 virus spread) [5, 6] ; (iii) and do exploration over the continuous time and space domains; and (iv) incorporate features such as population, lockdown time, and other spatial and temporal covariates to add the interpretability to the model. To this end, the policy π θ (a|S t ) has the following modules:

(i) A recurrent neural network (RNN) unit is to learn an abstract representation from historical events S t ,

where the input a = [t,ũ] witht ∈ R + andũ ∈ R 2 , the parameter matrices V ∈ R h×3 , W ∈ R h×h , B ∈ R h , the hidden state h ∈ R h is an embedding of history, and ψ is an element-wise nonlinear operator. The generated output event a t will be fed back to the model and serves as the input to trigger (or inhibit) the occurrence of the next event. A random noise vector is injected as part of the inputs to stimulate exploration.

(ii) A multilayer perceptron is applied to noise, hidden state, and static features (i.e., explanatory factors such as population and lockdown time of the city), to generate events, e.g.,

where h i is the hidden state, z i is the noise vector, f i is the static feature vector, σ(·) is an elementwise nonlinear operator, and model parameters H (1) ∈ R h ×(m+h) , U (1) ∈ R h , H (2) ∈ R 3×h , and U (2) ∈ R 3 . The expressive transformation function will represent a rich class of conditional distributions of discrete events [4] that is conditional on history and explanatory variables. Injecting noise to the policy will encourage the exploration over time and space, and can also be regarded as a "reparameterization trick" for a random variable, where one substitutes a random variable by a deterministic transformation of a simpler random variable [9, 17] .

We utilize an imitation learning framework [10] to learn our model parameters. In [10] , the imitation learning framework is designed for temporal point processes with a prespecified intensity function. We extend this learning framework to the spatio-temporal setting and to the intensity-free models. The imitation learning bypasses the difficult-to-evaluate likelihood and alleviates the model-misspecification at the same time.

In a nutshell, we aim to learn a stochastic policy (learner) π θ (a|S t ) to mimic the behaviors of the observed events (expert) {e 1 , e 2 , . . . }. This is realized by minimizing the discrepancy of the generated events with the observed events evaluated by functions from reproducing kernel Hilbert Space (RKHS). We summarize this imitation learning formulation in Theorem 1 and provide details in Appendix as how to empirically evaluate D(π E , π θ , F) as in Eq. (7) by finite samples of event data.

Theorem 1 Let the family of reward function be the unit ball in RKHS F, i.e., r F ≤ 1. Then the optimal policy π θ (a|S t ) that mimics the observed events (generated by expert π E ) can obtained by solving

where D(π E , π θ , F) is the maximum expected cumulative reward discrepancy between π E and π θ , with the expression

where N T andÑ T are the counts of observed events and generated events within time horizon T .

This imitation learning formulation accommodates the asynchronous nature of discrete events within a fixed time horizon and a space region. Minimizing D(π E , π θ , F) is equal to minimizing D 2 (π E , π θ , F), where the latter is more convenient to use without normalization. Our goal is to learn a policy π θ that the discrepancy is close to zero. Policy parameters θ will be learned in an end-to-end fashion.

The gradient of the objective function D 2 (π E , π θ , F) can be backpropagated through the generative policy network. The policy parameters can be optimized by (stochastic) gradient descent method, i.e.,

where k denotes iteration and α k is the learning rate. Using the chain rule,

where the roll-out samples a := {a i } are obtained by nonlinear transformations parametrized by θ and contain the derivative information of θ.

Our generative spatio-temporal point process model is intensity-free. But it is easy to generate new events to predict "when" and "where" the next event will happen, or to evaluate the impact of external variables. This is realized by generating new action a as the next event from the learned policy πθ(a|H t ). Specifically, given historical events to predict future events, the sequence of observed real events up to time t, H t = {e 1 = (t 1 , u 1 ), . . . , e i = (t i , u i ) | t i < t} serves as the inputs of the trained recurrent neural network, from which we obtain the last hidden state h i , i.e.,

(9) Then generate new events via nonlinear transformations to noise, hidden state h i , and features f i , i.e.,

To continue, the new generated eventê i+1 will be fed back the recurrent neural network to update the last hidden state h i+1 = ψ(Vê i+1 +Ŵ h i +B) and repeat the previous procedures.

Since we explicitly incorporate external features f i , such as population and lockdown time of a city, to the generative model, the learned model is ready to evaluate how different population or lockdown time of a city will change the propagation of the events. This is realized by generating new events under the new features we want to evaluate.

We first empirically validated the dynamic learning and prediction performance of our model on four widely used event datasets. We focused on checking whether the learned model can recover the spatial and temporal For these real events, we preserved the time and location (i.e., latitude and longitude) information. We focused on understanding the potentially complex spacetime intertwined dynamics, and their daily or yearly patterns. In our experiment, the generator is a one layer LSTM with hidden state size 64, and a two-layer MLP with hidden node size 74-32-3. The noise vector is of dimension 10. The generator is optimized by Adam, with a learning rate of 1e-3. The kernel bandwidth used in evaluating the discrepancy is set to be 1. All the codes and datasets will be released upon the paper is published.

Baseline. The baseline model was designed to share a similar architecture as our generator; but was trained using the MLE. Specifically, the baseline model has the same LSTM unit to embed history as in Eq. (4). In comparison, the output events are generated from a prespecified conditional probability density. We assume t ∼ Exp(t|λ θ (h)) and u ∼ Gaussian(u|µ θ (h), Σ θ (h)), which implies the time is generated from an exponential distribution with parameter λ θ (h), and the location is generated from a Gaussian distribution with mean µ θ (h) and covariance Σ θ (h). Under these model assumptions, the likelihood of the observed events {e i } i=1,...,n can be computed using Eq. (3). Note that the temporal and spatial components of the output events are coupled, due to that the probability distribution parameters all depend on the same hidden state h. Parameters of the model were learned via maximizing the likelihood.

Learning and Prediction Performance. We generated new events from the two well-trained models (i.e., learner policy and baseline), and evaluated the learning performance by checking the recovery results. All pseudo-events were sequentially generated within a predefined time horizon. For the UK Car Accidents, we are interested in the spatio-temporal allocations of the accidents at a daily level, and we chunk this dataset according to days to construct the sequences of events. For the New York City taxi Trips, we also split this dataset according to days and further gleaned the trips that happened between 3:00 pm to 4:00 pm each day. For Chicago Crimes, we split this dataset based on the date to construct the sequences of events, and further focused on the crimes that happened between 9:00 pm to 12:00 pm. For Worldwide Earthquakes where the events are sparse, we split this dataset according to years (since earthquakes are rare events) to construct the sequences of the events, and we aim to model the spatio-temporal patterns of the earthquakes at an annual level.

We linearly scaled the event times and locations (i.e., the scaled times fall in the region of [0, 2] and the scaled latitudes and longitudes fall in the region of [−2, 2] ) so that they displayed in a reasonable range for training stability and visualization purposes. We demonstrated the spatial distributions and the temporal intensity functions of the real events (expert) and the generated events (learner) in Fig. 2 (left) and Fig. 3 . In the figures, we only visualized a batch (i.e., 32) of generated sequences. As demonstrated, our learner policy, which jointly generated the spatial and the temporal components of the events, successfully recovered the underlying patterns of the real events, and especially was able to capture the sparse and irregular patterns over the space.

The baseline model, however, failed to capture the fine-grained patterns, which also indicates the conditional density assumption for the emission probabilities is quite restricted. If we used the baseline model to only learn and generate the temporal component of the events (i.e., we ignored the spatial information of the events), the baseline model can accurately recover the temporal patterns(we implemented the experiments but didn't display the results here). But if we let the baseline model jointly learn and generate the spatial and temporal components of the events, then the learning performance significantly degraded -the difficulties in capturing the spatial patterns will destroy the generative performance for the temporal patterns. To remedy this problem, one may resort to a nonparametric Gaussian mixture model to capture the clustering patterns for the spatial component, but still, this design is not generic and can only work for some specific patterns. Moreover, sophisticated models will lead to a hard to evaluate likelihood. The comparison results showcase the power of our method in learning the complex spatial-temporal patterns of real events

We also evaluated the prediction accuracy of the two-well trained models, i.e., to predict when and where the next event will happen given the observed history. Our model and the baseline models were evaluated by conditioning on the same historical real events, i.e., the same sequence of real observed events was injected to the LSTM as inputs to obtain the last hidden state, and this last hidden state was used as an initial state to generate new events as predictions. The predicted results were demonstrated in Fig. 2 (right) and Fig. 4 (the time horizon has been scaled to [0, 2]), where we visualized a batch of next ten predicted events. The results also demonstrated the superior prediction performance of our model, especially in predicting the event locations in comparison with the baseline.

Next, we demonstrated how to utilize our spatial-temporal point process model to understand the spread of the COVID-19 virus in the U.S., and we especially interested in predicting how the social distancing policy (i.e., city lockdown time) will influence the propagation of the disease. Datasets and Experiment Setup. We considered the dataset from the Center for Systems and Engineering at Johns Hopkins University 6 . This dataset daily updates the number of confirmed COVID-19 cases globally. We only focused on the U.S. confirmed cases reported after Jan 21, 2020, till May 24, 2020. In this dataset, each county has its report. Yet, the fine-grained location of each confirmed case is not provided. We can only get access to the location of the county. We, therefore, adjust our generative model to account for this dataset -we use the location information of the county as markers (i.e., static features) of events and only generate the occurrence time of the new confirmed case. The external features, population, lockdown time, together with the county location are fused to our generator to capture how these factors will influence the event triggering dynamics. Unlike the crimes or accident events that usually exhibit repeated daily patterns, the outbreak of diseases has its unique evolutionary progressions. The trajectories of events are nonstationary over time and exhibit different patterns county by county. We treat each county's confirmed cases as a sequence and predict when is the next confirmed case for this county.

The setup of the models is the same as before. In this context, we carefully construct the dataset as follows. We divide the 3261 counties' sequences into five groups via the sequence length: length ≤ 100 (# 2245); length > 100 and ≤ 1000 (# 796); length > 1000 and ≤ 5000 (# 155); length > 5000 and ≤ 10000 (# 39); length > 10000 and ≤ 20000 (# 18); and 8 counties is with length > 20000. We train a model for each group. This is because the counties within the same group share similarities in their virus spread patterns. Short sequences don't have sufficient data to train an individual model. We are pooling the information of other similar counties in the group in the prediction. For hot spots, like New York City, the cumulative confirmed count is close to 200,000, we train an individual model. Rather than generating the occurrence time of the next event, we generate the occurrence time when the new cumulative confirmed cases exceed one hundred.

Learning and Prediction Performance. The learning performance of our model on all counties in the U.S. is demonstrated in Fig. 5 . We generated new events from the learner policy and evaluated the learning performance by checking the recovery results. The results are an aggregation the events in terms of the empirical intensity. The results demonstrate a sound recovery performance of our models. We randomly selected eight counties' generating results and demonstrated them in Figure 6 (the measurement unit of the time as shown in the X-axis is one day). We observe that our models can capture the distinct patterns of each county, and the patterns show significant differences across counties. For hot spots, for example, the Suffolk, we can see that the emerging cases grow quickly, with a curve trending sharply upward. As new cases slow, the curve bends toward horizontal, showing that the state's outbreak may be leveling off. However, this does not mean the number of cases has stopped growing but indicates that the rate of growth has slowed, which could signify that social distancing measures are having an effect. The results show that our model can accurately recover the spread of events with various lengths and trends. Our model especially demonstrates better performance on hot spots (i.e., Philadelphia, Westchester, and Suffolk) mainly due to the sufficient training data. For short sequences, although we adopt pooling techniques, it is still challenging to capture the distinct patterns of each county. We are interested in evaluating how the lockdown time of each county will influence the spread of the diseases. Given a learned model, we generated events by tuning the lockdown time. We predict the trajectories of the events given a different lockdown time: one week earlier and one week later than the real lockdown time. The predicted intensity functions are shown in Figure 7 and Tabel 1. We observe that, for all these four counties, the number of events with delayed lockdown time is slightly larger than the counterparts with real lockdown time; the number of events with earlier lockdown time is slightly smaller than the counterparts with real lockdown time. With an earlier lockdown time, Los Angeles and Middlesex show similar trends compared to the real cases. The reason would be that the lockdown time does not make a significant effect on these counties in the early stage of the epidemic. For New York and Cook, however, the early lockdown time shows a significant reduction in the number of total confirmed cases. The results suggest that New York and Cook could control the disease better if they take an earlier lockdown time. 

In this paper, we proposed an intensity-free spatio-temporal point processes model and train the model using an imitation learning framework. This learning method bypasses the evaluation of the likelihood function, which has the potential to achieve a good balance between model flexibility and computational burden. We empirically showed the superiority of the proposed method in recovering the complex dynamics of real events and forecasting new events. We especially use COVID-19 as a case study and utilize our model to understand the spread of this virus. [17] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278-1286, 2014.

[18] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433-1438. Chicago, IL, USA, 2008.

In the following, we summarize the imitation learning formulation in Theorem 1 and provide details as how to empirically evaluate D(π E , π θ , F) as in Eq. (7) by finite samples of event data.

In a nutshell, we aim to learn a stochastic policy defined as π θ (a|S t ) to mimic the behaviors of the observed events. For the learner policy π θ (a|S t ), we let the state

refer to the generated pseudo events up to time t, and let action a i = (t i ,ũ i ) indicate the generated pseudo event. The next pseudo event (i.e., new action a i+1 ) is taken by policy π θ (a|S t ) conditional on current state. Then the new state is updated as

Under the imitation learning framework, we define π E as the expert's policy, which characterizes the dynamics of observed events and we assume the observed events are generated from expert policy π E . For spatiotemporal point processes, one can think of π E := p(e i | H ti ) is the conditional density for the next event. We aim to learn a learner policy π θ := π θ (a|S t ) to imitate π E .

The imitation learning requires first to learn an optimal reward function r * from data, via solving

where G is the family of all candidate learner policies π θ , N T is the total number of observed events up to time T , andÑ T is the total number of generated pseudo events up to time T .

Given unknown reward function and only expert's sequences of events, the reward function is learned (as in Eq. (11) ) under the principle that expert policy should be uniquely optimal given the reward. In other words, the expert performs better than any other policies π θ ∈ G given the reward. This reward learning procedure is also called inverse reinforcement learning. However, the above inverse reinforcement learning formulation is challenging to solve, which requires to solve a reinforcement learning problem in an inner loop and is time-consuming and resource intensive. We will discuss how to simply this formulation and directly obtain an analytical optimal reward function later.

Given the learned optimal reward r * , the optimal learner policy is obtained by maximizing the cumulative reward of the actions given a finite time horizon T π * θ = arg max

In summary, the overall generative adversarial imitation learning framework is illustrated in Fig. 1 . Pseudoevents are sequentially generated from policy π θ . The discrepancy D(π E , π θ ) between the generated events with the observed events is evaluated by a reward function. The policy and the reward are jointly learned from data, and the reward function will iteratively guide the policy to improve the sample quality until the sampled events and the real events are indistinguishable.

Under the imitation learning framework, the reward function and the policy are jointly learned from the data as in Eq. (11) . The reward function is learned under the principle that expert policy should be uniquely optimal. In this two-player game as shown in Eq. (11) , the reward that quantifies the discrepancy between π E and π θ is updated by considering the worst-case, and the optimal policy π * θ aims to close this gap.

The function class for reward should be carefully chosen. On the one hand, we want the reward function class to be sufficiently expressive so that it can represent the reward function of various shapes. On the other hand, it should be restrictive enough to be efficiently learned with finite samples. With the above competing considerations, we choose the reward function class to be the unit ball in RKHS F, i.e., r F ≤ 1. An immediate benefit of this function class is that we can show the optimal policy can be directly learned via a minimization formulation (provided in Theorem 1) instead of the original minimax formulation as in Eq. (11) .

A sketch of proof is provided as follows. We will start from Eq. (11) to derive the results in Theorem 1. Fixing time horizon T , let ξ stand for the sequence of observed events up to T , i.e., ξ = {e 1 , e 2 , . . . , e N T }, and we use η to stand for the sequence of generated events, i.e., η = {a 1 , a 2 , . . . , aÑ T }. For short notation, we denote

k(e, ·)dN e feature mapping from data space to R (13) and

mean embeddings of the intensity function in RKHS (14) where dN e denotes the counting process associated with sample path ξ, and k(e, e ) is a universal RKHS kernel. Similarly, we can define φ(η) and µ π θ . Then using the reproducing property, we write the cumulative reward for the learner policy as

Similarly, we obtain J(π E ) = r, µ π E H . From (11) , r * is solved by

where the first equality is guaranteed by the minimax theorem, and

In this way, we convert the original mini-max formulation for solving π * θ to a simple minimization problem, which will be more efficient and stable to solve in practice. We summarize the formulation in Theorem 1.

The optimal reward has an analytical expression as in Eq. (15), which can be estimated from finite samples. Given L trajectories of expert point processes, and M trajectories of events generated by π θ , the mean embedding µ π E and µ π θ can be estimated by their respective empirical mean:

Then for any a := (t, u) where t ∈ [0, T ) and u ∈ S, the estimated optimal reward evaluated at point a is (without normalization) iŝ

The optimal reward is a three-dimensional function over time and space, which recognizes the differences between the generated events and the observed events. This can be further illustrated in Fig. 8 . Given the observed events and the current generated events (we visualize the distribution of the events over the spatial component and the occurrence intensity over the temporal component respectively), we plot the estimated optimal reward function using Eq. (17). Since the reward is a three-dimensional function, in the figure, we fix t = t 0 (where t 0 = 5) and plotr * (x, y, t 0 ) over the XY space; and we fix u = (x 0 , y 0 ) (where (x 0 , y 0 ) = (0, 0)) and demonstrater * (x 0 , y 0 , t). As illustrated, the computed reward function indicates the differences between the observed events' occurrence intensity (expert) and the generated events' occurrence intensity (learner)the reward has a high value where the expert's intensity is greater than the learner's intensity, and has a low value where the expert's intensity is smaller than the learner's intensity. In this way, the reward function will guide the learner to mimic the expert in training via obtaining a policy to maximize the cumulative reward.

Learner's Reward Learner's Reward 

The unit ball in RKHS is dense and expressive. Fundamentally, our proposed framework and theoretical results are general and can be directly applied to other types of kernels. For example, we can use the Matérn kernel, which generates spaces of differentiable functions known as the Sobolev spaces. In later experiments, we have used Gaussian kernel and obtained promising results. As for future work, the kernel function can also be learned to maximize the distinguish power.

Minimizing D(π E , π θ , F) is equal to minimizing D 2 (π E , π θ , F), where the latter formulation is more convenient to use without the need to perform normalization. Our goal is to learn a policy π θ that the discrepancy is close to zero, i.e., min π θ D 2 (π E , π θ , F)

with the corresponding optimal reward function has an analytical and nonparametric expression r * = µ π E −µ π θ .

In the original imitation learning formulation as shown in Eq. (11) , which is a mini-max game solved by alternating the minimization subproblem for the policy and the maximization subproblem for the reward. The optimization procedure should be carefully scheduled to escape bad local optima, and this is in particular for a non-convex game. In our framework, we propose a neural-based generative policy for spatio-temporal point process to gain model expressiveness; and we introduce a nonparametric reward to make our model parsimonious. Overall, our framework is lightweight.

Policy parameters will be learned via solving (18) in an end-to-end fashion. The gradient of the objective function D 2 (π E , π θ , F) can be backpropagated through the generative policy network. The policy parameters can be optimized by (stochastic) gradient descent method, i.e., θ k+1 = θ k − α k ∇ θ D 2 (π E , π θ , F)| θ=θ k (19) where θ k denotes the parameters after updating k iterations with initial policy θ 0 and α k denotes the learning rate. For the objective function D 2 (π E , π θ , F) = µ π E − µ π θ 2 F = µ π E , µ π E F + µ π θ , µ π θ F − 2 µ π E , µ π θ F , only the last two terms contain the gradient information for π θ , and the gradient can be estimated by finite samples.

Given L trajectories of expert point processes and M trajectories of events generated by π θ , we can have the finite sample estimate for D 2 (π E , π θ , F) by plugging in the mean embedding µ E and π θ as shown in Eq. (16), where

The roll-out samples a := {a i } are obtained by nonlinear transformations parametrized by θ and contain the derivative information for θ. The gradient of the objective function is obtained as follows -first take the derivative with respect to the actions, and then use the chain rule to take the derivative with the unknown policy parameters, i.e., ∂D 2 (π E , π θ , F) ∂θ = ∂D 2 (π E , π θ , F) ∂a · ∂a ∂θ .

where a depends on the policy parameters through Eqs. (4) and (5).

Apprenticeship learning via inverse reinforcement learning

Spatio-temporal point processes: methods and applications. Monographs on Statistics and Applied Probability

Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm

Generative adversarial nets

Doubly stochastic Poisson processes

Spectra of some self-exciting and mutually exciting point processes

Generative adversarial imitation learning

A self-correcting point process

Auto-encoding variational bayes

Learning temporal point processes via reinforcement learning

Self-exciting point process modeling of crime

Log gaussian cox processes

Statistical inference and simulation for spatial point processes

Algorithms for inverse reinforcement learning