key: cord-0910639-gvt25gac authors: Marmarelis, Myrl G.; Steeg, Greg Ver; Galstyan, Aram title: A Metric Space for Point Process Excitations date: 2020-05-05 journal: The Journal of Artificial Intelligence Research DOI: 10.1613/jair.1.13610 sha: 326ba11a7b99054190e0f2aba8de520a72db1887 doc_id: 910639 cord_uid: gvt25gac A multivariate Hawkes process enables self- and cross-excitations through a triggering matrix that behaves like an asymmetrical covariance structure, characterizing pairwise interactions between the event types. Full-rank estimation of all interactions is often infeasible in empirical settings. Models that specialize on a spatiotemporal application alleviate this obstacle by exploiting spatial locality, allowing the dyadic relationships between events to depend only on separation in time and relative distances in real Euclidean space. Here we generalize this framework to any multivariate Hawkes process, and harness it as a vessel for embedding arbitrary event types in a hidden metric space. Specifically, we propose a Hidden Hawkes Geometry (HHG) model to uncover the hidden geometry between event excitations in a multivariate point process. The low dimensionality of the embedding regularizes the structure of the inferred interactions. We develop a number of estimators and validate the model by conducting several experiments. In particular, we investigate regional infectivity dynamics of COVID-19 in an early South Korean record and recent Los Angeles confirmed cases. By additionally performing synthetic experiments on short records as well as explorations into options markets and the Ebola epidemic, we demonstrate that learning the embedding alongside a point process uncovers salient interactions in a broad range of applications. Infectious diseases (Gibson, Streftaris, & Thong, 2018) , news topics (He, Rekatsinas, Foulds, Getoor, & Liu, 2015) , crime patterns (Mohler, Short, Brantingham, Schoenberg, & Tita, 2011) , neuronal spike trains (Pillow, Shlens, Paninski, Sher, Litke, Chichilnisky, & Simoncelli, 2008) , and market trade-level activity (Morariu-Patrichi & Pakkanen, 2018; Swishchuk & Huffman, 2020) naturally suit the form of a diachronic point process with a network of directed excitations. Understanding their intrinsic dynamics is of immense scientific and strategic value: a particular series of discrete options trades may inform an observer on the fluctuating dispositions of market agents; similarly, temporal news publication patterns may betray an ensuing shift in the public zeitgeist. The spread of a novel pathogen, notably the COVID-19 virus, through disjointed pockets of the globe hints to how it proliferates, and how that might be averted (Drakopoulos, Ozdaglar, & Tsitsiklis, 2017) . Simple Hawkes (1971) processes model excitations as a linear combination of responses to past events. Often in the multivariate setting, a drastic need for data-efficient techniques arises: practically, estimating the ( × ) possible excitations between each dyad (pair) of event types is often untenable without succinct and interpretable parametrization. How can one possibly disentangle the contributions of hundreds of options trades within each minute, in a myriad of different strike prices and expiration dates, to the future frequency of a particular type of trade? Temporal processes like markets are also highly non-stationary, exacerbating the restriction to learn meaningful models on short windows. In these cases it is necessary to envision a reduced parametric form that is expressive yet cogent. The manifold hypothesis (Fefferman, Mitter, & Narayanan, 2016 , for possible references) supposes that the vast majority of complex, high-dimensional phenomena reside in a state space of significantly lower dimensionality. The present endeavor is an effort to estimate embeddings of manifold-like structures from indirect measurements-in this case, marked events. Spatiotemporal domains (Veen & Schoenberg, 2008; Mohler, 2014; Yuan, Li, Bertozzi, Brantingham, & Porter, 2019 ) exploit physical constraints on interaction locality. In essence, the direct influence one event bears on another is restricted by their separation in space, invariant to absolute positioning. We show that it is feasible to conjure a latent metric representation of the event types by harnessing a spatiotemporal approach outside the spatial context. We tailor the latent space toward the precise interactions between events. Our first contribution is a model that captures multivariate point-process regularities in a compressed metric space, and which eliminates the need to estimate all the pairwise interactions between the event types. As our second contribution, we propose three estimation algorithms founded on the expectation-maximization (EM) technique in identifying Hawkes processes (Yuan et al., 2019; Zhou, Zha, & Song, 2013b; Halpin & Boeck, 2013) . A latent structure introduced in the E-step to attribute the source of each event realization (Figure 1 ) permits tractable solutions in the M-step, which we exploit to optimize the embedding directly. Third, our findings on diverse social datasets demonstrate improvements upon the State of the Art (SotA) models, while also yielding qualitatively intuitive representations (Salehi, Trouleau, Grossglauser, & Thiran, 2019; Xu, Farajtabat, & Zha, 2016; Zhou, Zha, & Song, 2013a) . We successfully learn the regional diffusion process for the 2014-2015 Ebola outbreak as well as the influences between options trades in different stocks. We also optimized an embedding for the South Korean early COVID-19 outbreak (Kim, 2020) and the winter stage of the pandemic in Los Angeles (Welsh, 2020) . One active line of work, paralleling ours, embeds each event based on some conceived heuristic like temporal proximity (Torricelli, Karsai, & Gauvin, 2020; Zhu & Xie, 2019; Zuo, Liu, Lin, Guo, Hu, & Wu, 2018) ; another direction of inquiry entails the estimation of lowrank multivariate Hawkes processes (Nickel & Le, 2020; Lemonnier, Scaman, & Kalogeratos, 2017; Junuthula, Haghdan, Xu, & Devabhaktuni, 2019) . We differ from both in learning an actual metric-space representation vis-à-vis the real Hawkes likelihood. To our knowledge, we are the first to propose a Euclidean embedding scheme driven entirely by an underlying marked Hawkes process. Consider a record of event occurrences ( , ) ∈ H, = 1, 2, . . . , with marked types ∈ {1, 2, . . . } at times ∈ [0, ). We assume that events with certain marks excite future events of either the same or another type. Multiple such interactions may be present, and we desire to identify those that are warranted by the available observations. A compact representation would, in effect, induce a shrinkage prior on the continuum of allowable interactions. Its form should align with our inductive bias. We begin with preliminaries. The multivariate intensity function ( , ), conditional on the events in [0, ), dictates the instantaneous frequency of event ( , ). Any interval [ , + d ) witnesses Poisson-distributed instances of event type with rate ( , )d . We decompose this intensity (Bacry & Muzy, 2016) into self-and cross-excitations and an intrinsic background rate, not knowing a priori which event triggered which: A variety of approaches exists to estimate or infer the ( × ) separate response functions, ↦ → ℎ( , , ), indexed by the influencing event type and influenced type . Short records and ambiguities in the combinatorial structure of potential causation hinder proper identification of actual interactions. In the presence of competing renditions for the O ( 2 ) coefficients, we elect the structure that follows from the manifold hypothesis. Concretely, we envision a Euclidean geometry that adequately captures the interactions between event types. We estimate distinct embeddings for receiving, ∈ ⊂ R , versus influencing, ∈ ⊂ R , event types because otherwise we inadvertently constrain ourselves to symmetrical interactions. Multivariate Hawkes processes generally admit directed interactions. Accordingly, the finite collections of vectors and constitute the embedding of a putative manifold wherein the multivariate component of the excitations is characterized by − 2 -the influence that some event type exerts on . From hereon, we shall express as and as to ease the notational burden. Each response is the result of a dyadic interaction between an event ( , ) in the past and the potential occurrence of an event with type at the present . The bank of basis functions combines Gaussian spatial proximities and exponential clocks with support forward in time: Eqs. 2 & 3 form the spatial and temporal basis of = 1, 2, . . . kernels comprising the response function, tentatively produced below and expanded with full nuance in Eq. 8. Evidently, our construction enables multiscale responses in space (with spreads ) and time (with decays ). Later we introduce the ingredients ( ) and to relieve a constraint on the magnitude of each response. The proposed model family is framed as a strict superset of the parametric Hawkes process with exponential kernels: setting = 1, as we did in most of our experiments, collapses back to the exponential temporal response function. A generalized Poisson process yields a clean log likelihood function, written as follows: However, direct gradient-based optimization is prone to instability and a painstaking selection of hyperparameters. We turn to an E-M procedure with an augmented objective. Suppose we happened upon the expected branching structure (Halpin & Boeck, 2013; Reinhart, 2018) of the realized point process. In other words, we introduced latent variables [ ] ∈ R × × holding expectation estimates of ∈ {0, 1}, indicating whether it was the event instance that triggered instance , and attributing responsibility to kernel basis . Approximate knowledge of the untenable true line of causation endows us with the so-called complete-data log likelihood termed log (Yuan et al., 2019; Veen & Schoenberg, 2008; Reinhart, 2018) , an expectation of the joint log-probability density of the record and the latent variables [ ] in terms of their probabilities [ ]. log ∑︁ By abuse of notation let ℎ (· · · ) denote the th response kernel. Note that in the above form, what was previously a logarithm of summations (see Eq. 4 and 1) is replaced by a weighted sum of decoupled logarithms. The probability that the event instance was due to the background white Poisson process is ; ∀ , , + = 1. Concretely, given a model ( , ), our expectation for the particular branch of event cascades where event triggered event via kernel is the ratio of that particular contribution to the overall intensity: Lemma 1. The complete-data log likelihood in Eq. 5 provides a lower bound for the point process log likelihood (Eq. 4): log ≤ log . Proposition 1. The lower bound of Lemma 1 is maintained when the latent variables' expectations [ ] take on any value as long as ∀ , , + = 1. Our proof of Proposition 1 relies on Lemma 1 in the supplementary material. The righthand term in Eq. 5 simplifies vastly if one assumes that ∀ ∈ , ∀ , ∈ ( , ) = 1, and that ∀ ∀ , ∫ ( − )d ≈ 1. The latter approximation is tenable for large enough ; the former is not. Only through certain concessions may we gain confidence that the sum is roughly unit (equal to one). First note that by the Gaussian integral, ∀ ∫ R ( , )d = 1. Veen & Schoenberg (2008) approximated this sum as a Gaussian integral and demonstrated the viability of such alongside later studies. This approximation holds as long as the spatial occurrence of events is distributed uniformly in R . In our embedding scheme, they are not: events are clumped at discrete locations of their types, and the objective function must also be constrained so that it does not drift into regimes that violate the approximation. Thusly, we are left with a coerced normalization of Eq. 2; We rely on the unaltered form in Eq. 2 for the derivation of analytical maximizers during optimization, as is well established in seismology and other spatiotemporal studies (Reinhart, 2018) ; nevertheless, we found empirically that the intervention in Eq. 7 stymies potential drift towards degeneracy. Granular control of the magnitudes is desirable, so the final touch is the introduction of one more kernel parameter, ( ), to directly represent the total after summation. The spatial kernel thus handles only the distribution of influence across its receptors. The full response function is therefore Eq. 8. We eliminate redundancies by post hoc constraining the exertion coefficient −1 =1 ( ) = 1 and scaling the basis coefficients appropriately. Furnished with the expected branching structure in Eq. 6 (the "Expectation" step), we perform projected gradient ascent by setting partial derivatives of the complete-data log-likelihood with respect to each kernel parameter to zero (the "Maximization" step). Eventually, the potential trigger routes are aggregated in certain ways to form coefficient estimates. Omitting the domains of summation over and , as implicitly {1, 2, . . . } and {1, 2, . . . } × {1, 2, . . . } respectively, the solutions unfold as the following: At times, it is necessary to preserve focus on the acceptable time horizons for a particular domain. A Gamma( , ) prior on the decay rate admits the maximization a posteriori in Eq. 9, which trivially becomes uninformative at the assignment (1, 0) that we chose in our upcoming experiments. We included those extra parameters simply in case it becomes desirable to bias the model's time scale in the future. Empirically, we found that one is typically interested in the half-life (log 2/ ), the prior of which is the reciprocal of the aforementioned gamma distribution and characterized by the aptly named inverse-gamma distribution with expectation −1 . Preserving the mean while increasing both parameters strengthens the prior. The influences Φ = [ ( , )] consist of the kernels with time integrated out, i.e. ( , ) There is evidence that this quantity encodes the causal network structure (Achab, Bacry, Gaiffas, Mastromatteo, & Muzy, 2017; Etesami, Kiyavash, Zhang, & Singhal, 2016) . Pursuant to the above maximization step, one may alternatively estimate all of these 2 degrees of freedom: that is what we will later term our baseline. We present two candidate approaches for estimating optimal Euclidean embeddings of the Hidden Hawkes Geometry (hence HHG) in a multivariate point process. One is based on the gradients of log , and the other on a diffusion-maps heuristic. Finally, we present a full-rank baseline estimator derived from the same EM algorithm. We learn embeddings directly via the EM objective function, log , rather than some heuristic. Our approach updates both the reception and influence embedding, concurrently with the rest of the parameters, during the M-phase. The influence points maximize their EM objective at tractable solutions to a set of decoupled equations. Observe the partial gradient with respect to an influence vector , after expanding the response functions: This expression is difficult to solve analytically. Recall, however, our prior simplifying assumption that ∀ , ∈ ( , ) = 1, also enforced a posteriori by means of Eq. 7. See the discussion above that surrounds this equation. The latter portion of Eq. 14 contains the form ( − ) ( , ), equivalent to taking a quantized "expectation" of a Gaussian variable subtracted by its own mean (see Eq. 2). Hence, as long as the Gaussian sum is assumed to be approximately unit, then the entire second part of Eq. 14 vanishes due to the contribution of the pieces involving . We garner the following-rather intuitive-formula for globally optimal influence points, within the microcosm of the current E-phase: Evidently each influence point ∈ is attracted to the reception points { ∈ } that it appears to excite. First-order optimizer (HHG-A). All parameters are updated simultaneously in each fixedpoint iteration. Solving the optimality conditions leads to a decoupled system of equations comprising functionally distinct blocks of parameters, like reception points, influence points, and kernel decay rates. Unfortunately there is no analogous solution for the reception points that could manifest by a sensible approximation like that of the Gaussian integral, above. The simplest strategy is to submit to regular gradient ascent with learning rate : climbing the average log likelihood We produce the gradient below, in implicit vector notation. To gain intuition on the selection of , we looked into entropic impact as a heuristic. The beautiful findings of McFadden, 1965 allowed us to reason about the contribution of shifting an embedding point to the differential entropy of a doubly stochastic point process: which admitted a simple rule of thumb for adjusting the learning rate of Eq. 16 proportionally to / , with other factors pertaining to domain idiosyncrasies. Maintaining this rule ameliorated convergence in our synthetic experiments of §3.1. Second-order optimizer (HHG-B). Further differentiating Eq. 17 with respect to every reception point leads to a block-diagonal Hessian matrix, , decoupled into ( × ) coordinate blocks indexed by event type, . Assuming each of these is negative definite-more on that below-then it is numerically trivial to invert them. A naïve Newton step could have trouble converging, so we employed Levenberg-Marquardt regularization to introduce a sort of learning rate, 1 (Nesterov & Polyak, 2006) . Newton-Raphson optimization may be viewed as maximizing a downwards-facing (convex) local quadratic approximation of the objective function. From that perspective, 1 plays the role of inserting a parabola centered at the current and strengthening local convexity. We additionally introduce a regularization parameter 2 that constrains how far the embedding points may escape the origin. It may be useful in quelling the aimless drift of unconnected event types. Following the previous analogy, 2 superimposes a parabola at the embedding origin, a Gaussian well in the log-likelihood space. High 1 grows jump sizes to mimic a learning rate, whereas 2 shrinks the jumps as a regularizer. In practice, we chose to execute four such optimization steps during each M-phase to approximately converge to the current global optimum. Recall that a sole global optimum exists for each block of parameters per M-phase; however, we are point-wise maximizing an upper bound to the true likelihood. Each EM iteration concludes at a solution that is not guaranteed to be optimal, much less global. Yet it does converge eventually (Hunter & Lange, 2004) . A multivariate function with negative-definite Hessian throughout the domain has one extremum, the global maximum, and is thereby convex. The reception-point Hessians are not negative definite per se. They may, however, be deconstructed into the difference of a diagonal component and an outer product of another matrix with itself: = diag( ) − . Clearly, if the elements of were all negative, then the whole would be negative definite. The vector is where the positive component inside the sum is bounded above by unit, and the negative component is not analogously bounded below. To avoid undesirable cases, we apply a zero ceiling to each element of before employing the additional regularizers via ( 1 , 2 ). Convexity is therefore ensured. Oftentimes, it suffices to only set one of the two-particularly 1 -to a nonzero value. Whenever only 2 is specified, it is implied that 1 → ∞. Could we posit a diffusion process across event types, and estimate coordinates that recreate diffusion distances on an approximate Riemannian manifold? Random-walk manifold embeddings are helpful in deep representations (Kalatzis, Eklund, Arvanitidis, & Hauberg, 2020; Li, Lindenbaum, Cheng, & Cloninger, 2019; Rey, Menkovski, & Portegies, 2019) . Construed as graph affinities, the influences Φ guide a Markovian random walk of which diffusion maps (Soize & Ghanem, 2016; Lian, Talmon, Zaveri, Carin, & Coifman, 2015; Coifman & Lafon, 2006 ) may be approximated via spectral decomposition. We found that asymmetrical diffusion-maps embeddings in the style of Pham & Chen, 2018 serve as an adequate initial condition for ( , ) but are not always conducive to stable learning in conjunction with the rest of our iterative procedure. We term the model learned entirely this way as HHG-DM. We briefly review the technique's application here; the curious reader is encouraged to peruse the theory presented in Coifman and Lafon's seminal publication (2006) . Casting the influence matrix as edge weights in a bipartite graph flowing between influence (col.) ↔ reception (row), we examine the diffusion process upon it (Pham & Chen, 2018) . We first normalize by density to our liking, per our selected value for the parameter 0 ≤ ≤ 1 Consider the row-stochastic version of , named . Its singular values multiplied by the left (orthonormal) eigenvectors thereof supply manifold embedding coordinates for the reception points, weighted by significance according to the singular values. Likewise, may be constructed as the row-stochastic transformation of from the same Eq. 19, of which the resultant coordinates grant us the influence points. In each set of coordinates, we preserve only those corresponding to the highest singular values-except for the largest, which is constant by definition. Did the reduction in degrees of freedom lend its hand to a more generalizable model of the point process? In order to motivate the reason for having an embedding at all-besides the gains in interpretability-we pitted the techniques HHG-A/B and HHG-DM against the following: having estimated the full-rank matrix entries (·, ·) directly (Zhou et al., 2013b) . We surmised adequate initial conditions for the EM procedure with a fixed empirical protocol. The surmised influence matrix came by summing up correlations between event types. Notice that it remains unscaled. Initial "coefficient"ˆwas computed as the naive reciprocal inter-arrival time between event typesˆ= ( − 1) −1 −1 =1 ( +1 − ). We justify this construction on basis of Eq. 9, which forms a weighted average over said arrival times to garner an optimal estimate for −1 . We feed the result of Eq. 21 into the diffusion-maps algorithm in order to obtain our initial embeddings (ˆ,ˆ) ∈ˆ×ˆ. ∀ˆ2 is initialized at the mean dyadic squared distance in the embedding;ˆ= (ˆ) −1 , in which variety is injected to nudge the kernels apart;ˆ= −1 and finally ∀ ,ˆ( ) = −1 −1 . First we explore simulated estimations to elucidate the comparative behavior of all four techniques: HHG-A, HHG-B, HHG-DM, and FRB. Then we present a number of findings and challenges regarding the characterization, namely, of a recent Ebola epidemic, the COVID-19 epidemic, and an illustrative portion of the options market. As the Euclidean norm, especially under a Gaussian kernel with exponential decay, suffers from the curse of dimensionality, all of the following experiments set ∈ {2, 3}. It remains the topic of future study to incorporate parsimonious yet more robust metrics. We intended to stress-test the learning algorithm under small record sizes and numerous event types; we thereby contrived a few scenarios with known ground-truth parameters sampled randomly. The underlying models had a single kernel = 1 in the response function and conformed to the formulations from §2.2. Each sampled model was simulated with the thinning algorithm (see e.g. and their supplementary material) in order to generate a time-series record of specified length . Reception and influence points were realized uniformly from a unit square ( = 2). Spatial bandwidths 2 were granted a gamma distribution with shape = 1 / √ and unit scale. Decay rates were standard log-normal as were backgrounds ( ), though scaled by 1 / . Stability (Bacry & Muzy, 2016) was ensured by setting = 1 / √ , constraining the Frobenius norm Φ = 1 that upper-bounds the 2 -induced norm, which itself upper-bounds the spectral radius of the influences (Φ), the real criterion. Outcome columns show, respectively, mean and standard deviation of test and train log likelihoods (cols. "Test" & "Train"), divergence from ground-truth branching structure (col. "Div."), and mean improvement in correlation of the model's embeddings from GloVe's, each with respect to the ground-truth space; t-test significance markers are provided as well (col. "Impr."). ***: ≤ 0.01, **: ≤ 0.05, *: ≤ 0.1. The correlation improvements all passed an Anderson-Darling test (Razali & Wah, 2011) for normality with confidence ≥ 0.95. In line with Goodhart's Law (1981) , different facets of the model apparatus were scrutinized. First, we sought to ascertain whether the baseline tends to reach high in-sample likelihoods yet abysmal out-of-sample likelihoods, including extreme outliers. We bootstrapped the mean difference between the train and test log-likelihoods, arriving at confidence intervals for the test-train gap in Figure 2 . This metric is a common indicator of overfitting. Its applicability is more tenuous in the empirical records scrutinized in §3.2, where the process is not guaranteed to be stationary. There are further questions one could ask than the descriptive means and standard deviations of train and test log in Table 1 . Did our models recover the chain of causation between excited events? We opened up the empirical [ ] estimates and computed their Hellinger distance (Campbell & Li, 2019) from those stipulated by the ground truth, as if they were categorical distributions. For each "to be caused" event , the quantities ( , ) ↦ → form empirical histograms that are poorly suited numerically for the more conventional KL-divergence measure, unlike the Hellinger distance. Was the matrix of asymmetric influences ( , ) recovered correctly? We visualized the squared errors between the ground-truth influences and those of the final estimate in Figure 3 . Digging deeper, it is often pertinent to examine residual distributions. Our point-process model considers some events as (probably) excited by the past and others as purely whitenoise background occurrences. These background events innovate the process by possibly triggering new dynamics. Should our model fit the record, then it may be called upon to sample the likely background events from the rest. The interarrival times of that subset ought to follow an exponential distribution with rate parameters ( ). By the Poisson superposition principle (Ogata, 1981) , we may inspect the distribution of all background events against Exp( ( )). Those quantiles are illustrated in Figure 4 . Finally, as the intrinsic dimensionality is unknown in practice, we reevaluated HHG-B under mismatched dimensionalities against the static ( = 2) ground truth. These results were included for the (30, 900)−case on the bottom of Table 1 . ( 1 , 2 ) remained unchanged. Comparison to GloVe. In addition to excitatory point processes, our technique shares commonalities with a dual realm: that of vector embeddings for words and other sequential entities. We sought to draw comparisons with the latest treatments in that field. Our plan consisted of feeding the ordered sequence of event-type occurrences into a typical application of the GloVe scheme (Pennington, Socher, & Manning, 2014) , in an attempt to recover a set of vectors that served as both influence and reception points. We designed a forward-looking window on the three next events to produce an asymmetric co-occurrence matrix, resulting in three embedding dimensions. The final column of Table 1 displays a systematic evaluation against GloVe's embeddings with reference to the ground-truth geometry. Concretely, all pairwise distances in each setting (that of our learned model and the newfound GloVe embeddings) were correlated to those of the ground truth by Kendall's rank-based nonparametric statistic (Newson, 2002) . The gap between GloVe's estimated correlation and the model's under scrutiny, each in [−1, 1] and where a positive difference means the model correlated more with the ground truth, was collected in each trial, and means along with a t-test significance (Student, 1908) were reported in the last column of Table 1 . GloVe is nondeterministic, so we obtained the sample mean of ten embedding correlations per trial. Each proposed model shows the best of 500 epochs; the others were trained until convergence. Log likelihoods are averaged over the record size. GEO refers to interactions modeled over actual geographic coordinates. ( = 3, 1 = 10 −2 , 2 = 10 −1 , = 2) Epidemics. Consider an infectious disease within a social apparatus, behaving like a diffusive point process. In 2019, a finely regularized variational approach to learning multivariate Hawkes processes from relatively short records (Salehi et al., 2019) was demonstrated on a dataset of symptom incidences during the ∼2014-2015 Ebola outbreak 1 . We gave the record precisely the same treatment the authors did, and obtained significantly higher commensurate likelihoods than their best case. In turn, they outperformed the cutting-edge approaches MLE-SGLP (Xu et al., 2016) and AHHG-DM4 (Zhou et al., 2013a ) that regularize towards sparsity. We also trained a model with an embedding fixed to the geographic coordinates of the 54 West African districts present in the dataset, requiring it only to learn the appropriate response function. Assessing the value of the real spatial component of this process, we found that only HHG-B and FRB successfully outperformed the spatiotemporal model. See Table 2 . COVID-19. The novel coronavirus brought the world to its knees almost a year ago, as of this writing. The human suffering and socioeconomic disruptions have persisted since. Identifying harbingers for COVID-19's spatiotemporal transmission is of paramount importance for proactive policymaking. The most readily available data consist of confirmed cases tallied at the end of each day; timestamps of higher fidelity are both impractical and of questionable value, since a positive test result indicates that virus transmission occurred some days ago, the precise number of which is ambiguous. Further, the inhomogeneous testing rate per community and over time contributes to the nonstationarity of the process. Inherent nonstationarity stems from changing social practices and other exogenous conditions. Reports Figure 6 : Embeddings of the highest-scoring model for South Korea: our novel HHG-A. Each location is endowed with a pair of an ex and a dot, corresponding to receiving and influencing points respectively. Panels were faceted by qualitative color groups interpreted from Figure 5 . Under = 2, the raw hidden-space coordinates were plotted. We added normal noise with standard deviation of five units to each of the influence points (dots) to reveal the various colors stacked on top of each other. of deaths attributed to COVID-19 are potentially even less reliable as proxy indicators for transmission, due primarily to inconsistent protocols. As before, we only modeled pointprocess excitations. Incorporating a self-limiting aspect-as in the compartmental models (e.g. Bertozzi, Franco, Mohler, Short, & Sledge, 2020)-remains the topic of future study. South Korea. Early on in the pandemic, South Korea experienced a transient surge in cases that it swiftly suppressed by strict social controls. We believed we were likely to succeed in interpreting that record as a relatively stable Hawkes process, as opposed to most other countries that did not enforce the same measures. We found the meticulous dataset released by the Korean Centers for Disease Control & Prevention (Kim, 2020) . They detail 3,385 incidences from the early outbreaks across the 155 regions of the country, each with at least one infection occurrence. Diligent testing appears to contribute to the embedding's identifiability. The test set consisted of the last 30 days in the record, containing 631 incidences. The table in Figure 5 displays the resultant score of each model. It is also worth noting that the full-rank baseline FRB registered a significantly higher in-sample likelihood than all the other models did, indicative of excess overfitting. See the full HHG-A embeddings in Figure 6 . The optimal response function (with one kernel, = 1) predicted that each contagious individual infected = 0.714 others on average, with a half life of 3.34 days. Los Angeles is a dense metropolitan area with diverse demographics. We are interested in recovering the landscape of transmission dynamics. It usually differs from a uniform spatiotemporal diffusion process because the physical separation between cities is not commensurate to how much their residents come into contact. Distant cities connected by common routes between home and work may be nearby in a putative "interaction space." In the absence of reliable network information on contact rates-for mobility data introduce their own set of biases 2 -it would be useful to infer this from the infection rates themselves. Test Table 3 : Out-of-sample average log-likelihood for Los Angeles attained by each model. We found a dataset of confirmed cases in Los Angeles County curated by the Los Angeles Times (Welsh, 2020) . The sheer volume of infections per city is reflected in daily counts. To curb some of the nonstationary elements we truncated the record to the latest 50,000 cases as of December 12, 2020. To discretize the time series into events, we interpolated the cumulative daily counts of each city into a continuous signal. Specifically, we extended the curves linearly after a logarithmic (because of exponential growth) transformation. We then located the exact timestamp of consecutive increments of a specific threshold size; in this case study, that was ten infections. The resultant dataset consists of 5,000 events beginning roughly on November 9. Each of those events belongs to one of the 64 most impacted cities of Los Angeles County, depicted on the map in Figure 7 . The test set consisted of the last five days in the record, comprising 1,605 events out of the 5,000. Trying = 2, 3 and a handful of choices for the hyperparameter , we arrived at the results in Table 3 . Our novel parametrization HHG-DM scored the highest, and was visualized in Figure 7 through colorings that reflect topography in the HHG-DM embedding. Its predicted average infection rate was 0.578 and its half life was 0.0934 days: a little troublesome in comparison to the South Korean results, and likely due to the instability of this process. OpenStreetMap. Below are plotted the two principal components of the influence (dots) and reception (exes) embeddings accompanied by the same colorings as above. Options market. The intertwined market activity of options with underlying stocks TSLA, AAPL, and IBM during the unremarkable consecutive trading days of Sept. 15 & 18, 2017 led to 24 distinct event types. Roughly 100,000 total trades per day were sampled on 120% and 80% fuzzy moneyness levels-as portrayed in Figure 10 -at the expiration dates 01/18/2019 & 04/20/2018 for both puts and calls. The historical data was procured from AlgoSeek with the generosity of Prof. Roger G. Ghanem. The last 45 minutes of trading comprised each day's test set. See Table 4 . An auxiliary accuracy metric in Table 4 is derived from categorical cross-entropy of the predicted event type at the time of an actual occurrence, rendered by the expression exp{E [log ( , ) − log ( , )]}, in other words the geometric average of prediction accuracies. Between events, intensities decrease monotonically and with a homogeneous rate, so the ratio of intensities should stay roughly the same between occurrences-barring relatively outsized background rates. The "naive" score is computed from simply the mean empirical rate of each event type. We visualized the two-dimensional color-coded embeddings in Figure 8 and dsitributional fits for background events in Figure 9 . 25% 50% 75% 100% 125% 150% 175% 200% strike price as percentage of underlying security price, i.e. moneyness 0.0 0.1 sampling probabilities 80% 120% Figure 10 : Trades at discrete strike prices were resampled according to quantized log-Gaussian profiles with reference to moneyness at any given point in time. Standard deviations, in logarithmic space, were half the separation between the two densities' centers. They were kept "loose" for the sake of seamless translation even under abrupt fluctuations in the underlying stock price. Table 4 : Market fits with associated half lives. In HHG-A, the best in-sample was picked out of a handful of candidates. A similar grid search was enacted on powers of 10 for 1 in HHG-B. Outcome of 1,000 epochs depicted. Categorical accuracy is a prediction score ∈ [0, 1] for the next event type, at the point of occurrence of the next event. We expose some empirical details for estimating models and then analyze significant results. Of our three proposals for hidden-embedding point process estimators HHG-B, HHG-A, and HHG-DM, we concede that the iterator's stability tends to decline in that order of enumeration. Observe, in Figure 11 , the epochs of the best models found. HHG-DM and HHG-B converge best in the synthetic experiments. Interestingly, that tendency did not transfer onto real datasets. That phenomenon demonstrates the value of theoretical guarantees for convergence. HHG-DM appears to learn in the same way that a random search with some upward drift does, contrary to our initial expectations. Every now and then the HHG-DM sampler would stumble upon a model with high in-sample likelihood that also garnered the best fit out of sample. On the other hand, HHG-A and even more so HHG-B typically achieved their ideal fits at the end of their learning curves, even though HHG-A underwhelmed in the synthetic cases. In the Korean dataset, for instance, both HHG-A and HHG-B reached their maximum likelihoods at the last epoch. That was not the case for HHG-DM. As a model would converge to its optimal estimate, certain overarching parameters like the response time scale (homogeneous across event-type pairs) would drift. See e.g. Figure 12 . It proved quite difficult to steer the optimizer away from its "destined" time scale. Disentangling time scales. Parsing out longer behaviors out of high-frequency trades is severely difficult, and the gamma prior failed to facilitate it. Whenever we implemented a prior strong enough to influence the learning algorithm, it would overpower the course of training and degenerate the model. Should one desire to investigate longer-term patterns through individual trades, it would be imperative to study higher-order interactions (Marmarelis & Berger, 2005) in the future. We display the epoch with the best training score in all of our experiments. Most notable in Table 1 of synthetic results are the following: HHG-A/B yield a higher test log than FRB in all three rows; HHG-DM's results are a little more mixed. The ability to reveal chains of causality between excited events, as measured by the Hellinger divergence from the ground-truth branching structure, is better in HHG than in FRB. On average, the embeddings mostly mirror the ground-truth space more accurately than GloVe does, but HHG-DM here proved most consistent with significant improvement throughout. We conjecture that the stiffness of a (normalized) spectral decomposition in HHG-DM provides some benefits and some drawbacks: it facilitates topological discovery-hence its correlation to the ground-truth layout-yet clashes with the real point-process objective. The direct optimizers of gradient-based HHG-A and curvature-informed HHG-B appear increasingly more resilient in general. The excessive overfitting of the full-rank Hawkes process (FRB) is illuminated in Figure 2 , as well as the more granular Figure 4 . As is the case for convergence in the synthetic experiments, HHG-B and HHG-DM stand out as most accurately recovering the influence matrix in Figure 3 . Focusing on the first two columns of the aforementioned figures, (15, 900) & (30, 900), we examine the relative performance of HHG-B and FRB as a function of increasing number of event types (the original process dimensionality). The full-rank baseline, FRB, exhibits inferior ability in recovering influences than HHG-B when the event types increase from 15 to 30, as witnessed in Figure 3 . A similar effect occurs to a lesser extent in Figure 4 . The main driving force behind the failure of FRB appears to manifest in the over-determination of possible interactions between event types. Included additionally are the results of model misspecification for HHG-B in the (30, 900) case within the final rows of Table 1 . Standing against the two-dimensional ground truth, models with = 1, 3, 4, 5 showcase the general sensitivity to over-compression ( = 1) yet FRB-like performance following under-compression ( = 3, 4, 5) . A surprising outcome is the divergence and GloVe-correlation improvement of the under-compressed models. All three of those models exhibit consistent lowered divergence metrics. The experiments with = 4, 5 yielded astounding correlation improvements, albeit of high variance. Sometimes-although not always, as is evident in the Ebola investigation below-a few extra degrees of freedom garner the helpful kind of flexibility. Benchmark on the Ebola dataset. Table 2 's results depict superiority in our models to the SotA. It is puzzling that our simple full-rank baseline EM estimator FRB outperforms the more sophisticated formulations, when MLE-SGLP resembles a regularized version thereof. Perhaps its shrinkage priors do not suit this particular dataset, or all ∼17,000 training events sufficiently fleshed out the possible interactions without need for regularization. Sparsity per se did not bear out as an appropriate constraint. Our hidden-embedding model HHG-B was the only such estimator that outperformed the spatiotemporal model built on geographic "ground truths." This finding suggests that geographic information is recovered in that embedding, at least. Table 2 . It was difficult to transparently compare our methodologies against the SotA from Salehi et al., 2019. Part of the separation of outcomes could have been due to our framing that compels the estimator to learn a response decay rate on its own. The other cited models had those parameters selected via grid search, and remain static during the course of training. As witnessed in Figure 13 , our proposed models roughly converged to similar decay rates as those chosen in Salehi et al., 2019, but the paths followed were rather variable. The half lives of all models are largely similar, aligning with the time scales of the particular Ebola strain under consideration that had an incubation period of about 8-12 days (Kerkhove, Bento, Mills, Ferguson, & Donnelly, 2015) . Intrinsic dimensionality of the Ebola process. One would expect that a model like HHG-B would replicate FRB as the degrees of freedom equilibrate, i.e. → . However, that was not found to be the case. It appears that our fixed-point EM algorithms vastly complicate learning, for little payoff, in comparison to FRB when increases. We deemed it sufficient to demonstrate that the likelihood is not monotonic in embedding dimensionality . To make our case, listed are likelihoods of the most stable HHG-B as a function of : HHG-B(2) = −1.7, HHG-B(3) = −0.46, HHG-B(4) = −0.80. It therefore appears that the optimal dimensionality lies at = 3, hinting at an intrinsic Euclidean geometry of the process. Transmission network implicit in COVID-19 infection rates. The manifold structure of contagion networks has long been studied (Taylor, Klimm, Harrington, Kramár, Mischaikow, Porter, & Mucha, 2015) , although estimation thereof remains elusive. Our point-process models identified embedding candidates based on the multivariate statistics of the infections. For one, we observe that the two urban hubs Seoul and Busan are not as far apart in the hidden influence space ( Figure 6 ) as they are geographically. They still cluster in their respective neighborhoods-Busan to the right, Seoul to the left. The generally rural regions in between seem to spread across this topology. Regarding Los Angeles (Figure 7 ,) we first note that there is visible geographic locality preserved in the colorings. The reception embedding exhibits contiguous streaks of red/orange, of yellow/green, and of blue/violet. The influence embedding has communities of green/yellow scattered about. There are clear and acute differences between the embeddings and geography, albeit some geographic structure is recovered in the embedding. We expected divergences, as previously stated, since physical communities and even mobility networks do not always correspond to the underlying transmission process. All these factors may serve to inform an epidemiological model as proxies, in agglomeration, to the true contact rates between individuals. We provide an additional factor, derived from the measured infection rates themselves. It is distinct from the other datasets that estimate interaction between communities. We also demonstrated its predictive ability. On our results pertaining to the pandemic, we decided against including the auxiliary goodness-of-fit tests on inter-arrival times and classification accuracies. Our reasoning is that the COVID-19 infection process is clearly not a Hawkes process; it is unstable, multiphasic, and multifaceted. Our models provide a coarse approximation of its dynamics. Other quantities more specific to a Hawkes process may be misleading. On the other hand, the likelihood is salient on any point process and embeddings should yield qualitative meaning. Our other point-process applications demonstrate the validity of the presented models. Analysis of market embeddings. Events belonging to each stock ticker tend to attract influencing points of the same color in Figure 8 , with deviations probably due to relative perceptions by the market. Efficient estimators are necessary in order to discern a lack of stationarity; in this case, transferring our Sept. 15 HHG-A model onto the Sept. 18 test set yielded an average log improvement from 2.57 to 2.82. Vice versa, transferring from Sept. 18 to Sept. 15 led to a decrease from 2.72 to 2.45. Thus point-process behavior largely persists, but not entirely. Further, aggregate ( )'s confirm our broad intuition that out-of-money trades move markets the most (Yang, Kutan, & Ryu, 2018) . Quantile plots in Figure 9 indicate that the embedding models-all with the HHG-suffixoutperformed the baseline in statistically filtering out white background events according to their estimated probabilities . A constant-intensity Poisson process would witness arrival times distributed exponentially (Bacry & Muzy, 2016) ; therefore, any deviation from that in a model's sampled background events suggests the presence of uncaptured interactions. We observed that the categorical accuracy measured in Table 4 , assessing classification ability towards the next event to come by ignoring temporal granularities, is highest for HHG-B in the two trading days studied. It beats the naive score by at least 2.4% each time. Given the most consistently smooth learning curves under HHG-B during both synthetic experimentation and empirical investigations, we recommend the implementation of HHG-B in future endeavors. The final test-set likelihoods of HHG-B were always remarkably close to the top, if not the actually the highest. There was no empirical dataset scrutinized for which HHG-B failed significantly in comparison to HHG-A, HHG-B, or even FRB. The fundamental notion driving the doubly stochastic process, first attributed to Cox (1955) , manifests in a variety of ways including the Latent Point Process Allocation (Lloyd, Gunter, Osborne, Roberts, & Nickson, 2016) model. In fact, note the meteoric diversity of Bayesian methods permeating serially dependent point processes (e.g. Apostolopoulou, Linderman, Miller, & Dubrawski, 2019; Detommaso, Hoitzing, Cui, & Alamir, 2019) . Zhang et al. (2019) sample the branching structure in order to infer the Gaussian process (GP) that constitutes the influence function. GPs, usually accompanied by inducing points, sometimes directly modulate the intensity (Liu & Hauskrecht, 2019; Lloyd et al., 2016; Aglietti, Bonilla, Damoulas, & Cripps, 2019; Ding, Khan, Sato, & Sugiyama, 2018; Flaxman, Chirico, Pereira, & Loeffler, 2019) . Linderman and Adams (Linderman & Adams, 2015) took an approach that estimated a matrix very similar to our [ ], relying on discretized binning and variational approximation. Salehi et al. (2019) exploited a reparametrization trick akin to those in variational autoencoders in order to efficiently estimate the tensor of basis-function coefficients. Recent progress has been made in factorizing interactions with a direct focus on scalability (Nickel & Le, 2020) , improving on prior work in low-rank processes (Lemonnier et al., 2017) . Block models on observed interaction pairs also exist (Junuthula et al., 2019) . While these all achieve compact Hawkes processes, our methodology distinguishes itself by learning a Euclidean embedding with the semantics of a metric space, not a low-rank projection. Deep nonparametric (Poisson or even intensity-free) point process models 3 made massive strides in the past few years. However, not all domains permit such expressive characterization. Our baseline estimator resembles most closely the one described in the work of Zhou et al. (2013b) , whereas the spatiotemporal aspect is inspired from the likes of Schoenberg et al. (see Veen & Schoenberg, 2008; Mohler, 2014; Yuan et al., 2019, ) . Variational substitutes in the EM algorithm have also been explored (Cho, Galstyan, Brantingham, & Tita, 2014) . A concurrent study to ours by Zhu et al. (2020) parametrizes a heterogeneous kernel in real Euclidean space by deep neural networks. We demonstrated the viability of estimating embeddings for events in an interpretable metric space by tying it to a self-exciting point process. The three proposed expectationmaximization algorithms extract parsimonious serial dependencies to different degrees of reliability, partly depending on record length and dimensionality. The optimizer of second order HHG-B converges most consistently to an accurate model. In comparison to learning the full triggering matrix (with FRB,) our embedding models achieve a higher likelihood out of sample. The estimated time scale for Ebola contagion settles close to its natural rate. Our framework paves the way for generalization, extension to more elaborate models, and consequent potential for actionable insights. We seek to demonstrate that log ≤ log always. Recall the likelihood function and its complete-data relative: The intensity may be written as ( To minimize distractions, we focus on the structure of what is found inside the logarithms and what is found on the outside. Accordingly, the above expression takes the abstract form log in terms of positive (for they serve as arguments to a logarithm) weights and contents , over arbitrary indices . The concave logarithm admits the following relation, by Jensen's inequality: log ≤ log . Since = corresponding to expansion terms in the intensity function, Therefore we hold that log ≤ log . We are once again concerned with the behavior of now in the case that = ℎ ( , , − ) ( , ) and the contents of the aforementioned logarithms do not match-hence the apostrophes. Note that the above is equivalent to is minimized). The same cannot be said about maximization with respect to [ ], which depends on log ( , ) as well. The latter task proves useful in learning. Thus when we perform a single learning iteration and vary the model parameters to evaluate log on some [ ] fixed a priori, it will still offer a lower bound to log . Uncovering causality from multivariate hawkes integrated cumulants Structured variational inference in continuous cox process models. 33rd Conference on Neural Information Processing Systems Mutually regressive point processes. 33rd Conference on Neural Information Processing Systems First-and second-order statistics characterization of hawkes processes and non-parametric estimation The challenges of modeling and forecasting the spread of covid-19 Universal boosting variational inference. 33rd Conference on Neural Information Processing Systems Latent self-exciting point process model for spatial-temporal networks Diffusion maps Some statistical methods connected with series of events Stein variational online changepoint detection with applications to hawkes processes and neural networks Bayesian nonparametric poissonprocess allocation for time-sequence modeling When is a network epidemic hard to eliminate Recurrent marked temporal point processes: Embedding event history to vector Learning network of multivariate hawkes processes: A time series approach Testing the manifold hypothesis Scalable high-resolution forecasting of sparse spatiotemopral events with kernel methods Heterogeneities in the case fatality ratio in the west african ebola outbreak 2013-2016 Comparison and assessment of epidemic models Inflation, Depression, and Economic Policy in the West, chap. Problems of Monetary Management: The U.K. Experience Modelling dyadic interaction with hawkes processes Spectra of some self-exciting and mutually exciting point processes Hawkestopic: A joint model for network inference and topic modeling from text-based cascades A tutorial on mm algorithms Neural jump stochastic differential equations. 33rd Conference on Neural Information Processing Systems The block point process model for continuous-time event-based dynamic networks Variational autoencoders with riamannian brownian motion priors A review of epidemiological parameters from ebola outbreaks to inform early public health decision-making Ds4c: Data science for covid-19 in south korea Multivariate hawkes processes for large-scale inference Diffusion variational autoencoders Multivariate time-series analysis and diffusion maps Scalable bayesian inference for excitatory point process networks Nonparametric regressive point processes based on conditional gaussian processes. 33rd Conference on Neural Information Processing Systems Latent point process allocation Deep random splines for point process intensity estimation of neural population data. 33rd Conference on Neural Information Processing Systems Temporal network embedding with micro-and macro-dynamics General methodology for nonlinear modeling of neural systems with poisson point-process inputs The entropy of a point process The neural hawkes process: A neurally self-modulating multivatiate point process Self-exciting point process modeling of crime Marked point process hotspot maps for homicide and gun crime prediction in chicago State-dependent hawkes processes and their application to limit order book modeling Cubic regularization of newton method and its global performance Parameters behind "nonparametric" statistics: Kendall's tau, somers' d and median differences Learning multivariate hawkes processes at scale On lewis' simulation method for point processes Deep mixture point processes: Spatio-temporal event prediction with rich contextual information Mobile phone data for informing public Fully neural network based model for general temporal point processes Glove: Global vectors for word representation Large-scale spectral clustering using diffusion coordinates on landmark-based bipartite graphs Spatio-temporal correlations and visual signaling in a complete neuronal population Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests A review of self-exciting spatio-temporal point processes and their applications Diffusion variational autoencoders Learning hawkes processes from a handful of events Geometric hawkes processes with graph convolutional recurrent neural networks. Association for the Advancement of Artificial Intelligence Generative sequential stochastic model for marked point processes Data-driven probability concentration and sampling on manifold The probable error of a mean General compound hawkes processes in limit order books Topological data analysis of contagion maps for examining spreading processes on networks weg2vec: Event embedding for temporal networks Estimation of space-time branching process models in seismology using an em-type algorithm Los angeles times Learning granger causality for hawkes processes Option moneyness and price disagreements Multivariate spatiotemporal hawkes processes and network reconstruction Efficient non-parametric bayesian hawkes processes Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes Learning triggering kernels for multi-dimensional hawkes processes Interpretable generative neural spatio-temporal point processes Crime event embedding with unsupervised feature selection Embedding temporal network via neighborhood formation