key: cord-0515733-ifvzq2aj
authors: Calderon, Pio; Soen, Alexander; Rizoiu, Marian-Andrei
title: Linking Across Data Granularity: Fitting Multivariate Hawkes Processes to Partially Interval-Censored Data
date: 2021-11-03
journal: nan
DOI: nan
sha: 151f8d80ee2ca6787c3d96dea627dabeb5583184
doc_id: 515733
cord_uid: ifvzq2aj

This work introduces a novel multivariate temporal point process, the Partial Mean Behavior Poisson (PMBP) process, which can be leveraged to fit the multivariate Hawkes process to partially interval-censored data consisting of a mix of event timestamps on a subset of dimensions and interval-censored event counts on the complementary dimensions. First, we define the PMBP process via its conditional intensity and derive the regularity conditions for subcriticality. We show that both the Hawkes process and the MBP process (Rizoiu et al. (2021)) are special cases of the PMBP process. Second, we provide numerical schemes that enable calculating the conditional intensity and sampling event histories of the PMBP process. Third, we demonstrate the applicability of the PMBP process by empirical testing using synthetic and real-world datasets: We test the capability of the PMBP process to recover multivariate Hawkes parameters given sample event histories of the Hawkes process. Next, we evaluate the PMBP process on the Youtube popularity prediction task and show that it outperforms the current state-of-the-art Hawkes Intensity process (Rizoiu et al. (2017b)). Lastly, on a curated dataset of COVID19 daily case counts and COVID19-related news articles for a sample of countries, we show that clustering on the PMBP-fitted parameters enables a categorization of countries with respect to the country-level interaction of cases and news reporting.

The Hawkes process, first introduced by Hawkes (1971) , is a temporal point process that exhibits the self-exciting property, i.e., the occurrence of one event increases the likelihood Figure 1 : Example of interaction between view events on YouTube (red lollipops) and tweets on Twitter (blue lollipops). The data is partially interval-censored, as YouTube does not expose individual views, but only the view counts C i 's over the predefined intervals [o i , o i+1 ) (shown as red rectangles). The dashed lines show the latent branching structure between views and tweets. Note that the red lollipops are also dashed and empty, indicating that YouTube views are not observed.

of future events. The Hawkes process is widely applied in both the physical and social sciences. For example, earthquakes are known to be temporally clustered: the mainshock is often the first in a sequence of subsequent aftershocks (Ogata, 1988) . In online social media, tweets by influential users typically induce cascades of retweets as the message diffuses over the social network (Rizoiu et al., 2017a) . The multivariate Hawkes process (Hawkes, 1971) extends the univariate process by allowing events to occur in multiple parallel timelinesdubbed as dimensions. These dimensions interact via cross-excitation, i.e., events in one dimension can spawn events in the other dimensions. Fig. 1 schematically exemplifies the interaction between two social media platforms: YouTube and Twitter. An initial tweet (denoted as A on the figure) spawns a retweet (B) via self-excitation and a view (C) via cross-excitation. The cross-excitation goes both ways: the view C generates the tweet D.

Given the event timestamps, we can fit the parameters of the Hawkes process using maximum likelihood estimation (MLE). However, in many practical applications, the event times are not observed and only counts over predefined partitions of time are available. We denote such data as interval-censored. For multivariate data, we denote the case when all dimensions are interval-censored as completely interval-censored. If only a subset of dimensions is interval-censored, we have partially interval-censored data. One reason for interval-censoring is data availability -for epidemic data (Browning et al., 2021) we usually observe the aggregated daily counts of reported cases instead of detailed case information. Another reason could be space limitations -for network traffic data (Shlomovich et al., 2020) , it is impractical to store high-resolution event logs which instead are stored as summaries over bins. A third reason could be data privacy. This is the case for YouTube, as shown in the upper half of Fig. 1 , where the individual views are interval-censored, and we only observe aggregated daily counts. This paper tackles three open questions related to using the multivariate Hawkes process with partially interval-censored data. The first question relates to fitting the process to both event time and interval-censored data. When the data is presented as event times, the multivariate Hawkes can be fitted using MLE (Daley and Vere-Jones, 2003) . For completely interval-censored univariate data, propose the Mean Behavior Poisson (MBP) -an inhomogeneous Poisson process that approximates the mean behavior of the Hawkes process -to estimate its parameter. However, for the partially interval-censored data, a model and fitting scheme are still elusive. The question is, can we devise a method to fit the multivariate Hawkes process in the partially interval-censored setting? What are the limits to Hawkes parameter recovery in the partially interval-censored setting?

The second question relates to modeling and forecasting online popularity across online social media platform boundaries. Online popularity has been extensively studied within the realm of a single social media platform -see Twitter Kobayashi and Lambiotte, 2016; Zarezade et al., 2016; Mishra et al., 2016; Zadeh and Sharda, 2022) , YouTube (Crane and Sornette, 2008; Ding et al., 2015; Rizoiu et al., 2017b) , Reddit (Medvedev et al., 2018; Krohn and Weninger, 2019 ) -and the self-exciting point processes are a tool of choice for modeling. However, content is often shared across multiple interacting platforms -such as YouTube and Twitter -and we need to account for cross-excitation using multivariate processes. But YouTube only exposes views data as interval-censored, rendering it impossible to use the classical Hawkes multivariate process. The Hawkes Intensity process (HIP) (Rizoiu et al., 2017b) proposes a workaround and treats the tweet and share counts as external stimuli for views. Its shortcoming is that it cannot model the cross-excitation from views to tweets and shares. The question is, can we improve performance in the YouTube popularity prediction task by modeling the views, tweets, and shares and fitting on partially interval-censored data?

The third question concerns analyzing interaction patterns across the online and offline environments, enabling us, for example, to determine whether online activity preempts or reacts to events that happen offline. Previous studies (Ortiz-Martínez et al., 2020; Ciaffi et al., 2020; Strzelecki, 2020; Walker et al., 2020) have studied the relationship between Google Trends, a proxy for internet search activity, and COVID-19 incidence. However, most of these studies calculate the correlation between time series, and do not build explanatory models to produce nuanced views of the interactions through interpretable parameters. The question is, can we apply multivariate Hawkes processes on partially interval-censored data to uncover country-level differences in the interplay between recorded daily case counts of COVID-19 and the publication of COVID-19-related news articles?

We address the first question, in Section 3, by introducing the Partial Mean Behavior Poisson (PMBP) process 1 , a novel multivariate temporal point process that can handle partially interval-censored data. On the interval-censored dimensions, PMBP's event intensity is the expected Hawkes intensity over the stochastic history of the said dimensions, condi-tioned on the history of the event time dimensions. On these latter dimensions, PMBP's intensity equals that of the corresponding Hawkes process. This construction enables fitting PMBP on partially interval-censored data and approximating the corresponding multivariate Hawkes process parameters through parameter equivalence. We identify two sources of information loss when fitting the multivariate Hawkes process using PMBP: (1) the model mismatch between the multivariate Hawkes and PMBP, and (2) the loss attributable to partial interval-censoring, which conceals exact event timings for a subset of dimensions. To differentiate between the two sources of loss, in Section 4.1, we perform extensive synthetic experiments. We identify the loss of type (1) by fitting both PMBP and Hawkes on event time data and the loss of type (2) by fitting PMBP on both event time and partially interval-censored data. We find a more substantial impact of the loss of type (2), which unsuprinsingly increases with the length of the censoring intervals. Notably, while some fitted parameters are consistently over-and under-estimated, the spectral radius -a multidimensional generalization of the branching factor -is surprisingly accurately fitted regardless of the information loss. Finally, PMBP generalizes both the multivariate Hawkes process and the MBP process -we show that the conditional intensities and regularity conditions of these two processes are special cases of the conditional intensity and regularity conditions of PMBP.

We address the second question in Section 4.2 by using PMBP to fit the popularity of YouTube videos jointly on YouTube and Twitter. We show that PMBP not only outperforms the current state-of-the-art HIP (Rizoiu et al., 2017b) , but PMBP can also quantify prediction uncertainty, and it produces predictions on all dimensions (HIP could only predict the views dimension).

We address the third question in Section 4.3 by using PMBP to uncover the link between COVID-19 case incidence and news coverage. We fit a country-specific PMBP process for 11 countries using a dataset of reported COVID-19 cases (interval-censored) and the publication dates of COVID-19-related news articles during the early stage of the COVID-19 outbreak. We cluster countries based on the fitted PMBP parameters and identify three groupings. In the first group -consisting of UK, Spain, Germany, and Brazil -we found that news is preemptive: an increase in the news coverage leads to a rise in the number of cases. In the second group (China and France), the news is reactionary and lags behind cases. No significant interaction between news and cases was observed in the third group (US, Italy, Sweden, India, and the Philippines).

The main contributions of this work are as follows. C1: We introduce the Partial Mean Behavior Poisson (PMBP(d, e)) process, a novel multivariate temporal point process which approximates the mean behavior of the ddimensional Hawkes process over the stochastic history of a subset of dimensions, while retaining dependence on the history of the complementary dimensions (Section 3.1). We develop the theoretical underpinnings of the PMBP process by deriving its conditional intensity function (Theorem 12) and proving regularity conditions (Theorem 21). We develop the computational aspects of the PMBP process by presenting an algorithm that allows numerical approximation of the conditional intensity (Section 3.4) and adapting the thinning algorithm to enable sampling (Section 3.5).

C2: We show that the PMBP process approximately recovers the multivariate Hawkes parameters in the partially interval-censored setting (Section 4.1). We show that of the two sources of information loss present in this approach, interval censoring contributes to a higher degree of over-and underestimation than the model mismatch.

C3: We show that the PMBP process outperforms HIP (Rizoiu et al., 2017b) , the current state-of-the-art in the task of YouTube popularity prediction (Section 4.2).

C4: We present a categorization of 11 selected countries into three groups (preemptive, minimal, reactionary) based on the interaction between case counts and news publication dates during the early stage of the COVID-19 outbreak, uncovered by inspecting and clustering the parameters of the fitted PMBP process (Section 4.3).

Notation. In this paper, bold-font either represents vectors, i.e., µ, α, or functions returning vector values, i.e., λ(·), ϕ(·). The symbols R, R + , N denotes the reals, the nonnegative reals, and the natural numbers, respectively. The Dirac delta function δ(·) is defined as b a δ(t)f (t)dt = f (0) when f is continuous at 0 with a < 0 < b. Furthermore, define δ(t) to be the diagonal matrix diag(δ(t), . . . , δ(t)). We define the Iverson bracket p = 1 if p is true and p = 0 otherwise.

Let L p (R + ) denote the space of measurable functions f : R + → R with finite L pnorm, i.e., f p :

The convolution of f, g ∈ L 1 (R + ) is given by (f * g)(t) := t 0 f (τ )g(t − τ )dτ -when convenient, we use the notation f (t) * g(t) := (f * g)(t). For n repeated convolutions of f , we set f ⊗n := f * . . . * f n times . Given the (m × n)dimensional matrix of functions f and the (n × p)-dimensional matrix of functions g, the matrix convolution of f and g is defined to be the (m × p)-dimensional matrix (f * g)(t) := t 0 f(τ ) · g(t − τ )dτ , where · is the matrix product.

In this section, we review the necessary material in point process literature that is needed to define the Partial Mean Behavior Poisson (PMBP) process. First, in Section 2.1, we define the multivariate Hawkes process (MHP) and the fundamental functions that complete its description: the conditional intensity, compensator, and likelihood function. Second, in Section 2.2, we reintroduce the Mean Behavior Poisson (MBP) process from , which will be extended in subsequent section.

The univariate Hawkes process is a well-known temporal point process that models a single type of event that displays self-exciting behavior. Informally, the self-exciting property means that the occurrence of an event increases the probability that another event will occur in the near future. This property enables events to be clustered in time, which mirrors the bursty dynamics that we oftentimes see in real world phenomena -e.g., earthquakes modeling (Ogata, 1988) , social media popularity , and stock orders in finance (Bacry et al., 2015) . In the more general multivariate Hawkes process (MHP), multiple interacting event types can be taken into account. Given d types of events, the corresponding d-dimensional Hawkes process is a temporal point process where each dimension tracks the dynamics of each event type. In addition to being self-exciting, the MHP is cross-exciting among other event types, i.e., an event occurring in one type of event increases the probability of any type of event occurring in the near future. For more details on application areas of MHPs, see Section 5.

Temporal point processes can be specified in two ways. We can define the associated counting process, N(t), which represents the number of events that occurred up until time t. Alternatively, we can specify the conditional intensity function of the process. Let H j t − be the set of all events that occur in dimension j up until t, for j ∈ 1 · · · d. The conditional intensity function λ t| d j=1 H j t − is defined as

which gives the instantaneous probability of a dimension j event occurring in the increment [t, t+dt) , conditioned on all events that happen before t. As per standard convention (Daley and Vere-Jones, 2003) , we denote λ (t) := λ t| d j=1 H j t − for brevity. Alternatively, the conditional intensity can be viewed as the mean number of events occurring in an infinitesimal interval, conditioned on the past. A simple multivariate extension of the result in Rasmussen (2018) gives

The conditional intensity of the d-dimensional Hawkes process is given by

where the function µ(t) is called the background intensity, which controls the arrival of external events into the system. The matrix ϕ(t) is called the Hawkes kernel, a d × d matrix of functions that characterizes the self-and cross-excitation across the event types representing the d dimensions. The diagonal entries ϕ jj (t) and off-diagonal entries ϕ ij (t), i = j, represent the self-and cross-exciting components of the Hawkes kernel, respectively. The function ϕ j (t) represents the j th column of the Hawkes kernel. The Hawkes kernel is often specified in a parametrized form to facilitate simple interpretability of the model. Let D denote the index set {1, . . . , d}.

The parameter α ij is called the branching factor from j to i, and the matrix α = (α ij ) ∈ (R + ) d×d is called the branching matrix.

The branching factor α ij gives the expected number of direct offsprings in dimension j spawned by a parent of dimension i.

For the function f ij (t), we consider the commonly used exponential kernel given by

where the θ ij parameter is called the decay rate, which controls the rate of influence decay from j to i. Suppose that the background intensity is assumed to be constant in all dimensions, i.e., µ(t) = µ. An MHP with an exponential kernel under this case thus has a parameter set Θ = {µ, α, θ}.

By integrating the conditional intensity, we obtain another important measure: the compensator Λ(t) of the process.

Definition 2 Given a temporal point process with conditional intensity λ (t), the compensator Λ(t) is defined as

where 0 ≤ s ≤ t.

Remark 3 The compensator Λ(t) can be interpreted as the expected number of events over

By integrating Eq.

(3), we obtain an explicit form for the compensator Λ(t) of the d-dimensional Hawkes process:

where

Regularity condition. A univariate Hawkes process is subcritical if the expected number of direct and indirect offsprings (i.e., the progeny) spawned by a single parent is finite. In this case, the Hawkes process is expected to die out as t → ∞. The intuition for the multivariate Hawkes process is similar, but in this case we have to consider that any event in one dimension is capable of producing events in any other dimension by cross-excitation. A multivariate Hawkes process is subcritical if the progeny resulting from a single event of dimension j to dimension i is finite for every pair (i, j) ∈ D × D.

Definition 4 Let λ 1 , · · · , λ d be the eigenvalues of the branching matrix α. The spectral radius ρ(α) of α is defined as

For a one-dimensional Hawkes process, the spectral radius is exactly the branching factor, the expected number of secondary events triggered by a parent event. If the branching factor is less than one, the Hawkes process is subcritical. If the branching factor is greater than one, the process is supercritical, and the progeny of a single parent event is expected to have an infinite number of offspring events as t → ∞. In this case the Hawkes process is also called explosive.

The following proposition is a standard result that characterizes the convergence of a geometric series of matrices. We use the following to obtain a closed-form expression of the total progeny produced by events in every dimension for the MHP.

Proposition 5 (Hubbard and Hubbard (2002)) If ρ(α) < 1, ∞ n=0 α n converges and is equal to (I − α) −1 .

The subcriticality condition for a multivariate Hawkes process is given by the following theorem. We provide a proof sketch of the theorem.

Theorem 6 A Hawkes process with branching matrix α is subcritical if ρ(α) < 1.

Proof Let ρ(α) < 1. Suppose we have one parent event in dimension j ∈ D. Let us consider the offsprings of this parent event.

The expected number of direct (i.e., first-generation) offsprings in dimension i is α ij . The expected number of second-generation offsprings in dimension i is d k=1 α ik α kj = (α 2 ) ij , which is intuitively the dimension i offsprings of the first-generation offsprings of the dimension j parent event. By the same argument, the number of m th generation offsprings would then be (α m ) ij . Thus, it follows that the (i, j) element of m n=1 α n tracks the total number of dimension i offsprings up to the m th generation produced a single parent event in dimension j. In the limit m → ∞, we can conclude by Proposition 5 that the Hawkes process is subcritical. Furthermore, the expected number of dimension i offsprings of a dimension j parent event is given by the (i, j) element of (I − α) −1 − I.

Parameter estimation. Suppose that we are given a set of observed events d j=1 H j T − up until some maximum time T > 0. Our task is to find the parameter set Θ that best fits this given set of observations. The standard approach is maximum likelihood estimation (MLE), where we find Θ that maximizes the probability of observing the data given the point process model. Equivalently, we can minimize the negative log-likelihood function L Θ; d j=1 H j T − of the parameter set Θ. For the d-dimensional Hawkes process, this is given by

We add the subscript PP-LL (Point-Process Log-Likelihood) to emphasize that the likelihood is evaluated with respect to event timestamps. (Ogata, 1981) Input: kernel matrix ϕ(t), background intensity µ, time horizon T > 0 Output:

Sampling. Given an MHP, the standard approach to sample event sequences is via the thinning algorithm discussed in Ogata (1981) . This techniques converts the task of sampling a Hawkes process into the significantly simpler task of sampling a homogeneous Poisson process. The rate of this Poisson process is obtained as an upper bound to the Hawkes conditional intensity and is recomputed every time a new event is accepted. Proposed events from the procedure are 'thinned' out with rejection sampling using the Hawkes conditional intensity. Algorithm 1 shows how to sample event sequences from a d-dimensional Hawkes process given a constant background intensity.

Consider a univariate Hawkes process with conditional intensity λ(t|H t − ). The Mean Behavior Poisson (MBP) process introduced by is defined as the temporal point process with conditional intensity ξ(t) given by

where * denotes the convolution operator. Evidently, there is a close relation between the univariate Hawkes process and the MBP process. The MBP intensity is in fact the average of the Hawkes conditional intensity over the set of realizations of the Hawkes process. The MBP process is a inhomogeneous Poisson process since it is not dependent on its event history.

Analogous to Eq. (11), the MBP compensator Ξ(t) is defined as the expectation of the Hawkes compensator Λ(t) over the stochastic history of the process. That is,

MBP as a linear time-invariant system. It was shown by that the map µ(t) ⇒ ξ(t) in Eq. (13) defines a linear time-invariant (LTI) system (Phillips et al., 2003) . In an LTI system, the following properties hold:

• Time invariance. Let t 0 > 0. If µ(t) ⇒ ξ(t), then

The response ξ(t) to the input µ(t) can be obtained by solving for the response of the system to the Dirac impulse δ(t) and applying the properties of linearity and time invariance. With this approach, ξ(t) can be expressed as 

where

and ⊗n corresponds to n-time convolution. Assuming that we use the parameterization for ϕ in Definition 1, a sufficient condition for the subcriticality of the MBP process is α < 1. This condition ensures that the infinite sum in Eq. (17) converges to zero as t → ∞.

In the sequel, we will consider the multivariate LTI system µ(t) ⇒ ξ(t) (Dudgeon and Mersereau, 1990) , where linearity (Eq. (14)) and time invariance (Eq. (15)) must hold in every dimension. The response of the multivariate LTI system can similarly be obtained by solving for the Dirac impulse response on each dimension and then applying the two LTI properties.

Parameter estimation in interval-censored settings. Since the MBP process is a Poisson process, its increments are independent, which allows the likelihood function to be expressed as a sum of the likelihood of disjoint Poisson distributions. This enables the MBP process to be fitted in interval-censored settings via maximum likelihood estimation.

Suppose instead of observing the sequence of events H T − , we observe interval-censored counts over a given partition of [0, T ), which we denote as P[0, T ). Furthermore, assume that the partition is subdivided into m subintervals, so that

, we are given the count C k of events that occur. In this setting and given the MBP process, the negative log-likelihood function L (Θ; {C k } m k=1 ) can be obtained with the following result. Proposition 7 ) Suppose we are given interval-censored counts {C k } m k=1 over the partition

The negative loglikelihood function of an MBP process with intensity ξ(t) and compensator Ξ(t) is given by

where we add the subscript IC-LL (Interval-Censored Log Likelihood) to make explicit the fact that we are calculating the likelihood with respect to interval-censored counts.

Hawkes intensity process. The Hawkes intensity process (HIP), introduced in Rizoiu et al. (2017b) , is a Hawkes-based temporal point process that can be fit to interval-censored data. It was used primarily for the task of YouTube popularity prediction, where the views on a particular YouTube video are interval-censored, and external shares and tweets that mention the video act as the exogenous intensity µ(t).

where o 0 = 0 and o m = T , and the associated view counts {C k } m k=1 , the HIP model is fitted by finding the parameter set Θ that minimizes the sum of squares error:

whereξ[·] is given byξ

The use of brackets in the previous equations is to emphasize that the involved quantities are discretized over a partition of time.

With the observation that the MBP intensity ξ(t) and HIP intensityξ[t] are evidently similar, it was proven in that HIP is a discretized approximation of the MBP process, where instead of optimizing the likelihood in Proposition 7, the sum of squares error in Eq. (19) is used.

In this section, we introduce the Partial Mean Behavior Poisson (PMBP) process: a multivariate temporal point process that can be used to fit partially interval-censored data. In Section 3.1, we define the PMBP process via its conditional intensity and discuss its relationship with the multivariate Hawkes process and the MBP process. In Section 3.2, we derive the convolutional formula of the PMBP conditional intensity by solving for the impulse response of the associated LTI system. Next, in Section 3.3, we derive the regularity conditions for PMBP and show that the regularity conditions for the multivariate Hawkes process and the MBP process are special cases of the PMBP regularity conditions. Sections 3.4 and 3.5 discusses two numerical schemes: the former for calculating the PMBP conditional intensity given observed data; and the latter for sampling the PMBP process. Lastly, in Section 3.6, we discuss maximum likelihood estimation for fitting the PMBP process in the partially interval-censored setting.

Let d ∈ N and e ∈ N ∪ {0} such that d ≥ e. We introduce the following partition of the index set D = {1, . . . , d}: E := {1, . . . , e} and E c := {e + 1, . . . , d}.

Definition 8 Consider a d-dimensional Hawkes process with conditional intensity λ (t) as defined in Eq. (3). The Partial Mean Behavior Poisson process PMBP(d, e) is the stochastic process with conditional intensity

The PMBP(d, e) conditional intensity ξ E (t) is obtained by taking the conditional intensity λ (t) of a d-dimensional Hawkes process and averaging over the set of event histories j∈E H j t − in the E dimensions. Under this construction, the PMBP(d, e) process can be viewed as a collection of d processes with the following properties: The dynamics in the E dimensions follow an e-dimensional inhomogeneous Poisson process, while in the E c dimensions is governed by a (d − e)-dimensional Hawkes process.

In practical settings, the E dimensions coincide with the set of events where the exact event time is not accessible and only interval-censored event counts can be obtained. On the other hand, the E c dimensions coincide with events with complete event time information. As a specific example, E would be YouTube views and E c the set of tweets, e.g., Fig. 1 .

It is clear from Eq. (20) that the Hawkes process and the MBP process are special cases of the PMBP(d, e) process. If we set e = 0, then E = ∅ and Eq. (20) becomes

which implies that the PMBP(d, 0) process is the d-dimensional Hawkes process. On the other hand, if e = d, then E = D and Eq. (20) becomes

showing that the PMBP(d, d) process is the d-dimensional MBP process.

The compensator Ξ E (t) of the PMBP(d, e) process can be defined in a similar way to the compensator Λ(t) of the Hawkes process in Definition 2.

Definition 9 The compensator Ξ E (t) of a PMBP(d, e) process is

where 0 ≤ s < t.

The PMBP(d, e) compensator Ξ E (t) can be interpreted as the expected number of events over [0, t) given the history of the process, similar to the interpretation of the Hawkes compensator Λ(t) in Remark 3.

Eq. (20) expresses the conditional intensity ξ E (t) of a PMBP(d, e) process as the conditional expectation of the intensity a d-dimensional Hawkes process over the history of a subset of its dimensions. The evaluation of this expression is impractical and computationally costly to use directly as it requires the simulation of many samples of the Hawkes process. In this section, we present an explicit formula for ξ E (t) that can be used as the basis for a numerical scheme to approximate the conditional intensity. Our method is based on expressing ξ E (t) as the impulse response solution of a Linear Time-Invariant (LTI) system, extending the procedure in for calculating the conditional intensity of the MBP process to the partially observed interval-censored setting. Consider the Hawkes kernel ϕ(t), expressed in matrix form as

As we recall from Section 2.1, ϕ j (t) represents the j th column of ϕ(t). Additionally, to emphasize the fact that ϕ j (t) is a column vector we write it as ϕ j (t) .

Now, define ϕ E (t) and ϕ E c (t) to be the matrices

Simply put, ϕ E (t) is ϕ(t) except with the columns corresponding to the E c dimensions zeroed out. Similarly, ϕ E c (t) is ϕ(t) but with columns corresponding to the E dimensions zeroed out. Observe that

We first prove the following preliminary result, which relates the PMBP conditional intensity ϕ E (t) to the counting process N(t) of the d-dimensional Hawkes process.

Lemma 10 Let PMBP(d, e) be a partial MBP process with conditional intensity ξ E (t) and N(t) be the counting process of the corresponding d-dimensional Hawkes process. The following holds:

Proof See Appendix A.

Lemma 10 extends the result in Eq. (1) for the d-dimensional Hawkes process, to the partial multivariate case considered by the PMBP(d, e) process. To go from Eq. (1) to Eq. (25), we simply replace the total expectation with the conditional expectation over event histories in the E dimensions.

The following theorem provides an expression for the conditional intensity ξ E (t) of a PMBP(d, e) process in terms of the Hawkes kernel ϕ(t) and the background intensity µ(t).

Theorem 11 Given the Hawkes process with conditional intensity Eq. (3), the conditional intensity of its corresponding PMBP(d, e) process is given by

Proof Starting from Eq. (20), we expand λ (t) using Eq. (2), as shown below.

corresponds to an LTI system (See Appendix A).

Let i ∈ D and ξ E (t) be the corresponding impulse response under s(t) ♥ = ⇒ ξ E (t). Suppose that E i (t) is the system's response to δ i (t), the unit impulse in dimension i given by the i th column of the diagonal matrix δ(t).

Applying Eq. (30) on the input-output pair (δ i (t), E i (t)),

where in the last line we used the definition of h E in Eq. (28). By commutativity of convolution, the third term can be written as

where we made use of Eq. (34) to arrive at the last equality. By combining the three terms, Eq. (35) can be written as

Imposing on the previous equation the assumption lim n→∞ ϕ ⊗n

Lastly, applying the specific form of s(t) in Eq. (29), we obtain the desired formula, Eq. (27).

The additional assumption lim n→∞ ϕ ⊗n E (t) = 0 introduced in Theorem 12 is a necessary condition for the convergence of h E (t) in Eq. (28). Intuitively, what this entails is that the intensity contribution of late-generation offsprings has to asymptotically go to zero to achieve convergence of the infinite sum. This is discussed further in Appendix D.

Remark 13 In general, ξ E (t) does not admit a closed form solution in terms of the kernel parameters in ϕ because the infinite sum of convolutions in Eq. (28) cannot be written in closed form. However, in the special case of a PMBP(2, 1) process with the exponential kernel (Eq. (4)), we do have a closed-form solution for ξ E (t). Appendix M contains the derivation of this closed-form expression.

Remark 14 Under a given effective input s(t), the system s(t) ♥ = ⇒ ξ E (t) returns the resulting intensity of the PMBP process averaged over the stochastic history of the E dimensions.

Given that the effective input s(t) treats the intensity contribution of events in the E c dimensions as exogenous, only events in the E dimensions are self-and cross-exciting in s(t) ♥ = ⇒ ξ E (t). In other words, the associated branching process for s(t) ♥ = ⇒ ξ E (t) considers the scenario where only events in the E dimensions produce offsprings.

Under s(t) = δ j (0), Eq. (36) tells us that the response of the LTI system s(t)

for τ > 0.

Integrating both sides of Eq. (37) over [0, t), we get

Taking the limit of both sides as t → ∞,

We see that if the integral of the right-hand side diverges to ∞, then the expected number of events, in view of Remark 3, explodes as t → ∞.

In order to have a finite expected number of events in the branching process over the E dimensions, the integral of h E (t) over [0, ∞) must be finite. Stated differently, we require

The PMBP(d, e) compensator Ξ E (t) in Definition 9 follows a similar law as Eq. (27).

The PMBP(d, e) compensator Ξ E (t) is given by

where M(t) and Φ(t) are the integral of the background intensity µ(t) and the integral of the Hawkes kernel ϕ(t) as defined in Eq. (7) and Eq.

, where we zero out the columns corresponding to the E c dimensions.

Proof See Appendix A.

In this section, we prove the subcriticality condition for the PMBP(d, e) process analogous to Theorem 6 for the d-dimensional Hawkes process. We first go over prerequisites and then prove some preliminary results. Lemma 17 allows us to compute an upper bound on the the n-time convolution of the matrix kernel ϕ, and Theorem 20 provides a condition on the branching matrix α in order for h E (t) defined in Eq. (27) to be convergent.

Our main result, the subcriticality conditions for the PMBP(d, e) process, is contained in Theorem 21. Given that the d-dimensional Hawkes process and the MBP process are special cases of the PMBP(d, e) process, we show in Corollary 22 that our derived subcriticality conditions include the corresponding conditions for the Hawkes and MBP processes. Recall that L p (R + ) is the space of functions f : R + → R that have finite p-norm. We require the following result on convergence in L p spaces.

Theorem 16 (Folland (1999) ) If 1 ≤ p < ∞, every absolutely convergent series in L p (R + ) converges.

Our first preliminary result is an upper bound on the 1-norm of each entry of ϕ ⊗n (t) expressed in terms of the n th power of the branching matrix α. We use this upper bound in the proof of Theorem 20, which identifies the condition on the branching matrix α for a convergent h E (t).

Lemma 17 Let α = (α ij ) ∈ (R + ) e×e and ϕ(t) = (ϕ ij (t)) ∈ (R + ) e×e be a matrix satisfying

Proof See Appendix B.

To proceed, we introduce a way of writing the branching matrix α in block form with respect to E and E c . In contrast to the multivariate Hawkes process where each dimension is indistinguishable in role from one another, there is an asymmetry with regard to the roles of the E and E c dimensions in the PMBP(d, e) process. As we shall see in Theorem 20, only the branching factors in the E dimension have an influence on the convergence of h E (t). Furthermore, in Theorem 21 we shall see that the regularity condition for PMBP(d, e) involves interactions between the E and E c dimensions.

Definition 18 Consider a d-dimensional Hawkes process with branching matrix α. Define the following submatrices of α:

Remark 19 With the submatrices in Definition 18 , the branching matrix α in Definition 1 can be expressed in block form as

The following theorem presents the condition on the branching matrix α that ensures L 1 -convergence of h E (t) defined in Eq. (27). As discussed in Remark 14, L 1 -convergence of h E (t) guarantees that the dynamics under the branching process on the E dimensions is nonexplosive and that we have a finite expected number of events.

Theorem 20 If the branching submatrix α EE satisfies ρ(α EE ) < 1, then h E ∈ L 1 (R + ) d×d .

Proof Let ρ(α EE ) < 1.

If we are able to show that the series h E = ∞ n=1 ϕ ⊗n E is absolutely convergent in L 1 (R + ) d×d , then Theorem 16 tells us that h E ∈ L 1 (R + ) d×d .

Our task is then to prove absolute convergence of every entry in h E . If we denote h k E to be the k th partial sum of the series h E , i.e.

then what we need to show is lim

×e as the submatrices of ϕ given by

ϕ E in Eq. (22) may be written in block matrix form as

Convolving ϕ E with itself, we see that ϕ ⊗2 E can be written as

Convolving n times, we arrive at

Taking the 1-norm of both sides, we see that

where the last inequality is due to Lemma 17.

In (a), we reindex i = i − e to start our index at 1. Lines (b), (c), (d) are applications of the definition of matrix product, Minkowski's inequality, and Young's convolution inequality, respectively. In (e), we apply Lemma 17, and in (f), we apply the definition of ϕ E c E . Putting in Eq. (42) and Eq. (44) into Eq. (40), we see that

where the previous equation follows from ρ(α EE ) < 1 and Proposition 5. Inspecting every block of Eq. (46), it is clear that that for all

Given that

we have proven that every entry in h E is absolutely convergent in L 1 (R + ).

Our main result for this section is the following theorem that presents three conditions which when satisfied ensure subcriticality of the PMBP(d, e) process.

The PMBP(d, e) process with branching matrix α is subcritical if the following conditions hold.

Proof Consider a PMBP(d, e) process with conditional intensity ξ E (t) given in Eq. (27). If ρ(α EE ) < 1, ∞ n=1 ϕ ⊗n E is a convergent function by Theorem 20, so ξ E (t) is well-defined. We proceed by noting that the intensity ξ E (t) can be decomposed as the sum of three distinct conditional intensities:

(A) an inhomogeneous Poisson process rate:

(B) a multivariate Hawkes intensity:

(C) a convolution term involving the Hawkes intensity:

We inspect each of these intensities separately.

For the process with intensity ξ E (t) to be subcritical, the processes corresponding to each of these three intensities necessarily have to be subcritical.

A Poisson process is independent of the arrival of new events, so the process is subcritical as long as µ(t) is a bounded function and ∞ n=1 ϕ ⊗n E is L 1 -convergent. Intensity (B). The multivariate Hawkes process corresponding to intensity (B) may be viewed as a process with branching matrix

Subcriticality of this process is governed by Theorem 6. Thus, our task is to determine

Expanding the left-hand side of Eq. (52),

we see that the eigenvalues of α E c are precisely 0 and the eigenvalues of

By Theorem 6 and given ρ(α E c E c ) < 1, the multivariate Hawkes process with intensity (B) is subcritical.

the matrix upper bound we introduced in Eq. (46). Consider the (i, j) branching factor of the process with intensity (C), which can be expressed as

(55)

Applying (a) Young's convolution inequality, (b) Minkowski's inequality, and (c) the matrix upper bound in Eq. (46), we see that Eq. (55) is upper bounded by:

This tells us that the (i, j) branching factor of the process with intensity (C) is upperbounded by the (i, j) branching factor of a multivariate Hawkes process with branching matrix Ωα E c . Thus, if we are able to show that the multivariate Hawkes process with branching matrix Ωα E c is subcritical, i.e., ρ(Ωα E c ) < 1, the process defined by intensity (C) is necessarily subcritical as well.

Ωα E c in block form may be written as:

Expanding the left-hand side of Eq. (56), we see that

Thus we see that the eigenvalues of Ωα E c are precisely 0 and the eigenvalues of

As the three sub-intensities that consist ξ E (t) all correspond to subcritical processes given the assumed conditions, ξ E (t) is then subcritical.

The regularity conditions for PMBP(d, e) in Theorem 21 cover the Hawkes process and the MBP process as special cases.

Thus, only the first condition is non-trivially satisfied, yielding ρ(α) < 1. If 0 < e < d (PMBP), all branching submatrices are possibly nonzero and all three conditions are potentially non-trivially satisfied.

As discussed in Remark 13, in general, a closed-form solution for the conditional intensity ξ E (t) in Eq. (27) is not guaranteed to exist since the infinite sum of convolutions h E (t) in Eq. (28) cannot be expressed in closed form. In order to calculate ξ E (t) given a sequence of observations and sample the PMBP(d, e) process, we require the development of numerical techniques to approximate ξ E (t). In this section, we describe two approximations that enable calculation of ξ E (t).

Given the set of observed histories j∈E c H j t − on [0, t), suppose that we are interested in approximating the intensity ξ E (t). Eq. (27) can be used for this task, however, there are two issues that we first need to address. First, the formula involves taking the convolution of functions. If the functions involved are not too complex, the convolution can be calculated in closed form. In general, however, this is not the case and we need to approximate it numerically. Second, Eq. (27) contains the infinite sum of convolutions h E (t), which in most cases cannot be written in closed form and has to be approximated as well.

To address these two issues, we introduce two approximations:

• Approximating continuous convolution with numerical convolution;

• And, approximating the infinite series h E (t) with the sum of the first k terms h k E (t).

Let f and g be matrix functions defined over [0, t] such that the number of columns of f (s) and the number of rows g(s) are equal, i.e., the matrix product f (s)·g(s) can be calculated. Assume that we are given a partition We introduce the conv operator, a discrete approximation to continuous function convolution, where the convolution f * g on [0, t] is approximated as a sum of convolution terms over the partition P[0, t]. Within each subinterval, we fix g at the left endpoint and perform the integration on f . The univariate version of this convolution approximation scheme was considered by .

where · is matrix multiplication and

Proof See Appendix C.

To obtain an approximation for ϕ ⊗n E (t) over P[0, t] for n ≥ 2, Proposition 23 can be applied n times to ϕ E (t). Summing the resulting expressions and applying the infinite series truncation in Section 3.4.2 allow us to obtain the approximation h E [0 : t]. Given h E [0 : t], we see in Eq. (27) that a second set of convolutions has to be performed, where we convolve h E [0 : t] with the background intensity µ[0 : t] and the influence contributions of the events in the E c dimensions.

To be able to use Proposition 23 on the convolution terms in Eq. (27) involving h E (t) we require the integral of h E (t), which we denote by H E (t). Proposition 24 writes this as a function of the Hawkes kernel integral Φ E (t) and h E (t).

H

Proof See Appendix C.

The infinite series h E (t) = ∞ n=1 ϕ ⊗n E (t) is approximated by truncating the series up to the k * th term, where k * is selected based on a convergence threshold we discuss below, and replacing the continuous convolution * with the numerical convolution operator conv. If we set

The accuracy of this approximation can be specified by a threshold on the max norm.

Definition 25 Given a real matrix M = (m ij ), we define its max norm M max as

Our approximation of the function

where γ h > 0 is a predetermined convergence threshold. 

The two hyperparameters ∆ P and γ h control the approximation error involved in calculating ξ E (t). The higher ∆ P and the lower γ h , the better the approximation. This, of course, comes with the tradeoff of a longer computation time.

An alternative way to calculate ξ E (t) is to interpret the PMBP(d, e) intensity in Eq. (20) as an expectation of the multivariate Hawkes intensity, with respect to the events in the E dimensions conditioned on the events in the E c dimensions. To compute this expectation, we simply sample Hawkes event histories over the E dimensions using the Hawkes thinning algorithm, calculating the Hawkes intensity given each sample, and then average the resulting intensities. The method is presented in Appendix E.

We have presented three approaches to compute ξ E (t): (1) a closed-form solution for the PMBP(2, 1) process with exponential kernel derived in Appendix M; (2) the approach developed in this section based on numerical convolution and infinite series truncation; and (3) a sample-based approach developed in Appendix E. A comparison of ξ E (t) obtained from these three approaches for the PMBP(2, 1) process is presented in Appendix F.

Given j∈E c H j t − , Algorithm 2 allows us to calculate the PMBP(d, e) conditional intensity ξ E (t). However, to 'continue' the process, we need to be able to sample events that occur beyond t. We can obtain samples from a PMBP(d, e) process using the thinning algorithm presented in Algorithm 3. Here, we consider a specific form of the exogenous rate µ(t), given by

where γ, ν ∈ (R + ) d are learnable parameters.

Intuitively, this corresponds to a spike of magnitude γ at t = 0 and a constant rate ν over time. This form of µ(t) follows Rizoiu et al. (2017b .

Algorithm 3 is a modified version of Ogata's thinning algorithm (Ogata, 1988) for the multivariate Hawkes process, with two modifications. First, this version of the thinning algorithm uses the appropriate upper bound ξ ub E for PMBP(d, e), considering that the PMBP(d, e) has a different conditional intensity from Hawkes, and the latter's upper bound will not be valid for PMBP(d, e) except in the special case E = ∅. We derive this upper bound in Appendix G. Second, given that the conditional intensity ξ E (t) in general cannot be written in closed form, we would need to apply the approximations detailed in Section 3.4. In addition, sampling forward in time requires us to be able to compute ξ E (t) at every candidate event time, and due to the convolution term, we have to keep track of the Hawkes intensity contributed by every previous event accepted by the thinning algorithm. We introduce a stepsize parameter ∆ t > 0 that controls the discretization in time when we approximate the Hawkes intensity.

Algorithm 3: Simulating a Partial MBP Process on [0, T ) with Thinning

Remark 26 Algorithm 3 has two discretization parameters ∆ P and ∆ t . The first increment ∆ P controls the discretization for h E , while the second increment ∆ t controls the discretization for the Hawkes intensity term j∈E c s∈H j T ϕ j E c (t − s). In general, we would like to keep both increments as small as possible, but this comes at a tradeoff of computation time. However, there is a higher priority to keep ∆ P small as h E , being an infinite sum, could hit convergence issues if ∆ P is not small enough.

Suppose that we are given a d−dimensional dataset over the time interval [0, T ] such that observations in the first q dimensions are interval-censored and observations in the last d − q dimensions are observed as event sequences.

To elaborate, let Q := {1, . . . , q} and Q c :

. . < t j n j }. Now suppose we want to fit the PMBP(d, e) process to the dataset described above using maximum likelihood estimation. Without loss of generality, we consider the case e = q (equivalently, E = Q). In other words, we are setting the dimensions we model with the Hawkes (MBP) process to be the set of dimensions where event sequences (volumes) are available.

Theorem 27 The negative log-likelihood of parameter set Θ under a PMBP(d, e) model given event sequences j∈E c H j T − and event volumes j∈E {C j k } n j k=1 can be written as

where

Proof Given a PMBP(d, e) model, we can factorize the joint probability of the event times and event volumes as:

where in (a) we used the observation that only the event times in the E c dimensions contribute to the intensity ξ E (t), as dictated by Eq. (27), which means that the event times t i are independent of the counts C i . In (b) we used the independence of observing counts in the 1 . . . E dimensions.

Taking − log(·) of both sides and converting the product to a sum over logarithms, we get

Using Proposition 7 and Eq. (10) on the first and second terms, respectively, we arrive at the desired formula.

In the general case of e = q, the choice of likelihood for dimension j,

Computing the negative log-likelihood L (Θ; T ) of a PMBP(d, e) model requires calculation of the intensity ξ E at the event times t i and the compensator Ξ E at both event times t i and observation times o i . We present a numerical scheme to calculate the negative log-likelihood in Appendix H.

In order to use gradient-based optimization tools to find the minimizer Θ , we would have to be able to compute the gradient L Θ (Θ; T ). In Appendix I we derive an expression for this gradient and discuss the numerical techniques needed to compute it alongside the negative log-likelihood L (Θ; T ).

Remark 29 Eq. (62) gives the negative log-likelihood for a single realization of the PMBP(d, e) process. Suppose we are given R realizations of the process and we want to find the joint negative log-likelihood of these R realizations. Let D j=1 H r,j T − be the set of event sequences corresponding to the r th realization. The joint log-likelihood is simply the sum of the loglikelihood of each realization. That is,

where L r (Θ; T ) denotes the negative log-likelihood corresponding to the r th set of event sequences D j=1 H r,j T − .

In this section, we conduct three experiments to investigate the capabilities and empirical behavior of the PMBP(d, e) model:

• Section 4.1: Parameter Estimation with Partially Inverval-Censored Data.

Through synthetic experiments, we test how well PMBP estimates the parameters of the Hawkes process in the partially interval-censored setting.

• Section 4.2: YouTube Popularity Prediction. We leverage PMBP to predict the future popularity of online videos, using the real-world dataset ACTIVE (Rizoiu et al., 2017b . The dataset contains the number of daily views, external shares, and tweets relating to a large number of YouTube videos. We compare prediction performance with the state-of-the-art Hawkes Intensity Process (Rizoiu et al., 2017b) .

• Section 4.3: Interaction Between COVID-19 Cases and News. We examine how PMBP can link online and offline phenomena in the context of COVID-19. We fit PMBP(2, 1) to a dataset of daily COVID-19 case counts and publication dates of COVID-related news articles during the early stage of the COVID-19 outbreak (January to July 2020) for 11 countries. We cluster the fitted country-specific PMBP parameters, and we explore the interaction of news and COVID-19 cases across countries.

In this section, we investigate how to leverage the PMBP(d, e) process to estimate the parameters of a multivariate Hawkes model in the partially interval-censored settings. We approach the task numerically and conduct synthetic experiments to study the quality of the approximation. We sample realizations from a d-dimensional Hawkes process, intervalcensor some dimensions using increasingly long interval lengths, and fit the PMBP(d, e) model on the obtained partially interval-censored data. Finally, we compare the fitted PMBP parameters with the original Hawkes parameters.

Sources of information loss in fitting. We identify two sources of information loss that occur in the above-described setup: (1) the mismatch between the data-generation model (the d-dimensional Hawkes process) and the fitting model (the PMBP(d, e) process); and (2) the interval-censoring of the timestamped Hawkes data. When we estimate Hawkes parameters using PMBP fit on partial interval-censored data, information losses of both types (1) and (2) occur. We disentangle between the two types by also fitting PMBP(d, e) on the timestamp dataset (i.e., the actual realizations sampled from the Hawkes process, see below). Any information loss in this setup is due to the model mismatch, the information loss of type (1) . Note that the likelihood function used for fitting PMBP depends on the employed version of the dataset. By comparing the parameter estimates on the two dataset versions, we can quantify the individual effects of model mismatch and interval-censoring.

Synthetic Datasets. We construct two synthetic datasets: the timestamp dataset and the partially interval-censored dataset. The latter is identical to the former, except it has one of the dimensions interval-censored. The timestamp dataset consists of samples from a two-dimensional Hawkes process. Table 1 lists the parameters corresponding to three values of the spectral radius: ρ(α) ∈ {0.5, 0.75, 0.9}. The ρ(α) = 0.5 indicates a clearly subcritical Table 1 : Hawkes spectral radii and model parameters considered in the parameter recovery synthetic experiment. We fix the initial impulse parameters γ 0 = γ 1 = 0 in our simulations. Hawkes process; the ρ(α) = 0.9 corresponds to a Hawkes process approaching the critical regime; and ρ(α) = 0.75 corresponds to an intermediate case between these two. For each parameter set, we sample 5000 event sequences H 1 60 ∪H 2 60 over the time interval [0, 60) using the thinning algorithm in Algorithm 1. Following a procedure similar to prior literature , we partition the 5000 event sequences into 50 groups, and each group of 100 events is used for joint fitting, yielding a single parameter set estimate. In total, we obtain 50 sets of parameter estimates from the sample.

We construct the partially interval-censored dataset by interval-censoring H 1 60 , the first dimension of each realization in the timestamp dataset. Given a partition of [0, 60), we count the number of events on dimension 1 that fall on each subinterval. We experiment with three lengths of the observation windows, to quantify the information loss of type (2) -intuitively, the larger the interval length, the more significant the information loss. We consider the following interval lengths: Note that the timestamp and partially interval-censored datasets are identical apart from the interval-censoring in the latter and should yield very similar fitted parameters.

PMBP log-likelihood functions. We fit the parameters of PMBP(2, 1) using two different versions of the likelihood function depending on which dataset we use:

• timestamp dataset: we use the point-process log-likelihood on both dimensions (PP-PP), defined in Eq. (64):

• partially interval-censored dataset: we use the interval-censored log-likelihood on dimension 1 and the point-process log-likelihood on dimension 2 (IC-PP) (see Eq. (62)): We see that the PP-PP estimates are tight around the generating Hawkes parameters for all the θ ij , α ij and ν i parameters, and for all parameter sets. This indicates that the information loss from the model mismatch (i.e., of type (1)) appears to have a minimal impact on fitting quality. Arguably, we observe a slight overestimation for θ 12 and θ 22 , and a clear overestimation of α 11 and underestimation of α 22 , particularly for the low spectral radius parameter set (ρ(α) = 0.5). The ν i parameters appear tightly recovered by PMBP on the timestamp dataset.

On the partially interval-censored dataset, we continue to observe good fits. This indicates that PMBP can successfully recover the generating Hawkes parameters even after interval-censoring. However, we see that the θ ij parameters become increasingly overestimated as the observation window widens, particularly for θ 12 and θ 22 . Similarly, α 12 is overestimated and α 11 is understimated for all parameter sets considered. In addition, for the high spectral radius set (ρ(α) = 0.9), we see underestimation of α 22 and overestimation of α 21 . The approximation quality degrades as the observation window widens, indicating an increasing information loss of type (2).

Though the generating parameters are not always correctly recovered, we note that the overestimation (under-) in one dimension is compensated by the underestimation (over-) on the complementary dimension. Interestingly, the spectral radius ρ(α) appears to be conserved regardless of the log-likelihood function used and interval-censoring (shown in Fig. 3b ). The recovered spectral radius is estimated close to the original Hawkes spectral radius for all considered cases. This is particularly relevant, as ρ(α) is a meaningful quantity relating to information spread virality (for social media diffusions), disease infectiousness (for epidemiology), or local seismicity (in seismology). The result also indicates that even when individual parameter fits are inaccurate, the Hawkes processes regime is correctly identified.

In this section, we evaluate PMBP(d, e)'s performance in predicting the popularity of YouTube videos. We observe the daily counts of views and shared (both interval-censored, i.e., E ↔ {views, shares}), and the tweets that mention the video (timestamp, E c ↔ {tweets}) during the first T train days after the video is posted. We use PMBP(3, 2) to predict the daily counts of views and shares, and the timestamp of the tweets posted during the period [T train , T test ).

Interval-censored forecasting with PMBP. To each YouTube video corresponds a partially interval-censored Hawkes realization. A straightforward approach to predict the unfolding of the realization during [T train , T test ) is to sample timestamps from PMBP(3, 2) on each of the three dimensions, conditioned on data before T train ; we then interval-censor the first two dimensions. In practice, sampling individual views takes considerable computational effort due to their high background rates, which results in huge numbers -sometimes in millions per day, and at least an order of magnitude larger than shares and tweets. We introduce in Appendix J an efficient scheme that only samples the E c dimensions (i.e., tweets), after which we leverage the compensator Ξ E to predict the expected counts in the other dimensions.

Dataset, Experimental Setup and Evaluation. We use a 20% random sample of the ACTIVE dataset (Rizoiu et al., 2017b) , which contains 2, 834 videos published between 2014-05-29 and 2014-12-26. For each video, we tune hyperparameters and PMBP model parameters using the first 90 days of daily view counts, share count to external platforms, and the timestamps of tweets that mention each video (T train = 90). The technical details of the fitting procedure, including an adjusted likelihood function, hyperparameter tuning, and assigning dimension weights, are provided in Appendix K.

The days 91 − 120 are used for evaluation (T test = 120). We measure prediction performance using the Absolute Percentile Error (APE) metric (Rizoiu et al., 2017b) , which accounts for the long-tailness of online popularity (e.g., the impact of an error of 10,000 views is very different for a video getting 20,000 views per day compared to a video getting 2 million views a day). We first compute the percentile scale of the number of views accumulated between days 91 and 120. APE is defined as:

whereN 120 and N 120 are the predicted and observed number of views accumulated between days 91 and 120; the function P(·) returns the percentile of the argument on the popularity scale.

Models and Baseline. We consider two 3-dimensional PMBP models: PMBP(3, 2) and PMBP(3, 3). The former treats the tweets as a Hawkes dimension (see Definition 8) and is thus susceptible to computational explosion for particularly high tweet counts given the quadratic complexity of computing cross-and self-excitation. The latter is an inhomogeneous Poisson process with no self-or cross-exciting dimension. We, therefore, fit PMBP(3, 2) solely on videos that have less than 1000 tweets on days 1 − 90; we fit PMBP(3, 3) on all videos considered.

We use as baseline the Hawkes Intensity Process (HIP), the state-of-the-art in popularity prediction discussed in Section 2.2. HIP, however, is designed for use in a forecasting setup. That is, HIP requires the actual counts of tweets and shares in the prediction window [T train , T test ) to get forecasts for the view counts on [T train , T test ). To adapt HIP for the prediction setup (i.e., the tweets and shares are not available at test time). We feed HIP for each of the days 91-120 the weighted average of the daily tweet and share counts on 1 − 90, assigning higher weights to more recent data.

Results. Fig. 4 shows PMBP(3, 2) and HIP fits for two sample videos from ACTIVE. These graphs illustrate two advantages of PMBP over HIP. First, in addition to our prediction target (i.e., views), PMBP(3, 2) also provides a prediction for future share and tweet counts. As the PMBP process is a point process in each dimension, we are able to model the dynamics and predict counts in each dimension. In contrast, HIP is unable to do this as the The blue shaded area shows the prediction uncertainty computed model treats tweets and shares as purely exogenous input. Second, given that we sample the fitted PMBP process to predict future counts (see Appendix J), we can quantify the uncertainty of our predictions by sampling from the multiple times and calculating the variance of the samples. A comparison of the performance of PMBP(3, 3), PMBP(3, 2), and HIP on the sample of ACTIVE is shown in Fig. 5a . Over the entire sample, it is apparent that PMBP(3, 3) outperforms both PMBP(3, 2) and HIP. However, user activity is known to be long-tailed, thereby a majority of the videos in ACTIVE do not have a lot of activity apart from the initial popularity spike around the upload time. As a result of this, it makes sense that the simplest model, PMBP(3, 3), fits best these flat trends as both PMBP(3, 2) and HIP would have a high propensity to overfit given little variation.

Instead of looking at a random sample of ACTIVE, we filter ACTIVE for videos that have sufficiently rich dynamics that can take advantage of the PMBP process. We refer to this set as 'dynamic' videos. We base the filtering on activity on days 21 − 90: We check if the standard deviations of the view counts, tweet counts, and share counts are higher than the median values. More details on the filtering procedure is detailed in Appendix K. In total, we retrieve 585 dynamic videos from ACTIVE. A performance comparison of PMBP(3, 3), PMBP(3, 2), and HIP on the set of dynamic videos is shown in Fig. 5b . Interestingly, we see a reversal of performance ranking on dynamic videos: PMBP(3, 2) performs best on this set of videos, followed by HIP, then PMBP(3, 3) comes last. This result is intuitive as PMBP(3, 2) is able to capture the dynamic behavior in this set of videos more so than the simple PMBP(3, 3) process.

The two models we proposed, PMBP(3, 3) and PMBP(3, 2), outperform HIP in different scenarios depending on the video's activity level. If the video does not have high activity, the simple PMBP(3, 3) would perform best, whereas highly active videos would be modeled best by the PMBP(3, 2) model.

Our third experiment focuses on the interpretability of the parameters of the PMBP process. Here, we investigate the interaction between COVID-19 daily case counts and publication dates of COVID-19-related news articles during the early phase of COVID-19, for 11 different countries.

Given that the daily case counts are specified as interval-censored counts, while the news publication dates are available as event timestamps, our dataset is in partially intervalcensored form, and so the PMBP model can be used for fitting. We utilize the PMBP(2, 1) model for this problem, where we set E = {cases} and E c = {news}. For each country, we fit a separate PMBP(2, 1) model, yielding a set of parameters that summarizes country-level dynamics. Inspecting the interaction terms in the branching matrix α 21 and α 12 enables us to understand the interaction between case incidence and news reporting during the early phase of COVID-19 in our select sample of countries. Did news preempt the rise in cases, or was it more reactionary?

Dataset. For this experiment, we curate and align datasets from two sources.

The first dataset contains COVID daily case counts provided by Johns Hopkins University (Dong et al., 2020) . The dataset is a set of date-indexed spreadsheets, each containing COVID reported case counts split by country and region. To limit our analysis, we focus on the following 11 countries: UK, USA, Brazil, China, France, Germany, India, Italy, Spain, Sweden, and the Philippines. The selection is similar to what was done in Browning et al. (2021) , with the addition of the Philippines. The second dataset contains COVID-19-related news articles, provided by the NLP startup Aylien (Aylien, 2020) . This dataset is a dump of COVID-related English news articles from 440 major sources during the period November 2019 to July 2020. For each of the 11 countries in the global sample, we filter the Aylien dataset for news articles that mention the country in the headline. In addition to the country mention, for China, we also added included several COVID-related keywords to limit the articles retrieved. Lastly, we limit our query to articles that come from sources with an Alexa rank of less than 150, thereby limiting to popular news sites.

Our focus is on modeling the early dynamics of COVID-19 cases and news. Since the start of the COVID outbreak is not aligned across countries, we set the initial time t = 0 to be different for each country, setting it to the first day where at least 10 cases were recorded. Except for China, which had cases as early as January 2020, the initial time for each country in our sample is in February or March 2020. Since we are interested in early-stage dynamics, we only consider data up until t = 120, where time is measured in days.

Model. For each country, we fit a PMBP(2, 1) model, returning a country-specific parameter set {θ, α, γ}. We perform goodness-of-fit tests for each country's PMBP(2, 1) fit by leveraging the time-rescaling theorem (Brown et al., 2002) on the news timestamp dimension and the Ansecombe transform (Anscombe, 1948) on the case count dimension to transform the predictions into samples from a common distribution. Q-Q plots and metrics for the goodness-of-fit tests we performed are provided in Appendix L.

Results. Table 2 contains the parameter estimates {θ, α, γ} for each country, while Fig. 6 shows the daily COVID-19 case counts and daily news article volume of the PMBP(2, 1) model fits for India and Italy. As we can see from Fig. 6 , it is apparent that the PMBP(2, 1) is able to capture the dynamics for both countries well. Based on the sample-based fit score we introduce in Appendix L, the actual COVID-19 case counts for India and Italy fall within the model's prediction interval for 97% and 61% of the time, respectively.

Results for the other countries are presented in Appendix L. The PMBP(2, 1) process was able to achieve a significant fit on the news article publication dates for 9 out of 11 countries, while it was able to achieve a significant fit on the COVID-19 case counts for 5 out of 11 countries. For UK, Italy and Brazil, the model fits on both case counts and publication dates are significant. In terms of the sample-based fit score, India, Brazil and the Philippines rank as the best model fits.

We explain the parameters in Table 2 , first starting with the θ ij parameters, the α ij parameters, then lastly the ν i parameters. We interpret the parameters as an approximation to the corresponding parameters of a multivariate Hawkes process, similar to what we investigate in Section 4.1.

The parameter θ ij encodes the influence decay from dimension j to i. A small value of θ ij means that j is expected to influence i over a longer period of time, whereas a large value would imply that the influence of j to i is instantaneous. From Table 2 , we see that India has the smallest θ 11 , which indicates the slow, consistent progression of COVID in India during the early phase. Brazil has the smallest θ 12 , which indicates that cases typically did not spike immediately once Brazil made its way into the news, but instead steadily increase over an extended period of time. Similar interpretations can be made for θ 21 and θ 22 .

The parameter α ij measures the strength of influence from j to i, given that this encodes the expected number of dimension i events triggered by a single event of dimension j. We can see from Table 2 that Brazil, India and the Philippines all have α 11 > 1, implying that COVID was highly contagious in these countries during the early stage. On the other hand, α 12 can be viewed as a measure of how strong news preempts an increase in cases. Given that Brazil has the largest α 12 in our sample, news was particularly preemptive of cases there during the early stage. Conversely, α 21 measures the expected number of news articles published after a single case. In countries with high α 21 , news serves a reactionary role to an increase in cases. Among the countries considered, China has the highest α 21 .

The exogenous parameters ν 1 and ν 2 represent the exogenous rates for cases and news, respectively. These parameters encapsulate all factors that are external to the system and cannot be accounted for by the self-and cross-excitation of cases and news in the model. ν 1 is a measure of imported cases from other countries, while ν 2 captures the base level of reporting. India has the highest importation of cases among the countries considered as it has the highest ν 1 , whereas Brazil, given its high ν 2 , has the highest base reporting.

Given that the parameters individually capture different aspects of the interaction between news and cases, we cluster the parameter sets across countries to identify groups that have similar diffusion profiles. To render the parameter sets comparable to one another, we scale the number of cases for each country to 100, fit the PMBP(2, 1) model on this scaled data, and perform K-means clustering on the resulting parameter sets.

The clusters we obtained (where we set K = 5) are shown in Table 3 . The first cluster, consisting of UK, Germany and Spain, all have α 12 and low α 11 . The second cluster, Brazil, also has high α 12 but has a significantly high α 11 . Since α 12 is high in these two clusters, the first two clusters consist of countries where news is strongly preemptive of cases. The third cluster, China and France, differs from the first two clusters by having a high α 21 , which is indicative of news playing a reactive role to cases. The fourth cluster (US, Italy, Sweden) and fifth cluster (India, Philippines) both have low α 12 and α 21 , signifying little interaction between news and cases. The fourth and fifth cluster differ in the fact that COVID infectiousness is significantly worse in the fifth cluster (India, Philippines), given that its centroid α 11 is greater than one and its centroid θ 11 is lowest across all clusters.

To visualize this categorization of countries, we apply tSNE on the parameter space to collapse it to a 2-dimensional space. The resulting map is shown in Fig. 7 .

We structure our review of related work into three sections. First, we examine previous work that deals with the partially interval-censored setting, which is particularly relevant in neurophysiology and signal processing applications. Second, we focus on fitting the Hawkes process in partially observed settings, namely cases where we encounter interval-censoring or missing data. Lastly, we go over some applications of multivariate Hawkes processes in the physical and social sciences.

Most of the studies that deal with data that include both event sequences and intervalcensored counts fall under neurophysiology, signal processing, and other specific domains.

Here we survey work in each of these areas.

Neurophysiology. Experimental data from neurophysiology studies oftentimes involve a mix of timestamps of some event of interest, for example neuronal spike trains, and interval-censored data obtained as time-series measurements from an external sensor. In this domain, models that study this mixed form of data are called hybrid processes.

Early studies on hybrid processes focused on understanding the correlation structure of the point process and time series, not necessarily on modeling and prediction. Willie (1982) studied a bivariate process consisting of a point process in one dimension and a time series in the other. They presented a method to estimate the conditional intensity of the point process given the history of the time series. Halliday et al. (1995) contained a review of methods to analyze hybrid datasets aimed at experimentalists. They developed methods for estimating time-domain and frequency-domain parameters in a hybrid data set-up, namely, the cross-spectrum and coherence, which can be used as indicators of the coupling strength between the two processes involved. Rigas (1996) presented a parallel approach to estimate the cross-spectrum and coherence of a hybrid process.

Recent work on hybrid models in neurophysiology focus more on modelling the dynamics, rather than simply characterizing the correlation structure. Guha and Biswas (2008) tested various hybrid model approaches to study the relationship between neuronal spike trains (a

High Infectivity Figure 7 : Labeled tSNE visualization of the clusters obtained from the fitted PMBP(2, 1) parameters across the 11 countries we consider.

timestamped dataset) and the local field potential (a time series) in mice. They tried out models with different causal structures: the point process driving the time series, the time series driving the point process, and no causal structure, which uses a latent space model, and concluded that there is no single best hybrid model that works for all applications. In a follow-up, Biswas and Guha (2010) formulated an ARMA-type model of the physiological tremors in the distal phalanx of the middle finger (time series) and motor unit firings in the middle finger portion of a related muscle (point process). They also introduced a goodness-of-fit measure based on mutual information to compare hybrid models. However, calculating this goodness-of-fit measure is quite involved as it requires the joint distribution of the model over the entire range of the hybrid process.

Signal processing. Another major area that studies models of combined event sequence and interval-censored data is signal processing. Still inspired by neurophysiology applications, Solo (2005a,b) derived the joint likelihood function and mutual information for a particular formulation of the hybrid model. A follow-up Solo and Pasha (2012) introduced a test for independence of a time series and point process based on Lagrange multipliers of the likelihood introduced in Solo (2005a). Xiao et al. (2017) focused on next-event time and type prediction for a marked point process, while using a supplemental time series as an exogenous signal to improve prediction performance. They proposed an attention-based twin RNN model that predicts directly the time and type of the next event, bypassing calculation of the conditional intensity. A follow-up Xiao et al. (2019) looked at the same problem, but instead proposes the recurrent point process network, which explicitly calculates the conditional intensity. Next-event type and time under the recurrent point process network are obtained by sampling this intensity. Both models are based on an RNN architecture, which grants the model high flexibility and a limited set of assumptions on the diffusion dynamics of the point process. However as these are deep-learning-based models, they suffer from interpretability issues and the need for a high volume of training data to fit adequately the large set of parameters.

Other areas. Some work on mixed data has also been done in the area of incident diagnosis. Luo et al. (2014) introduced a novel algorithm to calculate the correlation between time series and timestamped data, which uses a two-sample test comparing the subseries before or after an event with a random sample of subseries. van Dortmont et al. (2019) used this correlation measure in ChronoCorrelator, a visualization framework for incident diagnosis.

Lastly, some work in clinical studies looked at mixed event sequence and intervalcensored data but in a different setup. Instead of looking at a hybrid process, where the timestamp-time series divide exists on a dimension level, here the split is on the data level. Data is in the form of questionnaire responses from subjects. The setup involves some subjects providing detailed event time information, but some just providing summary answers in the form of panel counts, and we would have to do inference on this dataset. Zhu et al. (2013 Zhu et al. ( , 2014 Zhu et al. ( , 2015 ; Yu et al. (2017) presented different varieties of regression models, given a set of covariates, that can be fitted in this mixed data setup.

A significant portion of recent literature on the Hawkes process, and on point processes in general, deals with situations where the training data is only partially observed. The task is to fit the parameters of the point process given the partially observed dataset. This problem is not trivial as standard maximum likelihood techniques require observation of the complete dataset. As such, various modifications of standard procedure and other alternatives have been been presented to deal with this scenario.

There are many variations of how partial observability of the dataset can occur, but we focus on two: (1) interval-censoring of event time stamps (where we are only able to observe event counts over a partition of time), and (2) missing event data.

Interval-censored data. Shen and Cook (2015) studied the case of data aggregation in a clinical studies setting, where only disease event counts on periodic assessment times are observed, instead of the actual event times. The authors worked on developing the likelihood function for a dynamic mover-stayer model, a generalization of the Markov chain, under this interval-censored data, and developed an EM algorithm to fit the parameters.

For the Hawkes process given interval-censored data, there have been a few recent papers that worked on alternatives to the maximum likelihood procedure. Kirchner (2016) showed that a sequence of (integer-valued autoregressive time series) INAR(∞)-based family of point processes converges to the Hawkes process. Under this convergence, they concluded that the INAR(∞) is the discrete-time version of the Hawkes process. In a follow-up, Kirchner (2017) , they presented an alternative procedure to maximum likelihood estimation, which fits the associated bin-count sequences to the INAR(p) process. As the bin size goes to zero and the order p of the process goes to ∞, the INAR sequence converges to the Hawkes process and the parameter estimates converge to the Hawkes parameters. However, though fitting is performed on count data, convergence only actually occurs for small bin size, which cannot accommodate cases where the bin size is predetermined, or when the interval-censored data is only available for large bin size. Cheysson and Lang (2020) presented a spectral approach to fitting the Hawkes process given interval-censored data for arbitrary bin size, which solved the problem encountered in Kirchner (2017) . Their proposed method is based on minimizing the log-spectral likelihood of the associated bin-count sequence, instead of the usual log-likelihood of the Hawkes process. Under certain assumptions on the kernel to the process, they showed that optimization converges to the Hawkes parameters.

Another alternative approach is presented in Shlomovich et al. (2020) , which introduced a Monte Carlo Expectation Maximization (MC-EM) algorithm that uses sampling to obtain proposals for the hidden event times. They showed that their approach recovers parameters more reliably than the INAR(p) estimates from Kirchner (2017) in synthetic experiments. They further extended their approach to the multivariate case in Shlomovich et al. (2021) . As the approach that they take is reliant on sampling, it is more computationally expensive than the other approaches.

Missing data. Apart from interval-censored data, one well-studied area concerning partially observed Hawkes processes is the presence of missing data. Several studies have been performed that deal with different varieties of the data distortion. Though we do not work directly with missing data in this paper, the problem is closely related and we present work done in this area for completeness. Shelton et al. (2018) dealt with the situation where some dimensions of an unmarked multivariate Hawkes process have missing data over several subintervals of time. This also generalizes to the case where some dimensions are completely unobserved, and can be viewed similar to latent variables in Bayesian networks. They presented an MCMC sampler to obtain the posterior under this setting, and showed that adding missing variables improves prediction performance and interpretability of obtained parameter clusters. Derek Tucker et al. (2019) looked at a similar missing data setup, but focused on the marked univariate case and used a Bayesian estimation procedure to fit the model.

A different missing-data situation is considered in Mark et al. (2019) . There, they viewed missing data as corrupted and assigned a corruption probability p to each event. Instead of the Hawkes process, they instead fit the discrete Bernoulli autoregressive model to the data. They derived a likelihood function for the corrupted data that functions as an unbiased estimate of the likelihood of the complete data and used that to recover parameters of the process.

Lastly, Mei et al. (2019) worked on a method to reconstruct partially observed data, i.e., filling in the missing events, by sampling the posterior distribution through a Monte Carlo-based method called Particle Smoothing. Their method allows different missingness mechanisms, i.e., whether events are deleted in a deterministic or stochastic fashion. Self-and cross-exciting processes have found great application in a diverse array of domains. In this section, we review some relevant applications, including the social media popularity and epidemic case studies we tackled in this work.

Popularity prediction. Hawkes models are usually applied in Twitter popularity prediction, where the task is to predict the final retweet cascade size, conditioned on some observable initial dynamics. Hassan Zadeh and Sharda (2014) applied a variant of the marked Hawkes process called the ETAS model in a marketing setting to model popularity of brand posts on Twitter. introduced SEISMIC, which extends the Hawkes process to a doubly stochastic (Cox) process a model to dynamically model infectiousness. An evaluation of the Hawkes model versus the feature-based approach was presented in Mishra et al. (2016) , and they found that a combination of Hawkes and feature approaches improved prediction performance. To handle cases where there is too little in a single retweet cascade, Kong et al. (2020) introduced a mixture model of Hawkes processes that allows joint learning from multiple instances of cascades belonging to the same item. Rizoiu et al. (2018 , on the other hand, deal with YouTube popularity prediction. The former introduced the Hawkes Intensity Process (HIP) which combines exogenous stimulus from Twitter and endogenous "word-of-mouth" diffusion within YouTube to model view counts. The latter expanded on HIP and put it on solid mathematical ground by introducing the Mean Behavior Poisson (MBP) model. They showed that HIP is a special case of MBP under a specific discretization scheme and loss function.

Latent influence. Given that the multivariate Hawkes process can model the crossexciting influence among entities, it has found great applicability in a number of domains where we wish to uncover the latent influence structure given a time-stamped dataset of entity interactions. In the linguistic space, Goel et al. (2016) used a parametric multivariate Hawkes model to study language change and determine influential social network factors that contribute to more influence. The Hawkes model has also been adapted to study the influence of social-bots on online Twitter discussions related to the U.S. presidential debate (Rizoiu et al., 2018) and the influence of Russian trolls on news dissemination across different different online platforms (Zannettou et al., 2019) . In crime modeling, a nonparametric version of the spatiotemporal Hawkes model was used in Yuan et al. (2019) to uncover crime-event networks from a dataset of violent crime incidents. A 'lazy' variant of the multivariate Hawkes model was introduced in Nickel and Le (2021), which permits fitting to data orders of magnitude larger than the normal multivariate Hawkes. The authors applied the model to uncover the latent structure of policy diffusion across states in the U.S.

Epidemic modeling. A number of studies have used the Hawkes process to model the outbreak of a contagious disease in a population. The latent influence point process, a generalization of the multivariate Hawkes process to model dengue spread in meta-populations and incorporating human mobility was proposed in Kim et al. (2019) . Another study, Unwin et al. (2021) , used the Hawkes process with a Rayleigh kernel to model malaria transmission in near-elimination settings. A number of studies also focused on different aspects of the COVID-19 pandemic. Garetto et al. (2021) modified the Hawkes process to have a controllable time-varying branching factor, which was then used to study the impact of mobility restriction countermeasures on the spread of COVID-19 in Italy. Koyama et al. (2021) developed a state-space method to fit a discrete version of the Hawkes process and used this approach to estimate the time-varying reproduction number of COVID-19 in a sample of countries. Lastly, Browning et al. (2021) used a discrete Hawkes process with two phases determined by a switch point to model the initial rise and fall of daily mortality counts of COVID-19.

Crime modeling. Modeling crime patterns is one of the major applications of spatiotemporal Hawkes processes. Inspired by seismological clustering patterns which is one of the earliest applications of self-exciting processes, Mohler et al. (2011) applied the same idea to model urban crime, namely burglary and gang violence. This led to subsequent studies, which include modeling civilian death in Iraq (Lewis et al., 2012) and the contagiousness of gun violence (Loeffler and Flaxman, 2016) and gang violence (Brantingham et al., 2020) in the U.S. In the area of online crime, the multivariate Hawkes model was used in Bessy-Roland et al. (2021) to model and predict the frequency of different types of cyber attacks.

This work introduces the Partial Mean Behavior Poisson (PMBP) process, a generalization of the multivariate Hawkes process where we take the conditional expectation of a subset of dimensions over the stochastic history of the process. The motivation of our proposed approach comes from the need of fitting multivariate Hawkes process to partially intervalcensored data. The multivariate Hawkes process cannot directly be used to fit this type of data, but the PMBP process can be used to approximate Hawkes parameters via a correspondence of parameters.

In this paper, we derive the conditional intensity function of the PMBP process by considering the impulse response to the associated LTI system. Additionally, we derive its regularity conditions which leads to a subcritical process; which we find generalizes regularity conditions of the multivariate Hawkes process and the previously proposed MBP process. The computational problems of fitting a PMBP process is addressed, where we provide a method of approximation for hte conditional intensity and a method for sampling future events. The maximum likelihood estimation loss function is also derived for the partially interval-censored setting.

To test the practicality of our proposed approach, we consider three empirical experiments. First we test the capability of the PMBP process in recovering multivariate Hawkes process parameters in the partially interval-censored setting. By using synthetic data, we investigate the information loss from having a model mismatch and the interval-censoring of the timestamped data. We find that the fitted PMBP process parameters are close to the original multivariate Hawkes processes used to generate the data. Furthermore, we observe that the spectral radius of the fitted PMBP process are also close to the original Hawkes process. Second, we demonstrate the predictive capability of the PMBP model by applying it for YouTube popularity prediction. We compare our model to the state-of-the-art baseline and show that the PMBP process outperforms the former. Third, to demonstrate interpretability of the PMBP parameters, we fit the process to a curated dataset of COVID-19 cases and COVID-19-related news articles during the early stage of the outbreak in a sample of countries. By inspecting the country-level parameters, we show that there is a demonstrable clustering of countries based on how news predominantly played its role: whether it was reactionary, preemptive, or neutral to the rising level of cases.

Future work. There are three areas where future work can be explored. First, theoretical analysis on the approximation error of the model mismatch (i.e., fitting Hawkes data to the PMBP model) should be performed. In this work, we only perform an empirical evaluation of the information loss. Second, alternative schemes to approximate the conditional intensity should be investigated, as the current solution relies on the computationally heavy discrete convolution approximation. Lastly, given that the PMBP(d, e) process, by construction, is not self-and cross-exciting in the E dimensions, an open research question is whether we can construct a process that retains the self-and cross-exciting properties in all dimensions whilst also being flexible enough to be used in the partially interval-censored setting. 

Pulling out the infinitesimal ds out of the expectation on the left-hand side, we get

Notice that the left-hand side is exactly ξ E (t), as defined in Eq. (20). For the right-hand side,

Applying the tower property of conditional expectation, we average out the inner conditioning over j∈E H j s − and arrive at the desired result:

Define µ (t) = cµ(t) and ξ E (t) = cξ E (t).

If the system were LTI, then linearity µ (t) = ⇒ ξ E (t) must hold. That is, Eq. (26) tells us

However, if we multiply both sides of Eq. (26) by c, we see that

Unless c = 1 (trivial case) or d = e, Eq. (67) and Eq. (68) cannot hold simultaneously because of the summation over E c .

Thus, µ(t) = ⇒ ξ E (t) is not LTI unless e = d, which is the special case of an MBP process. Now consider the system s(t) ♥ = ⇒ ξ E (t) as defined in Eq. (29). The proof of this system being LTI is a straightforward multivariate extension of the proof of Theorem 2 in , but we present it here for completeness.

(Linearity) This follows simply by multiplying both sides of Eq. (26) by a constant c.

where in (a) we used the fact that ϕ E (s) = 0 if s < 0. Thus s(t) → ξ E (t) is LTI. 

For brevity, set s(t) as in Eq. (29), and set S(t) = t 0 s(τ )dτ . Now, we focus on the i th entry of both sides. Expanding the convolution, we get

where in (a) we applied the linearity of integration, (b) we used the definition of convolution, (c) reversed the order of integration, and (d) made a change of variable u = τ − v.

Collecting the results for every dimension i, we arrive at the desired formula.

Proof [Proof of Lemma 17] We proceed by induction. Let α and ϕ(t) be as stated. Let (i, j) ∈ E × E. Suppose n = 1.

where (a) and (c) follow from the definitions of ϕ ij and f ij , and (b) follows from α ij being a constant. Thus, b n ≤ α n holds for n = 1. Suppose that the relation holds for n = k ∈ N. That is, b k ≤ α k . This means that for every (p, q) ∈ E × E, the following holds:

Now, we show that the relation holds for n = k + 1. We observe that

In (a), the definition of the matrix product was used. In (b) and (c), the Minkowski inequality and Young's convolution inequality were used, respectively. In (d), our induction hypothesis Eq. (69) was used. By induction, b n ≤ α n holds for n ∈ N.

C. Proofs for Section 3.4

where the last line follows from the Fubini-Tonelli Theorem and that h E and ϕ E are L 1integrable.

The function h E (t) defined in Eq. (28) can be expressed term-by-term as

Given a Hawkes kernel of the form ϕ(t) = α f (t), where denotes elementwise multiplication, Eq. (70) can be written as

Consider the i, j entry of the n th term of the sum:

) ij This expression can be interpreted as follows:

• (α n E ) ij is the expected number of n th generation offspring events of type i produced by a single parent of type j in a d-dimensional branching process where only the E dimensions can produce offsprings.

• (f ⊗n E (t)) ij is the density of n th generation type i offspring events at time t produced by a single parent of type j at time 0.

• The product (α n E ) ij (f ⊗n E (t)) ij can be intepreted as the expected intensity contribution at time t from n th generation type i offspring events produced by a single parent of type j.

Thus h ij E (t), given by

is the expected type-i intensity at time t from a single parent event of type j, over the entire progeny of offsprings. Fig. 8 shows the nonzero entries of h E (t) for a PMBP(3, 2) process with parameter set θ = [1, 0.1, 0, 1, 1, 0, 1, 1, 0], α = [0.2, 0.2, 0, 0.2, 0.2, 0, 0.2, 0.2, 0]. In the plot we show the contributions of the first to the fifth generation to h E (t). As we can see, the contributions of succeeding generations become increasingly smaller as every α ij < 1. The contributions all go to zero asymptotically as t → ∞. In addition, we see that the mode of each generation's contribution is shifted to the right as the generation index increases. Intuitively, a delay exists because offspring events in generation n + 1 are produced by generation n events. The extent of the delay is controlled by θ.

Given that we have in Eq. (20) defined the PMBP(d, e) conditional intensity ξ E (t) as the expectation of the conditional intensity of a d-dimensional Hawkes process given observed event sequences j∈E c H j T − , a straightforward way to calculate ξ E (t) is as follows:

1. Use Algorithm 4 to obtain n samples samples of j∈E H j T − .

2. Given the n th sample {H j T,n = {t j k,n }|j ∈ E}, calculate the Hawkes conditional intensity:

3. Calculate the average to get the MBP conditional intensity: The approximation error of this method depends on the number of samples we take n samples . The higher n samples we take, the lower the standard error of our average Eq. (73).

Algorithm 4 is a modification of Ogata's thinning algorithm in Algorithm 1, where we consider the intensity contributed by the event sequences in j∈E c H j T − as part of the exogenous intensity, which in the usual case is just the constant µ.

Since the exogenous intensity is nonconstant due to this modification, Algorithm 1 has to be adjusted since it assumes a constant exogenous excitation. Specifically, we need to adjust the upper boundλ

so that its dominance holds until the next event after t.

The i th component of the conditional intensity λ(t) given event sequences j∈E c H j T − can be written as

Given that every component in ϕ is non-decreasing, a natural upper bound on ϕ ij (t−t j k ) is ϕ ij (0). Thus we can write

The upper boundλ(t) has to hold until the stochastic time of next event. The only non-stochastic upper bound for |H j t | until the next event is |H j T |. Setting this upper bound we getλ

which is now a correct upper bound for this setup.

Simulating an e-dimensional Hawkes Process on [0, T ) Given Observed Event Sequences j∈E c H j T by Thinning Input: kernel matrix ϕ(t), exogenous excitation µ; observed event sequences

As noted in the main text, a closed form solution for ξ E (t) of a generic PMBP(d, e) process cannot be written down except in special cases. One of these special cases as we have shown in Appendix M is the PMBP(2, 1) process with the exponential kernel. Here we present a comparison of the two approximation schemes, developed in Section 3.4 and Appendix E, with the closed form solution for the exponential PMBP(2, 1) process. Fig. 9 shows a comparison of ξ 1 1 (t) and ξ 2 1 (t) computed through the numerical convolution approximation, as an expectation over Hawkes processes, and the closed form solution in the Appendix. We see that there is agreement in all cases, showing the viability of our two approximation methods. 

In this section we derive the appropriate upper bounds for the PMBP(d, e) process used in the thinning algorithm.

We were able to derive two different upper bounds. The first one can be calculated in a simple manner, and we use this in our implementation of Algorithm 4 to generate samples from PMBP(d, e). However, this upper bound has the potential to be large if for some (i, j) ∈ D × D, max t≤T h ij E (t) is high, causing the thinning algorithm to propose and subsequently reject many trial points. We therefore introduce a second upper bound that bounds the intensity more closely, leading to faster sampling. We use this in our popularity prediction use case in Section 4.2.

Upper Bound 1. We need to find an upper bound at arbitrary time t > 0 (that holds up until the next stochastic event) for each of the terms in the conditional intensity below.

where

Leth E be the matrix whose (i, j) entry is max t≤T h ij E (t). Letν = max ν s≤T (t). Then the first term is bounded above byν. For the second term, we have the following upper bound that holds for all t:

For the third term, as in the usual Thinning algorithm for Hawkes, an upper bound that holds until the next event is j∈E c t j k ≤t ϕ j E c (t − t j k ). Lastly, for the fourth term, an upper bound that holds until the next event is given by:

where in (a) we made a change of variable u = s − t j k , in (b) we used the fact that ϕ ij (t) = 0 for t < 0, in (c) we used the fact that since ϕ ij (t) is non-increasing, Φ ij (t) is non-decreasing and thus Φ ij (t) ≥ Φ ij (t − s) for s ≥ 0, and finally in (d) we bound Φ j E c by its maximum value on [0, T ].

Thus an upper bound at t until the next event is given by

Suppose we are given an exponential kernel ϕ ij (t) = κ ij θ ij exp −θ ij t , with corresponding Φ ij (t) = κ ij 1 − exp −θ ij t . We can simplify the upper bound by using

Upper Bound 2. Suppose the exogenous intensity is given by

An upper bound for the conditional intensity at t until the next stochastic event is

which we get by upper bounding each term in Eq. (77). The fourth term H E (∞) · ν is obtained by noting that H E (t) is a non-decreasing function and hence upper-bounded by H E (∞). The tricky part here is obtaining an expression for the rightmost term. Let t ≥ t. Our goal is to find a function f(t) that satisifies

We split the rightmost integral as

and aim to bound these two terms separately.

To proceed, introduceĥ ij

Observe thatĥ

Next, note thatĥ E (t) is non-increasing. That is, given t ≥ t ≥ s,

Thus we have an upper bound for the first term. For the second term, observe that since we are working under the assumption of no events between t and t ,

This is the upper bound for the second term. Here,Ĥ E (t) is given bŷ

Algorithm 2 can be used to calculate the negative log-likelihood of a PMBP(d, e) process given interval-censored counts in the E dimensions and event sequences in the E c dimensions. However, it only returns the conditional intensity ξ E (t) on the points on the partition P[0, T ] and does not return the compensator Ξ E (t). These two observations are problematic if we want to compute the negative log-likelihood L (Θ; T ) because (1) we need the compensator Ξ E (t) to compute L (Θ; T ), and (2) we would have to evaluate ξ E (t) for every timestamp in j∈E c H j t and the compensator on the left and right endpoints of the observation intervals. Points where we need to evaluate these functions may not coincide with the points on P[0, T ]. Though we can interpolate our points of interest (i.e., the event timestamps and the censor points) on the partition, it is prone to error. A more accurate approach would be to calculate the ξ E (t) and Ξ E (t) directly at the points of interest.

In this section, we introduce Algorithm 5, which allows us to compute L (Θ; T ) given a set of event sequences and observed counts.

Let T = j∈E c {T j p } be the collection of all observed event timestamps in the E c dimensions. Let O = j∈E {o j p } = j∈E O j be the collection of all censor points in the E dimensions. Let P = {0 = t 0 < · · · < t P = T } be a partition of the time interval [0, T ] with step size P T . Define the points-of-interest setT as the union of these three sets supplied with labels

where 

We iterate overT in Algorithm 5, storing the contribution of each point-of-interest to the interval-censored log-likelihood and point-process log-likelihood defined in Eq. (63) and Eq. (64), respectively. The overall log-likelihood L (Θ; T ) is then the sum of these two.

In this section we calculate the gradient of the negative log-likelihood L (Θ; T ). Our starting point is Eq. (62). Taking the gradient with respect to the parameter vector Θ of both sides and using linearity of the gradient operator, we have

Taking the gradient of Eq. (63) and Eq. (64), we obtain:

Now we would need to calculate the gradients of the conditional intensity ξ E and the compensator Ξ E . Recall that our parameter vector Θ consists of the Hawkes kernel parameters contained in ϕ and the exogenous parameters γ and ν, in view of Eq. (61).

Let θ ∈ {γ i , ν i }. Taking derivatives of Eq. (27) and Eq. (38), we see that

Let θ be a parameter of the Hawkes kernel ϕ. For example, if we use the exponential kernel in Eq. (4), then θ ∈ {θ ij , α ij }. Again, taking derivatives of Eq. (27) and Eq. (38), we get

The derivatives involving ϕ E c and Φ E c are straightforward to calculate given the form of the Hawkes kernel. The derivative involving ∂ θ h E (t) can be calculated from ϕ E (t) and its self-convolutions using the following observation:

We can leverage this recursive calculation to compute ∂ θ h E (t) efficiently. The method is presented in Algorithm 6.

Algorithm 6: Computing ∂ θ h E recursively over a partition of [0, T ]

Lastly, observe that Eq. (90) and Eq. (91) both contain a term involving convolution with ∂ θ h E (t). From Proposition 23, computing this convolution requires us to have an expression for ∂ θ H E (t), the integral of ∂ θ h E (t). We can obtain this expression from ∂ θ h E (t) as follows:

Integrating both sides, we get

Taking the derivative of both sides with respect to θ and then applying the convolution product rule, we get

In every iteration of Algorithm 5, the gradients ∂ θ ξ(t) and ∂ θ Ξ(t) can be computed alongside ξ(t) and Ξ(t) for every θ ∈ Θ. These values are then passed one layer above to Eq. (88) and Eq. (89) to get the ICLL and PPLL gradients, respectively. Finally, the return values are passed to Eq. (87) to compute the overall gradient L Θ (Θ; T ).

Once a PMBP(d, e) model is fitted to a set of event sequences j∈E c H j T − and intervalcensored counts j∈E {C j k } n j k=1 , this can be used to predict the expected count of events in any interval of time past t = T .

Suppose that we observe events and counts on [0, T train ) and we wish to predict the expected count of events in every dimension on every subinterval of the partition P[T train , T test ), where T test > T train . The most straightforward way to do this is to continue the PMBP(d, e) process on [T train , T test ) by using Algorithm 3 to draw sample histories. For each sample, we count the number of events in each dimension on each subinterval of P[T train , T test ). The expected count on a subinterval would then be the average of the counts on the selected subinterval over the set of sample histories.

The problem with the previous approach is that it is not computationally efficient to perform the sampling, as we have to sample all dimensions simultaneously in Algorithm 3, especially problematic if some dimensions have a high background intensity.

Let

where o 1 = T train and o P = T test , be a partition of [T train , T test ). A computationally efficient scheme to predict expected counts can be done with the following three-step approach.

1. Sample only the E c dimensions on [T train , T test ).

This approach relies on two properties of the PMBP(d, e) process. First, ξ E (t) and Ξ E (t) only depends on the E c dimensions, and so we only need to actually sample these dimensions to calculate the intensity and compensator as these are independent of events that occurs in the E dimensions. Second, similar to what is stated in Remark 3 for the Hawkes process, the compensator Ξ E (t) of a PMBP(d, e) process can be interpreted as the expected count of events on [0, t) given event sequences j∈E c H j t − . Given a sample Note that there is a subtle difference between the two approaches. The first approach returns event sequences in each dimension, which we then count and average over to get expected values. On the other hand, the second approach directly estimates the expected counts as it uses the compensator of the process. Fig. 10 shows a comparison of the two methods over 1000 samples. We assume here that we observe data up until T train = 10 and we wish to get expected counts on P[10, 20) = [10, 11), · · · , [19, 20) . The solid lines show the estimate of the expected count. It is evident that there is very good agreement between the two approaches. The uncertainty clouds around the lines mean different things. Since the second approach directly estimates the expected counts, the cloud around the blue line represents the variance of the expected counts, whereas the red cloud represents the variance of the counts across different histories. 

In this section, we provide additional details on the fitting procedure for the PMBP(3, 3) and PMBP(3, 2) models on the online video popularity task and the filtering procedure we performed on ACTIVE to identify dynamic videos for performance evaluation.

We present here details on fitting PMBP(3, 3) and PMBP(3, 2) models on the ACTIVE dataset of views/shares/tweets. Specifically, we discuss four points:

1. assigning weights to each dimension's contribution to L (Θ; T ), 2. regularizing the exogenous parameter ν, 3. hyperparameter tuning, and 4. the optimization algorithm.

Assigning weights to each dimension's contribution to L (Θ; T ). For the first point, recall that the negative log-likelihood for a PMBP process given a mix of event sequences and interval-censored data is defined as the sum of the likelihood contribution in each dimension. Recalling Eq. (62),

A problem that we encountered with this formulation is that it treats the dimensions equally, but the scale of each dimension's values could differ largely from one another. For instance, in the online video popularity case study, view counts are orders of magnitudes higher than share counts and tweet counts, which would then cause the likelihood contribution of views to dominate the total likelihood, prioritizing fit on the views over the tweets and shares. This is problematic, in particular, for the PMBP(3, 2) model as we want to fit the tweets dimension well, as it drives the self-and cross-excitating behavior of the process. To solve this, we assign a dimension weight hyperparameter w j to each dimension j ∈ D which we multiply to the log-likelihood contribution of dimension j. Instead of L (Θ; T ) in Eq. (62), we use a dimension-weighted version L (Θ; T, w), given by

The corresponding gradients for this dimension-weighted version can be obtained by a trivial adjustment of Appendix I.

Regularizing the exogenous parameter ν. For the PMBP models we fit in the online popularity case study in Section 4.2 and COVID-19 case study in Section 4.3, we optimize the parameter set Θ = {θ, α, ν}. To reduce the size of the parameter space, the initial impulse parameters γ are fixed as hyperparameters. Similar to Bessy-Roland et al. (2021) , we found that adding a term that regularizes the exogenous parameter ν on the dimension-weighted likelihood L (Θ; T, w) in Eq. (92) further improves performance. We introduce another hyperparameter w ν that controls the level of regularization on the L 1 norm ν 1 . Intuitively, using a higher w ν is equivalent to biasing the model optimization to avoid taking on high values for the baseline intensities ν, thereby resulting to larger values in α. The regularized version of Eq. (92), where w = {w 1 , · · · , w d , w ν }, is then given by

This is the version of the log-likelihood that we use to fit the models in Section 4.2 and Section 4.3. The parameter set that we optimize for is Θ = {θ, α, ν}.

Hyperparameter tuning. The hyperparameters that we tune in our model are shown in Table 4 . There are three different types of hyperparameters we consider. The first two types are the dimension weights {w j } and the ν regularization weight w ν we discussed above. The third hyperparameter is how we fix the γ parameter. We consider two different modes: (1) max, where we fix γ for a particular video to be fitted to the maximum value of the daily view, share, and tweet count on days 1-10, and (2) start, where we fix γ to the video's initial daily view, share, and tweet count. The candidate hyperparameters we sweep over are also shown in Table 4 . Note that we have different candidate hyperparameters for PMBP(3, 2) and PMBP(3, 3), which we selected based on heuristics.

We use days 1-90 to perform hyperparameter tuning and fitting. Specifically, we use days 1-75 as the training set for hyperparameter selection, and use days 76-90 as the validation set. Once we determine the best-performing hyperparameter set for a video, we refit the model on days 1-90. Lastly, the performance of the tuned model is evaluated on the test set, days 91-120.

Optimization algorithm. To optimize L (Θ; T, w) over Θ, we use IPOPT, a nonlinear optimization solver for large-scale problems (Wächter and Biegler, 2006) . The solver requires the gradient of the objective and the Hessian of the Lagrangian. We wrap the procedure in Appendix I as a function to compute the gradient iterately. For the Hessian of the Lagrangian, we use the limited-memory quasi-Newton approximated provided by IPOPT.

In our performance evaluation and baseline comparison in Section 4.2, we filtered the ACTIVE dataset for YouTube videos that have rich dynamics on days 21 − 90. We showed that PMBP(3, 2) performs best on these dynamic videos.

We present here the filtering procedure we implemented to arrive at the evaluation dataset in Fig. 5b .

In this section, we (1) present the fits for the other countries in the global sample and (2) elaborate on the goodness-of-fit tests we perform on the model fit of PMBP(2, 1) on the COVID-19 daily case count and news article dataset. 

To check model fit of PMBP(2, 1) on the cases-news data in Section 4.3, we perform separate goodness-of-fit tests for the news dimension and the case dimension, as we detail below.

Observations in the news dimension are in the form of event timestamps t 2 1 , · · · , t 2 n 2 , where n 2 is the number of news articles. The time-rescaling theorem (Brown et al., 2002) says that

where j ∈ 1 · · · n 2 − 1. Applying the previous formula on the observed data gives us n 2 samples from Exp (1). Fig. 12 shows the Q − Q plots for {Ξ 2 (t 2 j+1 ) − Ξ 2 (t 2 j )}, comparing the samples to the theoretical distribution for Exp (1) . We can also use the Kolmogorov-Smirnov (KS) test to check the significance of Eq. (94), which we show in Table 5 .

Observations in the case dimension are in the form of daily case counts C 1 1 , · · · , C 1 120 , corresponding to counts in [0, 1), · · · , [119, 120). Under the PMBP(2, 1) process, the dimension 1 subprocess is a Poisson process, and so

To convert the C 1 j observations to samples from a single distribution, we use the Anscombe transform (Anscombe, 1948) . Given x ∼ Poi(m), the transformation

approximately yields Gaussian samples: A(x) ∼ N 2 m + 3 8 , 1 given large m. Combining Eq. (95) and Eq. (96) then subtracting the Gaussian mean, we see that Fig. 13 shows the Q−Q plots for {2 C 1 j + 3 8 − Ξ 1 (j) − Ξ 1 (j − 1) + 3 8 }, comparing the samples to the theoretical distribution for N (0, 1). We can then apply the Skew-Kurtosis (SK) test for normality to check significance of Eq. (97), which we show in Table 5 .

We also introduce a sampling-based score to measure quality of fit of PMBP(2, 1) to the daily case count. For each day j, we sample N i observations from Poi(Ξ 1 (j) − Ξ 1 (j − 1)). We then check whether the actual case count C 1 j is within the [2.5%, 97.5%] band of the distribution of the N i samples. We do this for each day j ∈ 1 . . . 120 and average the result. In summary, we calculate 1 120

The metric is simply the percentage of days where the actual count falls within the interval predicted by the model. The fit scores are tabulated in Table 5 .

M. Closed Form ξ E (t) for the PMBP(2, 1) Process Given a PMBP(2, 1) process, the conditional intensity function ξ E (t) is given by

Given a fixed event sequence for dimension 2 {t 2 1 < t 2 2 < . . . < t 2 N } prior to time t, the intensity function given by ξ 1 1 (t | H 1 t ) can be interpreted as a univariate MBP process. Assuming an exponential kernel, the intensity function in this case can be expressed in closed form using the impulse response function.

Suppose that ϕ ij (x) = α ij θ ij exp(−θ ij x). The impulse response for ξ 1

where h(t) = α 11 θ 11 exp((α 11 − 1)θ 11 t) · t ≥ 1 .

Settingŝ(t) = µ 1 + t 2 k <t α 12 θ 12 exp(−θ 12 (t − t 2 k )) ,

exp((α 11 − 1)θ 11 (t − τ ))dτ + α 11 θ 11 α 12 θ 12

We calculate each of the integral terms separately:

And,

exp((α 11 − 1)θ 11 t) · exp(θ 12 t 2 k ) (α 11 − 1)θ 11 + θ 12 · exp(−[(α 11 − 1)θ 11 + θ 12 ]t 2 k ) − exp(−[(α 11 − 1)θ 11 + θ 12 ]t)

Together we have,

α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 exp((α 11 − 1)θ 11 (t − t 2 k )) − exp(−θ 12 (t − t 2 k )) .

This gives the MBP intensity function for dimension 1 as

α 12 θ 12 exp(−θ 12 (t − t 2 k )) + µ 1 α 11 α 11 − 1 exp((α 11 − 1)θ 11 t) − 1 + t 2 k <t α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 exp((α 11 − 1)θ 11 (t − t 2 k )) − exp(−θ 12 (t − t 2 k )) .

To calculate the MBP intensity function ξ 2 1 for dimension 2, we would need to calculate

We expand the integrand above using the expression for ξ 1 1 in Eq. (103) and calculate the integral term-by-term.

exp(θ 21 τ )) exp((α 11 − 1)θ 11 τ ) − 1 dτ = µ 1 α 11 α 11 − 1 exp(−θ 21 t) t 1 exp((θ 21 + (α 11 − 1)θ 11 )τ )dτ − t 1 exp(θ 21 τ ))dτ = µ 1 α 11 α 11 − 1 exp(−θ 21 t) exp((θ 21 + (α 11 − 1)θ 11 )t) − 1 θ 21 + (α 11 − 1)θ 11 − exp(θ 21 t) − 1 θ 21 .

For the remaining two terms in the integral, we need to consider two cases: θ 21 = θ 12 and θ 21 = θ 12 .

Case 2: θ 21 = θ 12 . exp(θ 12 t 2 k − θ 21 t)(t − t 2 k ) t 1 exp(−θ 21 (t − τ )) t 2 k <τ α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 t 1 exp(−θ 21 (t − τ )) t 2 k <τ exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 t 1 t 2 k <τ exp(−θ 21 (t − τ )) exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 t t 1 i exp(−θ 21 (t − τ )) exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) t t 2 k exp(θ 21 τ ) exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) exp(−(α 11 − 1)θ 11 t 2 k ) t t 2 k exp(θ 21 τ ) exp((α 11 − 1)θ 11 τ )dτ − exp(θ 12 t 2 k ) t t 2 k exp(θ 21 τ ) exp(−θ 12 τ )dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) exp(−(α 11 − 1)θ 11 t 2 k ) t t 2 k exp((θ 21 + (α 11 − 1)θ 11 )τ )dτ − exp(θ 12 t 2 k ) t t 2 k exp((θ 21 − θ 12 )τ )dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) exp(−(α 11 − 1)θ 11 t 2 k ) t t 2 k exp((θ 21 + (α 11 − 1)θ 11 )τ )dτ − exp(θ 12 t 2 k ) t t 2 k dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp((α 11 − 1)θ 11 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(−θ 21 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(θ 12 t 2 k − θ 21 t)(t − t 2 k ) .

Case 2: θ 21 = θ 12 . t 2 k <τ α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 t 1 exp(−θ 21 (t − τ )) t 2 k <τ exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 t 1 t 2 k <τ exp(−θ 21 (t − τ )) exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 t t 1 i exp(−θ 21 (t − τ )) exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) t t 2 k exp(θ 21 τ ) exp((α 11 − 1)θ 11 (τ − t 2 k )) − exp(−θ 12 (τ − t 2 k )) dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) exp(−(α 11 − 1)θ 11 t 2 k ) t t 2 k exp(θ 21 τ ) exp((α 11 − 1)θ 11 τ )dτ − exp(θ 12 t 2 k ) t t 2 k exp(θ 21 τ ) exp(−θ 12 τ )dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) exp(−(α 11 − 1)θ 11 t 2 k ) t t 2 k exp((θ 21 + (α 11 − 1)θ 11 )τ )dτ − exp(θ 12 t 2 k ) t t 2 k exp((θ 21 − θ 12 )τ )dτ = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp(−θ 21 t) exp(−(α 11 − 1)θ 11 t 2 k ) θ 21 + (α 11 − 1)θ 11 exp((θ 21 + (α 11 − 1)θ 11 )t) − exp((θ 21 + (α 11 − 1)θ 11 )t 2 k ) − exp(θ 12 t 2 k ) θ 21 − θ 12 exp((θ 21 − θ 12 )t) − exp((θ 21 − θ 12 )t 2 k ) = α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 N k=1 exp((α 11 − 1)θ 11 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(−θ 21 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(−θ 12 (t − t 2 k )) θ 21 − θ 12 + exp(−θ 21 (t − t 2 k )) θ 21 − θ 12 .

Therefore, if θ 21 = θ 12 , ξ 2 1 (t | H 2 t ) = µ 2 + N k=1 α 22 θ 22 exp(−θ 22 (t − t 2 k )) + µ 1 α 21 2 − exp(−θ 21 (t)) + α 12 θ 12 α 21 θ 21 N k=1 exp(θ 12 t 2 k − θ 21 t)(t − t 2 k ) + µ 1 α 11 α 21 θ 21 α 11 − 1 exp(((α 11 − 1)θ 11 )t) − exp(−θ 21 t) θ 21 + (α 11 − 1)θ 11 − 2 − exp(−θ 21 t) θ 21 + α 11 θ 11 α 12 θ 12 α 21 θ 21 (α 11 − 1)θ 11 + θ 12 N k=1 exp((α 11 − 1)θ 11 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(−θ 21 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(θ 12 t 2 k − θ 21 t)(t − t 2 k ) . ξ 2 1 (t | H 2 t ) = µ 2 + N k=1 α 22 θ 22 exp(−θ 22 (t − t 2 k )) + µ 1 α 21 2 − exp(−θ 21 (t)) + α 12 θ 12 α 21 θ 21 θ 21 − θ 12 N k=1 exp(−θ 12 (t − t 2 k )) − exp(−θ 21 (t − t 2 k )) + µ 1 α 11 α 21 θ 21 α 11 − 1 exp(((α 11 − 1)θ 11 )t) − exp(−θ 21 t) θ 21 + (α 11 − 1)θ 11 − 2 − exp(−θ 21 t) θ 21 + α 11 θ 11 α 12 θ 12 α 21 θ 21 (α 11 − 1)θ 11 + θ 12 N k=1 exp((α 11 − 1)θ 11 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(−θ 21 (t − t 2 k )) θ 21 + (α 11 − 1)θ 11 − exp(−θ 12 (t − t 2 k )) θ 21 − θ 12 + exp(−θ 21 (t − t 2 k )) θ 21 − θ 12 .

Integrating ξ 1 1 (t | H 2 t ) and ξ 2 1 (t | H 2 t ) over the interval [1, T ] , we obtain the compensator as (Take note that integrals over the exponentials are from t k 2 to t) Ξ 1 1 (t | H 2 t ) = µ 1 t + t 2 k <t α 12 (2 − exp(−θ 12 (t − t 2 k ))) + µ 1 α 11 α 11 − 1 1 (α 11 − 1)θ 11 (exp((α 11 − 1)θ 11 t) − 1) − t + t 2 k <t α 11 θ 11 α 12 θ 12 (α 11 − 1)θ 11 + θ 12 exp((α 11 − 1)θ 11 (t − t 2 k )) − 1 (α 11 − 1)θ 11 − 2 − exp(−θ 12 (t − t k 2 )) θ 12

. (103) If θ 21 = θ 12 , Ξ 2 1 (t | H 2 t ) = µ 2 t + N k=1 α 22 (2 − exp(−θ 22 (t − t k 2 ))) + µ 1 α 21 t − 1 θ 21 (2 − exp(−θ 21 t)) + α 12 θ 12 α 21 θ 21 N k=1 exp(θ 12 t 2 k − θ 21 (t 2 k + t)) exp(θ 21 t 2 k ) θ 21 t 2 k − θ 21 t − 1 + exp(θ 21 t) + µ 1 α 11 α 21 θ 21 α 11 − 1 1 θ 21 + (α 11 − 1)θ 11 exp(((α 11 − 1)θ 11 )t) − 2 (α 11 − 1)θ 11 + exp(−θ 21 t) − 1 θ 21 − t θ 21 + exp(−θ 21 t) − 1 (θ 21 ) 2 + α 11 θ 11 α 12 θ 12 α 21 θ 21 (α 11 − 1)θ 11 + θ 12 N k=1 exp((α 11 − 1)θ 11 (t − t k 2 )) − 2 (θ 21 + (α 11 − 1)θ 11 )(α 11 − 1)θ 11 + exp(−θ 21 (t − t 2 k )) − 2 (θ 21 + (α 11 − 1)θ 11 )θ 21 − 1 (θ 21 ) 2 exp(θ 12 t 2 k − θ 21 (t 2 k + t)) exp(θ 21 t 2 k ) θ 21 t 2 k − θ 21 t − 1 + exp(θ 21 t) .

ξ 2 1 (t; α) = bδ(t − d) + bh 21

To show convexity of ξ 1 1 , we need to show convexity of each of the indicated terms. For the I.A term, we compute the Hessian matrix with respect to {α ij } as

Let us compute each of these terms. First, observe that from Eq. (109) we can write h 1 (t) =   ∞ n=1 (α 11 ) n (f 11 ) ⊗n 0 α 21 ∞ n=1 (α 11 ) n−1 f 21 * (f 11 ) ⊗(n−1) 0   . 

From Eq. (114), we see that Since f ij = 1, it can be shown that α 21 (α 11 ) 3 ∞ n=3 (n − 1)(n − 2)(α 11 ) n f 21 * (f 11 ) ⊗(n−1) < ∞ by the Integral Test.

The set of eigenvalues of Hess II.A is {0, b α 21 (α 11 ) 3 ∞ n=3 (n−1)(n−2)(α 11 ) n f 21 * (f 11 ) ⊗(n−1) (t− d)}. Since all of the eigenvalues of Hess II.A are non-negative, Hess II.A is positive-semidefinite and the term II.A is convex.

It follows that ξ 2 1 is convex. III. Convexity of Ξ 1 1 (T ; {α ij }). 

Integrating Eq. (114), we see that

We then have Now, suppose that we have a multi-impulse exogenous input, given by

Since the multi-impulse exogenous input is a sum of single exogenous input functions, the effect of each impulse is simply additive on the conditional intensity ξ 1 1 , ξ 2 1 and the compensators Ξ 1 1 , Ξ 2 1 . Since a nonnegative-weighted sum of convex functions is still convex, it follows that the PP-PP NLL for the multi-impulse exogenous input is also convex in {α ij } for fixed {θ ij }.

Input: kernel matrix ϕ(t), kernel integral matrix Φ(t), exogenous input function µ(t), exogenous input integral S(t), points-of-interest setT = {(ti, (rj, dj))} = T ∪ j∈E O j ∪ P, censored counts {C j k } for j ∈ E, threshold γ h > 0 Output: NLL, the negative loss likelihood of the partial MBP process initialize hE[t0 : tP ] = 0 d×d×(P +1) ; ∆[t0 : tP ] = ϕ E [t0 : tP ]; a[t0 : t |T | ] = A[t0 : t |T | ] = 0 d×(|T |+1) ;HE[t0 : t |T |+1 ] = 0 d×d×(|T |+1) ; l j = 0, ω j = 0 |O j | ∀j ∈ E; PPLL = ICLL = 0; do ∆ 

if |E| = 0 and k = 0 then ξ E = ξ E + conv(hE, µ + a, [t0 : t k ]); end PPLL = PPLL + log ξ 

Filter ACTIVE for videos that have a mean daily tweet count on days 20-90 of at least

Hawkes Processes in Finance. Market Microstructure and Liquidity

Multivariate Hawkes process for cyber insurance

Time series analysis of hybrid neurophysiological data and application of mutual information

Is Gang Violent Crime More Contagious than Non-Gang Violent Crime

The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis

Simple discrete-time self-exciting models can describe complex dynamic processes: A case study of COVID-19

Strong mixing condition for Hawkes processes and application to Whittle estimation from count data

Google trends and COVID-19 in Italy: could we brace for impact? Internal and Emergency Medicine

Robust dynamic classes revealed by measuring the response function of a social system

An introduction to the theory of point processes

Handling missing data in self-exciting point process models

Video Popularity Prediction by Sentiment Propagation via Implicit Network

An interactive web-based dashboard to track COVID-19 in real time

Multidimensional Digital Signal Processing

Real analysis: Modern techniques and their applications

A time-modulated Hawkes process to model the spread of COVID-19 and the impact of countermeasures

The Social Dynamics of Language Change in Online Networks

An overview of modeling techniques for hybrid brain data

A framework for the analysis of mixed time series/point process data-Theory and application to the study of physiological tremor, single motor unit discharges and electromyograms

Modeling brand post popularity dynamics in online social networks

Spectra of some self-exciting and mutually exciting point processes

Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach

Modeling stochastic processes in disease spread across a heterogeneous social system

Hawkes and INAR( ∞ ) processes. Stochastic Processes and their Applications

An estimation procedure for the Hawkes process

TiDeH: Time-Dependent Hawkes Process for Predicting Retweet Dynamics

Describing and Predicting Online Items with Reshare Cascades via Dual Mixture Self-exciting Processes

Estimating the time-varying reproduction number of COVID-19 with a state-space method

Modelling online comment threads from their start

Self-exciting point process models of civilian deaths in Iraq

Is Gun Violence Contagious

Correlating events with time series for incident diagnosis

Estimating Network Structure from Incomplete Event Data

Modelling structure and predicting dynamics of discussion threads in online boards

Imputing Missing Events in Continuous-Time Event Streams

Feature driven and point process approaches for popularity prediction

Self-Exciting Point Process Modeling of Crime

Modeling Sparse Information Diffusion at Scale via Lazy Multivariate Hawkes Processes

On Lewis' simulation method for point processes

Statistical Models for Earthquake Occurrences and Residual Analysis for Point Processes

Can Google® trends predict COVID-19 incidence and help preparedness? The situation in Colombia

Signals, systems, and transforms

Lecture Notes: Temporal Point Processes and the Conditional Intensity Function

Estimation of certain parameters of a stationary hybrid process involving a time series and a point process

Swapnil Mishra, and Lexing Xie. A Tutorial on Hawkes Processes for Events in Social Media

Expecting to be HIP: Hawkes intensity processes for social media popularity

DEBATENIGHT: The role and influence of socialbots on twitter during the first 2016 U.S. presidential debate

Aditya Krishna Menon, and Lexing Xie. Interval-censored Hawkes processes

Hawkes process inference with missing data

Analysis of interval-censored recurrent event processes subject to resolution: Interval-censored recurrent event processes subject to resolution

A Monte Carlo EM Algorithm for the Parameter Estimation of Aggregated Hawkes Processes

A Parameter Estimation Method for Multivariate Aggregated Hawkes Processes

System identification with analog and counting process observations i: Hybrid stochastic intensity and likelihood

System Identification with Analog and Counting Process Observations II: Mutual Information

A test for independence between a point process and an analogue signal: A test for independence

The second worldwide wave of interest in coronavirus since the COVID-19 outbreaks in South Korea, Italy and Iran: A Google Trends study

Using Hawkes Processes to model imported and local malaria cases in near-elimination settings

Chronocorrelator: Enriching events with time series

Use of Google Trends to investigate lossof-smell-related searches during the COVID-19 outbreak

Measuring the association of a time series and a point process

On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming

Joint modeling of event sequence and time series with attentional twin recurrent neural networks

Learning time series associated event sequences with recurrent point process networks

Regression analysis of mixed panel count data with dependent terminal events: Regression analysis of mixed panel count data with dependent terminal events

Multivariate Spatiotemporal Hawkes Processes and Network Reconstruction

How can our tweets go viral? point-process modelling of brand content

Disinformation warfare: Understanding statesponsored trolls on twitter and their influence on the web

RedQueen: An Online Algorithm for Smart Broadcasting in Social Networks

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity

Robison. Regression analysis of mixed recurrent-event and panel-count data

Statistical analysis of mixed recurrent event data with application to cancer survivor study

Regression analysis of mixed recurrent-event and panel-count data with additive rate models: Analysis of Mixed Recurrent-Event and Panel-Count Data

If θ 21 = θ 12 , Ξ 2 1 (t | H 2 t ) = µ 2 t + N k=1 α 22 (2 − exp(−θ 22 (t − t k 2 ))) + µ 1 α 21 t − 1 θ 21 (2 − exp(−θ 21 t)) + α 12 θ 12 α 21 θ 21 θ 21 − θ 12 N k=1 2 − exp(−θ 12 (t − t k 2 )) θ 12 − 2 − exp(−θ 21 (t − t k 2 )) θ 21 + µ 1 α 11 α 21 θ 21 α 11 − 1 1 θ 21 + (α 11 − 1)θ 11 exp(((α 11 − 1)θ 11 )t) − 2 (α 11 − 1)θ 11 + exp(−θ 21 t) − 1 θ 21 − t θ 21 + exp(−θ 21 t) − 1 (θ 21 ) 2 + α 11 θ 11 α 12 θ 12 α 21 θ 21 (α 11 − 1)θ 11 + θ 12 N k=1 exp((α 11 − 1)θ 11 (t − t 2 k )) − 2 (θ 21 + (α 11 − 1)θ 11 )(α 11 − 1)θ 11N. Convexity Analysis of the PMBP(2, 1) Likelihood Consider a PMBP(2, 1) process. Suppose the associated kernel can be expressed as ϕ ij (t) = α ij f ij (t; θ ij ), where f ij (t; θ ij ) is a probability density over [0, ∞) parametrized by θ ij . Given a multiple impulse exogenous input function,we show that for a fixed set of {θ ij } parameters and given observed point data Ψ 1 = {s 1 k } and Ψ 2 = {s 2 k } over [0, T ) for dimensions 1 and 2, the negative point-process log-likelihood (PP-PP NLL) of the PMBP(2, 1) is convex in the {α ij } parameters.We first consider a single impulse as the exogenous input. Supposewhere a, b, c, d ≥ 0. Given this general impulse exogenous input, we aim to show that the PP-PP NLL is convex in the {α ij } parameters. Given observed data points Ψ 1 = {s 1 k } and Ψ 2 = {s 2 k } over [0, T ) in the two dimensions, the PP-PP NLL is given byOur goal is to show that Eq. (104) is convex in the {α ij } parameters for fixed {θ ij }. Our strategy is to show that each of the four tems in Eq. (104) is a convex function of {α ij }.First, note that the map l : x → − log x is convex and non-decreasing. As such, for − log s 1has to be convex. For the last two terms of Eq. (104) to be convex, Ξ 1 1 (T ; {α ij }) and Ξ 2 1 (T ; {α ij }) have to be convex. I. Convexity of ξ 1 1 (·; {α ij }). Let us split Eq. (105) as follows:Immediately, we see thatn=2 n(n−1)(α 11 ) n (f 11 ) ⊗n is upper bounded by ∞ n=2 n(n−1)(α 11 ) n , which can be shown to be convergent by the Integral Test.It follows that the set of eigenvalues of Hess I.A is {0, a (α 11 ) 2 ∞ n=2 n(n−1)(α 11 ) n (f 11 ) ⊗n (t− a)}. Since all of the eigenvalues of Hess I.A are non-negative, Hess I.A is positive-semidefinite and the term I.A is convex.For the I.B term, since ϕ ij = α ij f ij (·; θ ij ) is linear in α ij , it follows that the Hessian matrix of ϕ 12 2 (t − t 2 k ) is identically zero, and thus I.B is trivially convex. For the I.C term, since h 11 and ϕ 12 are differentiable functions of α ij , it follows thatSince the Hessian for ϕ 12 2 (t − t 2 k ) is trivially 0, the first term of I.C is trivially zero. For the second term, since ∂ 2 h 11 1 (∂α ij ) 2 = 0 unless i = j = 1, we only need to consider the i = j = 1 case. Using Eq. (115), we havewhich is upper-bounded by α 12 (α 11 ) 2 |H 2 t | ∞ n=2 n(n−1)(α 11 ) n , which is convergent and positive. Thus, I.C is convex.Since I.A, I.B, I.C are all convex, ξ 1 1 is convex in {α ij } II. Convexity of ξ 2 1 (·; {α ij }). Let us split Eq. (106) as follows:∞ n=2 n(n − 1)(α 11 ) n , which can be shown to be convergent by the Integral Test.Since the eigenvalues of Hess III.A , {0, a (α 11 ) 2 ∞ n=2 n(n − 1)(α 11 ) n t c (f 11 ) ⊗n (τ − c)dτ }, are nonnegative, III.A is convex, and consequently Ξ 1 1 is convex.Similar to III, IV.B and IV.C have zero Hessians as α ij is linear in Φ ij 2 . For IV.A, see thatFrom Eq. (119), we getNote that α 21 (α 11 ) 3 ∞ n=3 (n−1)(n−2)(α 11 ) n t d f 21 * (f 11 ) ⊗(n−1) (τ −d)dτ is upper-bounded by α 21 (α 11 ) 3 ∞ n=3 (n − 1)(n − 2)(α 11 ) n , which is convergent by the Integral Test. From this we see that the eigenvalues of Hess IV.A are {0, b α 21 (α 11 ) 3 ∞ n=3 (n − 1)(n − 2)(α 11 ) n t d f 21 * (f 11 ) ⊗(n−1) (τ − d)dτ }. As these are nonnegative, we have that Hess IV.A is positive-semidefinite, and so IV.A is convex. Consequently, Ξ 2 1 is convex. As ξ 1 1 , ξ 2 1 , Ξ 1 1 , Ξ 2 1 are all convex in {α ij }, it follows that the PP-PP NLL for the given exogenous function µ(t) is convex in {α ij }.