key: cord-0427429-nh1qrgy7 authors: Okawa, Maya; Iwata, Tomoharu; Tanaka, Yusuke; Toda, Hiroyuki; Kurashima, Takeshi; Kashima, Hisashi title: Dynamic Hawkes Processes for Discovering Time-evolving Communities' States behind Diffusion Processes date: 2021-05-24 journal: nan DOI: 10.1145/3447548.3467248 sha: 79c6a97ff1e4aef7a911899ceeb3496493add70a doc_id: 427429 cord_uid: nh1qrgy7 Sequences of events including infectious disease outbreaks, social network activities, and crimes are ubiquitous and the data on such events carry essential information about the underlying diffusion processes between communities (e.g., regions, online user groups). Modeling diffusion processes and predicting future events are crucial in many applications including epidemic control, viral marketing, and predictive policing. Hawkes processes offer a central tool for modeling the diffusion processes, in which the influence from the past events is described by the triggering kernel. However, the triggering kernel parameters, which govern how each community is influenced by the past events, are assumed to be static over time. In the real world, the diffusion processes depend not only on the influences from the past, but also the current (time-evolving) states of the communities, e.g., people's awareness of the disease and people's current interests. In this paper, we propose a novel Hawkes process model that is able to capture the underlying dynamics of community states behind the diffusion processes and predict the occurrences of events based on the dynamics. Specifically, we model the latent dynamic function that encodes these hidden dynamics by a mixture of neural networks. Then we design the triggering kernel using the latent dynamic function and its integral. The proposed method, termed DHP (Dynamic Hawkes Processes), offers a flexible way to learn complex representations of the time-evolving communities' states, while at the same time it allows to computing the exact likelihood, which makes parameter learning tractable. Extensive experiments on four real-world event datasets show that DHP outperforms five widely adopted methods for event prediction. Various social phenomena can be described by diffusion processes among multiple communities. For example, infectious diseases like COVID-19 are transmitted from one county to another, leading to a worldwide pandemic [35] . Information such as opinions, news, and articles are shared and disseminated among online communities, e.g., user groups in social networks, news websites, and blogs. Such diffusion phenomena are recorded as multiple sequences of events, which indicate when and in which community the event occurred. Understanding the diffusion mechanism and predicting future events are crucial for many practical applications across domains. For example, policymakers would be able to design prompt and appropriate interventions to curb the spread of disease given a better understanding of the mechanisms behind the transmission and more reliable predictions. Temporal point processes provide an elegant mathematical framework for modeling event sequences. In these methods, the probability of event occurrences is determined by the intensity function. Hawkes process is an important class of point processes for modeling diffusion processes. These models use the triggering kernel to characterize diffusion processes and estimate its parameters via maximum likelihood. The triggering kernel encodes the magnitude and speed of influence from the past events, namely, how likely and quickly the past events in one community (i.e., "source" community) will affect the occurrence of a particular event in another community (i.e., "target" community). Hawkes process and its variants have been applied in diverse areas, from epidemic modeling [16] to social network analysis [12, 30, 41] . However, they have focused on learning the static influence of the past events on the current event, thereby largely overlooking the factor of time-evolution. In reality, the diffusion processes depend not only on the influences from the past but also on the current state of the target communities. For example, the outbreaks of infectious diseases in one community (e.g., country) can also be driven by people's awareness of the disease in each community (country) and their preventive behaviors which can constantly change over time, on top of the past record of the disease occurrence. As another example, information diffusion heavily depends on ongoing peoples' interests in the target community (e.g., online user group). In particular, the spread of information to one target community (user group) is strengthened when an topic deemed to be important by the target community emerges in the online space; while it is weakened in accordance with a gradual loss in peoples' interest in the topic. A few studies have considered the underlying dynamics of such "states" in communities. For instance, the SIR-Hawkes model [31] redesigned the triggering kernel of the Hawkes process by incorporating the recovered (immune) population dynamics over the course of the pandemic. Kobayashi et al. [18] proposed a time-dependent triggering kernel that varies periodically in time for modeling daily cycles of human activity. However, these approaches rely on handcrafted functions for describing the latent dynamics of states and so demand expert domain knowledge. Moreover, they may not be flexible enough to accommodate the complexity and heterogeneity of the real world. In fact, in many practical applications the complete set of factors is largely unknown and thus difficult to model through restricted parametric forms. Taking information diffusion as an example, the time-evolution of peoples' interest on a given topic is generally unknown and not directly observable. A potential solution is to directly model the triggering kernel parameters using a flexible function of time (e.g., neural network). Alas, naively employing this approach makes parameter learning intractable since the log-likelihood of Hawkes processes involves the integral of the triggering kernel. Computing the integral of the triggering kernel in combination with the neural network is generally infeasible. In this paper, we propose a novel Hawkes process model referred to as DHP (Dynamic Hawkes Process) which automatically learns the underlying dynamics of the communities' states behind the diffusion processes in a manner that allows tractable learning. We introduce the latent dynamics function for each community that represents its hidden dynamic states. Our core idea is to extend the triggering kernel by combining it with the latent dynamics function and its integral. Specifically, we model the magnitude of diffusion by the latent dynamics function and the speed of diffusion by the integral of the latent dynamics function. This design choice offers two benefits. First, the resulting triggering kernel can be expressed as a product of two components: composite function with the "inner" function being the integral of the latent dynamics function and the "outer" function being the basic triggering kernel; and the derivative of the inner function of that composite function (i.e., the latent dynamics function). Hence, by applying the substitution rule for definite integrals (i.e., the chain rule in reverse), we can obtain a closed-form solution for the integral of the triggering kernel involved in the log-likelihood. Second, it allows capture of the simultaneous changes of magnitude and speed of the diffusion as they are related through the latent dynamics function and its integral, which is desirable for many applications. For example, in the context of disease spread, active preventitive measures can reduce both the magnitude and the speed of the infection. To model the integral of the latent dynamics function, we utilize and extend a Figure 1 : Dynamics of community states learned by our DHP for Reddit dataset. Top: Intensity with observed event sequences and latent dynamics function for two Reddit communities (i.e., subreddits): news and space. Bottom: Learned triggering kernel between 8 selected subreddits at 3 different time points. Nodes denote subreddits, color indicates category. Node size is proportional to latent dynamics function for each subreddit. Edge width is proportional to triggering kernel, which indicates strength of diffusion between pairs of subreddits. We can see the latent dynamics function increases over time for most subreddits from March to May, 2020. It increases rapidly for news and slowly for space following the onset of the COVID-19 lockdown. Our DHP automatically learns how activities of each community evolve over the course of the pandemic. monotonic neural network [6, 33] . This formulation enables DHP to learn flexible representations of the community state dynamics that underlie the diffusion processes. It should be noted that DHP can be easily extended to capture the time-evolving relationships between communities, by introducing the latent dynamics function for pairs of communities. In this work, we adopt DHP to demonstrate the hidden state dynamics of individual communities. The main contributions of this paper are as follows: • We propose a novel Hawkes process framework, DHP (Dynamic Hawkes Process) for modeling diffusion processes and predicting future events. The proposal, DHP, is able to learn the timeevolving dynamics of community states behind the diffusion processes. • We introduce latent dynamics function; it reflects the hidden community dynamics and design the triggering kernel of the Hawkes process intensity using the latent dynamics function and its integral. The resulting model is computationally tractable and flexible enough to approximate the true evolution of the community states underlying the diffusion processes. • We carry out extensive experiments using four real-world event datasets: Reddit, News, Protest, and Crime. The results show that DHP outperforms the existing works. Case studies demonstrate that DHP uncovers the hidden state dynamics of communities which underlie the diffusion processes by the latent dynamic function (See Figure 1 ). With the evolution of data collection technology, extensive event sequences with precise timestamps are becoming available in an array of fields such as public health and safety [20, 22, 39] , economics and finance [2, 5, 14] , communications [12] , reliability [3, 34, 36] , and seismology [25] [26] [27] . Temporal point processes provide a principled theoretical framework for modeling such event sequences, in which the occurrences of events are determined by the intensity function. Classical examples of temporal point processes include reinforced Poisson process [29] , self-correcting point process [15] , and Hawkes process [13] . The Reinforced Poisson process [29] considers the cumulative count of past events and a time-decreasing trend, and has been recently applied for predicting online popularity [32] . The intensity of the self-correcting point process [15] increases steadily and this trend is corrected by past observed events. Although these models have been widely used, they are not suitable for modeling diffusion processes between communities as they cannot explicitly model the influence of the past events underlying diffusion processes. Hawkes process [13] explicitly models the influence of the past events and captures triggering patterns between events (i.e., diffusion processes). Hawkes processes have been proven effective for modeling diffusion processes, including earthquakes and aftershocks [22] , near-repeat patterns of crimes [22] , financial transactions [2, 11, 14] , online purchases [8, 10, 38, 40] , and information cascades [12, 30, 41] . Recent studies employ neural network architectures to model point process intensity. In [9] , the authors design the intensity using RNN. Omi et al. [28] extended our work by combining it with a monotonic neural network. Compared to classical point process methods, RNN-based models provide a more flexible way to handle the complex dependencies between events. However, the above methods focus on learning the triggering patterns of diffusion processes, i.e., influences from past events, and disregard the current (time-evolving) states of the communities. In the real world, event occurrences largely depend on the current community states (e.g., people's awareness of the disease, ongoing people's interest), which can evolve over time, as well as the past. Several studies incorporate the time-variant dynamics of the community states behind diffusion processes into the Hawkes process formulation. For instance, the SIR-Hawkes model [31] considers recovered (immune) population dynamics to enhance the prediction of infectious disease events over the course of pandemic. Kobayashi et al. [18] proposed a time-dependent Hawkes process that accounts for the circadian and weekly cycles of human activity. Navaroli et al. [23] used nonparametric estimation to learn cyclic human activities underlying digital communications. All of the above methods, unfortunately, rely on a domain expert's knowledge to elucidate the dynamics of the communities states behind diffusion processes. Such dynamics are often quite complex and remain unexplored in many practical applications. Different from the existing methods, our proposed method both incorporates the temporal dynamics of communities' states and the past influences. This section provides the general framework of point processes on which our work is built, and the formal definition of the event prediction problem studied in this paper. Point process is a random sequence of events occurring in continuous time { 1 , 2 , · · · }, with ∈ [0, ). Point processes are fully determined by "intensity" function ( ). Given the history of events H ( ) up to time , the intensity is defined as where ( ) is a number of events falling in [0, ), Δ is a small time interval, and E is an expectation. The intensity value ( ) at time measures the probability that an event occurs in the infinitesimal time interval [ , + Δ ) given past events H ( ). Hawkes process [13] is an important class of point processes, and can describe self-exciting phenomena. The intensity of Hawkes process is defined as where ≥ 0 is a background rate and (·) ≥ 0 is a triggering kernel encoding the augmenting or attenuating effect of past events on current events. Intuitively, each event at time elevates the occurrence rate of events at time by the amount ( − ) for > . The univariate Hawkes process can be extended to multivariate Hawkes Process (MHP) to handle the mutual excitation of events (i.e., diffusion) among different communities (denoted by dimensions). Suppose we have historical observations D = {( , )} =1 with time ∈ [0, ) and community ∈ {1, ..., }. In our setting, the communities indicate countries, city districts, online user groups or news websites. For an -dimensional multivariate Hawkes process, the intensity of the -th dimension takes the following form: where is the background rate of dimension and , (·) ≥ 0 is the triggering kernel that captures the impact of an event in community on the occurrence of an event in community . The typical choice for the triggering kernel is the exponential memory kernel, which is defined by where Δ represents the time interval Δ = − , , quantifies the magnitude of the influence from community on the event occurrence in community , and , controls how quickly its effect decays in time (i.e., speed of the diffusion). Other candidates include power law kernel [26] , Raleigh kernel [37] , and log-normal distribution [23] . The negative log-likelihood function of a multivariate Hawkes process over time interval [0, ] is given by: The key notations used in the paper are listed in Table 3 of Appendix A. An event is represented by the pair ( , ), where and denote time and community (e.g., country, news website) where the event happened, respectively. An event sequence is defined as the set In this section, we present DHP (Dynamic Hawkes Process), a novel multivariate Hawkes process framework for event prediction; it can learn the time-evolution of the communities underlying the diffusion processes. Figure 2 illustrates DHP. We design the triggering kernel of DHP intensity (panel A in Figure 2 ) as the product of two components: the triggering kernel with the input of time-rescaled events (panel B in Figure 2 ), which learns the decay influence from the past events; and the latent dynamics function (panel C) to adjust the magnitude of the influence from the past events. The latent dynamics function describes the time-evolving states of the communities (indicated by dimensions). In the context of disease spread, the latent dynamics function represents the dynamics of people's awareness of the disease in each country. For information diffusion, it characterizes the temporal evolution of readers' interests in news websites. We elaborate on the formulation of DHP in §4.1, followed by parameter learning ( § 4.2). The prediction procedure is described in Appendix C. The proposed model specifies the intensity of Hawkes process for dimension as where is the background rate for the -th dimension (i.e., community), , (·) is any chosen triggering kernel between dimension and dimension such as exponential memory kernel or log-normal distribution, andΔ is the time-rescaled or transformed time interval between the current time and the time of -th event . ( ) ≥ 0 represents the dynamics of the -th community underlying the diffusion processes at time , which controls the magnitude of diffusion. The transformed time intervalΔ is defined by the integral of the latent dynamics function between and as follows: where ( ) denotes the integral function of the continuous-time The above formulation can be understood by considering an analogy drawn from the time-rescaling theorem [4] . Intuitively, this transformation adjusts the influence of each event by stretching or shrinking time based on the value of the latent dynamics function ( ). When ( ) < 1, ). This formulation assumes that the speed of diffusion varies according to the temporal dynamics of the target community , which is captured by ( ). This assumption is realistic; for instance, disease spread is controlled by people's awareness of the disease in each country and their preventive behaviors. Information diffusion is largely influenced by the reader's interest in each news website. It is worth mentioning that the latent dynamic function can be easily extended to consider the dynamics of pairwise interactions between dimensions, by redefining the latent dynamics function as , ( ). The following discussion holds even under this extension. Our formulation allows considering the latent state of each dimension at current time as well as the influence from the past events. Also, it can capture the simultaneous changes in diffusion magnitude and speed, which is desirable for many applications (as discussed in the following paragraph). Most importantly, it enables us to compute the analytic integral of the intensity, which is required for evaluating the log-likelihood (further discussion can be found in § 4.2) and predicting the number of future events (See Appendix C). Triggering Kernel. The triggering kernel can have many forms. For example, we can assume the exponential memory kernel for , (·), i.e., where , encompasses the magnitude of the static interaction between the -th and -th dimension; and , weights the decay of the influence over time. Notice that, the above formulation relies on the implicit assumption that the magnitude and speed of diffusion are related through the latent dynamics function ( ), which controls the magnitude of diffusion; its integral ( ) governs the speed of diffusion. For example, when ( ) = 2 for every , the second term of Equation 7 is 2 exp (−2 Δ ), where Δ = − . When ( ) = 0.5 for every , it is 0.5 exp (−0.5 Δ ). This assumption is reasonable since the magnitude and speed of diffusion vary simultaneously in many cases. Taking disease transmission as an example, active prevention measures can both reduce the magnitude and the speed of the infection. The influence on the magnitude of diffusion from the latent dynamics function is tuned by , and the influence on the speed of diffusion from its integral is tuned by . Latent dynamics function. The design of (·) is flexible to so any non-negative function can be used. Inspired by [28] , we utilize and extend a monotonic neural network [6, 33] that learns a strictly monotonic function to design the latent dynamics function. Concretely, we model the integral function ( ) using the monotonic neural network. This guarantees that its derivative (i.e., the latent dynamics function ( )) is strictly non-negative, so intensity ( ) results in a non-negative function. In describing the integral function we propose to further enhance the expressiveness of the monotonic neural network by using a mixture of monotonic neural networks. Formally, where is the number of mixture components, Φ (·) is the -th monotonic neural network, is the mixture weight of the -th component, and 0 is a bias parameter for the output layer. To preserve monotonicity of the integral of the latent dynamics function ( ), we impose non-negative constraints on the mixture weights { 1 , ..., } and parameter 0 . For each dimension, we construct fully connected neural layers with monotonic activation functions. Whenever the context is clear, we simplify notation Φ (·) to Φ(·). At each layer ∈ {1, 2, ..., } of the monotonic neural network, the hidden-state vector h ( ) is given by where W ( ) and b ( ) are parameter matrix and vector to learn for -th layer, respectively. The input of the first layer is time : (·) is a monotonic non-linear function. Following the previous work [6, 33] , we use tanh activation for hidden layers and softplus for the last layer. The output of the monotonic neural network is Φ( ) = Bh ( ) , where B is a learnable weight matrix. The weight parameter matrices W ( ) and B are imposed to be non-negative. The latent dynamics function ( ), which is the derivative of the monotonic neural network ( ), takes the following form, where ( ) is the gradient of the monotonic neural network Φ ( ) with respect to time , namely ( ) = Φ ( )/ . The gradient ( ) can be obtained by applying the automatic differentiation implemented in deep learning frameworks such as TensorFlow [1] . As we place no restriction on the parametric forms of the community dynamics underlying the diffusion processes, our model can fit various complex dynamics of each community's state. This design choice enables us to automatically learn unknown complex dynamics of the communities' states behind the diffusion processes, while at the same time allowing us to compute the exact log-likelihood for training as described in § 4.2. The problem here is to obtain integral Λ in the last term, which reduces to Notice that the integrand of the above integral can be regarded as the product of composite function , ( (·)) and the derivative ( ) of the "inner" part of that composite function. Hence, we can solve the integral of Equation Given the exact log-likelihood, we back-propagate the gradients of the loss function L. In the experiment, we employ mini-batch optimization. We start by setting up the qualitative and quantitative experiments, and then report their results. We used four real-world event datasets from different domains. • Reddit: We crawled the official Reddit API 1 to gather timestamped hyperlinks between Reddit communities (i.e., subreddits) over 6 months from March 1 to August 31, 2020. Following the work of [24] , we use a list of hyperlinks to each target subreddit as a separate sequence and consider target subreddits as communities (i.e., dimensions). • News: News dataset, which is provided by GDELT project [21] through its API 2 , consists of roughly 20,000 news articles related to COVID-19 dated from January 20 to March 24, 2020. We filtered out 40 news websites and used them as communities. • Protest: Protest dataset, which was gathered by ACLED 3 , contains over 20,000 demonstration events in 35 countries during 9 months from March 1 to November 21, 2020. • Crime: Crime dataset is publicly available from the City of Chicago Data Portal 4 ; it includes about 30,000 reported crimes from 13 community areas of Chicago from 1 March to 19 December 2020. We treat community areas as communities. All the datasets are publicly available. The statistics of these datasets are given in Table 1 . The procedure of data preprocessing is provided in Appendix D.1. We compare DHP against five widely used point process methods that incorporate the influence of the past events: • HPP (Homogeneous Poisson Process): It is the simplest point process where the intensity is assumed to be constant over time. • RPP (Reinforced Poisson Processes) [29, 32] : RPP accounts for the aging effect and the cumulative count of past events. • SelfCorrecting (Self-correcting Point Process) [15] : Its intensity is assumed to increase linearly over time and this tendency is corrected by the historical events. • Hawkes (Hawkes Process): Its intensity is parameterized by Equation 3, which explicitly models the influence of the past events by using the static triggering kernel. For the experiments, we divided each dataset into train, validation, and test sets by chronological order with the ratios of 70%, 10%, and 20%. The model parameters were trained using the ADAM optimizer [17] . We tuned all the models using early stopping based on the log-likelihood performance on the validation set with a maximum of 100 epochs for the Reddit and News datasets and 30 epochs for the Protest and Crime datasets. Batch size is set to 128. The hyperparameters of each model are optimized via grid search. For the neural networks-based models (i.e., RMTPP and DHP), we choose the number of layers from {1, 2, 3, 4, 5}. For Hawkes process methods (i.e., Hawkes and DHP), the kernel function is selected from three commonly used kernels: exponential memory, power-law, and Raleigh kernels. These are mathematically defined in Appendix B. For DHP, we search on the number of mixtures over {1, 2, 3, 4, 5}. The chosen hyperparameters are presented in Appendix D.2. Our experiments use the following two metrics in evaluating all models. For both metrics, lower values indicate better performance. • NLL (Negative Log-Likelihood) is used to assess the likelihood of the occurrence of the events over the test period; it is cal- of Appendix C. Then, we measure the average normalized difference between the predicted and observed number of events across all time intervals as follows: whereˆ(( +1 , ]) is the predicted number of events in the small time interval ( +1 , ] and (·) is the ground truth at the -th time interval and -th dimension. In this section, we first compare DHP with existing methods on event prediction. Table 2 presents the negative log-likelihood (NLL) of the test data and Mean Absolute Percentage Error (MAPE) for different methods on the real-world event datasets. In this table, we omit the result of RMTPP since its log-likelihood function differs from those used in the other methods, (it is defined for the whole event sequence from all communities, not for the separate sequences of the individual communities, which precludes fair comparison). As shown in the table, our proposal, DHP, outperforms the four existing methods across all the datasets in terms of NLL. HPP has the worst NLL in most cases since it does not explore the temporal variation of the event occurrences. RPP and SelfCorrecting cannot achieve good results as they encode strong assumptions on the functional forms of the intensity, which limits the expressivity of the model. Hawkes surpasses HPP, RPP, and SelfCorrecting, which explicitly models the dependencies between past and current events. However, it still falls short for modeling the dynamic changes of the community states in the diffusion process. Our DHP achieves even better NLL than Hawkes. This verifies that incorporating latent community dynamics is essential for event prediction and that DHP can learn effective representations of the time-evolving dynamics of community states. DHP achieves the best MAPE for all datasets. RMTPP performs the second best in terms of MAPE for Reddit and News datasets, which is probably because RMTPP exploits the power of RNN for learning non-linear dependencies between events. But RMTPP performs poorly for Protest and Crime datasets since it cannot capture changes in the event occurrences due to the temporal evolution of communities' states, e.g., a large reduction in protest events due to the COVID-19. DHP outperforms all other methods across the datasets on the two metrics. The above result reveals the effectiveness of encoding the community state dynamics governing the diffusion process for event prediction. It also suggests that the assumption of DHP, i.e., the magnitude and speed of diffusion are related, holds for real diffusion processes. In this section, we analyze the impacts of hyperparameters or experimental settings. We report the prediction performance of DHP under different settings for the four datasets. Number of mixtures. We examine how the number of mixture components, , determines the prediction performance of DHP. Figure 3a shows the negative log-likelihood (NLL) on the test data with respect to different numbers of mixtures {1, 2, 3, 4, 5}. In this experiment, we fixed the number of layers as 3 and used the power-law kernel. The NLL performance tends to be stable for all the datasets. It slightly increases as the number of mixture components becomes larger for Protest and Crime dataset. The results indicate that increasing the number of mixtures can improve the expressiveness of the model. Kernel functions. We investigate the effect of three kernel functions: exponential kernel, power-law kernel, and Raleigh kernel, where the number of mixtures and the number of layers are set to 3. As shown in Figure 3b , the power-law kernel yields the best performance on all datasets. Number of layers. Figure 3c evaluates the sensitivity of our neural network Φ ( ) to the number of layers ∈ {1, 2, 3, 4, 5} by fixing the number of mixtures as 3 and using the power-law kernel. We observe that DHP yields better NLL results for the Protest dataset with larger numbers of layers. For the other three datasets, it has little effect on the performance. In general, DHP shows stable and robust prediction performance across different settings. In order to further verify the capability of DHP, we analyze the temporal dynamics of the community states behind the diffusion process learned by DHP from each dataset. Figure 4 visualizes the interactions between selected 8 Reddit communities (i.e., subreddits) learned from Reddit dataset. In each row, we compare the estimated interactions between subreddits by DHP (middle) and Hawkes (bottom), and the ground truth (top). For the ground truth, node size corresponds to the aggregated number of hyperlinks for each "target" community in the default 5-day interval; the weight of each edge represents the number of hyperlinks between source community ′ and target community . For DHP, node size is proportional to ′ , ′ ( ); edge width is , ′ ( ). For Hawkes, node size is ′ , ′ ; edge width is , ′ . Note that Hawkes produces the same results across times since it assumes the triggering kernel is static over time. We can see that the interactions learned by DHP are more consistent with the true evolution of the interactions between online user communities compared to Hawkes. The top panel of Figure 1 shows the intensity ( ) and estimated dynamics function ( ) learned for Reddit dataset along with the observed event sequences for the two subreddits. The latent dynamics function increases up to the end of May, rapidly for news and slowly for space. This is probably due to the COVID-19 lockdown. These results demonstrate that our DHP learns a reasonable representation of the latent temporal dynamics of the online communities. Figure 5 shows inferred interactions among news websites from 15 countries learned for News dataset. In these figures, the node size denotes the value of the latent dynamics function ′ , ′ ( ) for each news website; the edge width denotes the strength of interactions between them , ( ). East Asian and South-East Asian countries (denoted by blue and yellow) rise to their peaks around late February (See Figure 5a) and then decrease until mid-March (Figure 5b) , while the other countries are peaked around or after March 15 (Figure 5b) , not in February (Figure 5a) . We can also see in Figure 6 that the dynamic function peaks around mid-February for China (left), followed by the United Kingdom with the peak of mid-March (right). These trends are synchronized to the growth of the pandemic in each country. East Asian and South-East Asian countries experienced their first peak in COVID-19 cases ahead of the other countries, which would trigger the people's early interest on COVID-19 related topics and accelerate the spread of COVID-19 related news early on. This confirms that our proposal, DHP, well reproduces the complex evolution in news website activities. Figure 7 shows the intensity and latent dynamic function learned from the Protest dataset. According to a previous study 5 , in contrast to the online events, the pandemic initially leads to a reduction in protest events and the trend was corrected after several weeks. DHP well characterizes this trend. In China (left), the dynamic function decreased following the onset of the coronavirus around the beginning of March and returned to a moderate level by mid-June. For Russia, it declined gradually from March until the beginning of July, where the first peak of the pandemic occured around May 11. In conclusion, DHP uncovers the latent community dynamics underlying the diffusion processes, and so provides meaningful insights about the diffusion mechanism. Modeling and predicting diffusion processes are important tasks in many applications. We presented a novel Hawkes process framework, DHP (Dynamic Hawkes Process), that can learn the temporal dynamics of the community states underlying diffusion processes. The proposed DHP allows for the automatic discovery of the community state dynamics underlying the diffusion processes as well as offering tractable learning. By conducting extensive experiments on four real event datasets, we demonstrate that DHP provides better performance for modeling and predicting diffusion processes than several existing methods. For future work, we plan to explore the following two directions. First, DHP can be extended to capture the pairwise dynamics of the interactions among communities by introducing the latent dynamics function for pairs of communities. We will extend DHP to this case and conduct experiments to evaluate the performance of the extended DHP in capturing the time-evolving dynamics of the pairwise interactions between communities. Secondly, DHP is built on the assumption that the magnitude and speed of the diffusion are related to each other, which may limit the flexibility of the model. We will explore how to modify DHP to ease this assumption. For readers' convenience, we list the important notations used throughout the paper in Table 3 . In our experiment, we used three types of triggering kernel: Exponential, Power-law and Raleigh. Table 4 presents their equations and integrals, where and are parameters of the triggering kernel. is the scaling exponent of the power-law ( > 1) and we fix = 2 in the experiment. Tensorflow: Large-scale machine learning on heterogeneous distributed systems Hawkes processes in finance A point process model for the reliability of a maintained system subject to general repair The time-rescaling theorem and its application to neural spike train data analysis Estimating value-at-risk: a point process approach Neural likelihoods via cumulative distribution functions Deep coevolutionary network: Embedding user and item features for recommendation Recurrent marked temporal point processes: Embedding event history to vector Time-sensitive recommendation from recurrent user activities Multivariate Hawkes processes: an application to financial data Coevolve: A joint point process model for information diffusion and network evolution Spectra of some self-exciting and mutually exciting point processes Hawkes processes and their applications to finance: a review A self-correcting point process. Stochastic processes and their applications Modeling stochastic processes in disease spread across a heterogeneous social system Adam: A method for stochastic optimization Tideh: Time-dependent hawkes process for predicting retweet dynamics Community interaction and conflict on the web Efficient inference of Gaussian-process-modulated renewal processes with application to medical event data Gdelt: Global data on events, location, and tone Self-exciting point process modeling of crime Modeling response time in digital human communication Statistical models for earthquake occurrences and residual analysis for point processes Seismicity analysis through point-process modeling: A review Inference for earthquake models: a self-correcting model. Stochastic processes and their applications Fully neural network based model for general temporal point processes A survey of random processes with reinforcement Hawkes processes for events in social media SIR-Hawkes: linking epidemic models and Hawkes processes to model diffusions in finite populations Modeling and predicting popularity dynamics via reinforced poisson processes Monotonic networks Imperfect repair modeling using Kijima type generalized renewal process Global transport networks and infectious disease spread Generalized renewal process for repairable systems based on finite Weibull mixture. Reliability Engineering & System Safety Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures Coevolutionary latent feature processes for continuous-time user-item interactions Forest-based point process for event prediction from electronic health records Path to purchase: A mutually exciting point process model for online advertising and conversion Seismic: A self-exciting point process model for predicting tweet popularity • Reddit: The data collection procedure followed the one used in [19] . During crawling we selected the 25 most popular subreddits, and retrieved hyperlinks among those subreddits: we identified and recorded posts in one source subreddit that contain links to different target subreddits. This process finally yielded a total of roughly 23,000 posts, each of which had submission time, source subreddit, and target subreddit. We treated a list of hyperlinks to each target subreddit as a separate sequence and considered target subreddits as communities (i.e., dimensions). The source subreddit was not used for training but for qualitative evaluation ( Figure 4 ). • News: The original dataset contains over a million of news articles related to COVID-19. Each piece of news had a timestamp and a URL. We extracted the domain of news websites from a URL and obtained more than 1,000 unique domains. We filtered out 40 country-specific domains and used them as communities. The granularity of time is one second. • Protest: We sampled 35 popular countries and retrieved events from those countries. Each event was associated with two attributes: timestamp and country. The dataset was recorded at minute level. • Crime: Each event recorded the time and community area where a crime happened. The time granularity is one minute. All code was implemented using Python 3.9 and Keras [7] with a TensorFlow backend [1] . We conducted all experiments on a machine with four 2.8GHz Intel Cores and 16GB memory. The model parameters were trained using the ADAM optimizer [17] with 1 = 0.9, 2 = 0.999 and a learning rate of 0.002. For the neural networks-based models (i.e., RMTPP and DHP), the number of hidden units in each layer is fixed as 8. In our experiment, the number of mixtures is set to 3 for Reddit, News and Protest datasets, and to 5 for Crime dataset. In all experiments, we used the powerlaw kernel. The number of layers is set to 2 for Reddit and Protest datasets, 1 for News dataset, 3 for Crime dataset, respectively. • HPP (Homogeneous Poisson Process): The simplest point process whose intensity is constant over time:• RPP (Reinforced Poisson Processes) [29, 32] : For each dimension , the intensity of RPP is characterized bywhere ( ) is the relaxation function that characterizes the aging effect, and ( ) is the number of events of dimension that have occurred up to . Following the prior work [32] , we define ( ) by the following relaxation log-normal function:where and are parameters, which are local to the dimension.• SelfCorrecting (Self-correcting Point Process) [15] : The intensity function of SelfCorrecting is assumed to increase steadily over time with the rate > 0; this trend is corrected by constant > 0 every time an event arrives. Its intensity function associated with dimension is given bywhere , , and are parameters, and ( ) is the number of events of dimension in (0, ].