key: cord-0196122-as62itrf authors: Wang, Xuhong; Chen, Sirui; He, Yixuan; Wang, Minjie; Gan, Quan; Yang, Yupu; Yan, Junchi title: CEP3: Community Event Prediction with Neural Point Process on Graph date: 2022-05-21 journal: nan DOI: nan sha: 1211ba752bfc4f35ab93f47cefed37f4accf5360 doc_id: 196122 cord_uid: as62itrf Many real world applications can be formulated as event forecasting on Continuous Time Dynamic Graphs (CTDGs) where the occurrence of a timed event between two entities is represented as an edge along with its occurrence timestamp in the graphs.However, most previous works approach the problem in compromised settings, either formulating it as a link prediction task on the graph given the event time or a time prediction problem given which event will happen next. In this paper, we propose a novel model combining Graph Neural Networks and Marked Temporal Point Process (MTPP) that jointly forecasts multiple link events and their timestamps on communities over a CTDG. Moreover, to scale our model to large graphs, we factorize the jointly event prediction problem into three easier conditional probability modeling problems.To evaluate the effectiveness of our model and the rationale behind such a decomposition, we establish a set of benchmarks and evaluation metrics for this event forecasting task. Our experiments demonstrate the superior performance of our model in terms of both model accuracy and training efficiency. Modeling and learning from dynamic interactions of entities has become an important topic and inspired a great interest in wide application fields. In particular, studying the evolution of community social events can enable preemptive intervention for the pandemic (e.g., spreading [47] . Monitoring and forecasting the traffic congestion [20] spreading can help the police prevent congestion from spreading outside the community. The community needs more attention and help if there is a sudden change in the state of the economic [5] or political leanings [32] . In some cases, some entities, such as those with dense connections, or with similar characteristics, may form certain communities, and communities could also be defined by users based on their criteria. In reality, people may only be interested in a specific community of entities' interactions in some practical applications, such as community behavior modeling [11] , dynamic community discovery [26] and community outliers detection [10] . Continuous Temporal Dynamic Graph (CTDG) is a common representation paradigm for organizing dynamic interaction events over time, with edges and nodes denoting the events and the pairwise involved entities, respectively. Each edge also contains information about the timestamp of an event (assuming that the event occurs instantaneously). For better understanding and forecasting the events in a community, we propose the community event forecasting task (Fig. 1) in a CTDG to predict a series of future events about not only which two entities they will involve but also when they will occur. Problem Formulation. Formally, denote a CTDG as = ( , ), where = { : = 1, · · · , } is the set of edges, = ( , , ) with source node , destination node , and timestamp . The edges are ordered by timestamps, i.e., ≤ given 1 ≤ ≤ ≤ . We further denote a CTDG within a temporal window as : = ( , : ) where : = { : ≤ < }. Given the queried community (or node candidates that people are interested in) ⊂ , predicting future events within the community given observed events requires to model the following conditional distribution: ( +1 , · · · , + | 1: , ) where the distribution of each edge + is further a triple joint probability distribution of its source and destination nodes as well as its timestamp. Compared with traditional time series prediction, event forecasting on a CTDG jointly consider the spatial information characterized by the graph and the temporal signal characterized by the event stream in order to make a more accurate prediction. There have been several lines of work to approach the problem but all in compromised settings. Graph Neural Networks (GNNs) [43] is an emerging family of deep neural networks for embedding structural information via its message passing mechanism. Temporal Graph Neural Networks (TGNNs) extends GNNs to CTDGs by incorporating temporal signals into the message passing procedure. Among them, Jodie [17] , TGN [27] and TGAT [46] are designed for the temporal link prediction where the timestamp and one entity of the future event are given, i.e., modeling the conditional distribution ( +1 | 1: , +1 , +1 ). Know-Evolve [34] and DyRep [35] combine GNNs with methods for multi-variate continuous time series such as Temporal Point Process (TPP) for event timestamp prediction which predicts the time of the next event but requires the entities to be known, i.e., modeling the conditional distribution ( +1 | 1: , +1 , +1 ). All these models are not directly applicable to the community event forecasting problem on CTDGs. Another line of work extend TPP methods to incorporate extra signals. Notably, Marked TPP (MTPP) associates each event with a marker and jointly predicts the marker as well as the timestamp of future events. Recurrent Marked Temporal Point Process (RMTPP) [8] proposes to use recurrent neural networks to learn the conditional intensity and marker distribution. MTPP and RMTPP are capable of predicting CTDG events by treating the entity pair ( , ) as the event marker. However, these kinds of MTPP-based methods face three major drawbacks. To begin with, MTPP methods treat edges as individual makers, which are unable to utilize the community and relationship information, resulting in a suboptimal training solution. Besides, individual makers also bring an (| | 2 ) marker distribution space, making model unscalable to giant graphs. The last difficulty is that RMTPP is further constrained by its recurrent structure, which must process each event sequentially for keeping events' contextual correlation. Our contributions in this paper are: i) We consider the community event forecasting task on a CTDG that jointly predicts the next event's incident nodes and timestamp within a certain community, which is significantly harder than both temporal link prediction and timestamp prediction. To the best of our knowledge, we are the first who carefully consider and evaluate this task. ii) We handle the Community Event Predicting task with a graph Point Process model (CEP3), which incorporates both spatial and temporal signals using GNNs and TPPs and can predict event entities and timestamps simultaneously. To scale to large graphs, we factorize the mark distribution of MTPP and reduce the computational complexity from (| | 2 ) of previous TPP attempts [8, 35] to (| |). Moreover, we employ the a time-aware attention model to replace the TPP model's recurrent structure, significantly shortening the sequence length of each training step and enabling training in mini-batches. iii) We propose new benchmarks for the community event forecasting task on a CTDG. Specifically, we design new evaluation metrics measuring prediction quality of both entities and timestamps. For baselines, we collect and carefully adapt state-of-the-art models from time series prediction, temporal link prediction and timestamp prediction. Our evaluation shows that CEP3 is superior across all four real-world graph datasets. Temporal Graph Learning aims at learning node embeddings using both structural and temporal signals, which gives rise to a number of works. CTDNE [25] and CAWs [38] incorporate temporal random walks into skip-gram model for capturing temporal motif information in CTDGs. JODIE [17] and TigeCMN [48] adopt recurrent neural networks (RNNs) and attention-based memory module respectively to update node embedding dynamically. Temporal Graph Neural Networks like TGAT [46] and TGN [27] enhance the attention-based message passing process from Graph Neural Networks with Fourier time encoding kernel. These attempts focus on the temporal link prediction task. Besides that, other works [18, 41] focus on information diffusion task which aims at predicting whether a user will perform an action at time . None of them is designed for event forecasting. [15] and CoNN [11] study a similar event forecasting setting on Discrete Temporal Dynamic Graphs (DTDGs). However, continuous time prediction is much harder and their methods cannot be directly applied. A temporal point process (TPP) [29] is a stochastic process modeling the distribution of a sequence of events associated with continuous timestamps 1 , · · · , . A TPP is mostly characterized by a conditional intensity function ( ), from which it computes the conditional probability of an event occurring between and + given the history { : < } as ( ) . According to [1] , the log conditional probability density of an event occurring at time can be formulated as Additionally, a marked temporal point process (MTPP) associates each event with a marker which is often regarded as the type of the event. MTPP thus not only models when an event would occur, but also models what type of event it is. Event forecasting over CT-DGs can also be modeled as a MTPP if treating the event's incident node pair as its marker. The conditional intensity of the entire MTPP can be modeled as a sum of conditional intensities of each individual marker: = . This allows it to first make the prediction of event time, and then predict the marker conditioned on time via sampling from a categorical distribution: ∼ Categorical( ). This avoids modeling time and marker jointly. Such idea has been widely used in follow-up works such as Recurrent Marked Temporal Point Process (RMTPP) [8] . RMTPP parametrizes the conditional intensity and the marker distribution with a recurrent neural network (RNN). RMTPP's variants include CyanRNN [40] and ARTPP [45] . We also demonstrate several other relevant works and applications related to TPPs [30] . CoEvolving [39] , a variant model of MTPP, uses Hawkes processes to model the user-item interaction, respectively. NeuralHakwes [23] relaxes the positive influence assumption of the Hawkes process by introducing a self-modulating model. DeepTPP [19] models the event generation problem as a stochastic policy and applied inverse reinforcement learning to efficiently learn the TPP. However, these methods cannot be directly applied to event forecasting on CTDGs due to the following reasons. First, a CTDG is essentially a single event sequence, whose length ranges from tens of thousands to hundreds of millions. RNNs are known to have trouble dealing with very long sequences. One may consider dividing the sequence into multiple shorter windows, which will make the events disconnected within the given window and discard all the data before it. This fails to explore the dependencies between events that are distant over time but topologically connected (i.e. sharing either of the incident nodes). One may also consider training an RMTPP with Truncated BPTT [14] on the long sequence as a whole. However, this is inefficient because parallel training is impossible due to its recurrent nature, which means that one has to unroll the sequence one event at a time. Second, directly modeling the marker generation distribution with a unit softmax function will produce a vector with space complexity of (| | 2 ), as the event markers will be essentially the events' incident node pairs. This is undesirable for large graphs. [9] models link and retweet generation on a social network with a TPP, and also provides a simulation algorithm that generates from the TPP model. It is very similar to our task except that it is focused on a specific social network setting, and the authors did not quantitatively evaluate the quality of the simulation model. Previous works, such as [12, 35, 42] , use kinds of recurrent architecture to approximate temporal point process over graphs. However, recurrent architecture prevents the model from parallelized minibatch training, which is undesirable especially on large-scale graphs. This is because learning long-term dependencies using recurrent architecture requires the model to traverse the event sequence one by one instead of randomly mini-batch selection. MMDNE [21] , HTNE [49] and DSPP [4] employ the attention mechanism to avoid the inefficiency of the recurrent structure in training with large CTDGs. However, these works are restrictively dedicated to link prediction or timestamp prediction task. Adapting the two models to event forecasting requires non-trivial changes since neither of them handles efficient joint forecasting of the event's incident nodes. Our model is depicted in Fig. 2 . To predict the next events +1 , · · · , + given the history graph 1: , we first obtain an initial representation ℎ ( ·) 0 for every node using an GNN Encoder, as well as an initial graph˜0. Then for the -th step, we predict + = ( + , + , + ), i.e. the source node, destination node, and timestamp for the next -th event. The predicted event is then added into˜− 1 to form˜, to keep track of what we have predicted so far. The hidden states ℎ ( ·) are then updated from the new graph and ℎ ( ·) −1 . This generally follows the framework of RMTPP [8] , except that i) we initialize the recurrent network states with a timeaware GNN, which allows our recurrent module to traverse over a much shorter sequence without losing historical information, ii) we update the recurrent network states with a GNN to model the topological dependencies between entities caused by new events, and iii) we forecast the nodes and the timestamp for an event by decomposing the joint probability distribution. We give specific details of each component as follows. The Encoder GNN should be capable of encoding relational dependencies, timestamps, and optionally edge features at the same time. Therefore, it has the following form: where N [0, ] represents the neighborhood of on the graph 1: and ( ) are learnable time encodings used in [27, 31, 46] . init agg and init msg can be any aggregation and message functions of a GNN. We use a GNN to initialize the recurrent network's states because it takes all the historical events within a topological local neighborhood, including the incident nodes, the timestamps, and the feature of events together, as input, while enabling us to train on multiple history graphs in parallel. In particular, we use neighborhood graph temporal attention based method for encoding, whose detailed formulation is as follows. Temporal Graph Attention Module is a self attention based node embedding method inspired by [46] , the detailed formulation is as below: For each source node at time src , we sample its two hop neighbors whose timestamp is prior to src . The multi-head attention is computed as: The temporal encoding module is the same as in [27, 46] original paper: Where and b are learnable parameters. Although a GNN cannot consider events outside the neighborhood, we argue that such an impact is minimal. We empirically verify our belief by comparing against the variant that uses both attention and RNN based memory module in the training phase (named CEP3 w RNN), which can incorporate historical events and time-aware information by the view of topological locality and recursive impact, respectively. However, RNN takes drastically more memory and time in training because of the same reason in Section 2.3. Fig. 3 demonstrate the details of our event forecaster. We can see that the forecaster predicts future events only according to node embeddings and historical connections in the selected candidates (or communities). It means that we do not have to be concerned about a large number of communities which are likely to slow down the process, because CEP3 model can handle numerous communities simultaneously during training and inference. From the superposition property [50] of an MTPP described in Section 2.2, we forecast the next event + = ( + , + , + ) by a triple probability-chain: It means that we can predict first the timestamp, then the source node, and finally the destination node. We predict + by modeling the distribution of the time difference Δ + = + − + −1 as follows: where Softplus ensures that ( ) is above zero and the gradient still exists for negative ( ) values. MLP represents a multilayer perceptron to generate the intensity value of time. Since Δ + obeys exponential distribution, we simply sample the mean value of time intensity distribution, 1 , as the final Δ + output during training. The conditional intensity of the next event occurring on one of the nodes in is thus the sum of all ( ) [50] . Other conditional intensity choices are also possible. We then generate the source node + of event + from a categorical distribution conditioned on + , parametrized by another MLP: where ∥ means concatenation and has the same form as in Eq. 3. We then generate the destination node + similarly, conditioned on + and + : Note that the formulation above will only generate two distributions that have | | elements, instead of | | 2 as in RMTPP [8] . The implication is that during inference the strategy will be greedy: we first pick whatever source node that has the largest probability, then we pick the destination node conditioned on the picked source node. To verify the impact of this design choice, we also explored a variant of our method where we generate a joint distribution of the pair ( + , + ) with (| | 2 ) elements, which we name CEP3 w/o HRCHY (short for Hierarchy). If a dataset has a lot of nodes, evaluating ( + ) and ( + ) will incur a linear projection with (| |) complexity. This is another obstacle of scaling up to larger datasets. Figure 3 : The hierarchical probability-chain forecaster and its workflow relationship with the auto-regressive message passing module. The node embeddings are learned from the GNN Encoder described in Section 3.1. Note that the 'selected community' refers to an application-dependent collection of candidate nodes for query. As shown in Fig. 3 , we assume that an event's occurrence will directly influence the hidden states of its incident nodes. Moreover, the influence will propagate to other nodes along the links created by historical interactions. Therefore, after generating the new event + , we would like to update the nodes' hidden states by message passing on the graph with the new events. We achieve that by maintaining another graph˜that keeps track of the graph with the historical interactions 1: and the newly predicted events up to + . Specifically, we initialize˜0 with the candidate node set as its nodes. Two candidate nodes are connected in˜0 if their distance is within hops. The resulting graph encompasses the dependency between candidate nodes during the encoding stage. Every time a new event + is predicted, we add the event back in:˜= −1 ∪ + . Afterwards, we update the nodes' hidden states using a message passing network such as GCN [16] for spatial propagation and a GRU [6] for temporal propagation: where N˜is the neighboring events of node in˜, upd agg can be any message aggregation function and upd msg can be any message function. To verify the necessity of updating the community using message passing after a event, we also explore a variant where we do not update all the node's hidden states in the community, but only the incident nodes + and + . We name this variant CEP3 w/o AR. The forecasting module outputs the next event's timestamp + and indicent nodes + and + for all events + , which are minimized via negative log likelihood. Specifically, the loss function goes as follows: where the first two terms within summation are log survival probability from Eq. 2 and the last two terms are log probabilities for source and destination node prediction. The time integration term ∫ ( ) in Eq. 2 is approximated using first order integration method by ( )Δ for the ease of computation. In this section, we test the performance and efficiency of the proposed method against multiple baselines on four public real-world temporal graph datasets: Wikipedia, MOOC [17] , GitHub [35] , and SocialEvo [22] . Table 1 shows summary statistics of the datasets used in our experiments. A detailed description is put in the below. Wikipedia [17] dataset is widely used in temporal-graph-based recommendation systems. It is a bipartite graph consisting of user nodes, page nodes and edit events as interactions. We convert the text of each editing into a edge feature vector representing their LIWC categories [2] . MOOC [17] dataset, collected from a Chinese MOOC learning platform XuetangX, consists of students' actions on MOOC courses, e.g., viewing a video, submitting an answer, etc. Github [35] dataset is a social network built from GitHub user activities, where all nodes are real GitHub users and interactions represent user actions to the other's repository such as Watch, Fork, etc. Note that we do not use the interaction types as we follow the same setting as [35] . SocialEvo [22, 35] dataset is a small social network collected by MIT Human Dynamics Lab. Since the public dataset SocialEvo and Github have no edge feature, we generate a 10-dimensional edge feature using following attributes, including the current degrees of the two incident nodes of an edge, and the time differences between current timestamp and the last updated timestamps of the two incident nodes. Note that the time differences are described in the numbers of days, hours, minutes and seconds, respectively. In addition to CEP3, CEP3 w RNN, and CEP3 w/o AR mentioned in Section 3, we compare against the following baselines: a Seq2seq model with a GRU [6], a Poisson Process (TPP-Poisson), a Hawkes Process (TPP-Hawkes) [13] , RMTPP [8] and its variant with the same two-level hierarchical factorization as in Eq. 9 and 10 (RMTPP w HRCHY), an adaptation of DyRep [35] and an auto-regressive variant (named DyRep and DyRep w AR). Notably, RMTPP and its variants are SOTA models for MTPPs in general, and DyRep is SOTA in temporal link prediction and time prediction on CTDG. Since our task is new, we made adaptations to the baselines above, with details of each baseline is as follows: Time series Methods: For baseline model of sequential prediction, we build an RNN model, Gated Recurrent Unit (GRU). Each source and destination cell has a hidden state, the output of the model will be predicted time mean and variance as well as probability for each class to interact, the time will be formulated as Gaussian distribution and source and destination node will be formulated as categorical distribution. This formulation forces GRU predicting timestamp of upcoming events only depending on the hidden state in RNN, whereas other baselines adapt TPP function as a stochastic probability process, obtaining a better modeling capability. We use a GRU [6] as a simple baseline without using TPP to model time distribution, treating event forecasting on CTDG as a sequential modeling task. It takes in the event sequence and outputs the next event in a Seq2seq fashion [33] . The loss term for time prediction is mean squared error and the loss term for source and destination prediction is the negative log likelihood. This formulation forces GRU to predict the timestamp of upcoming events only depending on the hidden state in RNN, whereas other baselines adapt TPP for better modeling capabilities. Temporal Point Process Methods: Following the benchmark setting of [8] , we compared our model against other traditional TPP models and deep TPP models. • TPP-Poisson: We assume that the events occurring at each node pair ( , ) follows a Poisson Process with a constant intensity value , , which are learnt from data via Maximum Likelihood Estimation (MLE). • TPP-Hawkes [13] : We assume that the events occurring at each node pair ( , ) follows a Hawkes Process with a base intensity value , and an excite parameter , , which are learnt from data via MLE. • RMTPP [8] : We directly consider each source and destination node pair as a unique marker. We note that this formulation will exhaust memory and time on graphs with more than a few thousand nodes, since RMTPP will assign a learnable embedding for each node pair, resulting in (| | 2 ) (Here is total number of nodes in the entire CTDG) parameters which is too expensive to update. • RMTPP w HRCHY: We consider a variant of RMTPP where we replace the source-destination node prediction with our hierarchical formulation: we first select the source node, then condition on the source node we select the destination node. The latter formulation can also serves as an ablation study to demonstrate the usage of considering the graph structure. • DyRep [35] : To the best of our knowledge, DyRep is the most popular work that combine the temporal point process with graph learning techniques to model both temporal and spatial dependencies. Since the original DyRep formulation only handled temporal link prediction and time prediction, but not autoregressive forecasting, we compute an intensity value , with DyRep for each node pair and assumed a Poisson Process afterwards. • DyRep w AR: We made a trivial adaptation to the original formulation of DyRep by updating the source and destination node involved once a new edge is added to the graph. The update function is identical to DyRep updating function during embedding. This benchmark is designed to demonstrate that the propagation from newly forecast event to local neighborhood is necessary in getting better performance. We also crave our proposed model for having the ability to implement minibatch training. As described in the beginning of Section 3, our CEP3 achieves large-scale and parallelized training by utilizing Hierarchical TPP and GNN based updating module, respectively. A summary of the mentioned baselines is shown in Table 2 , the proposed model CEP3 satisfies all the desirable properties. We provide the details of our network architecture, the hyperparameters and the selected community detection method for better reproducibility. Table 3 summarized other key parameters in our model. From hardware perspective, for all the experiments we train our model and benchmark models on Intel(R) Xeon(R) Platinum [37] , and will be published after acceptation. ( | |) ( | | 2 ) ( | | 2 ) ( | | 2 ) ( | | 2 ) ( | | 2 ) Parallel One of the most important contributions of our study is the concept of community event prediction, which is a novel task in the field of temporal graphs. To the best of our knowledge, we are the first to consider forecasting events sampled from a temporal point process model, given a historical graph and a candidate node set in which the user is interested. For evaluation on a specific dataset, we utilize the communities segmented by the conventional community detection algorithm Louvain [3] as the candidate node set, and we report the average result of all communities. For each community , we measure the perplexity ( ) of the ground truth source and destination node sequence for evaluating the node predicting performance, and evaluate the mean absolute error ( ) of the predicted timestamps. Using our MAE to evaluate the quality of auto-regressive forecasted sequence over multiple timesteps can also be seen in traffic flow prediction [20] . Perplexity (PP) [24] is a concept in information theory that assesses how closely a probability model's projected outcome matches the real sample distribution. The less perplexity the situation, the higher the model's prediction confidence. In the field of natural language processing, perplexity is also commonly used to evaluate a language model's quality, i.e., to evaluate how closely the sentences generated by the language model match real human language samples. A language model predicts the next word from the word dictionary, whereas our event predicting model selects nodes from the node candidates (community). Therefore, it is reasonable to employ perplexity as a metric in our task. Specifically, suppose we have the communities' ground truth event sequence ( , , ) and the prediction sequence (ˆ,ˆ,ˆ) where = 1, · · · , . For we compute per step PP as In order to measure the distance between two sequence with difference length, we compute MAE following [44] as: To keep it comparable in diverse datasets, the MAE is divided by the max time span − 0 and the sequence length . We report the average of all communities as the of a certain dataset, and is calculated in the same way. Smaller values of both metrics indicate the better model performance. From Table 4 we can see the obvious advantage of CEP3 compared to other baselines in different datasets under both MAE and perplexity. The MAE difference between GRU and RMTPP show the effectiveness of temporal point process in predicting timestamps. Comparing CEP3 with sequence based TPP models RMTPP, we can see that using GNN to capture historical interaction information can improve the forecasting performance. Further more, when comparing DyRep w AR versus DyRep and CEP3 w/o AR versus pure CEP3, we can conclude that auto-regressive updates can better capture the impact of newly predicted events. Our model performs better than DyRep w AR because in our CEP3, during the auto-regressive update, the newly predicted event not only influences the node involved in the event but also propagates to other nodes via message passing. From top to bottom, Fig. 4 visualizes the circle layout of a certain community within the graphs of the Github, Wikipedia and MOOC datasets, respectively. We plot the ground truth, our model's prediction and DyRep's prediction for comparison. The visualized graphs are generated as follows: We first apply the learned forecasting model to predict the edges using Monte Carlo sampling. This generation process is repeated for three times. We then discard generated edges that are not within the 33% highest predicted edge probabilities and obtain the final generated graph. Generating multiple times and then discarding the less possible edges is to reduce uncertainty in each one-time generated graphs. In the first row, both CEP3 and DyRep capture the triangle connection in this small community. However, the triangle is lighter in the ground truth, which means that DyRep over-reinforces this connection in its predictions. In the second row, the prediction result of CEP3 is more similar to the truth, whereas DyRep generates a high-frequency purple edge which does not exist in the original graph. In the third row, CEP3 successfully learns the two black edges in ground truth, but DyRep predicts more than two links in darker colors. We can see that our method successfully recognises the highdegree nodes captured many patterns of interactions as well as the evolution dynamics of the interactions graph, which includes the nodes with higher degrees. The fundamental goal of community event prediction, rather than focusing on a single local node, is to anticipate if a certain high-frequency pattern will emerge in the community from a global perspective. Exploring money laundering patterns [28] in finance and disease transmission patterns [7] in communities are two of the most important examples. In Section 3 we have mentioned three variants: CEP3 w RNN to trade parallelized training for long-term dependency modeling with RNN based memory module, CEP3 w/o HRCHY to compare hierarchical (HRCHY) prediction versus joint prediction of incident nodes, and CEP3 w/o AR to compare using auto-regressive module versus not using. CEP3 w/o HRCHY. The formulation without hierarchy structure yields a slower training and inference speed as shown in Fig. 5 . We demonstrate that CEP3 is more efficient on large communities compared to CEP3 w/o HRCHY. As is shown in Fig. 5 , computing intensity function of each possible node pair would take more time by orders of magnitude. We relax this issue by decomposing the node pair prediction problem into two node prediction sub-problems, which can be solved quicker especially in large scale communities. Fig. 6 , we can see that the AR forecasting is not helpful in all varying numbers of prediction steps. When the number of "prediction steps" is small (such as 10), CEP3 could just use the node embeddings at time to predict the events. When the number of steps becomes larger, the systematic accumulated errors from AR gradually accumulate, leading to low MAE accuracy. And as the number of steps increases, the initial node embedding would have little effect in distant future events. This indicates that without AR, it may be difficult for the CEP3 model to accurately predict long-term occurrences. CEP3 w RNN. The results show that the memory module brings performance improvement compared with pure CEP3, except on the Github dataset. The reason is that the Github dataset is a small dataset with a small number of nodes and edges, and meanwhile it has a significantly longer timespan than other datasets. It means that the interaction is low-frequency in this situation, causing the insufficient memory updating. Inspired by the Transformer [36] models in NLP, we believe it is essential to use a pure attention-based model in training temporal graph encoders. This is because using a pure attention-based GNN as an encoder enables us to train multiple time windows in minibatches as described in Section 3.1, whereas the model such as Dyrep and RMTPP cannot utilize parallel training due to their RNN structures. This property allows our model to benefit from mini-batch training such as gradient stabilization and faster convergence. 7 shows our experiment result on the Wikipedia dataset with different numbers of parallel processes, which suggests that using parallel training can increase the speed significantly without losing accuracy. In this paper, we present a novel community event forecasting task on a continuous time dynamic graph, and set up benchmarks using adaptation of the previous work. We further propose a new model to tackle this problem utilizing graph structures. We also address the scalability problem when formulating the temporal point process on graph and reduce complexity with a hierarchical formulation. For future work, we are aware that current evaluation metrics of this task focus only on either time or event type. Although we perform a community visualization to show the event type prediction in a time window, a comprehensive joint metric is preferred to better evaluate the event forecasting task. Survival and event history analysis: a process point of view Developing a Tagalog Linguistic Inquiry and Word Count (LIWC) 'Disaster' Dictionary for Understanding Mixed Language Social Media: A Work-in-Progress Paper Fast unfolding of communities in large networks Deep Structural Point Process for Learning Temporal Interaction Networks Multihorizon inflation forecasts using disaggregated data Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation Predicting CoVID-19 community mortality risk using machine learning and development of an online prognostic tool Recurrent marked temporal point processes: Embedding event history to vector Coevolve: A joint point process model for information diffusion and network evolution On community outliers and their efficient detection in information networks The Co-Evolution Model for Social Network Evolving and Opinion Migration Tracking Dynamic Point Processes on Networks Spectra of some self-exciting and mutually exciting point processes Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach Recurrent Event Network for Reasoning over Temporal Knowledge Graphs Semi-Supervised Classification with Graph Convolutional Networks Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks Modeling Information Diffusion over Social Networks for Temporal Dynamic Prediction Learning Temporal Point Processes via Reinforcement Learning Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Temporal Network Embedding with Micro-and Macro-dynamics Sensing the" health state" of a community The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process Language Model Evaluation Beyond Perplexity Continuous-time dynamic network embeddings Community Discovery in Dynamic Networks: A Survey Temporal Graph Networks for Deep Learning on Dynamic Graphs Detection of Money Laundering Groups: Supervised Learning on Small Networks An Introduction to the Theory of Point Processes Neural Temporal Point Processes: A Review Neural Temporal Point Processes: A Review A Quantitative Political Spectrum and Forecasting of Social Evolution Sequence to Sequence Learning with Neural Networks Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs DyRep: Learning Representations over Dynamic Graphs Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. ICLR Workshop on Representation Learning on Graphs and Manifolds Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks Coevolutionary Latent Feature Processes for Continuous-Time User-Item Interactions Cascade Dynamics Modeling with Attention-based Recurrent Neural Network Dual Sequential Prediction Models Linking Sequential Recommendation and Information Dissemination Modeling Event Propagation via Graph Biased Temporal Point Process A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems Wasserstein learning of deep generative point process models Modeling the Intensity Function of Point Process Via Recurrent Neural Networks Inductive representation learning on temporal graphs Multi-step prediction for influenza outbreak by an adjusted long short-term memory Learning Temporal Interaction Graph Embedding via Coupled Memory Networks Embedding Temporal Network via Neighborhood Formation On the Superposition of Point Processes