key: cord-0520378-u97dzvc3 authors: Yan, Xu; Fan, Xiaoliang; Yang, Peizhen; Wu, Zonghan; Pan, Shirui; Chen, Longbiao; Zang, Yu; Wang, Cheng title: ConTIG: Continuous Representation Learning on Temporal Interaction Graphs date: 2021-09-27 journal: nan DOI: nan sha: d068e60bbc4a4fd5da1a6c6a2e2cad760c030a14 doc_id: 520378 cord_uid: u97dzvc3 Representation learning on temporal interaction graphs (TIG) is to model complex networks with the dynamic evolution of interactions arising in a broad spectrum of problems. Existing dynamic embedding methods on TIG discretely update node embeddings merely when an interaction occurs. They fail to capture the continuous dynamic evolution of embedding trajectories of nodes. In this paper, we propose a two-module framework named ConTIG, a continuous representation method that captures the continuous dynamic evolution of node embedding trajectories. With two essential modules, our model exploit three-fold factors in dynamic networks which include latest interaction, neighbor features and inherent characteristics. In the first update module, we employ a continuous inference block to learn the nodes' state trajectories by learning from time-adjacent interaction patterns between node pairs using ordinary differential equations. In the second transform module, we introduce a self-attention mechanism to predict future node embeddings by aggregating historical temporal interaction information. Experiments results demonstrate the superiority of ConTIG on temporal link prediction, temporal node recommendation and dynamic node classification tasks compared with a range of state-of-the-art baselines, especially for long-interval interactions prediction. G RAPH representation learning has attracted a surge of research attention owing to the widespread existence of graph-structured data in the real world such as social networks, citation networks, and other user-item interaction systems [1] . Learning graph embedding is a powerful approach of graph representation learning, which maps the characteristics of nodes to a low-dimensional vector space so that the proximities of nodes in topological space can be well reserved [2] . It has shown a great success in graph representation learning from shallow graph embedding methods [3] , [4] , [5] to deep graph neural networks (GNNs) [6] , [7] , [8] , [9] . Recently, representation learning on dynamic graph has attracted many research attention [10] , [11] , [12] , and they mainly model temporal graphs either as a sequence of snapshots [13] , [14] , [15] , [16] , [17] , [18] , or as real events with timestamps [19] , [20] , [21] , [22] , [23] , [24] . For existing works learning temporal interaction embedding from the sequence of interactions, a number of studies discretely update the embeddings of the interactive nodes once an interaction occurs [25] , [26] . Other works • learn the interactions by aggregating the temporal neighbor features to pass messages between nodes [19] , [23] , which model discrete dynamic of node representation in multiple propagation layers [27] . Although these discrete works have made substantial advances in learning temporal interaction embedding, they fail to capture the continuous dynamic evolution of nodes embedding trajectories thus losing temporal information for non-ignorable and inactive nodes with long-interval interactions (e.g., User A in Fig.1 ). Therefore, we aim to learn nodes embedding trajectories and capture the continuous dynamic of node representations. Learning dynamic node embedding trajectories is extremely challenging due to the complex non-linear dynamic in temporal interaction graphs (TIG). We conclude the chal-arXiv:2110.06088v1 [cs.SI] 27 Sep 2021 lenges in three folds under the scenario of posting in Reddit 1 website. First, latest interaction information may have a significant impact on the current interaction, e.g., in Fig.1 , the posts made by User A at t 4 would be influential for t q , since guidelines of holiday gathering under COVID-19 pandemics will be regularly implemented. Second, the node state is also affected by their neighbors' attention over time. For instance, User A is expected to emulate their neighbors by posting "COVID-19 vaccine" at t q in Fig.1 . Third, inherent characteristics of node would fatally determine the state regardless of aforementioned two factors. For example, User A is a frontline healthcare personnel and she could inherently pay much attention to COVID-19 trainings on infection control. In summary, existing methods hold partial considerations for these three-fold factors. To cover the shortcomings of previous methods, we propose a Continuous representation learning method on Temporal Interaction Graphs (ConTIG). The proposed ConTIG contains two modules: the update module and transform module. When an interactive message comes, in the update module, inspired by [28] , we define a neural ordinary differential equations (ODE) [29] with the three aforementioned factors affecting node states (i.e., latest interaction, neighbors' feature and inherent characteristics), and incorporate them with the continuous-time encoding function to trace dynamic knowledge and learn node state trajectories. Then the node embeddings at the ending time of neural ODE is used to update embeddings of interacting nodes. In the transform module, a self-attention mechanism is introduced to awaken historical interactive information of current interacting nodes and convert them to generate future node representations. The results on temporal link prediction, temporal node recommendation and dynamic node classification tasks show the effectiveness of our proposed model compared with a range of state-of-the-art baselines, especially for long-interval interactions prediction. The contributions of this work are summarized as follows: • We propose a representation learning method to incorporate three-fold factors (i.e., latest interaction, neighbor features and inherent characteristics) with a neural ODE. This allows us to capture continuous dynamic of node embedding trajectories by continuously updating node embeddings in the update module. • We combine aforementioned update module with a self-attention mechanism to predict future embeddings in the transform module. This makes ConTIG effective on long-interval interactions prediction. • We evaluate our ConTIG on temporal link prediction, temporal node recommendation and dynamic node classification tasks, and ConTIG compares favorably against state-of-the-art methods. The rest of this paper is organized as follows. In Section 2, we discuss some related work. Sections 3 and 4 describes the notations and our proposed model in detail. In Section 5, we conduct experiments on several datasets and compare them with state-of-the-art methods. In Section 6, the conclusion and our future work are presented. 1 . https://www.reddit.com/ 2 RELATED WORK Early works for representation learning are mainly shallow models including graph factorization approaches [30] , [31] and skip-gram models [3] , [4] , [5] , which learn node representations by random walk objectives. With the success of deep learning, GNNs [7] , [8] , [9] , [32] have gradually attracted great research interest. They are effective approaches to learn node representations by updating each node according to messages from neighboring nodes in graphs in each layer. GNNs essentially model discrete dynamics of node representations with multiple propagation layers [27] . However, all the above mentioned approaches are limited to learning node embeddings on static graph structure information, and the temporal behavior of interaction over time is generally ignored. One common way to handle temporal graphs is to decompose it into multiple static graphs snapshots by a regular time interval. Some works embed the graph convolution into the recurrent neural network (RNN) based models or attention mechanism [33] , which learns to exploit the dynamic information in the graph evolution within a period of time [13] , [14] , [16] , [17] , [34] , [35] , [36] . Some other works are dynamic extensions of ideas applied in the static case inspired by methods such as PageRank [37] , graph autoencoder [38] and the topic model [15] to capture both the temporal community dynamics and evolution of graph structure. However, learning embedding from the sequence of graph snapshots sampled from the temporal graph by a regular time interval may lose information by looking only at some snapshots of the graph over time. Therefore, recent works learn temporal graph embedding from the sequence of timed interactions. A number of studies learn the sequence of interactions by discretely updating the node embeddings once an interaction occurs by RNNs [24] , [25] , [39] , memory networks [26] , time point process [40] , [41] , transformer network [42] , generative models [43] , etc. Other works learn the interactions by aggregating the temporal neighbor features to pass messages between nodes [20] , [23] , [24] , [44] or learning node representation from random walk objects [21] and temporal motifs [45] , [46] , [47] . TGAT [23] proposes the temporal graph attention layer to aggregate temporal-topological neighborhood features. TGN [24] makes a combination of updating operators and aggregating operators. However, these methods learn the discrete dynamic of node representations, and it is not beneficial to learn the node state trajectories. Moreover, they ignore the influence of the inherent characteristics of the change on the node states, which is not beneficial to capture the complex non-linear dynamic of node representations. Different from the aforementioned works, we assembled the latest interaction, neighbor features and inherent characteristics of the nodes into a neural ODE to learn node state trajectories and update node embeddings. As a result, our method captures the continuous dynamic of node representation. Neural ordinary differential equations (ODE) [29] is an approach for modelling a continuous dynamics on hidden representation, where the dynamic is characterised through an ODE parameterised by a neural network (i.e. ODE Solver). Recently, there are some works devote to use neural ODE [29] combined with GNN to characterise the continuous dynamics of node representations [28] , [48] . However, these methods are not built for temporal graphs, which are continuous-time GNNs learning the dynamic of node representation in static graph scenario to build "deeper" networks. For temporal graphs, [49] learns the evolution of the temporal knowledge graph dynamics over time using ODEs as a tool. CG-ODE [50] proposed a latent ODE generative model that learns the coupled dynamics of nodes and edges with a GNN based ODE in a continuous manner. Different from them (i.e., [49] , [50] ), we stand at the perspective of node state modeling in temporal interaction graph, using a differential equation integrating the three-fold factors (i.e., latest interaction, neighbor features and inherent characteristics). This allows us to successfully capture continuous dynamic of node embedding trajectories. We summarize some notations and state the problem we want to solve in this section. action graph is a graph with temporal and attributed interaction edges between nodes. Temporal interaction graph is defined as a pair G = (V, E), where V represents vertices sets, and E is a set of the sequences of interactions with time label between two vertices. An interaction e is a tuple of form (u, i, t, f ui ), where u and i ∈ V represent two vertices have an interaction at timestamp t, and f ui represents the feature of interaction e. Given the temporal interaction graph G = (V, E), we aim to learn a mapping function: φ : V → R d to map node features into a low-dimensional latent space and generate node embeddings H ∈ R |V|×d to represent nodes, where d |V| representing the dimension of node embeddings. In the temporal interaction graph, each interaction is an edge connected by a source node and a target node, so we can use source nodes as the label of interaction to calculate the time interval ∆t e between current interaction and its latest interaction, where the latest interaction is the edge where the source node last appeared. For example, the latest interaction of the current interaction, e j p = (u j , i, t p , f ui ), is the interaction e j−1 q = (u j−1 , r u , τ u , f uru ) in which the node u at the last time appeared. As a result, the time interval ∆t ep between e j p and e j−1 q is t p − τ u , where r u and τ u represents the latest interactive node and timestamp of u, respectively. While, p and q represent the interaction id, and j represents that how many times u appears. Here, if the node appears for the first time (i.e., j = 0), the time interval of the interaction is 0. In this section, we will describe our model in detail. The proposed model -ConTIG consists of two essential modules (i.e., update and transform module). We will briefly introduce them in Section 4.1 and describe each module in detail in the rest of sections. Fig.2 provides the framework of our proposed ConTIG. The encoder-decoder framework (Fig.2 left) consists of two modules: the update module and the transform module. In the update module ( Fig.2 middle) , a continuous inference block is used to update the embeddings of the interacting nodes in a continuous setting. In the transform module ( Fig.2 right) , a graph attention layer is applied to estimate the future embedding of nodes by capturing observed historical temporal interaction information. When an interaction message e p = (u, i, t p , f ui ) comes, we first employ a neural encoder to project the latest interaction message e q = (u, r u , τ u , f uru ) of input interacting node into a latent space, i.e., E = f (X). Afterwards, treating E as the initial value H tp (0) in continuous inference block, we utilize a neural ordinary differential equations (ODE) to learn the nodes' continuously changing state trajectories at a certain time interval [0, ∆t], and the embeddings of nodes H tp−1 u is updated asH tp u by the node embeddings obtained in nodes' state trajectory at ending time H tp (∆t). Then, selecting k observed historical neighbors of current nodes u, we introduce a self-attention mechanism on graphs to convert these interactions information to generate future node embeddingĤ t u . Finally, the decoder uses the future node embeddingsĤ t u for downstream tasks (i.e., temporal link prediction, temporal node recommendation and dynamic node classification). The proposed ConTIG adapts an encoder-decoder architecture ( Fig.2 left) . For each interaction, we first project the features into a latent space to generate the initial hidden representation of nodes. After learning node embeddings at current timestamp, we finally design a decoder for the specific downstream tasks. Before the encoder, we initialize the latest interactive node r v and timestamp τ v as 0 for all node v ∈ V (line 1 in Algorithm 1). In the encoder, to learn the temporal pattern of each interaction, we use a time encoding function in [51] to obtain a continuous functional mapping F T : T → R d T from time domain to the d T -dimensional vector space, which projects the timestamps of interactions into the continuoustime space. For any given time t, we will generate its time embeddings as follows: where ω 1 , ω 2 , ..., ω d T are trainable parameters. To learn informative node representations H tp at timestamp t p , we concatenate the latest node embedding of the interactive Graph Attention Layer The encoder-decoder framework (left) consists of a continuous inference block (middle) and a graph attention layer (right). The former contains a neural ODE defined with three-fold factors (i.e., latest interaction, neighbors' feature and inherent characteristics) to update the node embeddings according to the output of encoder. The latter estimates the future embedding of nodes by aggregating observed historical temporal interaction information. Here, as we focus on the interval between current interaction and its latest interaction, we use the time embedding F T (∆t). Afterwards, we adopt a fully connected layer as an encoder (line 4 in Algorithm 1), and the hidden representations can be defined as follows: where || is the concatenation operation, and f (·) is a linear projection as follows: where W and b are learnable parameters. For the decoder, two fully connected layers are designed for temporal link prediction and temporal node recommendation tasks, and three fully connected layers are designed for dynamic node classification task. In update module (Fig.2 middle) , to capture the continuous dynamic of node representation, inspired by [28] , we define a ODE with the latest interaction information, neighbor features and inherent characteristics of nodes, and use a neural solver to generate the node state trajectories at a certain time interval [0, ∆t]. In this way, our method can estimate the possible direction of embedding trajectory for an inactive node (line 5 and 11-16 in Algorithm 1. We select the interactions between each node in temporal interaction graph u and its last interactive node r u , and divide them into a set E p to generate an adjacency matrix S p describing their relationships. S p ∈ R |V|×|V| is defined by E p as follows: As the degree of nodes can be very different, we typically normalize the adjacency matrix as D As such a normalized adjacency matrix always has an eigenvalue decomposition, and the eigenvalues are in the interval [−1, 1]. To get the positive eigenvalues and make graph learning algorithms stable in practice, we follow [7] and leverage the following regularized matrix for characterizing graph structures: where β ∈ (0, 1) is a hyperparameter, I N ∈ R N ×N is the identity matrix, N = |V|, and eigenvalues of A p are in [0, β]. Afterwards, to capture the complex non-linear dynamic of node embeddings, we assume that there are three possible factors affecting the node states: 1) latest interaction information exhibiting the latest nodes states; 2) neighbors' features affecting the change of node states; 3) the inherent characteristics of nodes determining the influence of the aforementioned two factors. Based on these considerations, we define an ODE in E p to solve the node embedding trajectories, which adaptively fuse them with a gated mechanism. Here, we treat the encoder E tp as the initial value of ODE H tp (0), i.e., H tp (0) = E tp . Then we use a differential equation defined as follows to learn the continuous node representations in the interval of interactions: where H tp (0) represents the latest interaction information of nodes, A p H tp (t) represents the influence from neighbors, −H tp (t) represents the inherent characteristics of nodes, denotes the element-wise product operation. z l , z n , and z i represents three gates, respectively. They are computed as: where W l , W n , W i and b l , b n , b i are trainable parameters, and σ(·) is the sigmoid activation function to normalize the output into [0, 1]. The gated fusion mechanism can adaptively fuse three factors according to their importance calculated by the initial value of ODE. In addition, the updating process starts at the timestamp of latest interaction of the node. To this end, following [24] , when an interactive message comes, we use the latest interaction information to update node embeddings and save the current interaction message for the next time the node appears. Meanwhile, due to the integration of time encoding in H tp (0), the temporal behaviors of nodes could be captured. As a result, we can leverage the aforementioned ODE to learn the nodes' state trajectories at the interval [0, ∆t], and the three main factors mentioned above jointly captures the continuous dynamic of node representations. Then the node representations are updated by ODE solver as follows: where g(t) = dH tp (t) dt . Finally, we use the hidden state at the end time H(∆t p ) to update the previous embedding memoryH tp−1 asH tp (i.e.,H tp = H(∆t p )) (line 5 in Algorithm 1). After capturing the continuous dynamic of node representations from latest interactions sets and updating the embeddings of nodes, a graph attention layer in transform module (Fig.2 right) is applied to convert the historical observed interaction features of nodes to generate future representations (Line 6 and 17-24 in Algorithm 1). In this module, for the current interaction e connected by u and v at time t, their temporal neighbors and the interaction information between them are token as input. We introduce a self-attention mechanism to distinguish different neighbors, and take account of the structural information with temporal information [23] . For node u at time t, we consider its neighbors N (u, t) = {i 0 , ..., i k−1 }, where the interactions between u and i j ∈ N (u, t) occurred at time t j prior to time t, and the sampling process of temporal neighbors of node i is the same as u. Then we take the node information with time t encoding Z t u =h t u || F T (t) and the neighborhood information with the time delta t − t j encoding Z t N =H t N (u,t) || f uN (u,t) || F T (t − t j ) as the input of attention, where the time delta t − t j is between current interaction e and the interaction of u and its neighbors i j ∈ N (u, t). In attention, three different linear projections is used to obtain the query Q t , key K t and value V t : where W Q , W K , W V are the weight matrices. The attention weights α j is given by: where the attention weight α j reveals how node u attends to the features of node i j within the topological structure defined as N (u, t) after accounting for their interaction time with i j . Therefore, the layer produces the time-aware representation of node u at time t,ĥ t u , which represents the future hidden representations generated by historical observed interactions as follows: To stabilize the learning process and improves performances, we extend the self-attention mechanism to be multihead ones [33] . Specifically, we concatenate M parallel attention mechanisms with different learnable projections: where V Finally, the future node embeddingsĤ tp u is generated by calculatingĥ tp u for each node in interactions (line 6 in Algorithm 1). The latest interactive node r u and r i are updated as i and u, respectively. In addition, the latest interactive timestamps τ u and τ i are both updated as t p (Line 7-8 Algorithm 1). In this work, we adopt time-sensitive link prediction binary cross-entropy loss function to learn ConTIG's parameters. The binary cross-entropy loss function is defined as follows and our goal is to maximize this likelihood function: where the summation is over the observed edges on u p and i p that interact at time t p , Q is the number of negative samples and P (i) is the negative sampling distribution over the node space. Input: Edge Stream e = (u, i, t, f ui ) ∈ E in temporal interaction graph G = (V, E) Output: Embedding resultsĥ t u for all nodes (v, r v ), ∀v ∈ V generates A p by Eqn.4-5 14: h v (∆t) ← ODESolve(g(t), h v (0), ∆t), ∀v ∈ V 15: return h u (∆t), h i (∆t) 16: end procedure 17: 18: for v in {u, i} do 19 : end for 25: returnĥ t u ,ĥ t i 26: end procedure The time complexity of our proposed method mainly consists of two portions. First, for the update module (i.e., continuous inference block), we consider that both adjacency matrices are stored as sparse matrices. And the runtime of the ODE solver depends on the length on the time interval (i.e., the end time of ODE solver) ∆t and the complexity of the dynamics. Then, the time complexity of the continuous inference block is O(∆t |E|). Second, for the transform module (i.e., graph attention layer), since the masked self-attention operation is parallelizable, as suggested by [33] . The per-batch time complexity of the graph attention layer with m heads can be expressed as O(mk), where k is the average neighborhood size. Since the batch is divided by edges, the time complexity of the graph attention layer is O(mk |E|), where m |E| and k |E|. Therefore, the time complexity of ConTIG is O((∆t + mk) |E|) ≈ O(C |E|), where C is a constant and is relative to the runtime of the ODE Solver. In this section, we will utilize four networks to compare the performance of our model with four static methods and five dynamic methods. We conduct three main tasks, link prediction, node recommendation and node classification, to evaluate the influence of introducing temporal information and learn continuous dynamic of node embedding trajectories. Datasets We evaluate the performance of ConTIG on temporal link prediction, node recommendation and dynamic node classification tasks with four public datasets, where three datasets are user-item networks selected by [25] and one dataset is e-mail network. The statistics of the four datasets are summarized in Table 1 . • Wikipedia 2 . The dataset describes the interactions between active users and pages they edit the contents with unique timestamps and the dynamic labels indicating whether a user is banned from editing. Baselines To evaluate the performance of ConTIG, we compare our method with state-of-the-art graph embedding methods on both static and dynamic graphs. Static graph embedding methods: GAE and VGAE [52] learn latent representations by graph auto-encoder. GraphSAGE [8] and GAT [9] are GNNs that capture the dependence of graphs via message passing between the nodes of graphs. Dynamic graph embedding methods: CTDNE [21] defines the temporal random walk requiring the walks to obey the temporal order. JODIE [25] learns to generate embedding trajectories of all users and items from temporal interactions by update and project operations. DyRep [40] use deep recurrent architecture and attention mechanism to effectively model fine-grained temporal dynamic. TGAT [23] introduces a self-attention mechanism and a functional time encoding technique to learn the time-feature interactions. TGN [24] is a generic inductive framework operating on continuous-time dynamic graphs [10] represented as a sequence of events. Parameter Settings For parameter settings, the dimension of both node representations and time representations are set as 172. The optimizer is Adam algorithm, learning rate is 0.0001, and dropout probability is 0.1. In continuous inference block, parameter β in adjacency matrix regularization is 0.95, and the end time of ODE is set as 1.0 instead of the real interval ∆t. In graph attention layer, the number of 2 selected neighbors k is set as 15 and heads M in attention is set as 2, and the negative sampling distribution P (i) is uniform distribution. Settings for Baselines For static baselines, we adopt the same baseline training procedures as in [23] . We refer to the PyTorch geometric library for implementing the GAE and VGAE baselines, and develop off-the-shelf implementation for GraphSAGE and GAT by referencing their original implementations to accommodate the temporal setting and incorporate edges features. For dynamic baselines, we use the open-source implementations for CTDNE, and use the source code released by the authors to implementing TGAT and TGN, where we implement JODIE and DyRep in the version of TGN in PyTorch. Implementation All our experiments are implemented on a 32g-MEM Ubuntu 20.04 server with Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz and NVidia(R) 2080Ti GPUs. All code is implemented in Pytorch version 1.5.1. The goal of the temporal link prediction task is to predict the probability that there exists a link between the two nodes given two nodes and a certain timestamp in the future. For this task, we evaluate our method on both the transductive and inductive setting. In the transductive task, we predict the links between nodes observed during training. In the inductive task, we predict the links between new nodes which have not been observed in the past. Our model is tuned on the validation set and we report the average precision (AP) on the test set. We divide the training, validation, and testing sets into a 70%-15%-15% split. After the division, the nodes that do not appear in the training set are considered as new nodes. The model is trained by temporal link prediction. The results comparison between our method and baseline methods in temporal link prediction tasks are shown in Table 2 . We observe that: 1) static graph embedding methods including GAE, VGAE, GraphSAGE and GAT, perform worse than those baselines modeling temporal interaction information. Because most of the interactions in the real-world networks are time-stamped; 2) among the dynamic graph embedding works, both temporal neighbor information (i.e., CTDNE, TGAT) and latest interaction information (i.e., JODIE, DyRep) methods perform worse than fusing methods (i.e., TGN, ConTIG); and 3) ConTIG achieves the best prediction performance on the datasets Wikipedia, Reddit, and CollegeMsg for both transductive and inductive settings, and the second best performance on Mooc for the inductive task. This observation demonstrates the advantage of our proposed method compared to existing methods. In fact, by considering the inherent node properties and modeling the three important factors (i.e., latest interaction, neighbor features and inherent characteristics) in temporal interaction graphs, our method can capture the complex non-linear dynamics of node representations effectively, thereby achieving superb performance. The goal of temporal node recommendation task is to predict the top-K possible neighbors of node u at t given the node u and future timestamp t. This task is also used to evaluate the performance of temporal network embedding methods. For this task, we evaluate all methods on transductive setting, each method outputs the user u's preference scores over all the items at time in test set. We sort scores in a descending order and record the rank of the paired node u and i. We evaluate the task on three user-item networks (i.e., Wikipedia, Reddit and Mooc), where Col-legeMsg dataset is not included because it is not a user-item network. Our evaluation metric in this task is Recall@K, where K ∈ {5, 10, 15, 20}. The results comparison between our method and baseline methods in temporal node recommendation task is shown in Fig.3 , showing that our model ConTIG perform better than all the baselines. Compared with the best competitors (i.e., TGN), the recommendation performance of ConTIG improves by 110.51% in terms of Recall@5 on Mooc. On Wikipedia and Reddit, the improvement is 3.25% and 13.13% with respect to Recall@5. These significant improvements verify that the differential equation fused with three factors (i.e., latest interaction, neighbor features and inherent characteristics) proposed in ConTIG is capable of learning the trend of node state trajectories in the network. Additionally, the significant improvement of ConTIG benefits from the combination of continuous node state modeling in update module and dynamic sub-graph structure capturing in transform module on node embeddings, which is good for the down-stream node recommendation task. The goal of dynamic node classification task is to predict the state label of user given the user, item and future timestamp. For this task, we evaluate our method on transductive setting, predicting the state labels of users who have been observed during training. We evaluate the task on three datasets with dynamic node labels (i.e., Wikipedia, Reddit and Mooc), where CollegeMsg dataset is not included because there is no node label. Specifically, we train a decoder after the the model trained by the temporal link prediction task. Our evaluation metric in this task is the area under the ROC curve (AUC). The results comparison between our method and baseline methods in dynamic node classification tasks are shown in Table 3 . Again, our algorithm achieves the best or comparable performance comparing with existing dynamic graph embedding approaches, demonstrating its effectiveness for the down-stream node classification task. The goal of this task is to observe the link prediction performance of our method for interaction sets with different time Our results of comparison between our method and dynamic graph baselines in temporal link prediction task on five interaction sets of Wikipedia and CollegeMsg are shown in Fig. 4 . Comparing the AP results in each set, we find that our method outperforms TGN, especially in long-interval interactions (e.g., the frequency interval 60-80% and 80-100%), demonstrating the superiority of ConTIG in capturing continuous dynamics of node representations. And we argue that the long-interval link prediction is more beneficial to practical applications, because quite a few users are always inactive on social networks, citation networks and other user-item interaction systems. update by a large margin, which demonstrates the effectiveness of the main modules of our model. Especially, the introduction of the continuous update module significantly improves the results, indicating that the latest knowledge between two consecutive interactions is an essential feature to learn the node representations. By further modeling and aggregating the historical interaction information of nodes, ConTIG consistently improves the performance, showing the importance of the historical interaction information. Specifically, we investigate the effect of each component in update module. By removing the adaptive fusion approach, the performance of ConTIG w/o adaptive degrades obviously, pointing out that the adaptive fusion is a key factor to the success of the ODE solver in update module. It helps the network to focus on the most correlated factor to update the node state, and adaptively fuses three factors in a data-dependent way. Besides, ConTIG outperforms Con-TIG w/o latest, ConTIG w/o neighbor and ConTIG w/o inherent by a small margin, which indicates that the three factors is of great importance for learning the change of node state in the update module. In particular, by comparing ConTIG with ConTIG w/o latest, ConTIG w/o neighbor and ConTIG w/o inherent, respectively, we observe that the neighbor features contributes most to the performance in all datasets, which indicates the importance of neighbors in the latest changing of node states. Simultaneously, for ConTIG w/o inherent, we observe that it has obvious effects on the datasets without edges features (i.e., Mooc and CollegeMsg), although it has tiny effects on the datasets with edge features (i.e., Wikipedia and Reddit). There are several hyper-parameters in our proposed method. It is necessary to analyze how these parameters influence the performance of our model in temporal link prediction task on the above datasets. In detail, these parameters include the end time ∆t in continuous inference block, the number of neighbors k, heads M in graph attention layer, node embedding dimension d and time embedding dimension d T . We choose different values for them and use the AP score to evaluate their performance. End time in continuous inference block (See Section 4.3). An important parameter is how long used to update the node states in the continuous inference block. In our experiment the ∆t ranges from 0.6 to 1.6 and the node and time embedding dimension is fixed to 172. As shown in Fig.5a , as ∆t is larger, the AP score first increases and then decreases, demonstrating that more time for learning node changes between two consecutive interactions could yield better performance. However, when the updating time is very long, it would make the current node over rely on the latest information, and forget the historical information, which hinders the performance. Number of neighbors (See Section 4.4). The influence of the number of neighbors will then be evaluated, which control the amount of historical interactive information is considered in the transform module. As shown in Fig.5b , in general, as the number of neighborhood k becomes larger, the model could achieve better performance because more historical information is considered. However, when the number of neighborhood is very large, it would introduce useless or overdue information into the learning process and thus degrade the performance. Heads in graph attention layer (See Section ??). Fig.5c show the AP results under different number of heads M in the multi-head attention. We observed that the AP score first increases and then decreases or stabilize, and there is a relatively stable and good performance at two-head attention. It means that less-head attention will make the results unstable and offset, but attention with many head may pay attention to useless or unrelated information. Node embedding dimension. The influence of different node embedding dimensions 16; 64; 128; 172; 256 on our model is shown in Fig. 5d . We discover the same trend on two datasets that as the dimension rises from a small value, the performance of ConTIG will improve rapidly and it becomes relative stable while the size is large. The potential reason is that a higher dimension enables the model to learn more information in the latent space. However, it would make computational consumption become very large, so we choose a reasonable dimension with the best performance (i.e., 172). Time embedding dimension. The influence of different time embedding dimensions 16; 64; 128; 172; 256 on our model is shown in Fig. 5e . We observe the similar results with node embedding dimension, as d T is larger, the AP score first increases and then tends to be stable. Different from node embedding dimension, there is a fluctuation in the high-dimension time embedding on Wikipedia and lowdimension time embedding on Mooc, which means high dimension and low dimensions may both hinder the performance, and shows the importance of choosing a reasonable time embedding dimension (i.e., 172). Here, we examine how the training time of ConTIG depends on the number of edges |E| which are used for training. We record the runtimes of ConTIG for training one epoch on the Wikipedia dataset using the best parameters in Section 5.7. Fig.6 shows that the entire runtime for one-epoch training is close to linear with |E|. This evaluation demonstrates our method's advantage of being scalable to process long edge streams. In this paper, we present ConTIG, a novel representation learning method to capture the continuous dynamic of node embedding trajectories by identifying three-fold factors (i.e., latest interaction, neighbor features and inherent characteristics). ConTIG contains two modules: a update module to learn the node embedding trajectories and a transform module to generate the future representations according to the historical interaction information. Experiments results demonstrate that ConTIG achieves state-of-the-art performances, especially for long-interval interactions on temporal link prediction tasks. In the future, besides node state trajectory, we plan to pay more attention to community evolution, exploring the impact of community on individuals during the graph evolution. Representation learning on graphs: Methods and applications A survey on network embedding Deepwalk: Online learning of social representations Line: Large-scale information network embedding node2vec: Scalable feature learning for networks A comprehensive survey on graph neural networks Semi-supervised classification with graph convolutional networks Inductive representation learning on large graphs Graph attention networks Representation learning for dynamic graphs: A survey Foundations and modelling of dynamic networks using dynamic graph neural networks: A survey Dynamic network embedding survey Large scale evolving graphs with burst detection Evolvegcn: Evolving graph convolutional networks for dynamic graphs Grade: Graph dynamic embedding K-core based temporal graph convolutional network for dynamic graphs Exploring temporal information for dynamic network embedding Temporal network embedding for link prediction via vae joint attention mechanism Continuous-time link prediction via temporal dependent graph neural network Gtea: Representation learning for temporal interaction graphs via edge aggregation Continuous-time dynamic network embeddings Ties: Temporal interaction embeddings for enhancing social media integrity at facebook Inductive representation learning on temporal graphs Temporal graph networks for deep learning on dynamic graphs Predicting dynamic embedding trajectory in temporal interaction networks Learning temporal interaction graph embedding via coupled memory networks Graph neural networks exponentially lose expressive power for node classification Continuous graph neural networks Neural ordinary differential equations Laplacian eigenmaps and spectral techniques for embedding and clustering Distributed large-scale natural graph factorization Learning graph embedding with adversarial training methods Attention is all you need Node embedding over temporal graphs Dysat: Deep neural representation learning on dynamic graphs via selfattention networks Discretetime temporal network embedding via implicit hierarchical learning in hyperbolic space Subset node representation learning over large dynamic graphs Dyngem: Deep embedding method for dynamic graphs Transformer-style relational reasoning with dynamic memory updating for temporal network modeling Dyrep: Learning representations over dynamic graphs Temporal network embedding with micro-and macro-dynamics Tcl: Transformer-based dynamic graph modelling via contrastive learning A data-driven graph generative model for temporal interaction networks Learning representation over dynamic graph using aggregation-diffusion mechanism Motifs in temporal networks Inductive representation learning in temporal networks via causal anonymous walks Local motif clustering on time-evolving graphs Neural dynamics on complex networks Temporal knowledge graph forecasting with neural ode Coupled graph ode for learning interacting system dynamics Self-attention with functional time representation learning Variational graph auto-encoders The research is supported by Natural Science Foundation of China (61872306), Xiamen Science and Technology Bureau (3502Z20193017) and Fundamental Research Funds for the Central Universities (20720200031).