key: cord-0145135-izfuyqjm authors: Fan, Yujie; Ju, Mingxuan; Zhang, Chuxu; Zhao, Liang; Ye, Yanfang title: Heterogeneous Temporal Graph Neural Network date: 2021-10-26 journal: nan DOI: nan sha: 24d0c1679f15a10a5739226bfaed282a89db702a doc_id: 145135 cord_uid: izfuyqjm Graph neural networks (GNNs) have been broadly studied on dynamic graphs for their representation learning, majority of which focus on graphs with homogeneous structures in the spatial domain. However, many real-world graphs - i.e., heterogeneous temporal graphs (HTGs) - evolve dynamically in the context of heterogeneous graph structures. The dynamics associated with heterogeneity have posed new challenges for HTG representation learning. To solve this problem, in this paper, we propose heterogeneous temporal graph neural network (HTGNN) to integrate both spatial and temporal dependencies while preserving the heterogeneity to learn node representations over HTGs. Specifically, in each layer of HTGNN, we propose a hierarchical aggregation mechanism, including intra-relation, inter-relation, and across-time aggregations, to jointly model heterogeneous spatial dependencies and temporal dimensions. To retain the heterogeneity, intra-relation aggregation is first performed over each slice of HTG to attentively aggregate information of neighbors with the same type of relation, and then intra-relation aggregation is exploited to gather information over different types of relations; to handle temporal dependencies, across-time aggregation is conducted to exchange information across different graph slices over the HTG. The proposed HTGNN is a holistic framework tailored heterogeneity with evolution in time and space for HTG representation learning. Extensive experiments are conducted on the HTGs built from different real-world datasets and promising results demonstrate the outstanding performance of HTGNN by comparison with state-of-the-art baselines. Our built HTGs and code have been made publicly accessible at: https://github.com/YesLab-Code/HTGNN. Many real-world data come in the form of graphs, such as academic networks [1, 2] , social networks [3, 4] , and epidemiological networks [5, 6] . The graph structure consists of a set of nodes interconnected by a set of edges. Learning node representations on graphs is essential for various downstream tasks, such as node classification, link prediction, and recommendation. Recently, graph neural networks (GNNs) have been broadly studied and achieved state-of-the-art performance by taking both node features as well as graph structures into consideration. Despite their superior performance, most of the current research efforts concentrate on static graphs [7, 8, 9, 10] or dynamic/spatialtemporal graphs with homogeneous structures [11, 12, 5, 13] . However, many real-world graphs evolve dynamically in the context of heterogeneous graph structures. From the perspective of spatial domain, the graph is heterogeneous with multi-typed nodes connected by multityped relations; from the perspective of temporal domain, either the node features or graph structures evolve over time. We call this type of graph the heterogeneous temporal graph (HTG). A HTG could be described as an ordered list of heterogeneous graph slices with a set of temporal relations connecting them. It is a general concept for modeling heterogeneous and dynamically changing graph data. Typical examples include dynamic academic networks and epidemiological networks. For dynamic academic networks, the heterogeneous structures would evolve along with the authors' research directions and co-authorship. In contrast, the graph structures remains unchanged for epidemiological networks, but the node features inevitably change with increased/decreased patient numbers. It is worth noting that dynamic heterogeneous graphs [14, 15, 16, 17] can be treated as an instance of HTGs, where the dynamic nature comes from the evolving graph structures. The dynamics associated with heterogeneity have posed new challenges for representation learning on HTGs. There exist some preliminary works. They could be roughly summarized into two categories: one first explores neural sequence models to process time-series features attached on each node, and then performs graph representation learning with the processed node features on the spatial domain [18, 5, 6, 19] ; the other first applies GNNs on each graph slice of a HTG, and then employs sequence models on the outputs of each slice to obtain the final representations [20, 16, 15] . Although these works could achieve satisfactory results, they are still faced with the following limitations: (1) The existing models are graph-dependent. That is, the perfor-mance depends heavily on the characteristics of a graph. Specifically, for a HTG with dynamically evolving heterogeneous graph structures (e.g., academic networks), the second category approaches that emphasize more on spatial dependencies usually obtain better results. On the contrary, for a HTG with constantly changing node features (e.g., epidemiological networks), the first category methods that focus more on temporal dependencies would achieve superior performance. Apparently, selecting a model that best fits with a given HTG requires empirical knowledge. (2) The spatial and temporal dependencies are processed in a serialized way. Most existing models either analyze the temporal domain first and the spatial domain later or in reverse order, which weakens the spatial-temporal interactions as the information in these two domains is treated separately. Currently, it is not yet well understood how to jointly integrate both spatial and temporal dependencies while preserving the heterogeneity for node representation learning over HTGs. To fill this gap, in this paper, we propose heterogeneous temporal graph neural network (HTGNN), a holistic framework tailored heterogeneity with evolution in time and space to learn node representations on HTGs. More specifically, to retain the spatial heterogeneity, we design intra-relation aggregation and inter-relation aggregation, which are performed purely on each graph slice of a HTG, to successively aggregate the information of a target node's neighbors within the same type of relation and over different types of relations. To handle the temporal dependencies, we introduce across-time aggregation, which is conducted across different graph slices, to gather the information of the target node's temporal neighbors. To capture the spatial-temporal interactions, we equip each layer of HTGNN with a hierarchical aggregation mechanism, including intra-relation, inter-relation, and across-time aggregation modules, to jointly, rather than serially, model heterogeneous spatial dependencies and temporal dimensions. With increased model depth, the information is iteratively propagated in these two domains, allowing HTGNN agnostic to graph characteristics. In sum, we make the following contributions: • We study the representation learning problem on HTGs. HTG is a general concept to model graph data with heterogeneous spatial structures and temporal evolution patterns (i.e., dynamically evolving graph structures or constantly changing node features). • We propose HTGNN to learn node representations on HTGs. HTGNN is a holistic framework, which is capable of jointly modeling heterogeneous spatial dependencies and temporal dimensions. This character differs from existing works that process these two types of dependencies in a serialized way. • We establish two HTGs from different real-world datasets, one with dynamically evolving heterogeneous structures (i.e., OGBN-MAG) and another with constantly changing node features (i.e., . Extensive experiments on these two HTGs demonstrate that HTGNN consistently achieves strong performance in comparison with state of the arts for different graph mining tasks. Heterogeneous Graph Neural Networks. Recently, various heterogeneous GNNs [21, 22, 10, 2, 9, 23, 24] have been proposed with successful applications [25, 26, 27, 28, 29] . RGCN [21] introduces relationspecific transformations for different relations during the learning process. HGT [10] utilizes meta relations to model graph heterogeneity and further learns the mutual attention for each meta relation based on the Transformer architecture. By leveraging metapaths defined on the heterogeneous graphs, HAN [2] designs nodelevel attention and semantic-level attention to learn the importance of metapath-based neighbors and different metapaths, respectively. These models are built for static heterogeneous graphs and cannot deal with the dynamic properties of HTGs. It is worth noting that, HGT uses a relative temporal encoding to assign each node a timestamp to handle graph dynamics. However, this could partially address the problem as the node embedding in HGT is assumed to be time-invariant, which is not in line with many real-world scenarios. Dynamic Graph Learning. Spatial-temporal graphs [11, 18, 6, 5] and dynamic graphs with homogeneous structures [12, 30, 13] have been widely studied in the literature. To further consider the graph heterogeneity, learning on dynamic heterogeneous graphs has drawn increasing attention, including dynamic heterogeneous graph embedding models [31, 32, 17, 14] that solely consider graph structures and dynamic heterogeneous GNNs [15, 20, 33, 34] that take both graph structure and node features into consideration. DyHATR [15] first introduces node-level and edge-level attentions to learn heterogeneous information and then applies RNNs with temporal attention to capture temporal dependencies. HDGAN [35] combines heterogeneous attention and Hawkes process to model graph heterogeneity and dynamics. However, there are some limitations in current works. Firstly, these models are graph-dependent, which requires empirical knowledge; secondly, the spatial and temporal dependencies are processed serially, which leads to a weakened connection between these two domains. This paper addresses these issues by designing a holistic graph-agnostic framework to learn node representations on HTGs. In this section, we define concepts used in our model. where T is the number of timestamps, G (t) is a heterogeneous graph at timestamp t, E describes the temporal relations between G t and G t+1 , V = T t=1 V t and E = T t=1 E t ∪ E denote the node set and edge set of G, respectively. Figure 1 shows one example from OGBN-MAG dataset. HG (1) … HG (2) HG (T) Author Paper Institution Field of study Relation-r-based Neighbors. Given a relation type r ∈ R, a node v at timestamp t, its relation-r-based neighbors at timestamp t is defined as N t r (v) = {u|(u, v) ∈ E t , ψ(u, v) = r}. Definition 3.4. Heterogeneous Temporal Graph Representation Learning. Given a HTG G = {G t } T t=1 , E = (V, E) consisting of T timestamps, the task of HTG representation learning is to learn a gen- The node representations are able to capture both spatial heterogeneity and temporal dependencies of the HTG, and could be applied in various downstream tasks at timestamp T + 1. The framework of HTGNN is shown in Figure 2 . HTGNN takes a HTG as input and yields node representations as outputs for downstream tasks (see Figure 2 (a)). It is composed of multiple heterogeneous temporal aggregation layers. Each layer is equipped with a hierarchical aggregation mechanism (see Figure 2 (b)), including intra-relation, interrelation, and across-time aggregation modules. Intrarelation and inter-relation aggregation modules are performed purely on each graph slice, aiming to depict the spatial heterogeneity, while the across-time aggregation module is conducted across different graph slices, aiming to capture the temporal dependencies. In one aggregation layer, each node successively receives messages from its spatial neighbors of the same relation type and different relation types; the nodes then start gathering messages from their temporal neighbors across graph slices. After this, another layer follows with node embeddings obtained from the previous layer. With increased model depth, the messages are iteratively propagated in spatial and temporal domains. Such design makes HTGNN be a holistic framework jointly modeling heterogeneous spatial dependencies and temporal dimensions. In a heterogeneous graph, each node type may have its own feature space. Take OGBN-MAG dataset as an example where only nodes with paper type are associated with input features. Metapath2vec [36] is a common approach to initiate features for those nodes without input features. Apparently, the feature spaces for paper type and other types are different as the former reflects the text content information while the latter represents the graph structural information. To handle this problem, before feeding node features into multiple heterogeneous temporal aggregation layers, we first adopt a type-specific projection on each node to map its distinct feature vector into a same feature space. Mathematically, given a node v with node type φ(v) at timestamp t, we have: The intra-relation aggregation module is performed separately on each relation type in each graph slice. It takes the node embeddings of last layer as inputs and outputs multiple relation embeddings for each node at each timestamp. Given a target node v at timestamp t and a relation type r ∈ R, the intra-relation aggregation can be described as: where N t r (v) represents relation-r-based neighbors of node v at timestamp t, h t,l−1 u is node u's embedding at timestamp t in layer l − 1, h t,l v,r denotes the relation r's embedding with respect to node v at timestamp t in layer l, and Θ intra is the trainable parameters and is non-shareable for relation type, timestamp and aggregation layer. Noted that when l = 1, AGG intra takes the outputs of type-specific projection module introduced above as inputs. For a target node, its different neighbors, even within the same type of relation, would also contribute differently in the learning process, and thus we adopt the self-attention mechanism to assign each neighbor a weight reflecting its importance. Formally, given a target node v and one of its relation-r-based neighbors at timestamp t, i.e., u ∈ N t r (v), their attention coefficient can be computed by: where σ(·) is LeakyReLU, W t,l r ∈ R d×d and a t,l r ∈ R 2d are trainable transformation matrix and attention vector, respectively, and concatenates the vector. We then normalize the attention coefficient across all relation-r-based neighbors via softmax function: After obtaining the normalized attention values of node v's neighbors, we perform weighted combination to compute relation r's embedding with respect to v: Figure 2 (c) illustrates the implementation of intrarelation aggregation module. Without loss of generality, multi-head attention mechanism can be employed. Specifically, K independent attention heads are executed in a parallel fashion, and then their representa-tions are concatenated, as following: where α t,l (u,v),r k denotes the attention value at k-th attention head. Through intrarelation aggregation, the target node would gather multiple relation embeddings. Based on this, the interrelation aggregation module aims to learn a spatial embedding for the target node summarizing the information of its spatial neighbors over all relation types. Formally, this process is denoted as: where R(v) denotes the set of relations connected to node v, h t,l v,r is relation r's embedding with respect to node v from previous module, h t,l v,R denotes the spatial embedding of node v that will be learned in this module, Θ inter is the trainable parameters and is non-shareable for timestamp and aggregation layer. A straightforward way to implement AGG inter (·) is treating each relation embedding equally by conducting element-wise sum/mean operation or concatenating them followed by a linear transformation. However, each relation type preserves a unique semantic meaning and thus should not be treated identically. Therefore, we manage to learn a importance weight for each relation type and explore the attention mechanism for implementation. Specifically, for relation type r, we use a three-step process to learn its importance: (1) we first retrieve its embeddings of all related nodes and feed them into a non-linear transformation; (2) we generate its summarized embedding by averaging the transformed relation embeddings; (3) finally, we calculate its attention coefficient by measuring the similarity between its summarized embedding with a relation attention vector. This learning process is formalized as: where V t r denotes the set of nodes connected by relation r at timestamp t, b ∈ R d is the bias vector, W t,l R ∈ R d×d and c t,l R ∈ R d are trainable transformation matrix and attention vector, respectively. The normalized importance of r with regard to v is computed as: With the importance of different relations, we generate the spatial embedding for v via linear combination: An intuitive explanation is shown in Figure 2 (d). We could also extend it to multi-head mechanism. Intra-relation and inter-relation aggregation modules that aggregate information from the target node's spatial neighbors are performed purely on each graph slice. Across-time aggregation aims to capture the interactions among the target node's temporal neighbors. We define the temporal neighbors as the same nodes in different graph slices (including itself). This module takes in the spatial embeddings of the target node's temporal neighbors and outputs a spatial-temporal embedding for this target node. This process is formalized as follow: where h t ,l v,R is the spatial embedding of node v's temporal neighbor at timestamp t in layer l, h t,l v,ST is node v's spatial-temporal embedding at timestamp t in layer l, Θ across is the trainable parameters and is non-shareable for node type and aggregation layer. As the transformer [37] has shown great performance in natural language processing domain, we explore its attention mechanism to model our across-time aggregation process. Before calculating attentions for different timestamps, we define a time encoding function P E(·) for h t,l v,R to incorporate time-related factors: where i is the index of each element and p(·) is a frequency encoding function that characterizes a timedependent sinusoid, where p(t, i) = sin(t/10000 2i/d ) if i is even, or cos(t/10000 2i/d ) if odd. By feeding the embeddings at different timestamps into this function, they become discriminative with regard to time. We then transform the target node's spatial embedding into Query vector, its temporal neighbor's spatial embedding into Key vector, and calculate their dot product as the attention coefficient to measure the importance of this temporal neighbor. Accordingly, we have: where W l φ(v),q , W l φ(v),k ∈ R d×d denote the trainable transformation matrices for query, and key, respectively. The normalized attention value is then calculated by: Finally, node v's spatial-temporal embedding is computed via a linear combination of its temporal neighbors' transformed embeddings and the calculated attention values, formulated as: A graphical explanation of across-time aggregation is shown in Figure 2 (e). Similarly, multi-head attention mechanism could also be applied in this module. Using the hierarchical aggregation mechanism above, we can obtain the spatial-temporal embedding for each node. Despite this, the feature vector of node itself also plays an essential role in learning its representation. Instead of directly summing them, we design a gate mechanism to control how much information the node and its neighbors should contribute. Given the node feature vector at timestamp t in layer l − 1 and its spatial-temporal embedding at the same timestamp in layer l, their combination is formulated as: where δ φ(v) ∈ R 1 and W φ(v) ∈ R d×d are the trainable weight and transformation matrix, respectively. 4.5 Learning Algorithm By stacking L heterogeneous temporal aggregation layers, we could derive the embedding for each node at each timestamp, denoted as h t,L v . We then simply sum the node embedding of all timestamps as its final embedding: v . HTGNN could be trained in an end-to-end manner with the labeled data at timestamp T + 1, as following: where J(·) measures the loss between ground y v and the predicted scoreŷ v , Θ 2 2 is the L2-regularizer to prevent over-fitting. Depending on the goals of different tasks, J(·) could be set as cross-entropy loss for node classification and link prediction problems, or mean absolute error for regression problem. In this section, we conduct four sets of experiments to evaluate the performance of the proposed HTGNN. We construct two HTGs from two different domains with distinct characteristics. OGBN-MAG dataset, an academic network with dynamically evolving heterogeneous structures, is used for link prediction task. COVID-19 dataset, an epidemiological network with constantly changing node features, is used for node regression task. OGBN-MAG: The original OGBN-MAG dataset [1] is a static heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG). We extract a HTG from OGBN-MAG consisting of 10 graph slices spanning from 2010 to 2019. We first select authors that consecutively publish at least one paper every year. We further collect these authors' affiliated institutions, published papers, and the papers' field of studies in each year to construct this HTG. Each graph slice is a heterogeneous graph that contains four types of nodes (paper, author, institutions, and fields of study), and four types of relations among them (authoraffiliated with-institution, author-writes-paper, papercites-paper, and paper-has a topic of -field of study). COVID-19: The COVID-19 data is obtained from 1point3acres 1 , which contains both state and county level daily case reports (e.g., confirmed cases, new cases, deaths, and recovered cases). We use the daily new COVID-19 cases as the time-series data for each state and county. We then build a HTG including 304 graph slices spanning from 05/01/2020 to 02/28/2021. Each graph slice is also a heterogeneous graph consisting of two types of nodes (state and county) and three types of relations between them, i.e., one administrative affiliation relation (state-includes-county) and two geospatial relations (state-near -state, county-near -county). In OGBN-MAG, each paper comes with a 128dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. For other nodes, we run the metapath2vec [36] on each graph slice to generate the 128-dimensional node embeddings as their input features. For COVID-19, we attach each node in each graph slice with its daily new cases as the node feature. We split each dataset into training, validation, and testing sets with a ratio of 8:1:1. Statistics of these datasets are summarized in Table 1 . We compare HTGNN with three classes of state-of-the-art baselines. Neural Sequence Models: This class of baselines is capable of capturing temporal dependencies. LSTM [38] is a type of recurrent neural network that learns order dependence of sequences. Transformer [37] handles sequences with global message routing, weighing the influence of different parts of the input. Static Graph Models: We consider several static homogeneous/heterogeneous GNNs that depict spatial dependencies. GCN [7] and GAT [8] work on homogeneous graphs, where the neighbor information is aggregated through a mean function and a self-attention mechanism, respectively. For the heterogeneous GNNs, we choose RGCN [21] and HGT [10] that do not rely on metapaths. RGCN considers specialized transformation matrices for different types of relations. HGT applies the Transformer architecture to learn the mutual attention for each meta relation. In the experiment, we treat HGT as a static heterogeneous GNN as the relative temporal encoding is not applicable for HTGs. Dynamic Graph Models: We select one spatialtemporal GNN, one dynamic homogeneous GNN, and two dynamic heterogeneous GNNs as baselines. CoGNN [6] applies a multilayer perceptron to process time-series node features and uses GCN [7] with skip connections for spatial information aggregation. DySAT [12] employs self-attention to aggregate structural neighborhood and temporal dynamics for node representation learning. HDGAN [35] combines heterogeneous attention and Hawkes process to model graph heterogeneity and dynamics. We replace the heterogeneous attention module with HGT [10] to avoid incorporating metapaths. DyHATR [15] uses hierarchical attention to learn heterogeneous information and incorporates RNNs with temporal attention to capture temporal dependencies. For those models designed for homogeneous graphs, we ignore the graph heterogeneity and directly feed the whole graph into the learning algorithms. We employ Adam optimizer with learning rate set to 5e-3, and weight decay set to 5e-4. For other parameters, we set dropout rate to 0.2, GNN layer to 2, hidden embedding dimension to 32 for OGBN-MAG and 8 for COVID-19, respectively; and we also use ReLU as the activation function. We train all the models with a fixed 500 epochs and use an early stopping strategy with a patience of 50. That is to say, the best models are selected when the validation loss does not decrease for 50 consecutive epochs. All models are trained for five times, and the mean and standard deviation of test performance are reported. All baselines and the proposed HTGNN are implemented with Python 3.7.10, PyTorch 1.8.1 and Deep Graph Library (DGL) 0.6.0. Experiments are conducted on a machine equipped with i9-9900K processor, two RTX 2080Ti graphic cards, and 64 GB of RAM. We conduct the link prediction experiment on the OGBN-MAG dataset to evaluate the performances of different methods. We split the dataset into three sets with a ratio of 8:1:1. Specifically, the graphs of 2010-2017 are used for training, the graph of 2018 for validation, and the graph of 2019 for testing. We frame our task to new co-author link prediction. The new co-author relation is defined as the co-author link that exists in year T + 1 but not in year T . We randomly select 10% new co-author links as positive samples. Following the standard manner of learning-based link prediction, we randomly sample the same number of nonexistent co-author links as negative samples. We set the time window size to 3, which means, to predict the co-author relation in next year, we consider the HTG of the past three years. Note that, for static graph models without considering temporal dependency, we simply set the time window to 1. For a pair of authors, after obtaining the embeddings via HTGNN, we feed their concatenation into Eq. (4.17) for training with the cross-entropy loss. Similar to [39, 40] , we adopt the widely used AUC score (the Area Under a receiver operating characteristic Curve) and AP score (Average Precision) to measure the co-author link prediction performance. The experimental results with mean performance and their standard deviations reported are shown in Table 2 . We have the following conclusions by analyzing the results: (1) Both sequence and static graph models could achieve satisfactory results, which indicates that the temporal and spatial dependencies depicted by these two types of methods contribute to the co-author link prediction problem. (2) Dynamic graph models improve the performance by taking the information in both spatial and temporal domains into consideration. (3) GNNs designed for heterogeneous graphs (i.e., RGCN, HGT, HDGAN, DyHATR) perform better than homogeneous GNNs (i.e., GCN, GAT, CoGNN, DySAT), which demonstrates the advantage of incorporating graph heterogeneity. (4) Our proposed HTGNN that could jointly model the heterogeneous spatial dependencies and temporal dimensions consistently outperforms all baselines. The node regression task is conducted on the COVID-19 dataset. We aim to perform state-level daily new case forecasting. We also split the dataset into training, validation, and testing sets with a ratio of 8:1:1. Specifi-cally, 05/01/2021-12/30/2020 are used for training, 12/31/2020-01/29/2021 for validation, and 01/30/2021-02/28/2021 for testing. In this task, we set the time window to 7 (using the past one-week historical data for forecasting). Similarly, we set it to 1 for static graph models. As suggested in [11, 5] , MAE (Mean Absolute Errors) and RMSE (Root Mean Squared Errors) are adopted to measure the performance. We report the average MAE and RMSE of 54 states in the US. The experimental results with mean and standard deviation reported are demonstrated in Table 2 . Besides some similar conclusions drawn from the link prediction task, we notice that sequence models yield better performances than static graph models. This phenomenon indicates that the temporal domain contributes more to the COVID-19 forecasting task compared to the spatial domain. This is because the graph structures remain unchanged in this case; however, the node features (i.e., daily new cases) have a remarkable change over time. In HTGNN, we propose to model heterogeneous spatial dependencies and temporal dimensions jointly. To evaluate this design, we establish two variants of HTGNN for comparison that handle these two types of dependencies serially. HTGNN ST processes spatial dependencies first and temporal dependencies later by first performing multi-layer intraand inter-relation aggregations in each graph slice, then applying across-time aggregation across all graph slices. HTGNN T S analyzes these two domains in reverse order by first employing across-time aggregation on the temporal domain and then conducting multi-layer spatial aggregation on the last graph slice. Experimental results shown in Table 3 demonstrate that: (1) HTGNN ST outperforms HTGNN T S on the OGBN-MAG dataset but is less effective on the Covid-19 dataset. This is because HTGNN ST emphasizing more on spatial domain fits better with OGBN-MAG with dynamically evolving graph structures. On the contrary, HTGNN T S paying more attention to the temporal domain is more suitable for the COVID-19 with changing node features. (2) HTGNN achieves better performance than these two variants in both datasets, proving that our proposed holistic model is agnostic to graph characteristics and delivers superior performance. We then perform additional ablation studies to evaluate the three major components in HTGNN: intrarelation, inter-relation, and across-time aggregation modules. Accordingly, we prepare three variants to examine the effect of each component. HTGNN w/o Intra replaces the intra-relation aggregation module in each layer with a mean pooling mechanism. HTGNN w/o Inter replaces the inter-relation aggregation module in each layer with a mean pooling mechanism. HTGNN w/o Across replaces the across-time aggregation module in each layer with a mean pooling mechanism. Experimental results are shown in Figure 3 . From Figure 3 , we observe that HTGNN equipped with three components achieves the best performance, which proves that each component makes its contribution to the final performance. It is also worth noting that HTGNN w/o Intra and HTGNN w/o Inter work better than HTGNN w/o across on the COVID-19 dataset but yield worse results on the OGBN-MAG dataset. We owe this to the distinct characteristics of different datasets. In particular, for OGBN-MAG, the heterogeneous graph structures evolve dynamically, increasing the difficulty in capturing spatial dependencies. In contrast, for COVID-19, the temporal dependencies are relatively harder to capture as the node features change constantly, while the graph structures remain unchanged. Figure 3 : Evaluation of each component in HTGNN. In this section, we investigate HTGNN's sensitivity to key hyper-parameters. Model depth. We vary the model depth (i.e., the number of heterogeneous temporal aggregation layers) from 1 to 5 to examine the model's performance on two datasets. The experimental results with mean and standard deviation reported are shown in Figure 4 (a)-(b). We can see that with the increase of model depth, the performance of HTGNN first improves then starts to decrease gradually. This phenomenon is attributed to the oversmoothing problem. Embedding dimension. We vary the embedding dimension from 4 to 64 for OGBN-MAG and 2 to 32 for COVID-19 to investigate its influence. Comparison results are illustrated in Figure 4 (c)-(d). We observe that increasing the embedding dimension initially improves the performance since a larger dimension can preserve more information. However, when using a too large dimension, the model would suffer the overfitting problem, which results in reduced performance. Time window size. We validate the effect of time window size by ranging it from 2 to 6 for OGBN-MAG and 5 to 13 for COVID-19, respectively. The results are shown in Figure 4 (e)-(f). We can see that a large time window size boosts the performance as more historical information is included. However, further enlarging the window size yields a fluctuating performance. In this paper, we study the representation learning problem on heterogeneous temporal graphs (HTGs), a general concept for modeling heterogeneous and constantly evolving graph data. We further propose heterogeneous temporal graph neural network (HTGNN), a holistic framework tailored heterogeneity with evolution in time and space for HTG representation learning. In particular, HTGNN consists of several heterogeneous temporal aggregation layers, each of which employs a hierarchical aggregation mechanism, including intra-relation, inter-relation and across-time aggregation modules, to jointly model heterogeneous spatial dependencies and temporal dimensions. Extensive experiments are conducted on two built HTGs: OGBN-MAG with dynamically evolving heterogeneous structures, and COVID-19 with constantly changing node features. Promising results demonstrate the great performance of HTGNN in comparison with state-of-the-art baselines. Open graph benchmark: Datasets for machine learning on graphs Heterogeneous graph attention network Attributed social network embedding Graph neural networks for social recommendation Cola-gnn: Cross-location attention based graph neural networks for long-term ili prediction Examining covid-19 forecasting using spatio-temporal graph neural networks Semi-supervised classification with graph convolutional networks Graph attention networks Magnn: metapath aggregated graph neural network for heterogeneous graph embedding Heterogeneous graph transformer Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting Dysat: Deep neural representation learning on dynamic graphs via self-attention networks Continuous-time dynamic network embeddings Dhne: Network representation learning method for dynamic heterogeneous networks Modeling dynamic heterogeneous network for link prediction using hierarchical attention with temporal rnn Dynamic heterogeneous graph neural network for real-time event prediction Learning and updating node embedding on dynamic heterogeneous information network Graph wavenet for deep spatial-temporal graph modeling Heteta: Heterogeneous information network embedding for estimating time of arrival Dynamic heterogeneous graph embedding using hierarchical attentions Modeling relational data with graph convolutional networks Heterogeneous graph neural network Multiview self-supervised heterogeneous graph embedding Heterogeneous graph structure learning for graph neural networks Heterogeneous graph neural networks for malicious account detection Disentangled representation learning in heterogeneous information network for large-scale android malware detection in the covid-19 era and beyond alpha-satellite: An aidriven system and benchmark datasets for hierarchical community-level risk assessment to help combat covid-19 Metagraph aggregated heterogeneous graph neural network for illicit traded product identification in underground market Community mitigation: A data-driven system for covid-19 risk assessment in a hierarchical manner Inductive representation learning on temporal graphs Dynamic heterogeneous information network embedding with meta-path based proximity Lime: Low-cost incremental learning for dynamic heterogeneous information networks Heterogeneous dynamic graph attention network Heterogeneous temporal graph transformer: An intelligent system for evolving android malware detection Dynamic heterogeneous graph embedding via heterogeneous hawkes process metapath2vec: Scalable representation learning for heterogeneous networks Attention is all you need Long short-term memory Link prediction based on graph neural networks Heterogeneous hypergraph variational autoencoder for link prediction