key: cord-0171394-doeahx19 authors: Wu, Qitian; Zhang, Hengrui; Yan, Junchi; Wipf, David title: Handling Distribution Shifts on Graphs: An Invariance Perspective date: 2022-02-05 journal: nan DOI: nan sha: c559fd26624401e8bf1586c1f08e8de560b41a21 doc_id: 171394 cord_uid: doeahx19 There is increasing evidence suggesting neural networks' sensitivity to distribution shifts, so that research on out-of-distribution (OOD) generalization comes into the spotlight. Nonetheless, current endeavors mostly focus on Euclidean data, and its formulation for graph-structured data is not clear and remains under-explored, given two-fold fundamental challenges: 1) the inter-connection among nodes in one graph, which induces non-IID generation of data points even under the same environment, and 2) the structural information in the input graph, which is also informative for prediction. In this paper, we formulate the OOD problem on graphs and develop a new invariant learning approach, Explore-to-Extrapolate Risk Minimization (EERM), that facilitates graph neural networks to leverage invariance principles for prediction. EERM resorts to multiple context explorers (specified as graph structure editers in our case) that are adversarially trained to maximize the variance of risks from multiple virtual environments. Such a design enables the model to extrapolate from a single observed environment which is the common case for node-level prediction. We prove the validity of our method by theoretically showing its guarantee of a valid OOD solution and further demonstrate its power on various real-world datasets for handling distribution shifts from artificial spurious features, cross-domain transfers and dynamic graph evolution. As the demand for handling in-the-wild unseen instances draws increasing concerns, out-ofdistribution (OOD) generalization (Mansour et al., 2009; Blanchard et al., 2011; Muandet et al., 2013; Gong et al., 2016) occupies a central role in the ML community. Yet, recent evidence suggests that deep neural networks can be sensitive to distribution shifts, exhibiting unsatisfactory performance within new environments, e.g., Beery et al. (2018) ; Su et al. (2019) ; Recht et al. (2019) ; Mancini et al. (2020) . A more concerning example is that a model for COVID-19 detection exploits undesired 'shortcuts' from data sources (e.g., hospitals) to boost training accuracy (DeGrave et al., 2021) . Such a problem is hard to solve since the observations in training data cannot cover all the environments in practice. Namely, the actual demand is to generalize a model trained with data from p(x, y|e = e 1 ) to new data from p(x, y|e = e 2 ). Recent research opens a new possibility via learning domain-invariant models (Arjovsky et al., 2019) under a cornerstone data-generating assumption: there exists a portion of information in x that is invariant for prediction on y across different environments. Based on this, the key idea is to learn a equipredictive representation model h that gives rise to equal conditional distribution p(y|h(x), e = e) for ∀e ∈ E. The implication is that such a representation h(x) will bring up equally (optimal) performance for a downstream classifier under arbitrary environments. The modelp(y|x) with such a property is called as invariant model/predictor. Several up-to-date studies develop new objective designs and algorithms for learning invariant models, showing promising power for tackling OOD generalization (Chang et al., 2020; Ahuja et al., 2020; Krueger et al., 2021; Liu et al., 2021; Creager et al., 2021; Koyama & Yamaguchi, 2021) . While the OOD problem is extensively explored on Euclidean data (e.g., images), there are few existing works investigating the problem concerning graph-structured data, despite that distribution shifts widely exist in real-world graphs. For instance, in citation networks, the distributions for paper citations (the input) and subject areas/topics (the label) would go through significant change as time goes by (Hu et al., 2020) . In social networks, the distributions for users' friendships (the input) and their activity (the label) would highly depend on when/where the networks are collected (Fakhraei et al., 2015) . In financial networks (Pareja et al., 2020) , the payment flows between transactions (the input) and the appearance of illicit transactions (the label) would have strong correlation with some external contextual factors (like time and market). In these cases, neural models built on graph-structured data, particularly, Graph Neural Networks (GNNs) which are the common choice, need to effectively deal with OOD data during test time. Moreover, as GNNs have become popular and easy-to-implement tools for modeling relational structures in broad AI areas (vision, texts, audio, etc.) , enhancing its robustness to distribution shifts is a pain point for building general AI systems, especially applied to high-stake applications like autonomous driving (Dai & Gool, 2018) , medical diagnosis (AlBadawy et al., 2018) , criminal justice (Berk et al., 2018) , etc. Nonetheless, compared with images or texts, graph-structured data has two fundamental differences. First, many graph-related problems (like the situations mentioned above) involve prediction tasks for each individual node, in which case the data points are inter-connected via graph structure that induces non-independent and non-identically distributed nature in data generation even within the same environment. Second, apart from node features, the structural information also plays a role for prediction and would affect how the model generalizes under environment variation. These differences bring up unique technical challenges for handling distribution shifts on graphs. In this paper, we endeavor to 1) formulate the OOD problem for node-level tasks on graphs, 2) develop a new learning approach based on an invariance principle, 3) provide theoretical results to dissect its rationale, and 4) design comprehensive experiments to show its practical efficacy. Concretely: Figure 1 : (a) The proposed approach Explore-to-Extrapolate Risk Minimization which entails K context generators that generate graph data of different (virtual) environments based on input data from a single (real) environment. The GNN model is updated via gradient descent to minimize a weighted combination of mean and variance of risks from different environments, while the context generators are updated via REINFORCE to maximize the variance loss. (b) Illustration for our Assumption 1. (c) The dependence among variables in the motivating example in Section 3.1. In this section, we present our formulation for the OOD problem on graphs. All the random variables are denoted as bold letters while the corresponding realizations are denoted as thin letters. An input graph G = (A, X) contains two-fold information 2 : an adjacency matrix A = {a vu |v, u ∈ V } and node features X = {x v |v ∈ V } where V denotes node set. Apart from these, each node in the graph has a label, which can be represented as a vector Y = {y v |v ∈ V }. We define G as a random variable of input graphs and Y as a random variable of node label vectors. Such a definition takes a global view and treat the input graph as a whole. Based on this, one can adapt the definition of general OOD problem Eq. 1 via instantiating the input as G and the target as Y, and then the data generation can be characterized as p(G, Y|e) = p(G|e)p(Y|G, e) where e is a random variable of environments that is a latent variable and impacts data distribution. However, the above definition makes little sense in node-level problems where in most cases there is a single input graph that contains a massive number of nodes. To make the problem-solving reasonable, we instead take a local view and investigate each node's ego-graph that has influence on the centered node. Assume v as a random variable of nodes. We define node v's L-hop neighbors as N v (where L is an arbitrary integer) and the nodes in N v form an ego-graph G v which consists of a (local) node feature matrix Use G v as a random variable of ego-graphs 3 whose realization is G v = (A v , X v ). Besides, we define y as a random variable of node labels. In this way, we can fragment a whole graph as a set of instances {(G v , y v )} v∈V where G v denotes an input and y v is a target. Notice that the ego-graph can be seen as a Markov blanket for the centered node, so the conditional distribution p(Y|G, e) can be decomposed as a product of |V | independent and identical marginal distributions p(y|G v , e). Therefore, the data generation of {(G v , y v )} v∈V from a distribution p(G, Y|e) can be considered as a two-step procedure: 1) the entire input graph is generated via G ∼ p(G|e) which can then be fragmented into a set of ego-graphs {G v } v∈V ; 2) each node's label is generated via y ∼ p(y|G v = G v , e). Then the OOD node-level prediction problem can be formulated as: given training data {G v , y v } v∈V from p(G, Y|e = e), the model needs to handle testing data {G v , y v } v∈V from a new distribution p(G, Y|e = e ). Denote E as the support of environments, f as a predictor model witĥ y = f (G v ) and l(·, ·) as a loss function. More formally, the OOD problem can be written as: (2) We remark that the first-step sampling G ∼ p(G|e = e) can be ignored since in most cases one only has a single input graph in the context of node-level prediction tasks. To solve the OOD problem Eq. 2 is impossible without any prior domain knowledge or structural assumptions since one only has access to data from limited environments in the training set. Recent studies (Rojas-Carulla et al., 2018; Arjovsky et al., 2019) propose to learn invariant predictor models which resorts to an assumption for data-generating process: the input instance contains a portion of features (i.e., invariant features) that 1) contributes to sufficient predictive information for the target and 2) gives rise to equally (optimal) performance of the downstream classifier across environments. With our definition in Section 2.1, for node-level prediction on graphs, each input instance is an ego-graph G v with target label y v . It seems not straightforward for how to define invariant features on graphs given two observations: 1) the ego-graph possesses a hierarchical structure for associated nodes (i.e., G v induces a BFS tree rooted at v where the l-th layer contains the l-order neighbored nodes N (l) v ) and 2) the nodes in each layer are permutation-invariant and variable-length. Inspired by WL test Weisfeiler & Lehman (1968) , we extend the invariance assumption (Rojas-Carulla et al., 2018; Gong et al., 2016; Arjovsky et al., 2019) to accommodate structural information in graph data: Assumption 1. (Invariance Property) Assume input feature dimension as d 0 . There exists a sequence of (non-linear) functions v . Denote r as a random variable of r v and it satisfies 1) (Invariance condition): p(y|r, e) = p(y|r), and 2) (Sufficiency condition): y = c * (r) + n, where c * is a non-linear function, n is an independent noise. A more intuitive illustration for the above computation is presented in Fig. 1(b) . The node-level readout r v aggregates the information from neighbored nodes recursively along the structures of BFS tree given by G v . Essentially, the above definition assumes that in each layer the neighbored nodes contain a portion of causal features that contribute to stable prediction for y across different e. Such a definition possesses two merits: 1) the (non-linear) transformation h * l can be different across layers, and 2) for arbitrary node u in the original graph G, its causal effect on distinct centered nodes v could be different dependent on its relative position in the ego-graph G v . Therefore, this formulation gives rise to enough flexibility and capacity for modeling on graph data. We next present our solution for the challenging OOD problem. Before going into the formal method, we first introduce a motivating example based on Assumption 1 to provide some high-level intuition. We consider a linear toy example and assume 1-layer graph convolution for illustration. Namely, the ego-graph G v (and N v ) only contains the centered node and its 1-hop neighbors. We simplify the h * and c * in Assumption 1 as identity mappings and instantiate Γ as a mean pooling function. Then we assume 2-dim node features where n 1 v and n 2 v are independent standard normal noise and is a random variable with zero mean and non-zero variance dependent on environment e. In Fig. 1(c) we show the dependency among these random variables in a graphical representation and instantiate them in an example of citation networks, where a paper's published avenue is an invariant feature for predicting the paper's sub-area while its citation index (a spurious feature) is affected by both the label and the environment. Based on this, we consider a vanilla GCN as the predictor modelŷ v = 1 |Nv| u∈Nv θ 1 x 1 u + θ 2 x 2 u . Then the ideal solution for the predictor model is [θ 1 , θ 2 ] = [1, 0] . This indicates that the GCN identifies the invariant feature, i.e., x 1 v insensitive to environment changes. However, here we show a negative result when using standard empirical risk minimization. This indicates that directly minimizing the expectation of risks across environments would inevitably lead the model to rely on spurious correlation (x 2 v depends on environments). Also, such a reliance would be strengthened with smaller σ e , i.e., when there is less uncertainty for the effect from environments. To mitigate the issue, fortunately, we can prove another result that implies a new objective as a sufficient condition for the ideal solution. The new objective tackles the variance across environments and guarantees the desirable solution. The enlightenment is that if the model yields equal performance on different e's, it would manage to leverage the invariant features, which motivates us to devise a new objective for solving Eq. 2. We now return to the general case where we have {(G v , y v )} for training and leverage a GNN model as the predictor:ŷ v = f θ (G v ). The intuition in Section 3.1 implies a new learning objective: , y e v ) and β is a trading hyper-parameter. If we have training graphs from a sufficient number of environments E tr = {e} and the correspondence of each graph to a specific e, i.e., {G e , Y e } e∈Etr which induces {{G e v , y e v } v∈Ve : e ∈ E tr }, we can use the empirical estimation with risks from different environments to handle Eq. 4 in practice, as is done by the Risk Extrapolation (REX) approach (Krueger et al., 2021) . Unfortunately, as mentioned before, for node-level tasks on graphs, the training data is often a single graph (without any correspondence of nodes to environments), and hence, one only has training data from a single environment. Exceptions are some multi-graph scenarios where one can assume each graph is from an environment, but there are still a very limited number of training graphs (e.g., less than five). The objective Eq. 4 would require data from diverse environments to enable the model for desirable extrapolation. To detour such a dilemma, we introduce K auxiliary context generators g w k (G) (k = 1, · · · , K) that aim to generate K-fold graph data {G k } K k=1 (which induces {{G k v } v∈V : 1 ≤ k ≤ K}) based on the input one G and mimics training data from different environments. The generators are trained to maximize the variance loss so as to explore the environments and facilitate stable learning of the GNN: where One remaining problem is how to specify g w k (G). Following recent advances in adversarial robustness on graphs (Xu et al., 2019; Jin et al., 2020) , we consider editing graph structures by adding/deleting edges. Assume a Boolean matrix B k = {0, 1} N ×N (k = 1, · · · , K) and denote the supplement graph of A as A = 11 − I − A, where I is an identity matrix. Then the modified graph for view k is where • denotes element-wise product. The optimization for B k is difficult due to its non-differentiability and one also needs to constrain the modification within a threshold. To handle this, we use policy gradient method REINFORCE, treating graph generation as a decision process and edge editing as actions (see details in Appendix A). We call our approach in Eq. 5 Explore-to-Extrapolate Risk Minimization (EERM) and present our training algorithm in Alg. 1. We next present theoretical analysis to shed insights on the objective and its relationship with our formulated OOD problem in Section 2.1. To begin with, we introduce some building blocks. The GNN model f can be decomposed into an encoder h for representation and a classifier c for prediction, i.e., f = c • h and we have . Besides, we assume I(x; y) stands for the mutual information between x and y and I(x; y|z) denotes the conditional mutual information given z. To keep notations simple, we define p e (·) = p(·|e = e) and I e (·) = I(·|e = e). Another tricky point is that in computation of the KL divergence and mutual information, we require the samples from the joint distribution p e (G, Y), which also results in difficulty for handling data generation of interconnected nodes. Therefore, we again adopt our perspective in Section 2.1 and consider a two-step sampling procedure. Concretely, for any probability function f 1 , f 2 associated with ego-graphs G v and node labels y, we define computation for KL divergence as We will show that the objective Eq. 4 can guarantee a valid solution for OOD problem Eq. 2. To this end, we rely on another assumption for data-generating distribution. Assumption 2. (Environment Heterogeneity): For (G v , r) that satisfies Assumption 1, there exists a random variable r such that G v = m(r, r) where m is a functional mapping.We assume that p(y|r, e = e) would arbitrarily change across environments e ∈ E. Assumptions 1 and 2 essentially distill two portions of features in input data: one is domain-invariant for prediction and the other contributes to sensitive prediction that depends on environments. The GNN model f = c • h induces two model distributions q(z|G v ) (by the encoder) and q(y|z) (by the classifier). Based on this, we can dissect the effects of Eq. 4 which indeed forces the representation z to satisfy the invariance and sufficiency conditions illustrated in Assumption 1. Theorem 1. If q(y|z) is treated as a variational distribution, then 1) minimizing the expectation term in Eq. 4 contributes to max q(z|Gv) I(y; z), i.e., enforcing the sufficiency condition on z for prediction, and 2) minimizing the variance term in Eq. 4 would play a role for min q(z|Gv) I(y; e|z), i.e., enforcing the invariance condition p(y|z, e) = p(y|z). Based on these results, we can bridge the gap between the invariance principle and OOD problem. Theorem 2. Under Assumption 1 and 2, if the GNN encoder q(z|G v ) satisfies that 1) I(y; e|z) = 0 (invariance condition) and 2) I(y; z) is maximized (sufficiency condition), then the model f * given by E y [y|z] is the solution to OOD problem in Eq. 2. The above results imply that the objective Eq. 4 can guarantee a valid solution for the formulated OOD problem on graph-structured data, which serves as a theoretical justification for our approach. We proceed to analyze the OOD generalization error given by our learning approach. Recall that we assume training data from p(G, Y|e = e) and testing data from p(G, Y|e = e ). Following similar spirits of Federici et al. (2021) , the training error and OOD generalization error can be respectively measured by D KL (p e (y|G v ) q(y|G v )) and D KL (p e (y|G v ) q(y|G v )) which can be calculated based on our definition in Eq. 6. Based on Theorem 1, we can arrive at the following theorem which reveals the effect of Eq. 4 that contributes to tightening the bound for the OOD error. Theorem 3. Optimizing Eq. 4 with training data can minimize the upper bound for The condition can be satisfied once z is a sufficient representation across environments. Therefore, we have proven that the new objective could help to reduce the generalization error on out-of-distribution data and indeed enhance GNN model's power for in-the-wild extrapolation. In this section, we aim to verify the effectiveness and robustness of our approach in a wide variety of tasks reflecting real situations, using different GNN backbones. Table 1 summarizes the information of experimental datasets and evaluation protocols, and we provide more dataset information in Appendix E. We compare our approach EERM with standard empirical risk minimization (ERM). Implementation details are presented in Appendix F. In the following subsections, we will investigate three scenarios that require the model to handle distribution shifts stemming from different causes. Table 1 : Summary of the experimental datasets that entail diverse distribution shifts ("Artificial Transformation" means that we add synthetic spurious features, "Cross-Domain Transfers" means that each graph in the dataset corresponds to distinct domains, "Temporal Evolution" means that the dataset is a dynamic one with evolving nature), different train/val/test splits ("Domain-Level" means splitting by graphs and "Time-Aware" means splitting by time) and the evaluation metrics. In Appendix E we provide more detailed information and discussions on the evaluation protocols. T1 T2 T3 T4 T5 T6 T7 We first consider artificial distribution shifts based on two public node classification benchmarks Cora and Amazon-Photo. For each dataset, we adopt two randomly initialized GNNs to 1) generate node labels based on the original node features and 2) generate spurious features based on the node labels and environment id, respectively (See Appendix E.1 for details). We generate 10-fold graph data with distinct environment id's and use 1/1/8 of them for training/validation/testing. We use a 2-layer vanilla GCN (Kipf & Welling, 2017 ) as the GNN model. We report results on 8 testing graphs (T1∼ T8) of the two datasets in Fig. 2 (a) and 3(a), respectively, where we also adopt 2-layer GCNs for data generation. The results show that our approach EERM consistently outperforms ERM on Cora and Photo, which suggests the effectiveness of our approach for handling distribution shifts. We also observe that in Photo, the performance variances within one graph and across different test graphs are both much lower compared with those in Cora. We conjecture the reasons are two-fold. First, there is evidence that in Cora the (original) features from adjacent nodes are indeed informative for prediction while in Photo this information contributes to negligible gain over merely using centered node's features. Based on this, once the node features are mixed up with invariant and spurious ones, it would be harder for distinguishing them in the former case that relies more on graph convolution. In Fig. 2 (b) and Fig. 3(b) , we compare the averaged training accuracy (achieved by the epoch with the highest validation accuracy) given by two approaches when using all the input features and removing the spurious ones for inference (we still use all the features for training in the latter case). As we can see, the performance of ERM drops much more significantly than EERM when we remove the spurious input features, which indicates that the GCN trained with standard ERM indeed exploits spurious features to increase training accuracy while our approach can help to alleviate such an issue and guide the model to focus on invariant features. Furthermore, in Fig. 2 (c) and Fig. 3(c) , we compare the test accuracy averaged on eight graphs when using different GNNs e.g. GCN, SGC (Wu et al., 2019) and GAT (Velickovic et al., 2018) , for data generation (See Appendix G for more results). The results verify that our approach achieves consistently superior performance in different cases. We proceed to consider another scenario where there are multiple observed graphs in one dataset and a model trained with one graph or a limited number of graphs is expected to generalize to new unseen graphs. The graphs of a dataset share the input feature space and output space and may have different sizes and data distributions since they are collected from different domains. We adopt two public social network datasets Twitch-Explicit and Facebook-100 collected by Lim et al. (2021) . Training with a Single Graph. In Twitch, we adopt a single graph DE for training, ENGB for validation and the remaining five networks (ES, FR, PTBR, RU, TW) for testing. We follow Lim et al. (2021) and use test ROC-AUC for evaluation. We specify the GNN model as GCN, GAT and a recently proposed model GCNII (Chen et al., 2020a ) that can address the over-smoothing of GCN and enable stacking of deep layers. The layer numbers are set as 2 for GCN and GAT and 10 for GCNII. Fig. 4 compares the results on five test graphs, where EERM significantly outperforms ERM in most cases with up to 7.0% improvement on ROC-AUC. The results verify the effectiveness of EERM for generalizing to new graphs from unseen domains. Training with Multiple Graphs. In FB-100, we adopt three graphs for training, two graphs for validation and the remaining three for testing. We also follow Lim et al. (2021) and use test accuracy for evaluation. We use GCN as the backbone and compare using different configurations of training graphs, as shown in Table 2 . We can see that EERM outperforms ERM on average on all the test graphs with up to 7.2% improvement. Furthermore, EERM maintains the superiority with different training graphs, which also verifies the robustness of our approach w.r.t. training data. We consider the third scenario where the input data are temporal dynamic graphs and the model is trained with dataset collected at one time and needs to handle newly arrived data in the future. Here are also two sub-cases that correspond to distinct real-world scenarios, as discussed below. Handling Dynamic Graph Snapshots. We adopt a dynamic financial network dataset Elliptic (Pareja et al., 2020) that contains dozens of graph snapshots where each node is a Bitcoin transaction and the goal is to detect illicit transactions. We use 5/5/33 snapshots for train/valid/test. Following Pareja et al. (2020) we use F1 score for evaluation. We consider two GNN architectures as the backbone: GraphSAGE (Hamilton et al., 2017) and GPRGNN (Chien et al., 2021 ) that can adaptively combine information from node features and graph topology. The results are shown in Fig. 5 where we group the test graph snapshots into 9 folds in chronological order. Our approach yields much higher F1 scores than ERM in most cases with averagely 9.6%/10.0% improvements using GraphSAGE/GPR-GNN as backbones. Furthermore, there is an interesting phenomenon that both methods suffer a performance drop after T7. The reason is that this is the time when the dark market shutdown occurred (Pareja et al., 2020) . Such an emerging event causes considerable variation to data distributions that leads to performance degrade for both methods, with ERM suffering more. In fact, the emerging event acts as an external factor which is unpredictable given the limited training data. The results also suggest that how neural models generalize to OOD data depends on the learning approach but its performance limit is dominated by observed data. Nonetheless, our approach contributes to better F1 scores than ERM even in such an extreme case. Handling New Nodes in Temporally Augmented Graph. Citation networks often go through temporal augmentation with new papers published. We adopt OGB-Arxiv (Hu et al., 2020) for experiments and enlarge the time difference between training and testing data to introduce distribution shifts: we select papers published before 2011 for training, in-between 2011 and 2014 for validation, and within 2014-2016/2016-2018/2018-2020 for testing. Also different from the original (transductive) setting in Hu et al. (2020) , we use the inductive learning setting, i.e., test nodes are strictly unseen during training, which is more akin to practical situations. Table 3 presents the test accuracy and shows that EERM outperforms ERM in five cases out of six. Notably, when using GPRGNN as the backbone, EERM manages to achieve up to 8.1% relative improvement, which shows that EERM is capable of improving GNN model's learning for extrapolating to future data. We compare with some closely related works and highlight our differences. Due to space limit, we defer more discussions on literature review to Appendix B. Generalization/Extrapolation on Graphs. Recent endeavors (Scarselli et al., 2018; Garg et al., 2020; Verma & Zhang, 2019) derive generalization error bounds for GNNs, yet they focus on indistribution generalization and put little emphasis on distribution shifts, which are the main focus of our work. Furthermore, some up-to-date works explore GNN's extrapolation ability for OOD data, e.g. unseen features/structures (Xu et al., 2021) and varying graph sizes (Yehudai et al., 2021; Bevilacqua et al., 2021) . However, they mostly concentrate on graph-level tasks rather than node-level ones (see detailed comparison below). Moreover, some recent works probe into extrapolating features' embeddings (Wu et al., 2021a) or user representations (Wu et al., 2021b) for open-world learning in tabular data or real systems for recommendation and advertisement. These works consider distribution shifts stemming from augmented input space or unseen entities, and the proposed models leverage GNNs as an explicit model that extrapolates to compute representations for new entities based on existing ones. In contrast, our work studies (implicit) distribution shifts behind observed data, and the proposed approach resorts to an implicit mechanism through the lens of invariance principle. Node-Level v. s. Graph-Level OOD Generalization. The two classes of graph-related problems are often treated differently in the literature: node-level tasks target prediction on individual nodes that are non-i.i.d. generated due to the interconnection of graph structure; graph-level tasks treat a graph as an instance for prediction and tacitly assume that all the graph instances are sampled in an i.i.d. manner. The distinct nature of them suggests that they need to be tackled in different ways for OOD generalization purpose. Graph-level problems have straightforward relationship to the general setting (in Eq. 1) since one can treat input graphs as x and graph labels as y. Based on this, the data from one environment becomes a set of i.i.d. generated pairs (x, y) and existing approaches for handling general OOD settings could be naturally generalized. Differently, node-level problems, where nodes inter-connected in one graph (e.g., social network) are instances, cannot be handled in this way due to the non-i.i.d. data generation. Our work introduces a new perspective for formulating OOD problems involving prediction for individual nodes, backed up with a concrete approach for problem solving and theoretical guarantees. While our experiments mainly focus on node-level prediction datasets, the proposed method can be applied to graph-level prediction and also extended beyond graph data (like images/texts) for generalization from limited observed environments. This work targets out-of-distribution generalization for graph-structured data with the focus on node-level problems where the inter-connection of data points hinders trivial extension from existing formulation and methods. To this end, we take a fresh perspective to formulate the problem in a principled way and further develop a new approach for extrapolation from a single environment, backed up with theoretical guarantees. We also design comprehensive experiments to show the practical power of our approach on various real-world datasets with diverse distribution shifts. We illustrate the details of using policy gradient for optimizing the graph editers in Eq. 5. Concretely, for view k, we consider a parameterized matrix P k = {π k nm }. For the n-th node, the probability for editing the edge between it and the m-th node would be p(a k nm ) = exp(π k nm ) m exp(π k nm ) . We then sample s actions {b k nmt } s t=1 from a multinomial distribution M(p(a k n1 ), · · · , p(a k nn )), which give the nonzero entries in the n-th row of B k . The reward function R(G k ) can be defined as the inverse loss. We can use REINFORCE algorithm to optimize the generator with the gradient ∇ w k log p w k (A k )R(G k ) where w k = P k and p w k (A k ) = Π n Π s t=1 p(b k nmt ). We present the training algorithm in Alg. 1. Algorithm 1: Stable Learning for OOD Generalization in Node-Level Prediction on Graphs. 1 INPUT: training graph data G = (A, X) and Y , initialized parameters of GNN θ, initialized parameters of graph editers w = {w k }, learning rates α g , α f . 2 while not converged or maximum epochs not reached do 3 for t = 1, · · · , T do 4 Obtain modified graphs G k = (A k , X) from graph editer g w k , k = 1, · · · , K; 10 OUTPUT: trained parameters of GNN θ * . Out-of-distribution generalization has drawn extensive attention in the machine learning community. To endow the learning systems with the ability for handling unseen data from new environments, it is natural to learn invariant features under the setting of the causal factorization of physical mechanisms (Schölkopf et al., 2012; Peters et al., 2016) . A recent work (Arjovsky et al., 2019) proposes Invariant Risk Minimization (IRM) as a practical solution for OOD problem via invariance principle. Based on this, follow-up works make solid progress in this direction, e.g., with group distributional robust optimization (Sagawa et al., 2019) , invariant rationalization (Chang et al., 2020) , game theory (Ahuja et al., 2020) , etc. Several works attempt to resolve extended settings. For instance, Ahmed et al. (2021) proposes to match the output distribution spaces from different domains via some divergence, while a recent work (Mahajan et al., 2021) also leverages a matching-based algorithm that resorts to shared representations of cross-domain inputs from the same object. Also, Creager et al. (2021) and Liu et al. (2021) point out that in most real situations, one has no access to the correspondence of each data point in the dataset with a specific environment, based on which they propose to estimate the environments as a latent variable. Krueger et al. (2021) devises Risk Extrapolation (REX) which aims at minimizing the weighted combination of the variance and the mean of risks from multiple environments. Xie et al. (2020) contributes to a similar objective from different theoretical perspective. Also, Koyama & Yamaguchi (2021) extends the spirit of MAML algorithm and arrives at a similar objective form. In our model, we also consider minimization of the combination of variance and mean terms (in Eq. 4) and on top of that we further propose to optimize through a bilevel framework in Eq. 5. Compared to existing works, the differences of EERM are two-folds. First, we do not assume input data from multiple environments and the correspondence between each data point and a specific environment. Instead, our formulation enables learning and extrapolation from a single observed environment. Second, on methodology side, we introduce multiple context generators that aim to generate data of virtual environments in an adversarial manner. Besides, our formulation in this paper focus on prediction tasks for nodes on graphs, where the essential difference, as mentioned before, lies in the inter-connection of data points in one graph (that corresponds to an environment), which hinders trivial adaption from existing works in the general setting. Another line of research related to us attempts to enhance the generalization ability of graph neural networks via modifying the graph structures. One category of recent works is to learn new graph structures based on the input graph and node features. To improve the generalization power, a common practice is to enforce a certain regularization for the learned graph structures. For example, Jin et al. (2020) proposes to constrain the sparsity and smoothness of graphs via matrix norms and further adopts proximal gradient methods for handling the non-differentiable issue. Chen et al. (2020b) ; also aim to regularize the sparsity and smoothness but differently harness energy function to enforce the constraints. From a different perspective, Xu et al. (2019) attempts to attack the graph topology for improving the model's robustness and proposes to leverage projected gradient descent to make it tractable for the optimization of discrete graph structures. Alternatively, several works focus on pruning the graph networks (Chen et al., 2021) or adaptively sparsifying the structures (Zheng et al., 2020; Hasanzadeh et al., 2020) . In this paper, we focus on out-of-distribution generalization and target handling distribution shifts over graphs, which is more difficult than the setting of previous methods that concentrate on in-distribution generalization. On methodology side, with a different training scheme, our context generators aim to generate data of multiple virtual environments and are learned from maximizing the variance of risks of multiple (virtual) environments. Also, one can specify our context generators with other existing graph generation/editing/attacking frameworks as mentioned above, which we leave as future works. There are some very recent works that endeavor to study the out-of-distribution generalization ability of GNNs for node classification. (Baranwal et al., 2021) introduces contextual stochastic block model that is a mix-up of standard stochastic block model and a Gaussian mixture model with each node class corresponding to a component, based on which the authors show some cases where linear separability can be achieved. (Ma et al., 2021) focuses on subgroup generalization and presents a PAC-Bayesian theoretic framework, while (Zhu et al., 2021) targets distribution shifts between selecting training and testing nodes and proposes new GNN model called Shift-Robust GNNs. By contrast, our work possesses the following distinct technical contributions. First, we formulate OOD problem for node-level tasks in a general manner without any assumption on specific distribution forms or the way for generation of graph structures (our Assumption 1 and 2 mainly focus on the existence of causal features and spurious features in data generation across different environments). Second, our proposed model and algorithm are rooted on invariant models, providing a new perspective and methodology for OOD learning on graphs. Third, we design and conduct comprehensive experiments on diverse real-world datasets that can reflect in-the-wild nature in real situations (e.g., cross-graph transfers and dynamic evolution) and demonstrate the power of our approach in three scenarios. We define the aggregated node feature a v = 1 |Nv| u∈Nv x u . According to the definition and setup in Section 3.1, we derive the risk under a specific environment R(e): Denote the objective for empirical risk minimization as L 1 = E e [R(e)] and we have its first-order derivative w.r.t. θ 1 as where the second step is given by independence among a 1 v , n 1 v , n 2 v and . Let ∂L1 ∂θ1 = 0, and we will obtain θ 1 + θ 2 = 1. Also, the first-order derivative w.r.t. θ 2 is where the last step is according to E . We further let ∂L1 ∂θ2 = 0, and will get the unique solution (10) and l(e) = (θ 1 + θ 2 − 1)a 1 v + θ 2 (n 1 v + n 2 v + ) − n 1 v . We derive the first-order derivation of L 2 w.r.t. θ 1 and θ 2 . Firstly, By letting ∂L2 ∂θ1 = 0, we obtain the equation θ 1 + θ 2 = 1. Plugging it into ∂L2 ∂θ2 we have which is a function of e unless [θ 1 , θ 2 ] = [1, 0] that gives rise to ∂L2 ∂θ2 = 0 for arbitrary distributions of environments. We thus conclude the proof. We first present a useful lemma that interprets the invariance and sufficiency conditions with the terminology of information theory. Lemma 1. The two conditions in Assumption 1 can be equivalently expressed as 1) (Invariance): I(y; e|r) = 0 and 2) (Sufficiency): I(y; r) is maximized. Proof. For the invariance, we can easily arrive at the equivalence given the fact I(y; e|r) = D KL (p(y|e, r) p(y|r)). For the sufficiency, we first prove that for (G v , r, y) satisfying that y = c * (r) + n would also satisfy that r = arg max r I(y; r). We prove it by contradiction. Suppose that r = arg max r I(y; r) and r = arg max r I(y; r) where r = r. Then there exists a random variable r such that r = m(r, r) where m is a mapping function. We thus have I(y; r ) = I(y; r, r) = I(c * (r); r, r) = I(c * (r); r) = I(y; r) which leads to contradiction. We next prove that for (G v , r, y) satisfying that r = arg max r I(y; r) would also satisfy that y = c * (r) + n. Suppose that y = c * (r) + n and y = c * (r ) + n where r = r. We then have the relationship I(c * (r ); r) ≤ I(c * (r ); r ) which yields that r = arg max r I(y; r) and leads to contradiction. Given the dependency relationship z ← G v → y, we have the fact that max q(z|Gv) I(y, z) is equivalent to min q(z|Gv) I(y, G v |z). Also, we have (treating q(y|z) as a variational distribution) Based on this, we have the inequality Also, we have (according to our definition in Eq. 6) where the second step is according to Jensen Inequality and the equality holds if q(z|G v ) is a delta distribution (induced by the GNN encoder h). Then the problem min q(z|Gv),q(y|z) D KL (p(y|G v , e) q(y|z)) can be equivalently converted into We thus have proven that minimizing the expectation term in Eq. 4 is to minimize the upper bound of I(y, G v |z) and contributes to max q(z|Gv) I(y, z). Besides, we have (according to the definition in Eq. 6) Using Jensen Inequality, we will obtain that D KL (q(y|z) E e [q(y|z)]) is upper bounded by Hence we have proven that minimizing the variance term in Eq. 4 plays a role for solving min q(z|Gv) I(y; e|z). With Lemma 1, we know that 1) the representation z (given by GNN encoder z = h(G v )) satisfies the invariant condition, i.e., p(y|z) = p(y|z, e) if and only if I(y; e|z) = 0 and 2) the representation z satisfies the sufficiency condition, i.e., y = c * (z) + n if and only if z = arg max z I(y; z). We denote the GNN encoder that satisfies the invariance and sufficiency conditions as h * and the corresponding predictor model Since we assume the GNN encoder q(z|G v ) satisfies the conditions in Assumption 1, then according to Assumption 2, we know that there exists random variable z such that G v = m(z, z) and p(y|z, e) would change arbitrarily across environments. Based on this, for any environment e that gives the distribution p e (y, z, z), we can construct environment e with the distribution p e (y, z, z) that satisfies p e (y, z, z) = p e (y, z)p e (z). Then we follow the reasoning line of Theorem 2.1 in Liu et al. (2021) to finish the proof by showing that for arbitrary function f = c • h and environment e, there exists an environment e such that Concretely we have where the first equality is given by Eq. 22 and the second/third steps are due to the sufficiency condition of h * . Recall that according to our definition in Eq. 6, the KL divergence D KL (p e (y|G v ) q(y|G v )) would be (25) Based on this newly defined KL divergence, we can extend the information-theoretic framework (Federici et al., 2021) for analysis on graph data. First, we can decompose the training error (resp. OOD error) into a representation error and a latent predictive error. Lemma 2. For any GNN encoder q(z|G v ) and classifier q(y|z), we have Proof. Firstly, we have where the third step is again due to Jensen Inequality and the equality holds once q(z|G v ) is a delta distribution. Besides, we have The result for D KL (p e (y|G v ) q(y|G v )) can be obtained in a similar way. Lemma 3. For any q(z|G v ) and q(y|z), the following inequality holds for any z satisfying p(z = z|e = e) > 0, ∀e ∈ E. Proof. The proof can be adapted by from the Proposition 3 in Federici et al. (2021) by replacing e in our case with t. The results of Lemma 2 and 3 indicate that if we aim to reduce the OOD error measured by D KL (p e (y|G v ) q(y|G v ), one need to control three terms: 1) I e (G v ; y|z), 2) D KL (p e (y|z) q(y|z) and 3) I(y; e|z). The next lemma unifies minimization for the first two terms. Lemma 4. For any q(z|G v ) and q(y|z), we have min q(z|Gv),q(y|z) Proof. Recall that q(y|z) is a variational distribution. We have Therefore, we can see that I e (G v ; y|z) is upper bounded by D KL (p e (y|G v q(y|z)) and the equality holds if and only if D KL (p e (y|z) q(y|z)) = 0. We thus conclude the proof. Recall that according to Lemma 1 we have the fact that our objective in Eq. 4 essentially has the similar effect as min q(z|Gv),q(y|z) D KL (p e (y|G v ) q(y|z)) + I(y; e|z). Based on the Lemma 2, 3 and 4, we know that optimization for the objective Eq. 4 can reduce the upper bound of OOD error given by D KL (p e (y|G v ) q(y|G v ) on condition that I e (G v ; y|z) = I e (G v ; y|z). We conclude our proof for Theorem 3. In this section, we introduce the detailed information for experimental datasets and also provide the details for our evaluation protocols including data preprocessing, dataset splits and the ways for calculating evaluation metrics. In the following subsections, we present the information for the three scenarios, respectively. Cora and Amazon-Photo are two commonly used node classification benchmarks and widely adopted for evaluating the performance of GNN designs. These datasets are of medium size with thousands of nodes. See Table 1 for more statistic information. Cora is a citation network where nodes represent papers and edges represent their citation relationship. Amazon-Photo is a copurchasing network where nodes represent goods and edges indicate that two goods are frequently bought together. In the original dataset, the available node features have strong correlation with node labels. To evaluate model's ability for out-of-distribution generalization, we need to introduce distribution shifts into the training and testing data. For each dataset, we use the provided node features to construct node labels and spurious environmentsensitive features. Specifically, assume the provided node features as X 1 . Then we adopt a randomly Figure 6 : Comparison of different leave-out data on Twitch-Explicit. We consider three GNN backbones trained with ERM. The "OOD" means that we train the model on one graph DE and report the metric on another graph ENGB. The "IID" means that we train the model on 90% nodes of DE and report the metric on the remaining nodes. The results clearly show that the model performance suffers a significantly drop from the case "IID" to the case "OOD". This indicates that the graph-level splitting for training/validation/testing splits used in Section 5.2 indeed introduces distribution shifts and would require the model to deal with out-of-distribution data during test. initialized GNN (with input of X 1 and adjacency matrix) to generate node labels Y (via taking an argmax in the output layer to obtain one-hot vectors), and another randomly initialized GNN (with input of the concatenation of Y and an environment id) to generate spurious node features X 2 . After that, we concatenate two portions of features X = [X 1 , X 2 ] as input node features for training and evaluation. In this way, we construct ten graphs with different environment id's for each dataset. We use one graph for training, one for validation and report the classification accuracy on the remaining graphs. One may realize that this data generation is a generalized version of our motivating example in Section 3.1 and we replace the linear aggregation as a randomly initialized graph neural network to introduce non-linearity. In fact, with our data generation, the original node features X 1 can be seen as domain-invariant features that are sufficiently predictive for node labels and insensitive to different environments, while the generated features X 2 are domain-variant features that are conditioned on environments. Therefore, in principle, the ideal case for the model is to identify and leverage the invariant features for prediction. In practice, there exist multiple factors that may affect model's learning, including the local optimum and noise in data. Therefore, one may not expect the model to exactly achieve the ideal case since there also exists useful predictive information in X 2 that may help the model to increase the training accuracy. Yet, through our experiments in Fig. 2 (b) and 3(b), we show that the reliance of EERM on spurious features is much less than ERM, which we believe could serve as concrete evidence that our approach is capable for guiding the GNN model to alleviate reliance on domain-variant features. A typical scenario for distribution shifts on graphs is the problem of cross-domain transfers. There are quite a few real-world situations where one has access to multiple observed graphs each of which is from a specific domain. For example, in social networks, the domains can be instantiated as where or when the networks are collected. In protein networks, there may exist observed graph data (protein-protein interactions) from distinct species which can be seen as distinct domains. In short, since most of graph data records the relational structures among a specific group of entities and the interactions/relationships among entities from different groups often have distinct characteristics, the data-generating distributions would vary across groups, which bring up domain shifts. Yet, to enable transfer learning across graphs, the graphs in one dataset need to share the same input feature space and output space. We adopt two public datasets Twitch-Explicit and Facebook-100 that satisfy this requirement. Twitch-Explicit contains seven networks where nodes represent Twitch users and edges represent their mutual friendships. Each network is collected from a particular region, including DE, ENGB, ES, FR, PTBR, RU and TW. These seven networks have similar sizes and different densities and maximum node degrees, as shown in Table 4 . Also, in Fig. 6 , we compare the ROC-AUC results on different leave-out data. We consider GCN, GAT and GCNII as the GNN backbones and train the model with standard empirical risk minimization (ERM). We further consider two ways for data splits. In the first case, which we call "OOD", we train the model on the nodes of one graph DE and report the highest ROC-AUC on the nodes of another graph ENGB. In the second case, which we call "IID", we train the model on 90% nodes of DE and evaluate the performance on the leave-out 10% data. The results in Fig. 6 show that the model performance exhibits a clear drop from "IID" to "OOD", which indicates that there indeed exist distribution shifts among different input graphs. This also serves as a justification for our evaluation protocol in Section 5.2 where we adopt the graph-level splitting to construct training/validation/testing sets. Tr Val T1 T2 T3 T4 T5 T6 T7 T8 T9 Elliptic contains a sequence of 49 graph snapshots. Each graph snapshot is a network of Bitcoin transactions where each node represents one transaction and each edge indicates a payment flow. Approximately 20% of the transactions are marked with licit or illicit ones and the goal is to identify illicit transaction in the future observed network. Since in the original dataset, the first six snapshots have extremely imbalanced classes (where the illicit transactions are less than 10 among thousands of nodes), we remove them and use the 7th-11th/12th-17th/17th-49th snapshots for training/validation/testing. Also, due to the fact that each graph snapshot has very low positive label rate, we group the 33 testing graph snapshots into 9 test sets according to the chronological order. In Fig. 8 we present the label rate and positive label rate for training/validation/testing sets. As we can see, the positive label rates are quite different in different data sets. Indeed, the model needs to handle distinct label distributions from training to testing data. OGB-Arxiv is composed of 169,343 Arxiv CS papers from 40 subject areas and their citation relationship. The goal is to predict a paper's subject area. In (Hu et al., 2020) , the papers published before 2017, on 2018 and since 2019 are used for training/validation/testing. Also, the authors adopt the transductive learning setting, i.e., the nodes in validation and test sets also exist in the graph for training. In our case, we instead adopt inductive learning setting where the nodes in validation and test sets are unseen during training, which is more akin to the real-world situation. Besides, for better evaluation on generalization, especially extrapolating to new data, we consider dataset splits with a larger year gap: we use papers published before 2011 for training, from 2011 to 2014 for validation, and after 2014 for test. Such a dataset splitting way would introduce distribution shift between training and testing data, since several latent influential factors (e.g., the popularity of research topics) for data generation would change over time. In Fig. 9 , we visualize the T-SNE embeddings of the nodes and mark the training/validation/testing nodes with different colors. From Fig. 9 (a) to Fig. 9 (c), we can see that testing nodes non-overlapped with the training/validation ones exhibit an increase, which suggests that the distribution shifts enlarge as time difference goes large. This phenomenon echoes the results we achieve in Table 3 where we observe that as the time difference between testing and training data goes larger, model performance suffers a clear drop, with ERM suffering more than EERM. In this section, we present the details for our implementation in Section 5 including the model architectures, hyper-parameter settings and training details in order for reproducibility. Most of our experiments are run on GeForce RTX 2080Ti with 11GB except some experiments requiring large GPU memory for which we adopt RTX 8000 with 48GB. The configurations of our environments and packages are listed below: : T-SNE visualization of training/validation/testing nodes in OGB-Arxiv. We mark training nodes (within 1950-2011) and validation nodes (within 2011-2014) as red and blue, respectively. In (a)-(c), the test nodes within different time intervals are visualized as yellow points. We can see that as the time difference of testing data and training/validation data goes large from (a) to (c), the testing nodes non-overlapped with training/validation ones become more, which suggests that the distribution shifts become more significant and require the model to extrapolate to more difficult future data. • PyTorch Geometric 1.7.2 In our experiments in Section 5, we adopt different GNN architectures as the backbone. Here we introduce the details for them. We use the GCNConv available in Pytorch Geometric for implementation. The detailed architecture description is as below: • A sequence of L-layer GCNConv. • Add self-loop and use batch normalization for graph convolution in each layer. • Use ReLU as the activation. GAT. We use the GATConv available in Pytorch Geometric for implementation. The detailed architecture description is as below: • A sequence of L-layer GATConv with head number H. • Add self-loop and use batch normalization for graph convolution in each layer. • Use ELU as the activation. GraphSAGE. We use the SAGEConv available in Pytorch Geometric for implementation. The detailed architecture description is as below: • A sequence of L-layer SAGEConv. • Add self-loop and use batch normalization for graph convolution in each layer. • Use ReLU as the activation. GCNII. We use the implementation 4 provided by the original paper (Chen et al., 2020a) . The associated hyper-parameters in GCNII model are set as: α GCN II = 0.1 and λ GCN II = 1.0. GPRGNN. We use the implementation 5 provided by Chien et al. (2021) . We adopt the PPR initialization and GPRprop as the propagation unit. The associated hyper-parameters in GPRGNN model are set as: α GP RGN N = 0.1. The hyper-parameters for model architectures are set as default values in different cases. Other hyperparameters are searched with grid search on validation dataset. The searching space are as follows: learning rate for GNN backbone α f ∈ {0.0001, 0.0002, 0.001, 0.005, 0.01}, learning rate for graph editers α g ∈ {0.0001, 0.001, 0.005, 0.01}, weight for combination β ∈ {0.2, 0.5, 1.0, 2.0, 3.0}, number of edge editing for each node s ∈ {1, 5, 10}, number of iterations for inner update before one-step outer update T ∈ {1, 5}. We consider 2-layer GCN with hidden size 32. We use weight decay with coefficient set as 1e-3. Besides, we set α g = 0.005, α f = 0.01, β = 2.0, s = 5, T = 1. For GCN, we set the layer number L as 2. For GAT, we set L = 2 and H = 4. For GCNII, we set the layer number as 10. We use hidden size 32 and weight decay with coefficient set as 1e-3. For Twitch-Explicit, other hyper-parameters are set as follows: • GCN: α g = 0.001, α f = 0.01, β = 3.0, s = 5, T = 1. • GAT: α g = 0.005, α f = 0.01, β = 1.0, s = 5, T = 1. • GCNII: α g = 0.01, α f = 0.001, β = 1.0, s = 5, T = 1. For Facebook-100, other hyper-parameters are set as: α g = 0.005, α f = 0.01, β = 1.0, s = 5, T = 1. For GraphSAGE and GPRGNN, we set the layer number as 5 and hidden size as 32. For Elliptic, other hyper-parameters are set as follows: • GraphSAGE: α g = 0.0001, α f = 0.0002, β = 1.0, s = 5, T = 1. • GPRGNN: α g = 0.005, α f = 0.01, β = 1.0, s = 5, T = 1. For OGB-Arxiv, other hyper-parameters are set as follows: Published as a conference paper at ICLR 2022 • GraphSAGE: α g = 0.01, α f = 0.005, β = 0.5, s = 1, T = 5. • GPRGNN: α g = 0.001, α f = 0.01, β = 1.0, s = 1, T = 5. For each method, we train the model with a fixed number of epochs and report the test result achieved at the epoch when the model provides the best performance on validation set. We provide additional experiment results in this section. In Fig.10 and 11 we present the distribution of test accuracy on Cora when using SGC and GAT, respectively, as the GNNs for data generation. In Fig. 12 and 13 we further compare with the training accuracy using all the features and removing the spurious ones for inference. These results are consistent with those presented in Section 5.1, which again verifies the effectiveness of our approach. Besides, the corresponding extra results on Photo are shown in Fig. 14, 15 , 16 and 17, which also back up our discussions in Section 5.1. T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 T8 40 50 60 70 80 Accuracy ERM EERM T1 T2 T3 T4 T5 T6 T7 T8 Systematic generalisation with group invariant predictions Invariant risk minimization games Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing Graph convolution for semisupervised classification: Improved linear separability and out-of-distribution generalization Recognition in terra incognita Fairness in criminal justice risk assessments: The state of the art Size-invariant graph representations for graph classification extrapolations Generalizing from several related classification tasks to a new unlabeled sample Invariant rationalization Simple and deep graph convolutional networks A unified lottery ticket hypothesis for graph neural networks Iterative deep graph learning for graph neural networks: Better and robust node embeddings Adaptive universal generalized pagerank graph neural network Environment inference for invariant learning Dark model adaptation: Semantic image segmentation from daytime to nighttime Ai for radiographic covid-19 detection selects shortcuts over signal Collective spammer detection in evolving multi-relational social networks An information-theoretic approach to distribution shifts Generalization and representational limits of graph neural networks Domain adaptation with conditional transferable components Inductive representation learning on large graphs Bayesian graph neural networks with adaptive connection sampling Open graph benchmark: Datasets for machine learning on graphs Graph structure learning for robust graph neural networks Semi-supervised classification with graph convolutional networks When is invariance useful in an out-of-distribution generalization problem ? CoRR, abs Out-of-distribution generalization via risk extrapolation (rex) New benchmarks for learning on nonhomophilous graphs Heterogeneous risk minimization Subgroup generalization and fairness of graph neural networks Domain generalization using causal matching Towards recognizing unseen categories in unseen domains Domain adaptation: Learning bounds and algorithms Domain generalization via invariant feature representation Evolvegcn: Evolving graph convolutional networks for dynamic graphs Causal inference by using invariant prediction: identification and confidence intervals Do imagenet classifiers generalize to imagenet Invariant models for causal transfer learning Multi-scale attributed node embedding Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization The vapnik-chervonenkis dimension of graph and recursive neural networks On causal and anticausal learning Pitfalls of graph neural network evaluation One pixel attack for fooling deep neural networks Social structure of facebook networks Graph attention networks Stability and generalization of graph convolutional neural networks A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia Simplifying graph convolutional networks Towards open-world feature extrapolation: An inductive graph learning approach Towards open-world recommendation: An inductive model-based collaborative filtering approach Risk variance penalization: From distributional robustness to causality. CoRR, abs Topology attack and defense for graph neural networks: An optimization perspective How neural networks extrapolate: From feedforward to graph neural networks Revisiting semi-supervised learning with graph embeddings From local structures to size generalization in graph neural networks Bayesian graph convolutional neural networks for semi-supervised classification Robust graph representation learning via neural sparsification Shift-robust gnns: Overcoming the limitations of localized graph training data Another dataset is Facebook-100 which consists of 100 Facebook friendship network snapshots from the year 2005, and each network contains nodes as Facebook users from a specific American university. We adopt fourteen networks in our experiments: John Hopkins, Caltech, Amherst, Bingham, Duke, Princeton, WashU, Brandeis, Carnegie, Cornell, Yale, Penn, Brown and Texas. Recall that in Section 5.2 we use Penn, Brown and Texas for testing, Cornell and Yale for validation, and use three different combinations from the remaining graphs for training. These graphs have significantly diverse sizes, densities and degree distributions. In Fig. 7 we present a comparison which indicates that the distributions of graph structures among these graphs are different. Concretely, the testing graphs Penn and Texas are much larger (with 41554 and 31560 nodes, respectively) than training/validation graphs (most with thousands of nodes). Also, the training graphs Caltech and Amherst are much denser than other graphs in the dataset, while some graphs like Penn have nodes with very large degrees. These statistics suggest that our evaluation protocol requires the model to handle different graph structures from training/validation to testing data. Another common scenario is for temporal graphs that dynamically evolve as time goes by. The types of evolution can be generally divided into two categories. In the first case, there are multiple graph snapshots and each snapshot is taken at one time. As time goes by, there exists a sequence of graph snapshots which may contain different node sets and data distributions. Typical examples include financial networks that record the payment flows among transactions within different time intervals. In the second case, there is one graph that evolves with node/edge adding or deleting. Typical examples include some large-scale real-world graphs like social networks and citation networks where the distribution for node features, edges and labels would have strong correlation with time (in different scales). We adopt two public real-world datasets Elliptic and OGB-Arxiv for node classification experiments.