key: cord-0637678-334bsmsg authors: Li, Haoyang; Wang, Xin; Zhang, Ziwei; Zhu, Wenwu title: Out-Of-Distribution Generalization on Graphs: A Survey date: 2022-02-16 journal: nan DOI: nan sha: 3ec6d3464cb1bc31f371a63681d4c147566dc230 doc_id: 637678 cord_uid: 334bsmsg Graph machine learning has been extensively studied in both academia and industry. Although booming with a vast number of emerging methods and techniques, most of the literature is built on the I.I.D. hypothesis, i.e., testing and training graph data are independent and identically distributed. However, this I.I.D. hypothesis can hardly be satisfied in many real-world graph scenarios where the model performance substantially degrades when there exist distribution shifts between testing and training graph data. To solve this critical problem, out-of-distribution (OOD) generalization on graphs, which goes beyond the I.I.D. hypothesis, has made great progress and attracted ever-increasing attention from the research community. In this paper, we comprehensively survey OOD generalization on graphs and present a detailed review of recent advances in this area. First, we provide a formal problem definition of OOD generalization on graphs. Second, we categorize existing methods into three classes from conceptually different perspectives, i.e., data, model, and learning strategy, based on their positions in the graph machine learning pipeline, followed by detailed discussions for each category. We also review the theories related to OOD generalization on graphs and introduce the commonly used graph datasets for thorough evaluations. Last but not least, we share our insights on future research directions. This paper is the first systematic and comprehensive review of OOD generalization on graphs, to the best of our knowledge. Graph-structured data is ubiquitous in our daily life. It has been widely used to model the complex relationships and dependencies between entities, ranging from microscopic particle interactions in physical systems and molecular structure in proteins to macroscopic traffic networks and global communication networks. Machine learning approaches on graphs, * Equal contributions especially for graph neural networks (GNNs), have shown great successes in both academia and industry, illustrating their excellent capabilities in various applications, e.g., social network analysis [Qiu et al., 2018] , recommendation systems , knowledge representation [Wang et al., 2017] , traffic forecasting , etc. Despite the notable success of graph machine learning approaches, the existing literature generally relies on the assumption that the testing and training graph data are independently drawn from the identical distribution, i.e., I.I.D. hypothesis. However, in real-world scenarios, such a hypothesis is difficult to be satisfied due to the uncontrollable underlying data generation mechanism [Bengio et al., 2019] . In practice, there will inevitably be scenarios with distribution shifts between testing and training graphs [Li et al., 2021b; Wang et al., 2021b] , which may cause significant performance drop for most existing approaches lacking the ability of out-of-distribution (OOD) generalization. Therefore, it is of paramount importance to develop approaches capable of handling out-of-distribution (OOD) generalization on graphs, especially for high-stake graph applications, e.g., criminal justice [Agarwal et al., 2021] , financial analysis [Yang et al., 2020a] , molecule prediction [Hu et al., 2020a] , as well as pandemic prediction [Panagopoulos et al., 2021] , medical detection [Horry et al., 2020] and drug repurposing [Hsieh et al., 2021] for COVID-19. Out-of-distribution (OOD) generalization algorithm [Shen et al., 2021; Wang et al., 2021b] aims to achieve satisfactory generalization performance under unknown distribution shifts. It has been occupying an important position in the research community due to the increasing demand for handling in-the-wild unseen data. Combining the strength of graph machine learning and OOD generalization, i.e., OOD generalization on graphs, naturally serves as a promising research direction to facilitate graph machine learning model deployments in real-world scenarios. However, this problem is highly non-trivial due to the following challenges. • Uniqueness of graph data: The non-Euclidean nature of graph-structured data space leads the unique graph model designs and makes obstacles for the direct adoption of OOD generalization algorithms that are mainly developed on Euclidean data (e.g., images and texts). • Diversity of graph task: The problems on graphs are originally diverse, ranging from node-level, link-level to graph-level tasks, along with distinct settings, objectives, and constraints. It is necessary to integrate different levels of graph characterizations into the graph OOD generalization method. • Complexity of graph distribution shift type: The distribution shift on graphs can exist on feature-level (e.g., node features) and topology-level (e.g., graph size or other structural properties), rendering more difficulties for OOD generalization. With both opportunities and challenges, it is the right time to review and carry out the studies of graph OOD generalization methods. In this paper, we provide a systematic and comprehensive review for OOD generalization on graphs for the first time, to the best of our knowledge. Specifically, to cover the whole life cycle of OOD generalization on graphs, we start by providing a formal problem definition. We divide the existing methodologies into three conceptually different categories based on their positions in the graph machine learning pipeline, and elaborate typical approaches for each category. We also review the theories and datasets for evaluations to further promote the research on OOD generalization on graphs. Last but not least, we share our insights on potential research topics deserving future investigations. The rest of the paper is organized as follows. In Section 2, we formulate the problem of OOD generalization on graphs and present our categorization of existing literature. We comprehensively review three categories of methods in Sections 3-5, followed by our review of related theory (in Section 6) and evaluation datasets (in Section 7). Lastly, we point out future research opportunities in Section 8. In this section, we first describe the formulation of OOD generalization on graphs. Then we provide the categorization of existing graph OOD generalization methods. Let X be the input space and Y be the label space. A graph predictor f θ : X → Y with parameter θ maps the input instance X ∈ X, i.e., node/link/graph for node-level/linklevel/graph-level task, into the label Y ∈ Y. A loss function measures the distance between prediction and ground-truth label. The graph OOD generalization problem is defined as: Definition 1 (Graph OOD generalization). Given the training set of N instances (i.e., nodes, links, or graphs) that are drawn from training distribution P train (X, Y ), where X i ∈ X and Y i ∈ Y, the goal is to learn an optimal graph predictor f * θ that can achieve the best generalization on the data drawn from test distribution The distribution shifts between P test (X, Y ) and P train (X, Y ) can lead to the failure of graph predictor built on I.I.D. hypothesis, since directly minimizing the average loss on training instances E X,Y ∼Ptrain [ (f θ (X), Y )] can not obtain an optimal predictor that generalizes to testing instances under distribution shifts. To tackle the challenges brought by unknown distribution shifts and solve the graph OOD generalization problem, considerable efforts have been made in literature, which can be categorized into three classes: • Data: This category of methods aims to manipulate the input graph data, i.e., graph augmentation. By systematically generating more training samples to increase the quantity and diversity of the training set, graph augmentation techniques are effective in improving the OOD generalization performance. • Model: This category of methods aims to propose new graph models for learning OOD generalized graph representations, including two types of representative methods: disentanglement-based graph models and causalitybased graph models. • Learning Strategy: This category of methods focuses on exploiting the training schemes with tailored optimization objectives and constraints to enhance the OOD generalization capability, including graph invariant learning, graph adversarial training, and graph selfsupervised learning. These three categories of methods solve the graph OOD generalization problem from three conceptually different perspectives. We provide the taxonomy in Figure 1 and elaborate these methods for each category in the following sections. We also summarize the characteristics of these methods in Table 1 . The OOD generalization ability of machine learning models, including graph models, heavily relies on the diversity and quality of training data [Wang et al., 2021b] . Broadly speaking, the more diverse and high-quality the training data, the better the generalization performance of graph models. With proper graph augmentation strategy, this type of methods can obtain more graph instances with a cheap and simple way for training, whose goal can be formulated as: where (X , Y ) belongs to training set D augmented from D. Specifically, GAUG [Zhao et al., 2020] proposes to generate augmented graphs via a differentiable edge predictor. DropEdge [Rong et al., 2020] features and then propagate the perturbed node features. ifMixup [Guo and Mao, 2021] leverages the Mixup [Zhang et al., 2018] for graph augmentation. GraphCL [You et al., 2020a] further proposes four general data augmentations for graph-structured data, including node dropping, edge perturbation, attribute masking, subgraph sampling. It makes the graph encoder maximize representation consistency under augmentations and has shown good OOD generalization ability [Ding et al., 2021] . Some works design new graph models with good theoretical support to produce OOD generalized graph representations, including disentanglement-based and causality-based graph models, whose goal can be formulated as: where we decompose f θ = w • g, and g is the speciallydesigned graph model for learning representations and w is the classifier making predictions by the representations. The formation of a real-world graph typically follows a complex and heterogeneous process driven by the interaction of many latent factors. Disentangled graph representation learning aims to learn representations that separate these distinct and informative factors behind the graph data and characterize these factors in different parts of the factorized vector representations [Ma et al., 2019] . Such representations have been demonstrated to be more resilient to the complex variants [Bengio et al., 2013] , and able to benefit OOD generalization [Montero et al., 2020; Dittadi et al., 2020] . DisenGCN [Ma et al., 2019] is the first method to learn disentangled node representations, whose key ingredient is a disentangled multichannel convolutional layer DisenConv. Executing inside DisenConv, the proposed neighborhood routing mechanism infers the latent factors by iteratively analyzing the potential subspace clusters formed by the node and its neighbors. Therefore, each channel of DisenConv can extract features specific to only one disentangled latent factor from the neighbor nodes, and perform a convolution operation independently. IPGDN [Liu et al., 2020] extends DisenGCN by explicitly encouraging the latent factors to be as independent as possible in addition to the neighborhood routing mechanism for disentangling latent factors behind graphs. FactorGCN [Yang et al., 2020b] is a disentangled GNN model for graph-level representation learning. It adopts a factorizing mechanism by decomposing input graphs into several interpretable factor graphs for graph-level disentangled representations. Each of the factor graphs is separately sent to the GNN model and then aggregated together. Besides the supervised methods above, there exist some unsupervised disentangled methods. VGAE [Kipf and Welling, 2016] and GraphVAE [Simonovsky and Komodakis, 2018] are based on the generative model namely utilizing Variational Autoencoders (VAEs) on graph for disentanglement, since the hyperparameter β of VAEs can balance the reconstruction and disentanglement [Higgins et al., 2017; Burgess et al., 2017] . NED-VAE [Guo et al., 2020] is another unsupervised disentangling method which can disentangle node and edge features from attributed graphs. Since reconstruction in generative methods could be computationally expensive and even introduce bias that has a negative effect on the learned representation, DGCL [Li et al., 2021a] proposes to learn disentangled graph representations with self-supervision. It first identifies the latent factors of the input graph and derives its factorized representations. Then it conducts factor-wise contrastive learning to encourage the factorized representations to independently reflect the expressive information from different latent factors. Causal inference is one important technique to achieve OOD generalization. Graph machine learning models tend to exploit subtle statistical correlations existing in the training set even though it is a spurious correlation (unexpected "shortcut") [Wu et al., 2022a] for predictions to boost training accuracy. The performance of graph models that heavily rely on the spurious correlations can be substantially degraded since the spurious correlations could change in the wild OOD testing environments. In contrast, the causality-based graph models supported by causal inference theory can inherently capture causal relations between input graph data and labels that are stable under distribution shifts [Schölkopf et al., 2021] , leading to good OOD generalization. Backed by confounder balancing theory [Kuang et al., 2018] in causality, OOD-GNN [Li et al., 2021b] first tackles the OOD generalization problem by a non-linear decorrelation operation on graphs. Specifically, OOD-GNN proposes to eliminate the statistical dependence between relevant Table 1 : A summary of graph OOD generalization methods. "Task" denotes the task type that each method focuses on, including node/link/graph level tasks. "Shift Type" denotes the type of distribution shifts that each method can handle, including topology-level (i.e., graph size and graph structure) and feature-level (i.e., node features) distribution shifts. "Backbone agnostic" indicates whether the method can be used for other GNN backbones. "|E| > 1" indicates whether the method relies on multiple environments during training process. and irrelevant graph representations of the graph encoder by a nonlinear graph representation decorrelation method utilizing random Fourier features [Rahimi and Recht, 2007] , which scales linearly with the sample size and can get rid of unexpected spurious correlations. The parameters of the graph encoder and sample weights for graph representation decorrelation are optimized iteratively to learn discriminant graph representations for predictions. Note that, the decorrelation operation actually has the same effect with confounder balancing that encourages the independence between treatment and confounder. In this way, OOD-GNN achieves satisfactory OOD generalization on several graph benchmarks with various types of distribution shifts (i.e., shifts on graph sizes, node features, and graph structures [Balke and Pearl, 1994] as their SCM to learn size-invariant representations that better extrapolate between test and train graph data. They also validate the OOD generalization under the setting that distribution shifts exist on node features. Besides, inspired by counterfactual learning [Glymour et al., 2016] , CFLP [Zhao et al., 2021] focuses on link prediction tasks to learn the causal relationship between the global graph structure and link existence by training GNNbased link predictors to predict both factual and counterfactual links. Gem [Lin et al., 2021] , built upon the Granger causality [Granger , 1969] , inputs the original computation graph into the explainer and outputs a causal explanation graph, exhibiting better generalization abilities. Besides graph data augmentation and graph models, some works focus on exploiting training schemes with tailored optimization objectives and constraints to promote OOD generalization, including graph invariant learning, graph adversarial training, and graph self-supervised learning. Invariant learning, which aims to exploit the invariant relationships between features and labels across different distri-butions while disregarding the variant spurious correlations, can provably achieve satisfactory OOD generalization under distribution shifts [Arjovsky et al., 2019; Chang et al., 2020; Ahuja et al., 2021] . Specifically, when assessing causality is challenging or the strong assumptions are violated in practice, we can approximate the task by searching features that are invariant under distribution shifts [Chang et al., 2020] . On the one hand, some graph methods adopt regularizers to encourage the similarity of representations across multiple environments to learn environment-invariant representations, which can be formulated as: where reg denotes the loss of the adopted regularizer. Specifically, SR- GNN [Zhu et al., 2021] proposes to address nodelevel OOD generalization in GNNs by explicitly minimizing the distributional differences between biased training data and true inference distribution of graphs. And Zhang et al. [2021] try to capture environment-invariant node properties and explicitly balance the multiple environments to generalize well under distribution shifts. On the other hand, some methods are built upon the invariance principle to address the OOD generalization problem from a more principle way. Following the recent invariant learning based OOD generalization studies [Arjovsky et al., 2019; Chang et al., 2020; Ahuja et al., 2021] , these methods treat the cause of distribution shifts between testing and training graph data as a potential unknown environmental variable e. The optimization objective can be formulated as: where E is the support of training environments and R(f θ |e) = E e X,Y [ (f θ (X), Y )] is the risk of the f θ on the training environment e. Therefore, as shown in the last column of Table 1 , this type of methods relies on explicit multiple-environment split (i.e., |E| > 1) that can be provided in advance or generated during the training process. Specifically, EERM [Wu et al., 2022a] is designed to handle node-level OOD generalization tasks on graphs, which can achieve a guaranteed solution for the node-level OOD problem under mild conditions. The GNN backbone in EERM is optimized by minimizing the mean and variance of risks from multiple training environments that are generated by the environment generators, while the environment generators are trained by maximizing the variance loss via a policy gradient method. DIR [Wu et al., 2022c] is proposed to handle graph-level OOD generalization tasks by discovering invariant rationales for GNN. It first conducts interventions on the training distribution to create multiple interventional distributions. Then it captures the causal rationales that are invariant across different distributions while filtering out the spurious patterns that are unstable for OOD generalization. Adversarial training has been demonstrated to improve model robustness against adversarial attacks and OOD generalization ability. Here we mainly focus on the graph adversarial training methods that improve the generalization ability, while the works protecting GNNs from attacks can be found in the previous survey [Sun et al., 2018] . Some works extend the existing OOD generalization algorithms on graphs. DAGNN [Wu et al., 2019 ] is a method motivated by DANN [Ganin et al., 2016] to learn domain invariant graph representations by a minimax game between the domain classifier and the encoder. Sadeghi et al. [2021] adopt DRO [Rahimian and Mehrotra, 2019] that is one type of classical algorithm to solve the OOD generalization problem into graphs for node-level semi-supervised tasks. Some other works take more account of the graph properties. GraphAT [Feng et al., 2019] minimizes the proposed graph adversarial regularizer to reduce the prediction divergence between the perturbed target example and the connected examples, so that it can resist the worst-case perturbations on graph-based learning and enhance generalization ability. Xue et al. [2021] observe GNNs are prone to falling into sharp local minima in loss landscapes in terms of model weight and feature. Therefore, they propose co-adversarial perturbation (CAP) optimization to flatten the weight and feature loss landscapes alternately, which can avoid falling into locally sharp minima and enhance generalization ability. Wu et al. [2022b] first derive a generalization bound for OOD graph classification tasks. And they further improve the generalization ability by proposing a weighted truncated adversarial weight perturbation (WT-AWP) algorithm on graphs, which can alleviate the vanishing-gradient issue. Self-supervision as an emerging technique has been employed to train neural networks for more generalizable predictions on the image field. It is also shown that self-supervised learning can benefit GNNs in gaining more generalization ability [You et al., 2020b] . Hu et al. [2020b] explore several graph pre-training techniques on both node-level and graph-level to improve OOD generalization of GNNs. Yehudai et al. [2021] study the ability of GNNs to generalize from small to large graphs, by proposing a self-supervised pretext task that aims at learning useful d-pattern representations. Besides, graph contrastive learning can also be adopted to promote OOD generalization. You et al. [2020a] argue that self-supervision with handcrafted pretext tasks relies on heuristics to design, and thus could limit the generality of the learned graph representations. Therefore, they develop the contrastive learning method GraphCL, one of the representative self-supervised learning methods, for GNNs and have shown its generalization ability in practice. GCC [Qiu et al., 2020] designs subgraph instance discrimination to capture the universal structural patterns for generalization. Xu et al. [2021] attribute such success to that self-supervised learning could map semantically similar data to similar representations and therefore some OOD testing data might fall inside the training distribution after the mapping. Furthermore, Liu et al. [2022] propose a graph self-training framework DR-GST that can recover the original labeled dataset without distribution shifts. Therefore, more unlabeled nodes can be assigned with pseudo labels whose distribution is the same as that of labeled nodes so as to benefit the OOD generalization ability. Here we mainly review the typical graph self-supervised methods that claim to improve the OOD generalization. For more details of other graph self-supervised methods, the readers could refer to the survey . In this section, we review some literature focusing on theoretical analysis including the generalization bound of GNNs [Puny et al., 2020] . The work [Verma and Zhang, 2019] Note that GNTK is induced by infinitely wide GNNs, whose prediction depends only on pairwise kernel values between graphs, and can be calculated efficiently with an analytic formula. It enjoys the full expressive power of GNNs, while inheriting the benefits of graph kernels, e.g., easy to train, provable theoretical guarantees, etc. Baranwal et al. [2021] investigate OOD generalization of GNNs under a specific data generating mechanism (i.e., contextual stochastic block model) and analyze the relation between linear separability and OOD generalization on graphs. To promote the further research of graph OOD generalization, we summarize the existing popular graph datasets for evaluation in Table 2 . There are two groups of datasets, where one group is for node-level tasks and the other group is for graphlevel tasks. These datasets cover multiple sources of graphs (e.g., social network, citation network, molecular graph, etc) and their causes of distribution shifts are also complex and diverse (e.g., time, species, scaffold, etc.). Besides the real-world datasets above, there are also some synthetic datasets proposed in the literature for simulating controllable distributional shifts (e.g., Spurious-Motif [Wu et al., 2022c] , TR3 [Wu et al., 2022d] , etc.) or artificially creating distributional shifts on benchmarks [Zhu et al., 2021; Zhang et al., 2021; Ding et al., 2021] . • Theoretical Guarantees: While some graph OOD generalization methods have made great progress empirically, there is still a large gap between these methods and the theories. It is a critical step to derive theoretical characterization on a learnable graph OOD generalization problem and further develop methods with theoretical guarantees for OOD optimality. • GNN Architecture: Recently, some studies [Knyazev et al., 2019; Veličković et al., 2020; Xu et al., 2021; Ko et al., 2021; highlight the importance of careful design for GNN architecture (e.g., readout operation) to gracefully generalize to OOD data. It remains to be further explored how to design theoretically guaranteed or automatically learned GNN architecture for OOD generalization. • Environment Split: The majority of general OOD generalization algorithms require multiple training environments [Shen et al., 2021] . However, it is prohibitively expensive to collect accurate environment labels for real-world graphs that are not easily humanunderstandable, limiting the adoption of those algorithms. It is worth investigating to develop the single and even dynamic environment [Wang et al., 2021a; Galke et al., 2021] graph OOD generalization method or infer environment split accurately during training. • Domain Knowledge: While incorporating proper domain knowledge can benefit predictive performance, there is still a lack of deep understanding on how to integrate domain knowledge to help graph OOD generalization, e.g., biochemical graphs [Han et al., 2021] , knowledge graph [Sinha et al., 2020; Li et al., 2022] , etc. • Broader Scope of Applications. It is a promising direction to deploy the OOD generalized graph methods on various realistic applications, e.g., recommender systems, social networks, traffic prediction, and risksensitive finance or healthcare fields, for more effective, trustworthy and satisfying predictions. Kartik Ahuja et al. Invariance principle meets information bottleneck for out-of-distribution generalization Invariant risk minimization Graph convolution for semisupervised classification: Improved linear separability and outof-distribution generalization. ICML, 2021 Wenzheng Feng et al. Graph random neural networks for semi-supervised learning on graphs. NeurIPS, 2020 Irina Higgins et al. beta-vae: Learning basic visual concepts with a constrained variational framework Drug repurposing for covid-19 using graph neural network and harmonizing multiple evidence. Scientific reports Understanding attention and generalization in graph neural networks Disentangled contrastive learning on graphs How does knowledge graph embedding extrapolate to unseen data: a semantic evidence view. AAAI, 2022 Confidence may cheat: Selftraining on graph neural networks under distribution shift. Webconf Jianxin Ma et al. Disentangled graph convolutional networks Subgroup generalization and fairness of graph neural networks Hyeonjin Park et al. Metropolis-hastings data augmentation for graph neural networks Social influence prediction with deep learning Graph contrastive coding for graph neural network pre-training Graphvae: Towards generation of small graphs using variational autoencoders Evaluating logical generalization in graph neural networks Petar Veličković et al. Neural execution of graph algorithms Stability and generalization of graph convolutional neural networks Knowledge graph embedding: A survey of approaches and applications Generalizing to unseen domains: A survey on domain generalization Adversarial weight perturbation improves generalization in graph neural networks Haotian Xue et al. Cap: Co-adversarial perturbation on weights and features for improving generalization of graph neural networks Counterfactual graph learning for link prediction