key: cord-0217020-ry60o42a authors: Zhang, Ziwei; Niu, Chenhao; Cui, Peng; Pei, Jian; Zhang, Bo; Zhu, Wenwu title: Permutation-equivariant and Proximity-aware Graph Neural Networks with Stochastic Message Passing date: 2020-09-05 journal: nan DOI: nan sha: fbec4ced22d2c3768ff4e59279b31029ab436ccf doc_id: 217020 cord_uid: ry60o42a Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such as finding communities and leaders. In this paper, we first analytically show that the existing GNNs, mostly based on the message-passing mechanism, cannot simultaneously preserve the two properties. Then, we propose Stochastic Message Passing (SMP) model, a general and simple GNN to maintain both proximity-awareness and permutation-equivariance. In order to preserve node proximities, we augment the existing GNNs with stochastic node representations. We theoretically prove that the mechanism can enable GNNs to preserve node proximities, and at the same time, maintain permutation-equivariance with certain parametrization. We report extensive experimental results on ten datasets and demonstrate the effectiveness and efficiency of SMP for various typical graph mining tasks, including graph reconstruction, node classification, and link prediction. G RAPH neural networks (GNNs), as generalizations of neural networks for learning on graph data, have enjoyed successes in many applications, such as social recommendation [1] , physical simulation [2] , and protein interaction prediction [3] . The existing GNNs are mostly based on the message-passing mechanism [4] . A fundamental property well preserved by the messagepassing GNNs is permutation-equivariance, i.e., if we randomly permutate the IDs of nodes while maintaining the graph structure unchanged, the representations of nodes in those GNNs are permutated accordingly. Mathematically, permutation-equivariance reflects one basic symmetric group of graph structures. Permutation-equivariance is highly useful for many graph mining tasks, such as node or graph classification [5] , [6] . As another important property, pairwise proximities between nodes are crucial for some other graph mining tasks, such as link prediction and community detection [7] , [8] . Some GNNs, such as Positionaware GNN (P-GNN) [8] , have specifically designed mechanisms to ensure proximity-awareness. In many applications of GNNs, both proximityawareness and permutation-equivariance are indispensable. Consider mining communities and leaders in graphs. The node labels correspond to the k-core number, a type of node centrality. Permutation-equivariance is important for the task. categorized by the k-core number centrality. To discover the communities, proximity-awareness is essential since the nodes in the same community are tightly connected and have large proximities. Permutation-equivariance helps to measure the centrality because most centrality measurements are permutation-equivariant by definition [9] . Do the existing GNNs, which are built on messagepassing, honor both proximity-awareness and permutationequivariance? Surprisingly and unfortunately, the answer is no. We show that proximity-awareness and permutationequivariance are incompatible in the exiting GNNs (see Theorem 1). This deficiency in the existing GNNs is particularly irritating since, for the same task, different datasets may rely on the two properties to different extents. Taking link prediction as an example, we observe that permutationequivariant GNNs such as GCN [10] or GAT [11] show better performance than P-GNN in coauthor graphs, but perform worse in biological graphs (see Section 5.3 for details). A work in drug repurposing for Covid-19 [12] shows a similar dilemma: proximity-aware methods and permutation-equivariant GNNs discover completely different drug candidates. Can we develop a general GNN that is proximity-aware and also maintains permutation-equivariance? In this paper, arXiv:2009.02562v2 [cs. LG] 22 Feb 2022 we propose Stochastic Message Passing (SMP) 1 , a general and simple GNN to preserve both proximity-awareness and permutation-equivariance properties. In order to preserve node proximities, we augment the existing GNNs with stochastic node representations. We theoretically prove that the mechanism can enable GNNs to preserve node proximities (see Theorems 2 and 3). At the same time, SMP is equivalent to a permutation-equivariant GNN with certain parametrization and thus is at least as powerful as those GNNs in permutation-equivariant tasks (see Remark 1) . Therefore, SMP is general and flexible in handling both proximity-aware and permutation-equivariant tasks, which is also demonstrated by our extensive experimental results. Besides, owing to the simple structure, SMP is computationally efficient, with a running time roughly the same as the simplest GNNs, such as SGC [13] , and is at least an order of magnitude faster than P-GNN on large graphs. Our ablation studies further show that a linear instantiation of SMP is expressive enough as adding extra non-linearities does not lift the performance of SMP on the majority of datasets. Our contributions are summarized as follows. • We propose Stochastic Message Passing (SMP), a simple and general GNN to handle both proximity-aware and permutation-equivariant graph tasks. • We prove that SMP has theoretical guarantees in preserving walk-based proximities and is as powerful as the existing GNNs in permutation-equivariant tasks. • Extensive experimental results demonstrate the effectiveness and efficiency of SMP. We show that a linear SMP instantiation is expressive enough on the majority of datasets. The rest of the paper is organized as follows. We review related work in Section 2. In Section 3, we show the incompatibility between walk-based proximity and permutationequivariance in message-passing GNNS. We develop our proposed Stochastic Message Passing in Section 4. We report an extensive experimental study in Section 5, and conclude the paper in Section 6. We provide additional experiments, details for reproducibility, and proofs in the appendix. We briefly review GNNs, the permutation-equivariance property, and the proximity-awareness property. We refer readers to [14] for a comprehensive survey. The earliest GNNs adopt a recursive definition of node states [15] , [16] or a contextual realization [17] . GGS-NNs [18] replace the recursive definition with recurrent neural networks (RNNs). Spectral GCNs [19] define graph convolutions using graph signal processing [20] with Cheb-Net [21] and GCN [10] approximating the spectral filters using a low-order Chebyshev polynomial and the firstorder polynomial, respectively. MPNNs [4] , GraphSAGE [3] , and MoNet [22] are proposed as general frameworks by characterizing GNNs with a message-passing function and an updating function. More advanced variants such as GAT [11] , JK-Nets [23] , GIN [24] , and GraphNets [25] follow these frameworks. 1 . Code is available at https://github.com/NiuChH/SMP. Li et al. [26] , Xu et al. [24] , Morris et al. [27] , and Maron et al. [28] show the connection between GNNs and the Weisfeiler-Lehman algorithm [29] of graph isomorphism tests, in which permutation-equivariance holds a key constraint. Further, Maron et al. [6] and Keriven et al. [5] analyze the permutation-equivariance property of GNNs theoretically. To date, most of the existing GNNs are permutationequivariant and are not proximity-aware. One exception is P-GNN [8] , which proposes to capture the positions of nodes using the relative distance between the target node and some randomly chosen anchor nodes. However, P-GNN cannot satisfy permutation-equivariance and is computationally expensive. Concurrent to our work, some studies propose to use position encodings to enhance GNNs in preserving graph structures [30] , [31] , [32] , [33] . Most of these methods rely on eigenvectors of a graph matrix, which are computationally expensive. Our proposed stochastic node representations can also be regarded as a type of position encoding while being extremely simple yet efficient. In order to enhance the expressive power of GNNs in graph isomorphism tests and also motivated by the literature on distributed computing [34] , some studies suggest assigning unique node identifiers for GNNs [35] , such as one-hot IDs [36] or random numbers [37] , [38] , [39] . For example, Sato et al. [38] show that random numbers can enhance GNNs in tackling two graph-based NP problems with a theoretical guarantee, namely the minimum dominating set and the maximum matching problem. Fey et al. [40] empirically show the effectiveness of random features in the graph matching problem. Concurrent to our work, RNI [41] shows that GNNs with random node features are universal in theory. Our work here, which also adopts stochastic node representations, differs in that we systematically study how to preserve permutation-equivariance and proximityawareness simultaneously in a simple yet effective framework, a novel topic distinct from those existing studies. Besides, we theoretically prove that our proposed method can preserve walk-based proximities. We also demonstrate the effectiveness of our method on large-scale benchmarks for both node-and edge-level tasks, while no similar results are reported in the literature. Another line of research is tackling the over-smoothing problem [26] , [42] and developing deep GNNs [43] . Since these studies are orthogonal to our paper, we expect these strategies to also work for our proposed SMP. The design of our method is also inspired by the literature on random projection for dimensionality reduction [44] . To the best of our knowledge, we are the first to study random projection in the scope of GNNs. More remotely, our definition of node proximities is inspired and inherited from graph kernels [45] , [46] , network embedding [47] , [48] , and the general studies of graphs [49] . In this section, we first introduce preliminaries of messagepassing GNNs, walk-based proximities, and permutationequivariance. Then, we show the incompatibility between proximity-awareness and permutation-equivariance in the existing GNNs. We consider a graph G = (V, E, F) where V = {v 1 , ..., v N } is a set of N = |V| nodes, E ⊆ V × V is a set of M = |E| edges, and F ∈ R N ×d0 is a matrix of d 0 node features. Denote by A the adjacency matrix, and by A i,: , A :,j , and A i,j , respectively, the i th row, the j th column and an element in the matrix. In this paper, we assume unweighted and undirected graphs. The neighborhood of The existing GNNs usually follow a message-passing framework [4] , where a neighborhood aggregation function AGG(·) and an updating function UPDATE(·) are adopted in the l th layer: where h (l) i ∈ R d l is the representation of node v i at the l th layer, d l is the dimensionality, e i,j is the edge feature when available, and m (l) i are the messages. We also denote by N ] and [·, ·] the concatenation operation. The node representations are initialized as node features, i.e., H (0) = F. We represent a GNN following Eq. (1) with L layers by a parameterized function as follows 2 : where H (L) is node representation learned by the GNN and W represents all the parameters. A key property of GNNs is permutation-equivariance. Definition 1 (Permutation-equivariance). Consider a graph G = (V, E, F) and any permutation P : V → V so that G = (V, E , F ) has an adjacency matrix A = PAP T and a feature matrix F = PF, where P ∈ {0, 1} N ×N is the permutation matrix, i.e., P i,j = 1 iff P(v i ) = v j . A GNN satisfies permutation-equivariance if the node representations for G and G are equivariant with respect to P, i.e., It is well-known that GNNs following Eq. (1) satisfy permutation-equivariance [6] . A graph G is said to have (nontrivial) automorphism if there exists a non-identity permutation matrix P = I N so that A = PAP T and F = PF, i.e., the graph has a non-trivial isomorphism to itself. We denote by C G = Using Definition 1 and 2, we immediately have the following corollary. If a graph has a non-trivial automorphism, a permutation-equivariant GNN produces identical node representations for automorphic node pairs Since node representations are used for downstream tasks, Corollary 1 shows that permutation-equivariant 2. Since the final layer of GNNs is task-specific, e.g., a softmax layer for node classification or a readout layer for graph classification, we only consider the GNN architecture to its last hidden layer. GNNs cannot differentiate automorphic node pairs. A direct consequence is that permutation-equivariant GNNs cannot preserve walk-based proximities between pairs of nodes. Definition 3 (Walk-based Proximities). For a given graph G = (V, E, F), denote by matrix S ∈ R N ×N the walk-based proximities between pairs of nodes, defined by where {v i v j } represents the set of walks from node v i to v j and S(·) is a real-valued function. The length of a walk-based proximity is the maximum length of all the walks of the proximity. Typical examples of walk-based proximities include the shortest distance [8] , the high-order proximities (a sum of walks weighted by their lengths) [50] , and random walk probabilities [51] . For a given walk-based proximity, a GNN is said to preserve the proximity if there exists a decoder function F de (·) such that for any graph G = (V, E, F), there exist parameters W G so that ∀ > 0: where For notation convenience, we also write the decoder function as when there is no ambiguity regarding F(·). The definition applies to any GNN architecture as long as it fits Eq. (1). Moreover, in the definition, we only constrain the inputs of the decoder function to be node representations H and the proximity function S(·), but we do not constrain the form of the decoder function. In other words, the decoder function can be arbitrarily sophisticated, e.g., deep neural networks with a sufficient number of layers and hidden units. Now we are ready to present the incompatibility. For any walk-based proximity function S(·) satisfying Definition 3, a permutation-equivariant GNN cannot preserve S(·), except for the trivial situation where all node pairs have the same proximity, i.e., ∀i, j, S i,j = c, and c is a constant. 3 Proof. We prove by contradiction. Assume there exists a non-trivial S(·) that can be preserved by a permutationequivariant GNN. Consider any graph G = (V, E, F). We will construct a graph G with automorphism from G so that any GNN cannot preserve S(·) on G . Specifically, let Basically, we generate two "copies" of the original graph, one indexing from 1 to N , and the other one from N + 1 to 2N . By assumption, there exists a permutation-equivariant 3 . Proposition 1 in P-GNN [8] can be regarded as a special case of Theorem 1 using the shortest distance proximity. GNN that can preserve S(·) in G . Let the node representations for such a GNN as H (L) = F GNN (A , F ; W G ). It is easy to see that node v i and v i+N in G form an automorphic node pair. According to Corollary 1, their representations are identical, i.e., Note that there exists no walk from the two copies, i.e. v i v j = v j v i = ∅, ∀i ≤ N, j > N . As a result, for ∀i ≤ N, j ≤ N, ∀ > 0, we have: We can prove the same for ∀i > N, j > N . The equation Combining the results, we have ∀ > 0, ∀i, j, |S i,j − S(∅)| < 2 . Since can be arbitrarily small, the equation shows that all node pairs have the same proximity c = S(∅). In other words, S(·) is a trivial situation. A contradiction. An alternative proof by constructing connected graphs as contradictions is provided in Appendix C.2. Since walk-based proximities are rather general and widely used in many graph mining tasks such as link prediction, Theorem 1 shows that the existing permutationequivariant GNNs cannot handle these tasks well. In this section, we develop our stochastic message passing model. We first describe our framework, and then explore a linear implementation and non-linear extensions. Theorem 1 indicates that a major shortcoming of permutation-equivariant GNNs is that they cannot differentiate automorphic node pairs. To solve that problem, we need to introduce some mechanism as "symmetry breaking", i.e., to enable GNNs to distinguish symmetric nodes. To achieve this goal, we sample a stochastic matrix E ∈ R N ×d , where each element follows an i.i.d. normal distribution N (0, 1). We leave exploring other possible stochastic signals besides Gaussian distributions as future works. The stochastic matrix can provide signals to distinguish the nodes because they are randomly sampled without being affected by the graph automorphism. In fact, we can easily calculate that the Euclidean distance between two stochastic signals divided by a constant √ 2 follows a chi distribution χ d , that is, When d is reasonably large, e.g., d > 20, the probability of two signals being close is very low. Then, inspired by the message-passing framework, we apply a GNN on the stochastic matrix so that the nodes can exchange information of the stochastic signals, We callẼ the stochastic representation of nodes. By the message-passing on the stochastic signals,Ẽ can be used to preserve node proximities (will be shown in Theorem 2 and Theorem 3 in a moment). To still allow our model to utilize node features, we concatenateẼ with node representations from another GNN with node features as inputs. That is, where F output (·) is an aggregation function, such as a linear function or simply the identity mapping. In a nutshell, our proposed method augments the existing GNNs with a stochastic representation learned by message-passings to differentiate different nodes and preserve node proximities. There is also a delicate choice worthy mentioning, i.e., whether the stochastic matrix E is fixed or resampled in each epoch. On the one hand, by fixing E, the model can learn to memorize the stochastic representation and distinguish different nodes, but with the cost of being unable to handle nodes not seen during training. On the other hand, by resampling E in each epoch, the model can have a better generalization ability since the model cannot simply remember one specific stochastic matrix. However, the node representations are not fixed (but pairwise proximities are preserved; see Theorem 2). In these cases,Ẽ is more capable of handling pairwise tasks such as link prediction or pairwise node classification. In this paper, we fix E for transductive datasets and resample E for inductive datasets (see Section 5.1 for the experimental settings and Section 5.7 for an ablation study of this design). Time Complexity From Eq.(11), the time complexity of our framework mainly depends on the two GNNs in learning the stochastic and permutation-equivariant node representations. In this paper, we instantiate these two GNNs using simple message-passing GNNs, such as GCN [10] and SGC [13] (see Section 4.2 and Section 4.3). Thus, the time complexity of our method is the same as those models employed, which is O(M ), i.e., linear with respect to the number of edges. We also empirically compare the running time of different models in Section 5.8. Besides, GNN acceleration schemes such as sampling [52] , [53] , [54] or partitioning the graph [55] can be directly applied to our framework. Based on the general framework in Eq. (11), let us explore its minimum model instantiation, i.e., a linear model. Specifically, inspired by Simplified Graph Convolution (SGC) [13] , we adopt a linear message-passing for both GNNs, i.e., whereà = (D + I) − 1 2 (A + I)(D + I) − 1 2 is the normalized graph adjacency matrix with self-loops proposed in GCN [10] , I is the identity matrix, and K is the number of propagation steps. We also set F output (·) in Eq. (12) to a linear mapping or identity mapping. Elegantly, this simple SMP instantiation has a theoretical guarantee on preserving walk-based proximities. Theorem 2. An SMP in Eq. (12) can preserve the walk-based proximityà K (à K ) T with high probability if the dimensionality of the stochastic matrix d is sufficiently large, i.e., ∀ > 0 and δ > 0, ∃ d 0 so that for any d > d 0 , where H is the node representation obtained from SMP in Eq. (12) . The result holds for any stochastic matrix, no matter whether E is fixed or resampled during each epoch. Proof. Our proof is mostly based on the random projection theory. First, since we show in Theorem 1 that the permutation-equivariant representations cannot preserve any walk-based proximity, here we develop our proof assuming H =Ẽ. This can be easily achieved in the model by ignoring . For example, if we set F output (·) as a linear function, the model can learn to set the corresponding weights for H (L) as all-zeros and weights for E as an identity matrix. We set the decoder function as a normalized inner product Let Since E is a Gaussian random matrix, using the Johnson-Lindenstrauss lemma [44] (in the inner product preservation form, e.g., see Corollary 2.1 and its proof in [56] ), ∀0 < < By setting = maxi ai , we have > 2 ( a i + a j ) and P (|Si,j − F de (Hi,:, Hj,: which leads to the theorem by solving and setting d 0 as follows. Next, we show that SMP is equivalent to a permutationequivariant GNN with certain parametrization. The result is straightforward from the definition. For any task, Eq. (11) with a linear F output (·) in Remark 1 is at least as powerful as the permutation-equivariant F GNN (A, F; W ), i.e., the minimum training loss of using H in Eq. (11) is equal to or smaller than that using In other words, SMP will not hinder the performance 4 even if the tasks are strictly permutation-equivariant, since the stochastic representations are concatenated with the permutation-equivariant GNNs followed by a linear mapping. In these cases, the linear SMP is equivalent to SGC [13] . Combining Theorem 2 and Corollary 2, the linear SMP instantiation in Eq. (12) is capable of handling both proximity-aware and permutation-equivariant tasks. One may be curious whether a more sophisticated variant of Eq. (11) can further improve the expressiveness of SMP. There are three adjustable components in Eq. (11): two GNNs in propagating the stochastic matrix and node features, respectively, and an output function. In theory, adopting non-linear models as either component is able to enhance the expressiveness of SMP. Indeed, if we use a sufficiently expressive GNN in learningẼ instead of linear propagations, we can prove a more general version of Theorem 2. For any length-L walk-based proximity, i.e., where len(·) is the length of a walk, there exists an SMP variant in Eq. (11) with F GNN (A, E; W) containing L + 1 layers (including the input layer) to preserve that proximity if the following conditions hold: (1) The stochastic matrix E contains identifiable unique signals for different nodes, i.e. E i,: = E j,: , ∀i = j. Here we assume that the Gaussian random vectors E are rounded to machine precision so that E is drawn from a countable subspace of R. (2) The message-passing and updating functions in learning E are bijective. (3) The decoder function F de (·) also takes E as inputs and is universal approximation. Proof. The proof of the theorem is given in Appendix C.1. We can also adopt more advanced methods for F output (·), such as attentions or even another GNN, so that the two GNNs are more properly integrated. Although non-linear extensions of SMP can, in theory, increase the model expressiveness, they also take a higher risk of over-fitting due to model complexity, not to mention that the computational cost also increases. In practice, we find from our ablation studies that the linear SMP instantiation in Eq. (12) works reasonably well on most of the datasets (please refer to Section 5.7 for details). In this section we report our extensive experimental studies. We first describe the experiment settings, benchmarks and baselines. Then, we use a synthetic dataset to demonstrate the simultaneous needs of both permutation-equivariance and proximity-awareness in applications, illustrating the deficiencies of the existing GNNs in preserving the two properties and the capability of our SMP method. We use a series of benchmark tasks, including link prediction, node classification, and pairwise node classification, to comprehensively examine the capability of our SMP method against the strong baselines. Next, we conduct ablation studies of SMP. Last, we evaluate the efficiency of our method. Except for the proof-of-concept experiment in Section 5.2, we use the following setup. We conduct experiments on the following ten datasets: two simulation datasets, Grid and Communities (Comm in abbreviation) [8] , a communication dataset Email [8] , two coauthor networks, CS and Physics [57] , two protein interaction networks, PPI [3] and PPA [7] , and three benchmarks, Cora, CiteSeer, and PubMed [58] . We summarize the statistics of datasets in Table 1 and provide datasets details in Appendix B.1. These datasets cover a wide spectrum of application domains, various sizes, and with or without node features. Since Email and PPI contain more than one graph, we conduct experiments in an inductive setting, i.e., the training, validation, and testing set are split with respect to different graphs. We repeat each experiment 5 times for all datasets except for PPA (3 times for each experiment on PPA), and report the averaged results and the standard deviations after the plus-minus signs. We adopt two sets of baselines. The first set is permutationequivariant GNNs including GCN [10] , GAT [11] , and SGC [13] . They are widely adopted GNN architectures. The second set contains P-GNN [8] , a representative proximityaware GNN. In comparing with the baselines, we mainly evaluate two variants of SMP with different F output (·): SMP-Identity, i.e., F output (·) as an identity mapping, and SMP-Linear, i.e., F output (·) as a linear function. Note that both variants adopt linear message-passing functions as SGC. We conduct ablation studies with more SMP variants in Section 5.7. For fair comparisons, we adopt the same architecture and hyper-parameters for all the methods (please refer to Appendix B.2 for details). For datasets without node features, we adopt a constant vector as the node features. 2 The results of node classification measured in accuracy (%) on the proof-of-concept synthetic dataset. The best result and the second-best result for each task, respectively, are in bold and underlined. We first conduct a proof-of-concept experiment to demonstrate the importance of preserving both permutationequivariance and proximity-awareness. We generate a synthetic dataset similar to the intuition behind the example in Figure 1 as follows. First, the nodes are randomly partitioned into a set of communities. The nodes within the same community have a higher probability of forming edges than the nodes in different communities, i.e., the well-known stochastic block model [49] . Then, within each community, we generate a social status for each node with two possible choices. If a node is active, it has a high probability of forming edges with other nodes in the same community. Otherwise, the node has a low probability of forming edges with others, i.e., inactive. From the above generating process, we can see that proximity-awareness is essential to predict which community a node belongs to, since nodes within the same community have large proximities. To predict whether a node is active or inactive, permutation-equivariance is helpful, since the social status serves as a type of centrality measurements. Please refer to Appendix B.1 for further details of the synthetic dataset. We conduct experiments on the synthetic dataset for the node classification task, i.e., predicting the node labels. We consider the following three cases. (1) Community: The node label is the community that the node belongs to. (2) Social Status: The node label is the social status of the node. (3) Both: The node label is the Cartesian product of (1) and (2), i.e., every community and social status pair is a distinct label. We use a softmax layer on the learned node representations as the classifier, and use accuracy, i.e., the percentage of nodes correctly classified, as the evaluation metric. We omit the results of SMP-Identity since the node representations in SMP-Identity have a fixed dimensionality that does not match the number of classes. Table 2 shows the results, which are consistent with our analyses. The permutation-equivariant GNNs perform reasonably well on predicting the social status labels but cannot discover communities, since node proximities are not preserved in those methods. P-GNN manages to handle community labels well, but performs poorly for social status labels. None of them can handle the most challenging setting where both properties are needed to predict the node labels of community and status. SMP performs consistently well in all three cases. The results clearly show that SMP can simultaneously preserve permutation-equivariance and proximity-awareness when needed and retain highly competitive performance for each property. In fact, for the community labels, SMP significantly outperforms P-GNN, demonstrating that SMP can better preserve proximities between nodes. Next, we report experimental results on benchmark datasets. Link prediction predicts missing links in a graph. We randomly split the edges into three exclusive parts of relative sizes 80%, 10% and 10%, and use them for training, validation, and testing, respectively. Besides these positive samples, we obtain negative samples by randomly sampling an equal number of node pairs that do not have edges for training/validation/testing. For all the methods, we set a simple classifier: Sigmoid(H T i H j ), i.e., use the inner product to predict whether a node pair (v i , v j ) forms a link, and use AUC (area under the ROC curve) as the evaluation metric. One exception to this setting is that on the PPA dataset, we follow the splits and evaluation metric (i.e., Hits@100) provided by the dataset [7] . Limited by space, the results for three benchmarks (Cora, CiteSeer, and PubMed) are shown in Appendix A.2. The results except PPA are shown in Table 3 . SMP achieves the best results on five out of the six datasets and is highly competitive (the second-best result) on the other (Physics). The results demonstrate the effectiveness of SMP in link prediction tasks. We attribute the strong performance of SMP to its capability of maintaining both proximityawareness and permutation-equivariance properties. On Grid, Communities, Email, and PPI, both SMP and P-GNN outperform the permutation-equivariant GNNs, confirming the importance of preserving node proximities. Although SMP is simpler and more efficient than P-GNN, SMP reports even better results. When node features are available (CS, Physics, and PPI), SGC outperforms GCN and GAT. The results re-validate the findings in SGC [13] and LightGCN [59] that the nonlinearity in GNNs is not necessarily indispensable. Some plausible reasons include that the additional model complexity brought by non-linear operators makes the models tend to overfit or difficult to be trained. On those datasets, SMP retains comparable performance on two coauthor graphs and shows better performance on PPI, possibly because the node features on the protein graphs are less informative than the node features on coauthor graphs for predicting links. Thus, preserving graph structure is more beneficial on PPI. As we experiment on Email and PPI in the inductive setting, the results show that SMP also can handle inductive tasks well. The results on PPA are shown in Table 4 . SMP outperforms all the baselines, showing that it can handle largescale graphs with millions of nodes and edges. PPA is part of a recently released Open Graph Benchmark (OGB) [7] . The superior performance on PPA further demonstrates the effectiveness of SMP in link prediction. In the task of node classification, we need ground-truths in the evaluation. Thus, we only adopt datasets with node labels. Specifically, for CS and Physics, we adopt 20/30 labeled nodes per class for training/validation and the rest for testing [57] . For Comm, we adjust the number as 5/5/10 labeled nodes per class for training/validation/testing. For Cora, CiteSeer, and PubMed, we use the default splits that come with the datasets. We do not adopt Email because some graphs in the dataset are too small to show stable results. We also exclude PPI since it is a multi-label dataset. Other settings are the same as Section 5.2. The results are shown in Table 5 . SMP reports nearly perfect results on Comm. Since the node labels are generated by graph structures on Comm and there are no node features, a model has to be proximity-aware to handle Comm well. P-GNN, which shows promising results in the link prediction task, fails miserably here. On the other five graphs, SMP reports highly competitive performance. These graphs are commonly-used benchmarks for GNNs. P-GNN, which completely ignores permutation-equivariance, performs poorly as expected. The results of pairwise node classification tasks measured in AUC (%). The best result and the second-best result for each dataset, respectively, are in bold and underlined. In contrast, SMP manages to be competitive with the permutation-equivariant GNNs, as endorsed by Remark 1. In fact, SMP even shows better results than its counterpart, SGC, indicating that preserving proximities is also helpful. We follow P-GNN [8] and experiment on pairwise node classification, i.e., predicting whether two nodes have the same label. Compared with node classification in Section 5.4, pairwise node classification focuses more on the relation between nodes and thus more likely to require a model to be proximity-aware. Similar to link prediction, we split the positive samples (i.e., node pairs with the same label) into an 80%-10%-10% training-validation-testing set with an equal number of randomly sampled negative pairs. For large graphs, since the possible positive samples are intractable (i.e. O(N 2 )), we use a random subset. As we also need node labels as the ground-truth, we only conduct pairwise node classification on the datasets when node labels are available. We also exclude the results on PPI since the dataset is multi-labeled and cannot be used in a pairwise setting [8] . Similar to the link prediction task in Section 5.3, we adopt a simple inner product classifier and use AUC as the evaluation metric. The results are shown in Table 6 . We observe consistent results as the link prediction task, i.e., SMP reports the best results on four datasets and the second-best results on the other three datasets. These results again verify that SMP can effectively preserve and utilize node proximities while retaining comparable performance when the tasks are more permutation-equivariant like, e.g., on CS, Physics, and the three benchmarks (Cora, CiteSeer, and PubMed). To examine whether SMP can indeed preserve node proximities, we conduct experiments on graph reconstruction [60] , i.e., using the node representations learned by GNNs to reconstruct the edges of the graph. Graph reconstruction corresponds to the first-order proximity between nodes, i.e., whether two nodes directly have a connection, which is the most straightforward node proximity [61] . Specifically, following link prediction and pairwise node classification, we adopt the inner product classifier Sigmoid(H T i H j ) and use AUC as the evaluation metric. To control the impact of node features (i.e., many graphs exhibit assortative mixing [49] , thus even models only using node features can reconstruct the edges to a certain extent), we do not use node features for all the models. The results are reported in Table 7 . The results show that SMP greatly outperforms permutation-equivariant GNNs such as GCN and GAT for the graph reconstruction task, clearly demonstrating that SMP can better preserve node proximities. P-GNN shows highly competitive results as SMP. However, similar to the other tasks, the intensive memory usage makes P-GNN unable to handle mediumscale graphs such as Physics and PubMed. We conduct ablation studies by comparing different SMP variants, including SMP-Identity, SMP-Linear, and additional three variants. • In SMP-MLP, we set F output (·) to a fully-connected network with one hidden layer. We show the results of link prediction in Table 8 . The results for node classification and pairwise node classification, which show similar conclusions, are provided in Appendix A.3. In general, SMP-Linear shows impressive performance, achieving the best or second-best results on six datasets and highly competitive on the other (Comm). SMP-Identity, which does not have learnable parameters in the output function, performs slightly worse. The results demonstrate the importance of adopting a learnable linear layer in the output function, which is consistent with Remark 1. SMP-MLP does not lift the performance in general, showing that adding extra complexities in F output (·) brings no gain in those datasets. SMP-Linear-GCN feat reports the best results on Communities, PPI, and PPA, indicating that adding extra non-linearities in propagating node features is helpful for some graphs. SMP-Linear-GCN both reports the best results on Gird with a considerable margin. Recall that Grid has no node features. The results indicate that inducing nonlinearities can help the stochastic representations to better capture proximities for some graphs. We also assess the effects of whether the stochastic signals E are fixed or not during different training epochs for our proposed SMP. For brevity, we only report the results of link prediction in Table 9 . The results show that fixing E usually leads to better results on transductive datasets (recall that datasets except Email and PPI are transductive) and resampling E leads to better results on inductive datasets. The results are consistent with our analysis in Section 4.1. To compare the efficiency of different methods quantitatively, we report the running time of different methods in Table 10 . The results are averaged over 3,000 epochs on an NVIDIA TESLA M40 GPU with 12 GB of memory. The results show that SMP is computationally efficient, i.e., only marginally slower than SGC and comparable to GCN. P-GNN is at least an order of magnitude slower except for the extremely small graphs such as Grid, Communities, or Email, which have no more than a thousand nodes. In addition, the expensive memory cost makes P-GNN unable to work on large-scale graphs. In this paper, we propose SMP, a general and simple GNN to preserve both proximity-awareness and permutationequivariance. We augment the existing GNNs with stochastic node representations. We prove that SMP can enable GNNs to preserve node proximities and is equivariant to a permutation-equivariant GNN with certain parametrization. Our experimental results demonstrate the effectiveness and efficiency of SMP. Jian Pei is a Professor in the School of Computing Science at Simon Fraser University. He is a renown leading researcher in the general areas of data science, big data, data mining, and database systems. He is recognized as a Fellow of the Royal Society of Canada (Canada's national academy), the Canadian Academy of Engineering, ACM and IEEE. He is one of the most cited authors in data mining, database systems, and information retrieval. Since 2000, he has published one textbook, two monographs and over 300 research papers in refereed journals and conferences, which have been cited extensively by others. His research has generated remarkable impact substantially beyond academia. For example, his algorithms have been adopted by industry in production and popular open source software suites. Jian Pei also demonstrated outstanding professional leadership in many academic organizations and activities. We compare SMP with augmenting GNNs using a one-hot encoding of node IDs, i.e., the identity matrix. Intuitively, since the IDs of nodes are unique, such a method does not suffer from the automorphism problem and should also enable GNNs to preserve node proximities. However, using such a one-hot encoding has two major problems. First, the dimensionality of the identity matrix is N × N . Thus, the number of parameters in the first message-passing layer is also O (N ) . Therefore, the method is inevitably computationally expensive and may not be scalable to largescale graphs. Having a large number of parameters may also cause overfitting. Second, the node IDs are not transferable across different graphs, i.e., the node with ID v 1 in one graph and the node with ID v 1 in another graph do not necessarily share any similarity. Since the parameters in the messagepassing depend on the node IDs as input features, using one-hot encoding cannot handle inductive tasks well. We empirically compare such a method with SMP and report the results in Table 11 . The results show that SMP-Linear outperforms GCN onehot in most cases. Besides, GCN onehot fails to handle Physics, which is only a mediumscale graph, due to the heavy memory usage. One surprising result is that GCN onehot outperforms SMP-Linear on Grid, the simulated graph where nodes are placed on a 20 × 20 grid. A plausible reason is that since the edges in Grid follow a specific rule, using a one-hot encoding gives GCN onehot enough flexibility to learn the rules, and the model does not overfit because the graph has a rather small scale. One may wonder whether SMP is transferable across different graphs, since the stochastic features are independently drawn. Empirically, we find that SMP reports reasonably good performance on inductive datasets, such as Email and PPI. One plausible reason is that, since the proximities of nodes are preserved even the random features per se are different (see Theorem 2 and Theorem 3) , all subsequent parameters based on proximities can be transferred. Besides, only using the stochastic representation of SMP in Eq. (10) can be regarded as combining node IDs while fixing the first-layer parameters during training (assuming the first message-passing layer takes the form of matrix multiplication between node features and the weight matrix, as in SGC, GCN, and GAT). By fixing the parameters, both the computational bottlenecks and transferability problems are resolved. We further report the link prediction results on three GNN benchmarks: Cora, CiteSeer, and PubMed. The results in Table 12 show similar trends as other datasets presented in Section 5.3, i.e., SMP reports comparable results to the other permutation-equivariant GNNs while P-GNN cannot handle the task well. We report ablation study results for the node classification task and pairwise node classification task in Table 13 and Table 14, respectively. The results again show that SMP-Linear generally achieves good-enough results on the majority of the datasets, and adding non-linearity does not consistently lift the performance. To compare the effectiveness of our proposed method with position encodings of GNNs, we adopt one recent method EigenGNN [33] , which uses the eigenvectors of a graph structure matrix to enhance the existing GNNs in preserving graph structures. Specifically, we use GCN as the base architecture for EigenGNN. All hyper-parameters are kept consistently to ensure a fair comparison. The results for the link prediction task are shown in Table 15 . Notice that we omit the results on PPA since calculating eigenvectors for extremely large-scale graphs is infeasible. From the table, we can observe that SMP-Linear outperforms or is comparable to EigenGNN on eight of nine datasets, demonstrating the effectiveness of SMP. One exception is on Grid, where EigenGNN shows remarkably better results. We attribute this to the fact that Grid is an especially regular simulated graph (recall that nodes are placed on a 20 × 20 grid) and therefore EigenGNN can better capture this pattern. Besides, both EigenGNN and SMP greatly outperform GCN in most cases, demonstrating the importance of preserving proximities for GNNs. • Proof-of-concept synthetic dataset: the dataset is generated using an add-on version of the stochastic block model [49] . The graph has 400 nodes in 10 communities. Each node has an independent equal chance to be active or inactive. The probability of forming edges is (1) 0.8; (2) 0.5; (3) 0.2; and (4) 0.005, respectively, between (1) two active nodes in the same community; (2) an active node and an inactive one in the same community; (3) two inactive nodes in the same community; and (4) two nodes in different communities. There is no node feature. • Grid [8] : A simulated 2D grid graph with size 20 × 20. • Communities [8] : A simulated caveman graph [62] composed of 20 communities, each community containing 20 nodes. The graph is perturbed by rewiring 1% edges randomly. It has no node feature. The label of each node indicates the community that the node belongs to. • Email 5 [8] : Seven real-world email communication graphs. Each graph has six communities and each node has an integer label indicating the community that the node belongs to. • Coauthor Networks 6 [57] : Two networks from Microsoft academic graph in CS and Physics with nodes representing authors and edges representing co-authorships. The node features are embeddings of the paper keywords of the authors. • PPI 5 [3] : 24 protein-protein interaction networks. Each node has a 50-dimensional feature vector. 5 . https://github.com/JiaxuanYou/P-GNN/tree/master/data 6. https://github.com/shchur/gnn-benchmark/tree/master/data/ npz/ The results of comparing SMP with using one-hot node IDs. OOM indicates out of memory. "-" indicates that we do not adopt the dataset for the task because the dataset does not have ground-truth labels. where the nodes correspond to papers and the edges correspond to citations between papers. The node features are bag-of-words and the node labels are the ground truth topics of the papers. • All datasets except PPA: we uniformly set the number of hidden layers to 2 for all methods, i.e., two messagepassing steps, and set the dimensionality of hidden layers to 32, i.e., H (l) ∈ R N ×32 , for 1 ≤ l ≤ L (for GAT, we use 4 heads with each head containing 8 units). We use the Adam optimizer with an initial learning rate of 0.01 and decay the learning rate by 0. All experiments are conducted on a server with the following configurations. . Our proof strategy is to show that the stochastic node representations can remember all the information about the walks, which can be decoded by the decoder F de (·). First, as the message-passing and the updating function are bijective by assumption, we can recover from the node representations in each layer all their neighborhood representations in the previous layer. Specifically, there exists F (l) (·), 1 ≤ l ≤ L, such that To keep our notations clear, we split the function into two parts, one for the node itself and the other one for its neighbors 9. To let F (l) (·) output a set with arbitrary lengths, we can adopt sequence-based models such an LSTM. The ablation study of different SMP variants for the node classification task. The best results and the second-best results are in bold and underlined, respectively. For the first function, if we successively apply such functions from the l th layer to the input layer, we can recover the input features of the GNN, i.e., E. Since the stochastic matrix E contains an identifiable unique signal for each node, we can decode the node ID from e i ; E = i. We denote the process of applying such l + 1 functions to get the node ID as neighbor (·) to the decoded vector set so that we can recover their neighborhood representations in the (l − 2) th layer. Similar procedures can be performed successively until we reach the first layer. Next, we show that for e This result is easy to obtain as follows. (va 1 , va 2 , ..., va l+1 ) is a walk ⇔Ea i ,a i+1 = Ea i+1 ,a i = 1, ∀1 ≤ i ≤ l ⇔ ai ∈ Na i+1 , ∀1 ≤ i ≤ l ⇔∃e (i−1) ∈ F where F(·) is composed of F (l) self (·), 0 ≤ l ≤ L, and F (l) neighbor (·), 1 ≤ l ≤ L. Applying the proximity function S(·), we have: We finish the proof by setting the real decoder function F de (·) to arbitrarily approximate this desired function S (F (·, ·)) under the universal approximation assumption. Some may find that our proof of Theorem 1 in the main paper leads to multiple connected components in the constructed graph G . Next, we give an alternative proof maintaining one connected component in G assuming the original graph G is connected) under an additional assumption that the walk-based proximity is of finite length. Proof. Similar to the previous proof, we will prove by contraction. We assume there exists a non-trivial S(·) that a permutation-equivariant GNN can preserve. Besides, we assume the length of S(·) is upper bounded by l max , where l max is any finite number, i.e., ∀i, j, where len(v i v j ) is the length, i.e., number of nodes minus 1, of the walk v i v j . For a connected graph G = (V, E, F), we create G = (V , E , F ) that has automorphism. Then we will show that any GNN cannot preserve S(·) on G . Specifically, denoting N = N + l max , we let G have 3Ñ nodes so that: Intuitively, we create three "copies" of G and three "bridges" to connect the copies and thus make G also connected. It is also easy to see that nodes v i , v i+Ñ , and v i+2Ñ all form automorphic node pairs. Therefore, we have: Next, we see that the nodes in G are divided into six parts (three copies and three bridges), which we denote as V 1 = {v 1 ...v N }, V 2 = {v N +1 ...vÑ }, V 3 = vÑ +1 ...vÑ +N , V 4 = vÑ +N +1 , ..., v 2Ñ , V 5 = v 2Ñ +1 , ..., v 2Ñ +N , and V 6 = v 2Ñ +N +1 , ..., v 3Ñ . Since V 2 , V 4 , V 6 are bridges with a length l max , any walk crosses these bridges will have a length large than l max . For example, let us focus on v i ∈ V 1 , i.e., 1 ≤ i ≤ N . If v j is in V 3 , V 4 , or V 5 (i.e.,Ñ < j ≤ 2Ñ + N ), any walk v i v j will either cross the bridge V 2 or V 6 and thus has a length larger than l max . As a result, we have: If v j ∈ V 1 or v j ∈ V 2 , i.e., j ≤Ñ , we can use the fact that v j and v j+Ñ forms an automorphic node pair, i.e., ∀ > 0, we have Similarly, if v j ∈ V 6 , i.e., 2Ñ + N < j, we can use the fact that v j and v j−Ñ forms an automorphic node pair to prove the same inequality. Thus, we prove that if i ≤ N, ∀ > 0, ∀j, |S i,j − S(∅)| < 2 . The same proof strategy can be applied to i > N . Since can be arbitrarily small, the results show that all node pairs have the same proximity S(∅), which leads to a contraction and finishes our proof. Learning disentangled representations for recommendation Neural relational inference for interacting systems Inductive representation learning on large graphs Neural message passing for quantum chemistry Universal invariant and equivariant graph neural networks Invariant and equivariant graph networks Open graph benchmark: Datasets for machine learning on graphs Position-aware graph neural networks Centrality and network flow Semi-supervised classification with graph convolutional networks Graph attention networks Network medicine framework for identifying drug-repurposing opportunities for covid-19 Simplifying graph convolutional networks Deep learning on graphs: A survey The graph neural network model A new model for learning in graph domains Neural network for graphs: A contextual constructive approach Gated graph sequence neural networks Spectral networks and locally connected networks on graphs The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains Convolutional neural networks on graphs with fast localized spectral filtering Geometric deep learning on graphs and manifolds using mixture model cnns Representation learning on graphs with jumping knowledge networks How powerful are graph neural networks? Relational inductive biases, deep learning, and graph networks Deeper insights into graph convolutional networks for semi-supervised learning Weisfeiler and leman go neural: Higher-order graph neural networks Provably powerful graph networks Weisfeiler-lehman graph kernels On positional and structural node features for graph neural networks on non-attributed graphs Benchmarking graph neural networks Directional graph networks Eigen-gnn: A graph structure preserving plug-in for gnns Local and global properties in networks of processors What graph neural networks cannot learn: depth vs width Relational pooling for graph representations Coloring graph neural networks for node disambiguation Random features strengthen graph neural networks Principal neighbourhood aggregation for graph nets Deep graph matching consensus The surprising power of graph neural networks with random node initialization Graph neural networks exponentially lose expressive power for node classification Deepgcns: Can gcns go as deep as cnns The random projection method On graph kernels: Hardness results and efficient alternatives Shortest-path kernels on graphs Deepwalk: Online learning of social representations node2vec: Scalable feature learning for networks Arbitraryorder proximity preserved network embedding Predict then propagate: Graph neural networks meet personalized pagerank Stochastic training of graph convolutional networks with variance reduction Fastgcn: Fast learning with graph convolutional networks via importance sampling Adaptive sampling towards fast graph representation learning Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks Random projections Pitfalls of graph neural network evaluation Collective classification in network data Lightgcn: Simplifying and powering graph convolution network for recommendation Structural deep network embedding Line: Large-scale information network embedding Networks, dynamics, and the small-world phenomenon