key: cord-0043192-rn4ul1ql authors: Yang, Jia-Qi; Zhan, De-Chuan; Li, Xin-Chun title: Bottom-Up and Top-Down Graph Pooling date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_43 sha: 9603106f3cb74abfa662b4c511a552126a4a8e03 doc_id: 43192 cord_uid: rn4ul1ql Pooling layers are crucial components for efficient deep representation learning. As to graph data, however, it’s not trivial to decide which nodes to retain in order to represent the high-level structure of a graph. Recently many different graph pooling methods have been proposed. However, they all rely on local features to conduct global pooling over all nodes, which contradicts poolings in CNNs that only use local features to conduct local pooling. We analyze why this may hinder the performance of graph pooling, then propose a novel graph pooling method called Bottom-Up and Top-Down graph POOLing (BUTDPool). BUTDPool aims to learn a more fine-grained pooling criterion based on coarse global structure information produced by a bottom-up pooling layer, and can enhance local features with global features. Specifically, we propose to use one or multiple pooling layers with a relatively high retain ratio to produce a coarse high-level graph. Injecting the high-level information back into low-level representation, BUTDPool enhances learning a better pooling criterion. Experiments demonstrate the superior performance of the proposed method over compared methods. The revolution of deep learning [8] has profoundly affected the development of many application fields, such as computer vision [10] , natural language processing [12] , and audio signal processing [11] . The architecture of neural networks has evolved a lot in recent years, convolutional neural network (CNN) [10] remains the most successful model in applications that can exploit grid-like data structure. Some structured data, e.g., images, can be represented as grid-like graph structure, yet CNNs are not directly applicable to general graph data. The traditional approach for dealing with graph data is utilizing graph kernel [5] . These methods suffered from drawbacks of kernel methods such as high computational complexity and very shallow model, thus didn't perform well on relatively large datasets. Graph neural networks (GNN) [21] aims at building deep learning methods for graph data such as social networks, citation networks, and the world wide web, and its effectiveness has been shown in many real-world applications. The basic idea of most GNN models is to migrate successful deep learning building-blocks to graph models. Graph convolutional neural networks (GCN) [13] is a prominent GNN variant, which is an analogy of CNN. Pooling layer is a critical component of most deep neural networks [8] . Stateof-the-art CNNs usually use successive convolutional layers with small kernel sizes followed by one or more pooling layers, which can increase the effective receptive field while keeping the efficiency and representation learning power of convolutional layers. Graph poolings play a similar role in GCNs, and pooling layers are also necessary for tasks such as graph classification, graph encoding, or sub-graph sampling. However, the most popular pooling methods such as spacial max pooling and average pooling can't be incorporated into GCNs easily. Some recent researches focus on providing pooling methods that applicable in GCNs, such as DiffPool [23] and SAGPool [14] , and they have shown significant improvement on many graph classification tasks. Existing graph pooling methods [7, 24] rely on features extracted by graph convolution layers to calculate a score for each node. Graph convolution works by aggregating information from neighbor nodes defined by the adjacency matrix, so the feature produced by graph convolution is local feature like in common CNNs, where a convolution kernel considers only a small area. The local feature can only reflect local structure around a node itself, but can hardly reflect the importance of a node within a larger area since macroscopic information is never available in local features. In CNNs the local features are exploited in an intuitive way: 1) Only spacial nearby areas are compared during pooling, which is well-defined when considering local features 2) Pooling layers guarantee a fixed fraction of local features are retained (e.g. 1 4 ). So even if the most important part is dropped, any one of its neighbor features will hopefully be fine enough because of the local similarity. On the other hand, existing graph pooling methods work in a very different way: 1) All nodes are compared based on their local features, including those nodes that are far apart from each other. 2) A fixed fraction of all nodes are retrained, so the global structure of the graph may change a lot if some critical parts are dropped completely. To solve this problem, we propose a new graph pooling method called Bottom-Up and Top-Down Graph Pooling (BUTDPool). The core idea is to gather highlevel information with one or more coarse pooling layers, then feed this information back to the low-level representation to help to learn a more fine-grained pooling function. Specifically, we propose to apply a bottom-up pooling layer to enlarge the receptive field of each node, then use a top-down unpooling layer to map the pooled graph back and add to the original graph. Finally, a finegrained pooling layer is applied to the graph with global information. Experiments showed the advantage of our method compared to state-of-the-art graph pooling methods, especially on large graphs. Generally, we make several noteworthy contributions as follows: -We analyzed a drawback of existing graph pooling methods that have not been noticed before: they are all pooling over graph globally but only based on local features. -We proposed a new graph pooling structure called Bottom-Up and Top-Down Graph Pooling to tackle that drawback, which is generally applicable and can be complementary with existing methods. -Experiments on real-world network datasets are conducted. The experimental results demonstrate that the Bottom-Up and Top-Down Graph Pooling achieves better results than many existing methods. We give a brief overview of some related works in this section. (GC) is a graph representation learning method with a sound theoretical foundation and has been the backbone of many successful graph learning methods, which is also the workhorse of most graph pooling methods. The most widely used graph convolution [13] is an approximation of localized spectral convolution. There are variants of graph convolution, e.g. Graph-SAGE [9] use LSTM instead of mean as aggregation function in graph convolution, however, they preserves the localized property so the drawback discussed in Sect. 3.2 remains. Hierarchical and Global Graph Pooling. Global pooling aims at obtaining a global summary of a whole graph, which gives GCNs the ability to process graphs with different node numbers and graph structure. Typical global pooling methods including SortPool [24] and Set2Set [18] . However, global pooling methods lack the ability to learn hierarchical or high-level representations. Hierarchical pooling methods aim to provide a texture-level to object-level representations at different layers and wipe off useless information progressively by multiple pooling layers, which is very important in deep models. SAGPool [14] is a recently proposed hierarchical pooling method that has state-of-the-art performance on many benchmark datasets. Since it's unlikely to learn the global structure of a large graph using a single global pooling layer, hierarchical pooling is more important in deep learning. So we only consider hierarchical pooling methods in the rest of our paper. A nature idea considering graph pooling is to learn the importance of each node, then only keep the most important nodes and simply drop other unimportant nodes. This is the basic idea of a bunch of graph pooling methods we call as score-based methods, since they calculate a score as a metric of relative importance of each node. SAGPool [14] and gPool [7] are typical score-based pooling methods. Differentiable Graph Pooling. Score-based methods have a drawback that the top-k operation used in score-based methods is not differentiable. Fully differentiable models are usually easier to optimize than models with nondifferentiable components, differentiable graph pooling methods are proposed to tackle this issue. DiffPool [23] is a representative differentiable pooling method where a large graph is downsampled to a small graph with pre-fixed size in a fully differentiable approach. Since the time and space complexity of DiffPool is O(|V| 2 ) rather than O(|E|) of sparse methods such as SAGPool, DiffPool can be very slow or even not applicable on large graphs. Visual attention mechanisms have been widely used in image captioning and visual question answering (VQA) [1, 16] , and similar attention mechanism has been proved to exist in human visual system [3] . The combination of bottom-up and top-down attention was suggested by [1] in VQA task, where the bottom-up attention uses an object detection model to focus on concerned regions, then the top-down attention utilizes language feature to take attention on the image regions most related to the question. Although our method shares a similar name with the bottom-up and top-down attention in VQA, they are completely different in motivation, application, and implementation. In order to learn a more fine-grained graph pooling criterion with larger effective receptive field, we adopt a bottom-up and top-down fashion architecture, which will be defined in detail in the following sections (Fig. 1 ). We define a graph as G = (V, A), where V is the vertex set that consisting of nodes {v 1 , . . . , v n }, and A ∈ R n×n is the adjacency matrix (typically symmetric and sparse) where a ij denotes the edge weight between nodes v i and v j . a ij = 0 denotes the edge does not exist. The degree matrix is defined as a diagonal In graph convolution, the adjacency matrix with self-connectionsà ∈ R n×n is usually used instead of A, whereà = A + I n , andD ∈ R n×n is the degree matrix ofÃ. In a deep neural network, X denotes the input of the th layer, X 0 denote the original feature matrix of the input graph. Each node v i in a graph has a feature vector denoted by x i ∈ R d . For a graph with n nodes, the feature matrix is denoted by X ∈ R n×d , where row i correspond to x i . In graph classification task, each graph belongs to one out of C classes, given a graph we want to predict it's class. We analyze a drawback of existing pooling methods in this section, which we attribute to the contradiction of local features and global pooling criterion. We define local features as the features that only contain information gathered from the very nearby area. For example, in CNNs, the features are typically calculated by a small kernel (e.g., 3×3). The kernel size determines the receptive field of a convolutional layer, which is the largest area a feature can see, and that feature is innocent to the outside of this receptive field. The features produced by graph convolutions are also local: typical graph convolutions only consider neighbor nodes, i.e., one-hop connection [13] . This is the expected behavior of GCNs by design to inherit merits of CNNs: parameter sharing at different local area, which can be regarded as a strong prior that local patterns are applicable everywhere [8] . Typical pooling layers in CNNs, say max-pooling without loss of generality, select one feature out of a small feature set (e.g., 1 out of 4 in a max-pooling with size 2×2), then all selected features form a smaller feature map. These local features are comparable since they have information about each other, and even if the feature of most importance is dropped, the selected feature may still be representative of this small area because of local similarity in most nature data like image. The overall structure won't get disturbed after pooling even if the pooling is totally random, which makes poolings in CNNs robust and relatively easier to train. However, existing graph poolings utilize these local features in a very different and counterintuitive way compared to common poolings in CNNs. They use local features to calculate a score [7, 14] to define the importance of each node, then they simply select the nodes with the largest scores. The scores are also local features as we defined before, since they only rely on local features produced by GCNs. The scores of two far apart nodes can't reflect any information of their relationship or their relative importance in the whole graph, which harms the ability of poolings to retain global structure thus will harm performance in downstream tasks. The analysis of the drawback of existing methods also sheds light on the idea to resolve it: we need to make local features non-local, i.e., have a larger receptive field, in order to apply score-based pooling in a more fine-grained way. Stacking GC layers can increase receptive field, however not efficient since a GC layer can only increase receptive field by 2 hops. And it's indeed hard to train a deep GCN [15, 20] . To gather a macroscopic view of the whole graph, we use a stack of base pooling layers to do a coarse pooling. A base pooling layer can be any pooling layer that can reduce node number by a fixed ratio r such as most score-based pooling methods. We define a base pooling layer (BPL) as whereX andà is the feature matrix and adjacency matrix of the graph after pooling, Θ is the parameter of this base pooling layer. Without loss of generality, we use SAGPool [14] as our base pooling layer. Given input X, A and r, the SAGPool calculates a score vector y = GC(X). Based on this score, the k nodes with largest scores are selected and their indexes are denoted as idx. The output features is calculated byX = tanh(y[idx]) X[idx], the output adjacent matrixà = A [idx, idx] , where the [] operator selects elements based on row and column indexes. The GC is a graph convolution layer with output feature size 1 so that the output of this layer can be used as the score. When a retain-ratio r is given instead of k, we define k = |V| * r. A bottom-up pooling layer (BUPL) can be defined as a stack of base pooling layers, for example, a bottom-up pooling layer with 2 base pooling layers can be defined byX The corresponding bottom-up pooling layer is denoted bỹ WhereX ,à is the output of a bottom-up pooling layer. Different from SAGPool layer, we return an index idx bu in BUPL to memorize the map between input nodes and output nodes. The bottom-up pooling layer can produce a much smaller graph, since the retain ratio is relatively small. This graph can be viewed as a rough summary of the original graph and the receptive field of each node is larger. Roughly, the receptive field of remaining nodes can still cover most nodes, thus this is a high-level graph. Notice that DiffPool is not applicable as a bottom-up pooling layer since we prefer a sparse pooling method. Now we have a high-level pooling result produced by the bottom-up pooling layer denoted byX ,à . In order to feed high-level information back to the low-level graph, we should define a mapping from the downsampled small graph to the original large graph with more nodes, which is unpooling. A nature idea is to apply attention to the small graph and large graph. However, this attention operation will take O(|V small | × |V large |) time complexity, which loses the merit of the sparse property of GCN. To keep efficiency and simplicity, we save the index of selected nodes at every bottom-up pooling step so that we can recover the mapping of nodes easily. This is similar to the gUnpool layer proposed by [7] . However, they use zero value in the dropped nodes, which does not feed any information back to those nodes. To fix this, we take the mean of all retained nodes as their corresponding high-level features. The top-down unpooling can be denoted as follows: Where TDUPL is the top-down unpooling layer. The retained nodes in unpooling result have information of their own receptive field, and other averaged nodes have information of the whole graph. When this graph is injected to low-level graph, each nodes will have both local and global information (an averaged node will have a retained neighbour with large probability, viceversa. Then one hop relation is considered in following graph convolution). Then the high-level features are summed with low-level features as input of the fine-grained pooling layer. Which can be defined as Where FGPL Θ fg is the fine-grained pooling layer, which can be any scorebased pooling layer. The r is the retain ratio, which is larger than the retain ratio r bu used in bottom-up pooling layers. Z is the feature combined with local information X and higher level informationX , which gives FGPL the power to learn a more fine-grained pooling. The proposed bottom-up and top-down pooling layer combines bottom-up layer and top-down layer defined before and can be denoted by The procedure of a bottom-up and top-down pooling layer is summarized in Algorithm 1. Input : Adjacent matrix A ; Input feature matrix X ; Bottom-up pooling layer parameters Θ bu and fine-grained pooling layer parameters Θ fg ; Bottom-up pooling ratio r bu and pooling ratio r Output: A smaller graph with adjacent matrix A +1 ; Output feature matrix We give a brief introduction of compared methods and experiment protocol in the following sections. We give a brief introduction of compared methods in the following sections. gPool is a score-based method used in the Graph U-Nets [7] . gPool suppose that there is a direction defined by vector p at the th layer that the nodes v i with feature x i align with p best is the most relative nodes. So the score of node v i is defined as x T i p / p . The score (after a sigmoid function) is multiplied to the input of next layer to make p optimizable. SAGPool is a score-based method proposed by [14] . The authors of SAGPool argue that gPool does not consider topology relationship in graphs since all nodes are projected to the same p . SAGPool uses graph convolution blocks to calculate score instead. DiffPool is a fully differentiable graph pooling method introduced in [23] . Diff-Pool uses GNN layers to learn a soft assignment of each node to a cluster, then pool a cluster into a node. As mentioned in [17] , the data split is a crucial factor that affects evaluation a lot. So we generate 20 different random splits (80%train, 10%validate, 10%test) of every dataset at first, then evaluate each model on these 20 splits and take the average accuracy as our measurement. All methods are implemented using pytorch-geometric [6] . Model Architecture. We follow the model architecture proposed in [14] to make a fair comparison, with the graph pooling layers replaced by compared methods, Figure 2 depicts the model architecture. Datasets. We selected several graph classification benchmark datasets from real biological and chemical applications. The D&D dataset [4] and PROTEINS dataset [2, 4] are protein classification datasets. The NCI1 dataset and NCI109 dataset [19] represent two balanced subsets of datasets of chemical compounds classification. REDDIT-MULTI-5K [22] is a social network dataset with large graphs. The datasets are summarized in Table 1 . BUTDPool is a stack of existing graph pooling methods, the overhead of time complexity is determined by the number of base pooling layers. For a BUTDPool layer with 2 bottom-up layers, the running time will be roughly 3 times of a single base pooling layer (2 bottom up layer, 1 fine-grained layer). The space complexity is similar. In a typical deep neural network, the number of pooling layers is small compare to other convolutional layers, so this overhead is affordable. On the other hand, the overhead of DiffPool is |V| times complexity of time and space compared to sparse methods, which limits its usability on even medium size graphs. We analyzed the contradiction of local feature and global pooling in existing graph pooling methods, then introduced a novel and easy-to-implement improvement BUTDPool over existing graph pooling methods to mitigate this problem. The large graph is pooled by a bottom-up pooling layer to produce a highlevel overview, and then the high-level information is feedback to the low-level graph by a top-down unpooling layer. Finally, a fine-grained pooling criterion is learned. The proposed bottom-up and top-down architecture is generally applicable when we need to select a sub-graph from a large graph and the quality of the sub-graph matters. Experiments demonstrated the effectiveness of our approach on several graph classification tasks. The proposed BUTDPool can be an alternative building block in GNNs with the potential to make improvements in many existing models. Bottom-up and top-down attention for image captioning and visual question answering Protein function prediction via graph kernels Control of goal-directed and stimulus-driven attention in the brain Distinguishing enzyme structures from non-enzymes without alignments Scalable kernels for graphs with continuous attributes Fast graph representation learning with PyTorch Geometric Graph U-Nets Deep Learning. Adaptive Computation and Machine Learning Inductive representation learning on large graphs Deep residual learning for image recognition CNN architectures for large-scale audio classification Convolutional neural networks for sentence classification Semi-supervised classification with graph convolutional networks Self-attention graph pooling Deeper insights into graph convolutional networks for semisupervised learning Knowing when to look: adaptive attention via a visual sentinel for image captioning Pitfalls of graph neural network evaluation Order matters: sequence to sequence for sets Comparison of descriptor spaces for chemical compound retrieval and classification Simplifying graph convolutional networks A comprehensive survey on graph neural networks Deep graph kernels Hierarchical graph representation learning with differentiable pooling An end-to-end deep learning architecture for graph classification