key: cord-0058279-fxs0yi0q authors: Lukasik, Jovita; Friede, David; Stuckenschmidt, Heiner; Keuper, Margret title: Neural Architecture Performance Prediction Using Graph Neural Networks date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_14 sha: 5394eead5cbe392cd3b95494d4fed90c3ce3571d doc_id: 58279 cord_uid: fxs0yi0q In computer vision research, the process of automating architecture engineering, Neural Architecture Search (NAS), has gained substantial interest. Due to the high computational costs, most recent approaches to NAS as well as the few available benchmarks only provide limited search spaces. In this paper we propose a surrogate model for neural architecture performance prediction built upon Graph Neural Networks (GNN). We demonstrate the effectiveness of this surrogate model on neural architecture performance prediction for structurally unknown architectures (i.e. zero shot prediction) by evaluating the GNN on several experiments on the NAS-Bench-101 dataset. Deep learning using convolutional neural architectures has been the driving force of recent progress in computer vision and related domains. Multiple interdependent aspects such as the increasing availability of training data and compute resources are responsible for this success. Arguably, none has had as much impact as the advancement of novel neural architectures [11, 19] . Thus, the focus of computer vision research has shifted from a feature engineering process to an architecture engineering process. The direct consequence is the need to automate this process using machine learning techniques. Neural Architecture Search (NAS) [7] attends to techniques automating architecture engineering. Due to very long compute times for the recurrent search and evaluation of new candidate architectures [41] , NAS research has hardly been accessible for researchers without access to large-scale compute systems. Yet, the publication of NAS-Bench-101 [38] , a dataset of over 423k fully trained neural architectures, facilitates a paradigm change in NAS research. Instead of carefully evaluating each new proposed neural architecture, NAS-Bench-101 enables to experiment with classical data-based methods such as supervised learning to evaluate neural architectures. While the impact of benchmarks such as NAS-Bench-101 on the community is high, they come at extreme computational costs. All architectures in the search space have to be extensively evaluated, which leads to practical restrictions on the search space. This calls for accurate surrogate models that enable to extrapolate expected performances to structurally different and larger architectures in unseen areas of the search space, i.e. zero shot prediction. In this paper, we first tackle the task of learning to predict the accuracy of convolutional neural architectures in a supervised way, i.e. we learn a surrogate model that enables to predict the performance of neural architectures on the CIFAR-10 image classification task. Furthermore, we evaluate our proposed model on two different zero shot prediction scenarios and show its ability to accurately predict performances in previously unseen regions of the search space. Most current neural architectures for computer vision can be represented as directed, acyclic graphs (DAGs). Thus, we base our surrogate model on Graph Neural Networks. Graph Neural Networks (GNNs) [35] have proven to be very powerful comprehending local node features and graph substructures. This makes them a very useful tool to embed nodes as well as full graphs like the NAS-Bench-101 architectures into continuous spaces. Furthermore, the benefit of GNNs over Recurrent Neural Networks (RNNs) has been shown in the context of graph generating models. The model Deep Generative Models of Graphs (DGMG) in [22] utilizes GNNs and shows dominance over RNN methods. DGMG is able to capture the structure of graph data and its attributes in a way that probabilistic dependencies within graph nodes and edges can be expressed, yielding in learning a distribution over any graph. This makes DGMG a strong tool to map neural architectures into a feature representation which captures the complex relation within the neural architecture. Inspired by [22] , we utilize the GNN as our surrogate model for the performance prediction task. In summary, in this paper we make the following contributions: We present a surrogate model-a graph encoder built on GNNs-for neural architecture performance prediction trained and evaluated on the NAS-Bench-101 benchmark and show that this neural performance predictor accurately predicts architecture performances in previously structurally different and unseen regions of the search space, i.e. zero shot prediction. The remaining paper is structured as follows: Sect. 2 gives a short review of the related work. In Sect. 3 we present our proposed encoder model. Section 4 gives detailed model implementation details of the proposed surrogate model. In Sect. 5, we describe the NAS-Bench-101 dataset on which we conduct our evaluation. In Sect. 6, we present our experiments and results. Finally, we give a conclusion and outline some future directions in Sect. 7. Neural Architecture Search (NAS) [25, 30, 31, 40, 41] , the process of designing neural network architectures in an automatic way, gained substantial attention recently. See [7] for an overview and detailed survey over recent NAS methods. The currently most successful approaches follow different paradigms: Reinforcement learning (RL) [30, 40, 41] as a NAS strategy considers the neural architecture generation as the agent's action with it's reward given in terms of validation accuracy. Evolutionary Algorithm (EA) [24, 31] approaches optimizing the neural architectures themselves by guiding the mutation of architectures and evaluating their fitness given in terms of validation accuracy. Bayesian optimization (BO) [16] derives kernels for architecture similarity measurements to extrapolate the search space. Gradient based methods [25, 26] use continuous relaxations of neural architectures to allow for gradient-based optimization. NAS-Bench-101 [38] is a public dataset of 423k neural architectures and provides tabular benchmark results for a restricted cell structured architecture search space [41] with exhaustive evaluation on the CIFAR-10 image classification dataset [18] . As shown in [39] , only subspaces of the architectures in NAS-Bench-101 can be used to evaluate one-shot NAS methods [25, 30] , motivating their proposed variant NAS-Bench-1shot1 [39] . Similarly to NAS-Bench-101, NAS-Bench-201 [6] uses a restricted, cell-structured search space, while the employed graph representation allows to evaluate discrete and one-shot NAS algorithms. The search space is even more restricted than NAS-Bench-101, providing only 6k unique evaluated architectures in total. We conduct our experiments on NAS-Bench-101, which is the largest available tabular neural architecture benchmark for computer vision problems. The work on performance prediction models for neural architectures is very limited. [23] uses a performance predictor in an iterative manner during the search process of NAS. [1] uses features of a neural architecture, such as the validation accuracy, some architecture parameters such as the number of weights and the number of layers as well as hyperparameters, to predict learning curves during the training process by means of a SRM regressor. [26] proposes a performance prediction model learned in combination with an auto-encoder in an end-to-end manner. The neural architectures are mapped into a latent feature representation, which is then used by the predictor for performance prediction and are further decoded into new neural architectures. Recently [33] proposes a semi-supervised assessor of neural architectures. The graphs are employed by an auto-encoder to discover latent feature representations, which is then finetuned by means of a graph similarity measurement. Lastly, a graph convolution network is used for performance prediction. Combining modern machine learning methods with graph structured data has increasingly gaining popularity. One can interpret it as an extension of deep learning techniques to non-Euclidean data [4] or even as inducing relational biases within deep learning architectures to enable combinatorial generalization [3] . Because of the discrete nature of graphs, they can not trivially be optimized in differentiable learning methods that act on continuous spaces. The concept of graph neural networks is a remedy to this limitation. The idea of Graph Neural Networks as an iterative process which propagates the node states until an equilibrium is reached, was initially mentioned in 2005 [12] . Motivated by the increasing populatiry of CNNs, [5] and [15] defined graph convolutions in the Fourier domain by utilizing the graph Laplacian. The modern interpretation of GNNs was first mentioned in [17, 21, 28] where node information was inductively updated through aggregating information of each node's neighborhood. This approach was further specified and generalized by [13] and [10] . The research in GNNs enabled breakthroughs in multiple areas related to graph analysis such as computer vision [20, 36, 37] , natural language processing [2] , recommender systems [27] , chemistry [10] and others. The capability of GNNs to accurately model dependencies between nodes makes them the foundation of our research. We utilize them to move from the discrete graph space to the continuous space. In this paper, we want to use continuous methods, GNNs, to handle the graphs characterizing neural architectures from the NAS-Bench-101 dataset [38] . More precisely, we show the ability of GNNs to encode neural architectures such as to allow for a regression of their expected performance on an image classification problem. In this section we present our GNN-based model to encode the discrete graph space of NAS-Bench-101 into a continuous vector space. One can imagine a single GNN iteration as a two-step procedure. First, each node sends out a message to its neighbors alongside its edges. Second, each node aggregates all incoming messages to update itself. After a final amount of these iteration steps, the individual node embeddings are aggregated into a single graph embedding. For each node v ∈ V , we associate an initial node embedding h v ∈ R dn . In our experiments we use a learnable look-up table based on the node types. Propagating information through the graph can be seen as an iterative message-passing process (2) and a differentiable, permutation invariant aggregation function Ξ. The message module M (t) is illustrated by the green arrows in Fig. 1 (left) . To address the directed nature of the NAS-Bench-101 graphs, we add a reverse message module This is outlined in Fig. 1 (left) by the red arrows and leads to so-called bidirectional message passing. The update module U (t) utilizes each node's incoming messages to update that node's embedding from h Exploring many different choices for the message and update modules experimentally, we find that the settings similar to [22] work best for our needs. We pick a concatenation together with a single linear layer for our message modules. The update module consists of a single gated recurrent unit (GRU) where h (t−1) v is treated as the hidden state. For the aggregation function, we choose the sum. To increase the capacity of our model, on the one hand, we apply multiple rounds of propagation and on the other hand, we use a different set of parameters for each round. After the final round of message-passing, the propagated node embeddings h = (h v ) v∈V are aggregated into a single graph embedding h G ∈ R dg , where We obtain good results by using a linear layer combined with a gating layer that adjusts each node's fraction in the graph embedding. This aggregation layer A in (5) is further illustrated in Fig. 1 (right) . In this section, we give further details on the implementation of our GNN model. The message module M (t) concatenates the embedding of the considered node h out is a clone of M (t) initialized with its own weights, The message module (green) and the reverse message (red) can be seen on the left side of Fig. 1 . The update module U (t) is a single GRU cell. First, the incoming messages m u→v and m out u→v are added and handled as the GRU input. Second, the node embedding h (t−1) v is treated as the hidden state and is updated, We use two rounds of propagation before aggregating the node embeddings into a single graph embedding. This graph aggregation consists of two parts. First, a linear layer transforms the node embeddings to the required graph embedding dimension d g . Second, another linear layer combined with a sigmoid handles each node's fraction in the graph embedding, An illustration of the aggregation module is given in Fig. 1 (right) . NAS-Bench-101 [38] is a public dataset of neural architectures in a restricted cell structured search space [41] evaluated on the CIFAR-10-classification set [18] . NAS-Bench-101 considers the following constraints to limit the search space: it only considers directed acyclic graphs, the number of nodes is limited to |V | ≤ 7, the number of edges is limited to |E| ≤ 9 and only 3 different operations are allowed {3 × 3 convolution, 1 × 1 convolution, 3 × 3 max − pool}. These restrictions lead to a total of 423k unique convolutional architectures, which are built from the cells in the following way: Each cell is stacked three times, followed by a max-pooling layer which reduces the feature map size by factor two. This pattern is repeated 3 times, followed by global average pooling and a dense softmax layer to produce the output. While this search space is limited it covers relevant architectures such as for example ResNet like [14] and InceptionNet like [32] models [38] . The architectures have been trained for four increasing numbers of epochs {4, 12, 36, 108}. Each of these architectures is mapped to its test, validation and training measures. In this paper we use the architectures trained for 108 epochs and aim to predict their corresponding validation and test accuracy. We conduct experiments in three complementary domains. First, we evaluate the performance prediction ability of the proposed GNN in the traditional supervised setting. Then, we conduct zero shot prediction experiments in order to show the performance of the proposed model for unseen graph structures during training. Both experiments are carried out on the validation accuracies reported in NAS-Bench101. Last, we compare our results to the recent publication by Tang et al. [33] in terms of test accuracy prediction. If not mentioned differently, we set d n = 250 for the node dimensions and d g = 56 for the dimension of the latent space. We split the dataset 70%/20%/10% edit-sampled into training-, test-and validation set. All our experiments are implemented using PyTorch [29] and PyTorch Geometric [9] . The hidden layers of the regressor are of size 28, 14 and 7. We used no activation function for the very last output (linear regression) and trained the joint encoder model with a learning rate of 1e −5 for 100 epochs. The hyperparameters were tuned with BOHB [8], Supervised Performance Prediction. Here, we evaluate the latent space generated by the encoder with respect to its prediction error regarding a metric of interest of the NAS-Bench-101 graphs, i.e. the validation accuracy on CIFAR-10. For this purpose, we utilize a simple predictor, i.e. a four-layer MLP with ReLU non-linearities. We jointly train the encoder and the predictor supervisedly end-to-end. We test for prediction as well as for zero shot prediction errors. There are a few outliers in the NAS-Bench-101 graphs that end up with a low validation accuracy on the CIFAR-10 classification task. Figure 2 (left) visualizes these outliers and shows that our model is able to find them even if it cannot perfectly predict their accuracies. One can see that the model predicts the validation accuracy of well performing graphs very accurately. To further explore the loss, Fig. 2 (right) illustrates the mean and variance of the squared error of the test set partitioned in 9 bins with respect to the ground truth accuracy. The greater part of the loss arises from graphs with a low accuracy. More importantly, our model is very accurate in its prediction for graphs of interest namely graphs with high accuracy. The rather bad prediction of graphs with low and intermediate accuracy can be explained through their low share in the dataset. Taking a look at the distribution of the individual accuracies in the overall NAS-Bench-101 dataset, as shown in Fig. 3 (left) , illustrates the low share of low and intermediate accuracies in the dataset and explains therefore, the rather bad prediction behaviour of our surrogate model. Figure 3 (middle) and (right) plot the validation accuracy compared to the test accuracy of the NAS-Bench-101 dataset. This figure illustrates that predicting the best architecture on the validation set does not necessarily imply a proper prediction on the test set. . A more precise look into the areas of interest for neural architectures display that the best neural architecture by means of the test accuracy is unequal to the best accuracy by means of the validation accuracy (right). We compare the results of the encoder to several baselines. Our baselines are a random forest approach and also an MLP regressor with four layers, using onehot node feature encodings and graph depth/width feature encodings. In order to compare to an RNN baseline, we adapted the RNN-surrogate model from [23] , which, in their original implementation, only handles cells of equal length. For the application to NAS-Bench-101 with cell types of different length, we do the following slight modification: we input to the LSTM a 0-padded one-hot vector of node attributes, encoding up to 7 nodes and 5 operations. Table 1 summarises the performance prediction results on the supervised performance prediction task. All experiments are repeated 3 times and we report the mean and the relative standard deviation. The experiments show that our surrogate model is able to predict the neural architecture performances in a stable way and outperforms all baselines in terms of the RMSE by a significant margin. Zero Shot Performance Prediction. Next, we consider the task of predicting the validation accuracy of structurally unknown graph types, i.e. zero shot prediction. The zero shot prediction task is furthermore divided into two subtasks. First, the encoder is trained on all graphs of length 2, 3, 4, 5, 7 and tested on graphs of length 6. In this scenario, the unseen architectures could be understood as interpolations of seen architectures. Second, we learn the encoder on graphs of length 2, 3, 4, 5, 6 and test it on graphs of length 7. This case is expected to be harder not only because the graphs of length 7 are the clear majority and have the highest diversity, but also because the prediction of their performance is an extrapolation out of the seen training distribution. Table 2 summarizes the performance prediction results on the zero shot performance prediction task. All experiments are repeated 3 times and we report the mean and the relative standard deviation. As expected, the resulting RMSE is slightly higher for the extrapolation to graphs of length 7 than for the zero shot prediction for graphs of length 6. Yet the overall prediction improves over all baselines by a significant margin. The higher standard deviation in comparison to the random forest baselines indicates that the performance of the GNN depends more strongly on the weight initialization than in the fully supervised case. Yet, please note that this dependence on the initialization is still significantly lower than for the MLP and RNN baselines. The experiments show that our surrogate model is able to accurately predict data that it has never seen, i.e. that it can predict the accuracies even for architectures not represented by the training distribution. In the following, we analyse the training behaviour of our model in the different scenarios described above. For visualisation aspects of the training behaviour of our encoder, we plot the development of the training loss against the validation loss for the supervised performance prediction from Sect. 6.2 Supervised Performance Prediction. Figure 4 displays this development of training loss against validation loss measured by means of the RMSE. The smallest achieved RMSE is ∼0.0487 for training on 70% of the dataset, i.e. 296, 558 samples. Zero Shot Prediction. The progress of the training loss and test error for the zero shot prediction case of our encoder can be seen in Fig. 5 . The training set containing all graphs of length 2, 3, 4, 5, 7/2, 3, 4, 5, 6 has a total amount of 361, 614/64, 542 samples. Thus the encoder is tested only on graphs of length 6/7, which corresponds to a total of 62, 010/359, 082 neural architectures. The experiments show that our model is able to accurately predict data that it has never seen before. The behaviour of the test error during the second zero shot prediction task, see Fig. 5 (right), displays interesting information. During the first epochs, the error rises before it starts decreasing and approaching the training loss asymptotically. One interpretation could be that the model first learns simple graph properties like the number of nodes before it learns more complex graph substructures that generalise to the unseen data. In this section we compare our GNN-surrogate model with the most recent stateof-the-art predictor [33] . They evaluate their predictor on the test accuracy of the NAS-Bench-101 dataset. Since predicting on the validation accuracy does not imply the same proper prediction behaviour on the test set, we evaluate our surrogate model in the same setting. In [33] , an auto-encoder model is first trained on the entire NAS-Bench-101 dataset and then fine-tuned with a graph similarity metric and test accuracy labels. Because the training relies on an unsupervised pre-training, they refer to the approach as semi-supervised. To enable a direct comparison, we sample randomly 1, 000/10, 000/100, 000 graphs from the training data set and evaluate the performance prediction ability of the GNN surrogate model on all remaining 431, 624/413, 624/323, 624 graphs in the NAS-Bench-101 dataset. Please note that, at training time, the semi-supervised approach from [33] actually has access to more data than our fully supervised approach, because of the unsupervised pre-training. Semi-supervised assessor [33] 0.0031 0.0026 0.0016 Table 3 shows the experimental comparison, where we report an average over three runs for our approach while the numbers of [33] are taken from their paper. The proposed GNN surrogate model surpasses the proposed semisupervised assessor [33] when 10.000 and 100.000 training architectures are available. With only 1.000 randomly drawn training samples, the results of our approach decrease. Yet, since we do not have access to the exact training samples used in [33] , the results might become less comparable the lower the number of samples drawn. In this paper, we propose a GNN surrogate model for the prediction of the performance of neural architectures. Through multiple experiments on NAS-Bench-101, we examined various capabilities of the encoder. The GNN encoder is a powerful tool regarding supervised performance prediction and also especially in the zero-shot setup. Further research will mainly review the possibilities of neural architecture search in accordance with further performance prediction. Accelerating neural architecture search using performance prediction Graph convolutional encoders for syntax-aware neural machine translation Relational inductive biases, deep learning, and graph networks Geometric deep learning: going beyond Euclidean data Spectral networks and locally connected networks on graphs NAS-Bench-102: extending the scope of reproducible neural architecture search Neural architecture search: a survey BOHB: robust and efficient hyperparameter optimization at scale Fast graph representation learning with PyTorch geometric Neural message passing for quantum chemistry Generative adversarial nets A new model for learning in graph domains Inductive representation learning on large graphs Deep residual learning for image recognition Deep convolutional networks on graph-structured data Neural architecture search with Bayesian optimisation and optimal transport Semi-supervised classification with graph convolutional networks Learning multiple layers of features from tiny images ImageNet classification with deep convolutional neural networks Large-scale point cloud semantic segmentation with superpoint graphs Gated graph sequence neural networks Learning deep generative models of graphs Progressive neural architecture search Hierarchical representations for efficient architecture search DARTS: differentiable architecture search Neural architecture optimization Geometric matrix completion with recurrent multi-graph neural networks Learning convolutional neural networks for graphs Automatic differentiation in PyTorch Efficient neural architecture search via parameter sharing Large-scale evolution of image classifiers Rethinking the inception architecture for computer vision A semi-supervised assessor of neural architectures BANANAS: Bayesian optimization with neural architectures for neural architecture search A comprehensive survey on graph neural networks Scene graph generation by iterative message passing SyncSpecCNN: synchronized spectral CNN for 3D shape segmentation NASbench-101: Towards reproducible neural architecture search NAS-Bench-1Shot1: benchmarking and dissecting one-shot neural architecture search Neural architecture search with reinforcement learning Learning transferable architectures for scalable image recognition Acknowledgement. We thank Alexander Diete for helpful discussions and comments. This project is supported by the German Federal Ministry of Education and Research Foundation via the project DeToL.