key: cord-0647408-09ue3rzo authors: Jiang, Chunheng; Pedapati, Tejaswini; Chen, Pin-Yu; Sun, Yizhou; Gao, Jianxi title: Neural Capacitance: A New Perspective of Neural Network Selection via Edge Dynamics date: 2022-01-11 journal: nan DOI: nan sha: 28f55bc896f59dd3e92a83e0292bc2a390cde1f0 doc_id: 647408 cord_uid: 09ue3rzo Efficient model selection for identifying a suitable pre-trained neural network to a downstream task is a fundamental yet challenging task in deep learning. Current practice requires expensive computational costs in model training for performance prediction. In this paper, we propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training. Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections. Therefore, a converged neural network is associated with an equilibrium state of a networked system composed of those edges. To this end, we construct a network mapping $phi$, converting a neural network $G_A$ to a directed line graph $G_B$ that is defined on those edges in $G_A$. Next, we derive a neural capacitance metric $beta_{rm eff}$ as a predictive measure universally capturing the generalization capability of $G_A$ on the downstream task using only a handful of early training results. We carried out extensive experiments using 17 popular pre-trained ImageNet models and five benchmark datasets, including CIFAR10, CIFAR100, SVHN, Fashion MNIST and Birds, to evaluate the fine-tuning performance of our framework. Our neural capacitance metric is shown to be a powerful indicator for model selection based only on early training results and is more efficient than state-of-the-art methods. Leveraging a pre-trained neural network (i.e., a source model) and fine-tuning it to solve a target task is a common and effective practice in deep learning, such as transfer learning. Transfer learning has been widely used to solve complex tasks in text and vision domains. In vision, models trained on ImageNet are leveraged to solve diverse tasks such as image classification and object detection. In text, language models that are trained on a large amount of public data comprising of books, Wikipedia etc are employed to solve tasks such as classification and language generation. Although such technique can achieve good performance on a target task, a fundamental yet challenging problem is how to select a suitable pre-trained model from a pool of candidates in an efficient manner. The naive solution of training each candidate fully with the target data can find the best pre-trained model but is infeasible due to considerable consumption on time and computation resources. This challenge motivates the need for an efficient predictive measure to capture the performance of a pre-trained model on the target task based only on early training results (e.g., predicting final model performance based on the statistics obtained from first few training epochs). Preprint. In order to implement an efficient neural network (NN) model selection, this paper proposes a novel framework to forecast the predictive ability of a model with its cumulative information in the early phase of NN training, as practised in learning curve prediction (Domhan et al., 2015; Chandrashekaran & Lane, 2017; Baker et al., 2017; Wistuba & Pedapati, 2020) . Most prior work on learning curve prediction aims to capture the trajectory of learning curves with a regression function of models' validation accuracy. Some of the previous algorithms developed in this field require training data from additional learning curves to train the predictors (Chandrashekaran & Lane, 2017; Baker et al., 2017; Wistuba & Pedapati, 2020) . On the other hand, our model does not require any such data. It solely relies on the NN architecture. Ranking models according to their final accuracy after fine-tuning is a lot more challenging as the learning curves are very similar to each other. The entire NN training process involves iterative updates of the weights of synaptic connections, according to one particular optimization algorithm, e.g., gradient descent or stochastic gradient descent (SGD) (Bottou, 2012; LeCun et al., 2015) . In essence, many factors contribute to impact how weights are updated, including the training data, the neural architecture, the loss function, and the optimization algorithm. Moreover, weights evolving during NN training in many aspects can be viewed as a discrete dynamical system. The perspective of viewing NN training as a dynamical system has been studied by the community (Mei et al., 2018; Chang et al., 2018; Banburski et al., 2019; Tano et al., 2020; Dogra & Redman, 2020; Feng & Tu, 2021) , and many attempted to make some theoretical explanation of the convergence rate and generalization error bounds. In this paper, we will provide the first attempt in exploring its power in neural model selection. One limitation of current approaches is that they concentrated on the macroscopic and collective behavior of the system, but lacks a dedicated examination of the individual interactions between the trainable weights or synaptic connections, which are crucial in understanding of the dependency of these weights, and how they co-evolve during training. To fill the gap, we study the system from a microscopic perspective, build edge dynamics of synaptic connections from SGD in terms of differential equations, from which we build an associated network as well. The edge dynamics induced from SGD is nonlinear and highly coupling. It will be very challenging to solve, considering millions of weights in many convolutional neural networks (CNNs), e.g., 16M weights in MobileNet (Howard et al., 2017) and 528M in VGG16 (Simonyan & Zisserman, 2014) . Gao et al. (2016) proposed a universal topological metric for the associated network to decouple the system. The metric will be used for model selection in our approach, and it is shown to be powerful in search of the best predictive model. We illustrate our proposed framework in Fig.1 . The main contributions of our framework can be summarized as follows: • View NN training as a dynamical system over synaptic connections, and first time investigate the interactions of synaptic connections in a microscopic perspective. • Propose neural capacitance metric β eff for neural network model selection. • Empirical results of 17 pre-trained models on five benchmark datasets show that our β eff based approach outperforms current learning curve prediction approaches. • For rank prediction according to the performance of pre-trained models, our approach improves by 9. Learning Curve Prediction (c) Figure 1 : Illustration of our framework. (a) An example multilayer perceptron (MLP) G A is mapped to a directed line graph G B , which is governed by an edge dynamics B. Each node (dichromatic square) of G B is associated with a synaptic connection linking two neurons (in different colors) from different layers of G A . (b) A diagram of transfer learning from the source domain (left stack) to a target domain (right stack). The pre-trained model is modified by adding additional layers, i.e. installing a neural capacitance probe (NCP) unit, on top of the bottom layers. The NCP is frozen with a set of randomly initialized weights, and only the bottom layers are fine-tuned. (c) Observed partial learning curves (green line segments) of validation accuracy over the early-stage training epochs and the corresponding neural capacitance metric β eff during fine-tuning. The predicted final accuracy at β eff → 0 (red dot) is used to select the best one from a set of models. The metric β eff relies on G B 's weighted adjacency matrix P , which itself is derived from the reformulation of the training dynamics. To predict the performance, a lightweight β eff of the NCP is used instead of the heavyweight one over the entire network on the right stack of (b). embeddings to capture the similarities between the architectures of various models and also the other datasets that it was trained on. Dynamical System View of NNs. There are many efforts to study the dynamics of NN training. Some prior work on SGD dynamics for NNs generally have a pre-assumption of the input distribution or how the labels are generated. They obtained global convergence for shallow NNs (Tian, 2017; Banburski et al., 2019) . System identification itself is a complicated task (Haykin, 2010; Lillicrap et al., 2020) . In studying the generalisation phenomenon of deep NNs, Goldt et al. (2019) formulated SGD with a set of differential equations. But, it is limited to over-parameterised two-layer NNs under the teacher-student framework. The teacher network determines how the labels are generated. Also, some interesting phenomena (Frankle et al., 2020) are observed during the early phase of NN training, such as trainable sparse sub-networks emerge (Frankle et al., 2019) , gradient descent moves into a small subspace (Gur-Ari et al., 2018) , and there exists a critical effective connection between layers (Achille et al., 2019) . Bhardwaj et al. (2021) built a nice connection between architectures (with concatenation-type skip connections) and the performance, and proposed a new topological metric to identify NNs with similar accuracy. Many of these studies are built on dynamical system and network science. It will be a promising direction to study deep learning mechanism. Dynamical System of a Network. Many real complex systems, e.g., plant-pollinator interactions (Waser & Ollerton, 2006) and the spread of COVID-19 (Thurner et al., 2020) , can be described with networks (Mitchell, 2006; Barabási & Pósfai, 2016) . Let G = (V, E) be a network with node set V and edge set E. Assuming n = |V |, the interactions between nodes can be formulated as a set of differential equationsẋ where x i is the state of node i. In real systems, it could be the abundance of a plant in ecological network, the infection rate of a person in epidemic network, or the expression level of a gene in regulatory network. The term P is the adjacency matrix of G, where the entry P ij indicates the interaction strength between nodes i and j. The functions f (·) and g(·, ·) capture the internal and external impacts on node i, respectively. Usually, they are nonlinear. Let x = (x 1 , x 2 , . . . , x n ). For a small network, given an initial state, one can run a forward simulation for an equilibrium state x * , such thatẋ However, when the size of the system goes up to millions or even billions, it will pose a big challenge to solve the coupled differential equations. The problem can be efficiently addressed by employing a mean-field technique (Gao et al., 2016) , where a linear operator L P (·) is introduced to decouple the system. In specific, the operator depends on the adjacency matrix P and is defined as where z ∈ R n . Let δ in = P 1 be nodes' in-degrees and δ out = 1 T P be nodes' out-degrees. For a weighted G, the degrees are weighted as well. Applying L P (·) to δ in , it gives which proves to be a powerful metric to measure the resilience of networks, and has been applied to make reliable inferences from incomplete networks (Jiang et al., 2020b; a) . We use it to measure the predictive ability of a NN (see Section 4.3), whose training in essence is a dynamical system. For an overview of the related technique, the readers are referred to Appendix H. NN Training is a Dynamical System. Conventionally, training a NN is a nonlinear optimization problem. Because of the hierarchical structure of NNs, the training procedure is implemented by two alternate procedures: forward-propagation (FP) and back-propagation (BP), as described in Fig.1(a) . During FP, data goes through the input layer, hidden layers, up to the output layer, which produces the predictions of the input data. The differences between the outputs and the labels of the input data are used to define an objective function C, a.k.a training error function. BP proceeds to minimize C, in a reverse way as did in FP, by propagating the error from the output layer down to the input layer. The trainable weights of synaptic connections are updated accordingly. Let G A be a NN, w be the flattened weight vector for G A , and z be the set of activation values. As a whole, the training of G A can be described with two coupled dynamics: A on G A , and B on G B , where nodes in G A are neurons, and nodes in G B are the synaptic connections. The coupling relation arises from the strong inter-dependency between z and w: the states z (activation values or activation gradients) of G A are the parameters of B, and the states w of G B are the trainable parameters of G A . If we put the whole training process in the context of networked systems, A denotes a node dynamics because the states of nodes evolve during FP, and B expresses an edge dynamics because of the updates of edge weights during BP (Mei et al., 2018; Poggio et al., 2020a; b) . Mathematically, we formulate the node and edge dynamics based on the gradients of C: where t denotes the training step. Let a ( ) i be the pre-activation of node i on layer , and σ (·) be the activation function of layer . Usually, the output activation function is a softmax. The hierarchical structure of G A exerts some constraints over z for neighboring layers, i.e., z where n is the total number of neurons on layer , and G A has L + 1 layers. It also presents a dependency between z and w. For example, when G A is an MLP without bias, a ( ) , which builds an interconnection from G A to G B . It is obvious, given w, the activation z satisfying all these constraints, is also a fixed point of A. Meanwhile, an equilibrium state of B provides a set of optimal weights for G A . The metric β eff is a universal metric to characterize different types of networks, including biological neural networks (Shu et al., 2021) (Section 3). Because of the generality of β eff , we analyze how it looks on artificial neural networks which are designed to mimic the biological counterparts for general intelligence. Therefore, we set up an analogue system for the trainable weights. To the end, we build a line graph for the trainable weights (Section 4.1), and reformulate the training dynamics in the same form of the general dynamics (Eq. 1) (Section 4.2). The reformulated dynamics reveals a simple yet powerful property regarding β eff (Section 4.3), which is utilized to predict the final accuracy of G A with a few observations during the early phase of the training (Section 4.4). For a detailed description of the core idea of our framework, see Appendix I. We build a mapping scheme φ : G A → G B , from an NN G A to an associated graph G B . The topology of the synaptic connections (edges) is established as a well-defined line graph proposed by Nepusz & Vicsek (2012) , and nodes of G B are the synaptic connections of G A . More precisely, each node in G B is associated with a trainable parameter in G A . For an MLP, each synaptic connection is assigned a trainable weight, the edge set of G A is also the set of synaptic connections of G B . For a CNN, this one-to-one mapping from neurons on layer to layer + 1 is replaced by a one-to-more mapping because of weight-sharing, e.g., a parameter in a convolutional filter is repeatedly used in FP and associated with multiple pairs of neurons from the two neighboring layers. Since the error gradients flow in a reversed direction, we reverse the corresponding links of the proposed line graph for G B . In specific, given any pair of nodes in G B , if they share an associated intersection neuron in FP propagation routes, a link with a reversed direction will be created for them. In Fig.1 (a), we demonstrate how the mapping is performed on an example MLP. We have the topology of G B in place, but the weights of links in G B are not yet specified. To make up this missing components, we reveal the interactions of synaptic connections from SGD, quantify the interaction strengths and then define the weights of links in G B accordingly. Related technical details are disclosed in next section. In SGD, each time a small batch of samples are chosen to update w, i.e., w ← w − α∇ w C, where α > 0 is the learning rate. When desired conditions are met, training is terminated. We denote the activation gradients as δ ( ) = [∂C/∂z ( ) 1 , · · · , ∂C/∂z ( ) n ] T ∈ R n 1 and the derivatives of activation function σ for layer as σ = [σ (a ( ) To understand how the weights W ( ) affect each other, we explicitly expand δ ( ) : where is the Hadamard product. We find that parameters W ( ) are associated with all accessible parameters on downstream layers, and such recursive relation defines a high-order hyper-network interaction (Casadiego et al., 2017) between any W ( ) and the other parameters. The Hadamard product x y has an equivalent matrix multiplication form, i.e. x y = Λ(y)x, where Λ(y) is a diagonal matrix consisting of the entries of y on the diagonal. Therefore, we have Our purpose is to build direct interactions between synaptic connections. It can be done by identifying which units provide direct physical interactions to a given unit and appear on the right hand side of its differential equation B in Eq. 4, and how much such interactions come into play. There are multiple routes to build up a direct interaction between any pair of network weights from different layers, as presented by the product terms in δ ( ) . However, the coupled interaction makes it an impossible task, which is well known as a credit assignment problem (Whittington & Bogacz, 2019; Lillicrap et al., 2020) . We propose a remedy. The impacts of all the other units on W ( ) is approximated by direct, local impacts from W ( +1) , and the others' contribution as a whole is implicitly encoded in the activation gradient δ ( +1) . Moreover, we have the weight gradient (see Appendix A for detailed derivation) which shows the dependency of W ( ) on W ( +1) , and itself can be viewed as an explicit description of the dynamical system B in Eq. 4. Put it in terms of a differential equation, we have Because of the mutual dependency of the weights and the activation values, it is hard to make an exact decomposition of the impacts of different parameters on W ( ) . However, in the gradient ∇ W ( ) , W ( +1) presents as an explicit term and contributes the direct impact on W ( ) . To capture such direct impact and derive the adjacency matrix P for G B , we apply Taylor expansion on ∇ W ( ) and have which defines the interaction strength between each pair of weights from layer + 1 to layer . See Appendix B for detailed derivation of P on MLP, and Appendix C on general NNs. Let w = (w 1 , w 2 , . . .) be a flattened vector of all trainable weights of G A . Given a pair of weights w i and w j , one from layer 1 , another from layer 2 . If 2 = 1 + 1, the entry P ij is defined according to Eq. 8, otherwise P ij = 0. Considering the scale of trainable parameters in G A , P is very sparse. Let W ( +1) * be the equilibrium states (Appendix C), the training dynamics Eq. 7 is reformulated into the form of Eq. 1 and gives the edge dynamics B for G B : with f (w i ) = F (w * i ) and g(w i , w j ) = w j − w * j . The value of weights at an equilibrium state {w * j } is unknown, but it is a constant and does not affect the computing of β eff . According to Eq. 8, we have the weighted adjacency matrix P of G B in place. Now we can quantify the total impact that a trainable parameter (or synaptic connection) receives from itself and the others, which corresponds to the weighted in-degrees δ in = P 1. Applying L P (·) (see Eq. 2) to δ in , we get a "counterpart" metric β eff = L P (δ in ) to measure the predictive ability of a neural network G A , as the resilience metric (see Eq. 3) does to a general network G (see Dynamical System of a Network in Section 3). If G A is an MLP, we can explicitly write the entries of P , hence a β eff explicitly For details of how to derive P and β eff of an MLP, see Appendix B. Moreover, we prove in Theorem 1 below that as G A converges, ∇ ( ) W vanishes, and β eff approaches zero (see Appendix D). Theorem 1. Let ReLU be the activation function of G A . When G A converges, then β eff = 0. For an MLP G A , it is possible to derive an analytical form of β eff . However, it becomes extremely complicated for a deep NN with multiple convolutional layers. To realize β eff for deep NNs in any form, we take advantage of the automatic differentiation implemented in TensorFlow 2 . Considering the number of parameters, it is still computationally expensive, and prohibitive to calculate a β eff for the entire G A . Algorithm 1 Implement NCP and Compute β eff s , a target dataset D t , the maximum number of epochs T 1: Remove F (2) s from F s and add on top of F (1) s an NCP unit U with multiple layers (Fig.1b ) 2: Initialize with random weights and freeze U 3: s on D t for epochs of T 4: Obtain P from U according to Eq. 8 5: Compute β eff with P according to Eq. 3 or Eq. 10 Because of this, we seek to derive a surrogate from a partial of G A . As shown in Section 4.4, we insert a neural capacitance probe (NCP) unit, i.e., putting additional layers on top of the beheaded G A (excluding the original output layer), and estimate the predictive ability of the entire G A using β eff of the NCP unit. Therefore, in the context of model selection from a pool of pre-trained models, if no confusion arises, we call β eff a neural capacitance. Here we show a novel application of our proposed neural capacitance β eff to model selection. In specific, we transfer the pre-trained models by (i) removing the output layer, (ii) adding some layers on top of the remaining layers (Fig.1b) , and fine-tune them using a small learning rate. As shown in Algorithm 1, the newly added layers U on top of the bottom layers of F s are used as an NCP unit. The specifics of the NCP unit are detailed in Section 5. The NCP does not involve in fine-tuning, and is merely used to calculate β eff , then to estimate the performance of G A over the target domain D t . According to Theorem 1, when the model converges, β eff → 0. In an indirect way, the predictive ability of the model can be determined by the relation between the training β eff and the validation accuracy I. Since both β eff and I are available during fine-tuning, we collect a set of data points of these two in the early phase as the observations, and fit a regularized linear model I = h(β eff ; θ) with Bayesian ridge regression (Tipping, 2001) , where θ are the associated coefficients (see Appendix E for technical details). The estimated predictor I = h(β eff ; θ * ) makes prediction of the final accuracy of models by setting β eff = 0, i.e., I * = h(0; θ * ), see an example in row 3 of Fig.2 . For full training of the best model, one can either retain or remove the NCP and fine-tune the selected model. Pre-trained models and datasets. We evaluate 17 pre-trained ImageNet models implemented in Keras 3 , including AlexNet, VGGs (VGG16/19), ResNets (ResNet50/50V2/101/101V2/152/152V2), DenseNets (DenseNet121/169/201), MobileNets (MobileNet and MobileNetV2), Inceptions (In-ceptionV3, InceptionResNetV2) and Xception, to measure the performance of our approach. Four benchmark datasets CIFAR10, CIFAR100, SVHN, Fashion MNIST of size 32 × 32 × 3, and one Kaggle challenge dataset Birds 4 of size 224 × 224 × 3 are used, and their original train/test splits are adopted. In addition, 15K original training samples are set aside as validation set for each dataset. Experimental setup. To get a well-defined β eff , G A requires at least three hidden layers (see Appendix C). Also, a batch normalization (Ioffe & Szegedy, 2015) is usually beneficial because it can stabilize the training by adjusting the magnitude of activations and gradients. To this end, on top of each pre-trained model, we put a NCP unit composed of (1) a dense layer of size 256, (2) a dense layer of size 128, each of which follows (3) a batch normalization and is followed by (4) a dropout layer with a dropout probability of 0.4. Before fine-tuning, we initialize the NCP unit using Kaiming Normal initialization (He et al., 2015) . We set a batch size of 64 and a learning rate of 0.001, fine-tune each pre-trained model for T = 50 epochs, and repeated it for 20 times. As shown in Fig.2 , the pre-trained models are converged after the fine-tuning on CIFAR10. For each model, we collect the validation accuracy (blue stars in row 1) and β eff on the training set (green squares in row 2) during the early stage of the fine-tuning as the observations (e.g., green squares in row 3 marked by the green box for 5 epochs), then use these observations to predict the test accuracy unseen before the fine-tuning terminates. For better illustration, learning curves are visualized on a log-scale. Evaluation. We apply the Bayesian ridge regression on the observations to capture the relation between β eff and the validation accuracy, and to estimate a learning curve predictor I = h(β eff ; θ * ). The performance of the model is revealed as I * = h(β * eff ; θ * ) with β * eff = 0. As shown in row 3 of Fig.2 , the blue lines are estimated h(·; θ), the true test accuracy at T and the predicted accuracy are marked as red triangles and blue stars, respectively. Both the estimates and predictions are accurate. We aim to select the best one from a pool of candidates. A relative rank of these candidates matters more than their exact values of predicted accuracy. To evaluate and compare different approaches, we choose Spearman's rank correlation coefficient ρ as the metric, and calculate ρ over the true test accuracy at epoch T and the predicted accuracy I * of all pre-trained models. In Fig.3(a) , we report the true and predicted accuracy for each model on CIFAR10, as well as the overall ranking performance measured by ρ. It indicates that our β-based model ranking is reliable with ρ > 0.9. For the results on all five datasets, see Appendix Fig.F.4 . The estimation quality of h determines how well the relation between I and β eff is captured. Besides the regression method, the starting epoch t 0 of the observations also plays a role in the estimation. As shown in Fig.3(b) , we evaluate the impact of t 0 on ρ of our approach. It goes as expected, when the length of learning curves is fixed, a higher t 0 usually produces a better ρ. Since our ultimate goal is to predict with the early observations, t 0 should also be constrained to a small value. To make the comparisons fair, we view t 0 as a hyper-parameter, and select it according to the Bayesian information criterion (BIC) (Friedman et al., 2001) , as shown in row 3 of Fig.2 . Impact of size of training set. CIFAR10 has 50K original training and 10K testing samples. Generally, the 50K samples are further split into 35K for training and 15K for validation. In studying the dynamics of the NN training, it is essential to understand how varying the training size influences the effectiveness of our approach. We select the first {10,15,20,25,30}K of the original 50K samples as the training set of reduced size, and the last 10K samples as the validation set to fine-tune the pre-trained models for 50 epochs. As shown in Fig.3(c) , we can use a training set of size as small as 25K to achieve similar performance to that uses all 35K training samples. It has an important implication for efficient NN training, because the size of required training set can be greatly reduced (around 30% in our experiment) while maintaining similar model ranking performance. To be noted that the true test accuracy used in computing ρ is the same test accuracy for the model trained from 35K training samples and it's shared by all the five cases {10,15,20,25,30}K in our analysis. Table 1 : A comparison between our β eff based approach and the baselines in model ranking. The notation LLC represents the length of the learning curve, and Imprv represents the relative improvement of our approach to the best baseline. Due to the failure of the supporting package 6 of LC, there is a missing ρ at LLC of 10, which does not affect our conclusions. Ours versus baselines. We select BGRN (Baker et al., 2017) and CL (Chandrashekaran & Lane, 2017) as the baselines, as well as two heuristic rules of using the last seen value (LSV) (Klein et al., 2017b) or the best seen value (BSV) of a learning curve for extrapolation. We compare the performance of ours with the baselines. As shown in Table 1 and Appendix Fig.F .5, using a few of observations, e.g., only 5 epochs, our approach can achieve 9.1/38.3/12.4/65.3/40.1% relative improvements over the best baseline on CIFAR10/CIFAR100/SVHN/Fashion MNIST/Birds. Our approach is efficient, especially for large and deep NNs. Different from the training task that involves a full FP and BP, i.e. T train = T FP + T BP , computing β eff only requires to compute the adjacency matrix P according to Eq. 8 on the NCP unit, T β eff = T NCP . Although the computation is complicated, the NCP is lightweight. The computing cost per epoch is comparable to the training time per epoch (see Appendix Fig.G) . Let T β eff = c × T train . If c > 1, i.e., T β eff is higher than T train , vice versa. Considering the required epochs, our approach needs k observations, and takes T ours = k × T β eff . To obtain the ground-truth final accuracy by running K epochs, it takes T full = K × T train . If T full > T ours , our β eff based prediction is cheaper than "just training longer". It indicates that K × T train − k × T β eff = (K − c × k) × T train > 0, saving us K − c × k more training epochs. We perform a running time analysis of the two tasks with 4× NVIDIA Tesla V100 SXM2 32GB, and visualize the related times in Appendix Fig.G . On average c = T β eff /T train ≈ 1.3, computing β eff takes 1.3 times of the training per epoch. But the efforts are paying off, as we can predict the final accuracy by observing only k = 10 of K = 100 full training epochs, T ours is only 13% of T full . When the observations are used for learning curve prediction, the heuristics LSV and BSV directly take one observation (last or best) as the predicted value, so they are mostly computationally cheap but have suboptimal model ranking performances. Relatively, BGRN and CL are more time-consuming because both require to train a predictor with a set of full learning curves from other models. Our approach also estimates a predictor, but does not need any external learning curves. Here we assume that each model is observed for only k = 5 epochs, and conduct a running time analysis of these approaches over learning curve prediction, including estimating a predictor. As shown in Appendix Table G .2, our approach applies Bayesian ridge regression to efficiently estimate the predictor I = h(β eff ; θ), taking comparable time as BGRN, significantly less than CL, but performs best in model ranking. In contrast, the most expensive CL, does not perform well, sometimes even worst. We present a new perspective of NN model selection by directly exploring the dynamical evolution of synaptic connections during NN training. Our framework reformulates the SGD based NN training dynamics as an edge dynamics B to capture the mutual interaction and dependency of synaptic connections. Accordingly, a networked system is built by converting an NN G A to a line graph G B with the governing dynamics B, which induces a definition of the link weights in G B . Moreover, a topological property of G B named neural capacitance β eff is developed and shown to be an effective metric in predicting the ranking of a set of pre-trained models based on early training results. There are several important directions that we intend to explore in the future, including (i) simplify the adjacency matrix P to capture the dependency and mutual interaction between synaptic connections, e.g., approximate gradients using local information (Jaderberg et al., 2017) , (ii) extend the proposed framework to NAS benchmarks (Ying et al., 2019; Dong & Yang, 2020; Dong et al., 2021; Zela et al., 2020; to select the best subnetwork, and (iii) design an efficient algorithm to directly optimize NN architectures based on β eff . Let G A be an MLP. To understand the learning mechanism, we take a sample x with label y, and go through the entire training procedure, including a forward pass FP and a backward pass BP. To be convenient, we rewrite z (0) = x as the inputs, let z (1) and z (L−1) be the activations of the first and the last hidden layers, respectively, and let z (L) be the outputs. The MLP is a parameterized modelŷ = z (L) = F w (x) with w = (W (1) , W (2) , . . . , W (L) ), where W ( ) is the weight matrix of synaptic connections from layer − 1 to layer , and 1 ≤ ≤ L. Suppose there are n neurons on layer , W ( ) has the size n × n −1 . The inputs x are fed into MLP, after a forward pass from layer 1 down to layer L − 1 and layer L. Each neuron receives a cumulative input signal from the previous layer, and sends an activated signal to a downstream layer. Let σ be the activation function of layer , a be the pre-activation value of neuron i on layer , we have z j }, and z (L) is a probability distribution over n L classes, i.e. 1 T z (L) = 1. With the predictions z (L) and the ground truth y, we can calculate the prediction error C(z (L) , y), which is often a cross entropy loss, i.e. C(z (L) , y) = − i y i log z (L) i . To minimize C, BP is applied, and the weights w are updated backward, from the output layer up to the first hidden layer. Now we derive the gradients for a close examination. First, we get the derivatives of C w.r.t z (L) and W (L) . Because z On layer L, we get the derivatives of C w.r.t z (L) and W (L) : The derivatives of C are ∂C ∂z On layer , where 1 ≤ ≤ L − 2, we get ∂z The gradients can be written in a dense form: We examine three hidden layers { − 1, , + 1} of G A and three neurons on these layers {j, i, k}. Let w ( +1) ki connects the neuron k on layer + 1 to the neuron i on layer , w ( ) ij be the synaptic connection weight between the neuron j on layer and the neuron i on layer − 1, and w ( −1) jm connects the neuron j on layer − 1 and the neuron m on layer − 2. Now we have a close look at ∂C/∂w ( ) ij . According to the chain rule, we have The gradient term δ To ease the analysis, we simplify it with a numerical value or a synthetic one with no synaptic connection weight included. Therefore, the summation term can be viewed as a simple linear function of all synaptic connection weights w ( +1) ki associated with neuron i on layer , and the associated coefficient is p{w i ). Therefore, we are able to calculate the indegree and out-degree of w There are several exceptions, including the first hidden ( = 1), the last hidden ( = L − 1) and the output ( = L) layers. For the output layer, we have Because σ L is softmax, no explicit relation regarding w (L) ij can be built. It implies that no well-defined in-bound connections to w (L) ij , i.e., δ in (w (L) ij ) = 0. But, we can build the connections from w From the perspective of w The softmax σ L (·) makes the output values sum up to one, i.e., k y k = 1, and δ in (w (L−1) ij ) = 0. Now, we examine the first hidden layer. Similar to the output layer, there is no well-defined out-bound connections for w (1) ij , δ out (w (1) ij ) = 0. Setting = 1 in Eq. 13, we can get the in-degree of w Based on our definition of the weights of G B , when the number of layers is small, it is trivial that β eff = 0. To get a non-trivial β eff , we identify the minimum number of hidden layers in G A . First, we examine a G A with one hidden layer, i.e. L = 2, whose degrees are Since the degrees sum up to zero, β eff = 0, regardless of how many hidden neurons in G A . If G A only has two hidden layers, i.e. L = 3, the in-degrees are The out-degrees are summarized as follows: The total degree may be non-zero, but β eff = 0 alway holds. Therefore, the minimum number of hidden layers required for a well-defined β eff is three, i.e., L ≥ 4. We summarize the in-degrees ij ) = 0. and the out-degrees It is easy to derive Now, we move forward to compute the total degree The definitions of in-degree and out-degree ensure that 1 T δ in = 1 T δ out must hold. Let's prove it: With the fact that σ 2 = σ for ReLU, according to Eq. 3, we have The right hand side (RHS) of Eq. 6 is a function of W ( +1) , and denoted as F (W ( +1) ). Here we derive the strength of the impact from W ( +1) and other weights W (− ) = (W (0) , W (1) , . . . , W ( ) , W ( +2) , . . . , W (L) ) for building the edge dynamics. Let W = (W ( +1) , W (− ) ) and F (W ) = dW ( ) /dt. We denoteŴ (− ) as the current states of W (− ) ), W * ( +1) as an equilibrium point, and W * = (W * ( +1) ,Ŵ (− ) ). According to the Taylor expansion, we linearize F at W * and have The last term on the RHS can be cancelled out when the realizations of W (− ) take the current states of W (− ) , i.e.Ŵ (− ) . The gradient is simplified as The term ∂F (W * ( +1) ,Ŵ (− ) )/∂W ( +1) = ∂ 2 C(W * ( +1) ,Ŵ (− ) )/∂W ( ) ∂W ( +1) effectively captures the interaction strengths between W ( ) and W ( +1) , because it measures how much F is affected by a unit perturbation on W ( +1) . Usually, W * ( +1) are not available before updating W ( +1) following the update of W ( ) , we use the current states of W ( +1) instead. The system can be viewed as a realization of the general Eq. 1, with linear f (W ( ) ) = F (W * ) and g(W ( ) , W ( +1) ) = W * ( +1) − W ( +1) . Now, we can immediately have the adjacency matrix P of The second order gradient P (l,l+1) = ∂ 2 C/∂W ( ) ∂W ( +1) is proposed to measure the interaction strength between W ( ) and W ( +1) , ∀1 ≤ ≤ L. Considering an MLP, and assume that each activation function σ is ReLU for < L, when G A converges, ∇ Ridge regression introduces an 2 -regularization to linear regression, and solves the problem where X ∈ R n×d , y ∈ R n , θ ∈ R d is the associated set of coefficients, the hyper-parameter λ > 0 controls the impact of the penalty term θ 2 2 . Bayesian ridge regression introduces uninformative priors over the hyper-parameters of the model, and estimates a probabilistic model of the problem in Eq. 17. Usually, the ordinary least squares method posits the conditional distribution of y to be a Gaussian, i.e., p(y|X, θ) = N (y|Xθ, σ 2 I d ), where σ > 0 is a hyper-parameter to be tuned, and I d is a d×d identity matrix. Moreover, if we assume a spherical Gaussian prior θ, i.e., p(θ) = N (θ|0, τ 2 I d ), where τ > 0 is another hyper-parameter to be estimated from the data at hand. According to Bayes' theorem, p(θ|X, y) ∝ p(θ)p(y|X, θ), the estimates of the model are made by maximizing the posterior distribution p(θ|X, y), i.e., arg max θ log p(θ|X, y) = arg max θ log N (y|Xθ, σ 2 I d ) + log N (θ|0, τ 2 I d ), which is a maximum-a-posteriori (MAP) estimation of the ridge regression when λ = σ 2 /τ 2 . All θ, λ and τ are estimated jointly during the fit of the model, and σ = τ √ λ. To estimate I = h(β eff ; θ), we use scikit-learn 8 , which is built on the algorithm described in Tipping (2001) updating the regularization parameters λ and τ according to MacKay (1992 Figure F .4: Predictions of the validation accuracy of pre-trained models on all five datasets based on β eff v.s. true test accuracy of these models after fine-tuning for T = 50 epochs. The Spearman's ranking correlation ρ is used to quantify the performance in model selection. Each shape is associated with one type of pre-trained models. Distinct models of the same type are marked in different colors. To be noted, each includes AlexNet in computing ρs. pre-trained models and five datasets discussed in the main text. Each data point is associated with one pre-trained model over one dataset. The relative cost of our approach in computing β eff with respect to training more epochs can be measured by c = T β eff /T train . On average, it isc ≈ 1.3 (slope of the pink line). We summarize the main idea of the mean-field approach developed by Gao et al. (2016) , and show how Eq. 3 is obtained (Jiang et al., 2020a; b) . Based on the notations described in Section 3, we consider a vertex i and the interaction term j P ij g(x i , x j ) in Eq. 1, where P ij is the influence j has on i. Similarly, i influences j with a weight P ji . We define the in-degree δ in i = j P ij and the out-degree δ out i = j P ji . The interaction term can be rewritten as Here the in-degrees δ in captures the idiosyncratic part, and the average g(·, ·) captures the network effect. The mean-field approximation is to replace local averaging with global averaging, which approximates the network impact on a vertex as nearly homogeneous. Specifically, we can get where the vector g(x i , x) has the jth component g(x i , x j ). A linear operator is defined for a weighted average of the entries in z. The mean-field approximation giveṡ In the first order linear approximation, we can take the L P -average inside g. The average of external interactions is approximately the interaction with its average, i.e. L P [g(x i , x)] ≈ g(x i , L P (x)) anḋ where L P (x) is a global state. Let x av L P (x). Applying L P to both sides of Eq. 22 giveṡ x av = L P [f (x)] + L P [δ in g(x, x av )]. According to the extensive discussion and tests in (Gao et al., 2016) , the in-degrees δ in and the interaction with the external x av are roughly uncorrelated, so the L P -average of the product is roughly the product of L P -averages. Therefore, L P [δ in g(x, x av )] ≈ L P (δ in )L P [g(x, x av )]. Using the first order linear approximation, we take the L P -average inside f and ġ x av = f (L P (x)) + L P (δ in )g(L P (x), x av ). Therefore, we haveẋ av = f (x av ) + β eff g(x av , x av ), where β eff = L P (δ in ) is the resilience metric, and its steady-state is the effective network impact x eff , satisfyingẋ eff = f (x eff ) + β eff g(x eff , x eff ) = 0. Our framework is built on several different techniques and the related contents are dispersed in different sections. Here we briefly summarize the core idea of this paper and show how these sections are organized, see Fig. I .7 for a flowchart of our core procedure. We view the NN training as a dynamical system, and directly model the evolving of the trainable weights in the SGD based training as a set of differential equations (Section 3), characterized by a general dynamics in Eq. 1. Usually, it is convenient to study the dynamics of agents (trainable weights in our case) on a regular network, where each node represents an agent in the dynamical system and the interactions of agents are governed by Eq. 1. Many powerful techniques have been developed in network science and dynamical systems, e.g. the universal metric β eff developed by Gao et al. (2016) to quantify and categorize various types of networks, including biological neural networks (Shu et al., 2021) . Because of the generality of the metric, we analyze how it looks on artificial neural networks which are designed to mimic the biological counterpart for general intelligence. Therefore, an analogue system of the trainable weights under the context of the general dynamics is set up in our framework. To the end, we build a line graph for the trainable weights ( Fig. 1a and Section 4.1) and "rewrite" (Section 4.2 and Appendix C) the training dynamics in the form of Eq. 1, which includes a self-driving force f (·), an external driving force g(·, ·) and an adjacency matrix P (Eqs. Figure I .7: A flowchart of the core procedure of our framework. The reformulated training dynamics yields a simple yet powerful property. It is proved that as the neural network converges, β eff approaches zero (Theorem 1 in Section 4.3, also one of our primary contributions). As shown in Fig. 1(c) and Section 4.3, we exploit the property to predict the final accuracy of a neural network model with a few observations during the early phase of the training, and apply it to select the pre-trained models (Algorithm 1 in Section 4.4). Generally speaking, the metric β eff should be calculated for the entire neural network. However, because many state-of-the-art neural network models have large-scale trainable weights. If all layers are considered, it will be prohibitive to compute the associated β eff . We make a compromise, and estimate β eff of the entire network using β eff of the NCP unit (i.e., a partial part of the entire network, see the second to the last sentence of Section 4.3). It's confirmed from our empirical experiments (Section 5) that the simplified, lightweight version of β eff is still effective in predicting the final accuracy of the entire network. The metric β eff developed by Gao et al. (2016) is universal to characterize different types of networks. Although our framework utilizes the metric, our application to artificial neural network training dynamics and the related theoretical results as specified by Theorem 1 are novel. Specifically, it is applied to study the NN training (Section 3) and predict the final accuracy of an NN with a few observations during the early phase of the training (Fig. 1c) . But β eff relies on the adjacency matrix P of G A (Eq. 3). To derive P , we resort to a reformulation (Section 4.2 and Appendix C) of the training dynamics in the same form of the general dynamics (Eq. 1). One issue in calculating β eff is the complexity if the entire G A is considered. As a resolution, we propose to use the lightweight β eff of the NCP unit -a partial of G A -to predict the performance of the entire network (Sections 4.3 & 4.4). Critical learning periods in deep networks Accelerating neural architecture search using performance prediction Theory III: Dynamics and generalization in deep networks How does topology influence gradient propagation and model performance of deep networks with densenet-type skip connections? Stochastic gradient descent tricks Model-free inference of direct network interactions from nonlinear collective dynamics Speeding up hyper-parameter optimization by extrapolation of learning curves using previous builds AntisymmetricRNN: A dynamical system view on recurrent neural networks Dynamical systems and neural networks Optimizing neural networks via Koopman operator theory Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves NAS-Bench-201: Extending the scope of reproducible neural architecture search NATS-Bench: Benchmarking nas algorithms for architecture topology and size Phases of learning dynamics in artificial neural networks: in the absence or presence of mislabeled data Stabilizing the lottery ticket hypothesis The early phase of neural network training The elements of statistical learning Universal resilience patterns in complex networks Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup Gradient descent happens in a tiny subspace Neural Networks and Learning Machines Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification MobileNets: Efficient convolutional neural networks for mobile vision applications Batch Normalization: Accelerating deep network training by reducing internal covariate shift Decoupled neural interfaces using synthetic gradients Inferring degrees from incomplete networks and nonlinear dynamics True nonlinear dynamics from incomplete networks Fast Bayesian optimization of machine learning hyperparameters on large datasets Learning curve prediction with Bayesian neural networks Deep learning HW-NAS-Bench: Hardware-aware neural architecture search benchmark Backpropagation and the brain Bayesian interpolation A mean field view of the landscape of twolayer neural networks Complex systems: Network thinking Controlling edge dynamics in complex networks Theoretical issues in deep networks. Proceedings of the National Academy of Sciences Complexity control by gradient descent in deep networks The resilience and vulnerability of human brain networks across the lifespan Very deep convolutional networks for large-scale image recognition Accelerating training in artificial neural networks with dynamic mode decomposition A network-based explanation of why most covid-19 infection curves are linear An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis Sparse Bayesian learning and the relevance vector machine Plant-pollinator interactions: from specialization to generalization Theories of error back-propagation in the brain Learning to rank learning curves NAS-Bench-101: Towards reproducible neural architecture search NAS-Bench-1Shot1: Benchmarking and dissecting one-shot neural architecture search This work was supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).