key: cord-0433316-9qnly3um authors: Wang, Chunnan; Wang, Hongzhi; Feng, Guocheng; Geng, Fei title: Multi-Objective Neural Architecture Search Based on Diverse Structures and Adaptive Recommendation date: 2020-07-06 journal: nan DOI: nan sha: 6a4e6930616f277b66558a484ae76e7224b39e2d doc_id: 433316 cord_uid: 9qnly3um The search space of neural architecture search (NAS) for convolutional neural network (CNN) is huge. To reduce searching cost, most NAS algorithms use fixed outer network level structure, and search the repeatable cell structure only. Such kind of fixed architecture performs well when enough cells and channels are used. However, when the architecture becomes more lightweight, the performance decreases significantly. To obtain better lightweight architectures, more flexible and diversified neural architectures are in demand, and more efficient methods should be designed for larger search space. Motivated by this, we propose MoARR algorithm, which utilizes the existing research results and historical information to quickly find architectures that are both lightweight and accurate. We use the discovered high-performance cells to construct network architectures. This method increases the network architecture diversity while also reduces the search space of cell structure design. In addition, we designs a novel multi-objective method to effectively analyze the historical evaluation information, so as to efficiently search for the Pareto optimal architectures with high accuracy and small parameter number. Experimental results show that our MoARR can achieve a powerful and lightweight model (with 1.9% error rate and 2.3M parameters) on CIFAR-10 in 6 GPU hours, which is better than the state-of-the-arts. The explored architecture is transferable to ImageNet and achieves 76.0% top-1 accuracy with 4.9M parameters. pointed out that the cell structure diversity is significant in the resource-constrained CNN models, and studied more flexible CNN architectures, where each block of cells is allowed to contain different structures and can repeat for different times. It searched the optimal setting of cell structures and cell numbers of different blocks, and achieved good results. Such solution breaks the traditional inflexible structures, but also has a defect, i.e. the search cost is too high. The search space of one cell is large, let alone that of more cells combined with parameters related to the outer level structures. The huge search space brings MNasNets considserable In order to control the size of search space thus to reduce search cost, and at the same time explore more flexible architectures for better models, i.e., Pareto optimal CNN models with high accuracy and small parameter number, in this paper, we put forward the idea of high-performance cell stacking (HCS). That is, to utilize high-performance cells discovered by existing NAS algorithms to construct flexible architectures, as shown in Figure 3 , and search for the optimal cell stacking method to obtain better lightweight CNN. The introduction of existing high-performance cells, on the one hand, ensures the effectiveness of components, effectively reduces the search cost caused by cell design, and greatly reduces search space to avoid the search of invalid architecture, thus ensuring the search efficiency; on the other hand, increases the cell diversity as well as flexibility of CNN architectures. Our HCS-based search space could make full use of existing research results, and thus make it possible to explore more flexible CNN architectures efficiently, which is superior to existing search spaces. In addition, in order to efficiently find Pareto optimal architectures that are lightweight and accurate in our newly designed search space, we design a multi-objective optimization (MOO) algorithm, called Multi-Objective Optimization based on Adaptive Reverse Recommendation (MoARR). The idea of MoARR is to avoid selecting worse architectures by effectively analyzing our historical evaluation information, thus reduce evaluation cost and accelerate the optimization. More specifically, MoARR utilizes historical information to study the potential relationship between the parameter quantity, accuracy and the architecture encoding, thus adaptively learn the reverse recommendation model (RRModel) that is capable of selecting the most suitable architecture code according to the target performance. Then, MoARR recommends better architectures to be evaluated under the guidance of RRModel, i.e., inputting higher accuracy and smaller parameter number to RRModel for better architectures. With the increase of the evaluated architectures, RRModel becomes more reliable, and the architectures recommended by it approach to the Pareto Optimality. Using RRModel, MoARR can pertinently optimize architectures, and thus greatly reduce useless architecture evaluations. Compared with the existing MOO approaches, our MoARR is more suitable for dealing with the MOO NAS problems, where architecture evaluations are expensive and time-consuming. More specifically, the existing approaches for seeking with Pareto-optimal front can be classified into two categories, approaches based on mathematical programming [26] and those based on genetic algorithm [7, 10, 24, 9] . The first class of methods are unable to cope with our black-box MOO NAS problem, where expression and gradient information of two optimization objective are unknown. The genetic methods can deal with the black-box problem, but could evaluate many useless architectures due to the uncertainties brought by many random operations and the neglect of valuable rules provided by historical evaluation information. They may require many samples and generations for good results, which is not suitable for dealing with expensive MOO NAS problems. Obtain Figure 2 : Overall framework of MoARR. To lay out our approach, we first give the specific definition of our research objective (Section 2.1), and define a new search space of NAS (Section 2.2). We then introduce MoARR that views NAS as a multi-objective optimization task, and makes full use of historical evaluation information to obtain high-performance light-weight CNN models (Section 2.3). In order to accelerate the evaluation and reduce computational cost, we also design the acceleration strategy to use a small number of epochs and a few samples to quickly obtain accuracy scores (Section 2.4). Figure 2 is our overall framework. In this paper, we aim to increase the flexibility and diversity of CNN architectures, so as to obtain lightweight architectures with higher accuracy. Formally, our search target is defined as follows: max where S denotes all CNN architecture codes in our new search space, which is described in Section 2.2, ACC(x) denotes the accuracy score, P AR(x) denotes the number of parameters, and P max is the upper limit of parameter amount. This is a multi-objective optimization task, and our goal is to obtain architectures that provide the best accuracy/parameter amount trade-off. Structural flexibility and cell diversity are two key points for the design of our search space. To achieve structural flexibility, we make the number of cells and channels in each stage to be adjustable. In this way, we can get architectures with diversified width and height. As for the cell diversity, we allow cells in different Figure 3 : General CIFAR network architecture. stages to have different structures, and take the existing high-performance cells structures discovered by previous NAS work as available options. Details are as follows. We provide a general network architecture for our search space, as shown in Figure 3 . It consists of 5 stages. Stage 1 extracts common low-level features, stages 2˜4 down-sample the spatial resolution of the input tensor with a stride of 2, and stage 5 produces the final prediction with a global pooling layer and a fully connected layer. Previous NAS approaches generally choose to use 2 reduction cells in CNN architectures, whereas some [23] uses 3. In pursuit of a more general search space, we use V alidity RC4 ∈ {T rue, F alse} to decide whether to use the 3rd reduction cell in represents the growth ratio of width compared to the previous stage. The name of the i-th normal cell in stage s is denoted as N C s i 1 , the name of reduction cell used in stage s is denoted as RC s , and the type of global pooling used in Stage 5 is denoted as GP . The options of N C s i , RC s and GP are shown in Table 1 . Therefore, an architecture can be encoded as shown in Figure 3 . The set of all possible codes is denoted as S and is referred to as our search space. Normal Cell Symbol Reduction Cell Symbol DARTS (1st) [18] Darts_V1_NC DARTS_V1_RC DARTS (2nd) [18] Darts_V2_NC DARTS_V2_RC NasNet-A [35] NasNet_NC NasNet_RC AmoebaNet-A [25] AmoebaNet_NC AmoebaNet_RC ENAS [23] ENAS_NC ENAS_RC RENAS [6] RENAS_NC RENAS_RC GDAS [12] GDAS_V1_NC GDAS_V1_RC GDAS (FRC) [12] GDAS_V2_NC GDAS_V2_RC ASAP [22] ASAP_NC ASAP_RC ShuffleNet [33] ShuffleNet_NC ShuffleNet_RC Global Polling Definition Global Polling Symbol Global average polling Avg_GP Global max polling Max_GP The average of Avg_GP and Max_GP AvgMax_GP Table 1 : Options of N C s i , RC s and GP in network architectures. We extract 10 normal cell structures and 10 reduction cell structures from 10 high-performance CNN architectures discovered by previous work, as the components. And we consider 3 classic global polling operations in the final stage. Let x, y ∈ S denote two elements in set S. If ACC(x)P AR(y), we say that architecture y Pareto dominates x (y is better than x), denoted as x ≺ y. For elements in set S ⊆ S that are not Pareto dominated by other elements inŜ, we call them the Pareto boundary ofŜ, denoted as B(Ŝ) = {x ∈Ŝ | y ∈Ŝ, x ≺ y}. Then, the Pareto optimal solutions for our multi-objective NAS problem is denoted by B(S). In MoARR, our target is to quickly optimize the elements in Pareto boundary B(Ŝ), whereŜ ⊆ S denotes the set of evaluated architectures, and finally obtain B(S). More specifically, we aim at selecting the best possible architectures to evaluate for each iteration, avoiding selecting worse architectures as much as possible, thus accelerate optimization process and reduce evaluation cost. To achieve this goal, we put forward Adaptive Reverse Recommendation (ARR), an architecture selection strategy which is capable of utilizing historical evaluation information ofŜ for effective and targeted architecture recommendation, i.e., recommending the most suitable architecture code according to the performance demands. Such performance-oriented architecture selection strategy can greatly reduce useless architecture evaluations and improve the quality of the selected architectures by setting superior performance scores, which is coincident with the goal of MoARR. Besides, ARR avoids the defects of genetic MOO methods mentioned in Section 1, which makes MoARR more suitable for dealing with expensive NAS problem. We further discuss ARR as follows. ARR. The model, which maps the performance pair (acc,par) to a suitable architecture code x ∈ { argmin x∈S (|ACC(x) − acc| 2 +|P AR(x) − par| 2 ) } that has the closest performance scores, is called the reverse recommendation model (RRModel) in ARR. And the core idea of ARR is to make full use of historical evaluation information HEI = { | x∈Ŝ } to adaptively build effective RRModel, and then utilize RRModel to select superior architectures SA ⊆ {x ∈ S \Ŝ | y ∈Ŝ, x ≺ y} directly by setting better performance values. We note that the construction of effective RRModel is the key point of ARR. A straightforward solution for this task is to utilize HEI to construct a performance-to-code training data T D p−c = { <(acc, par), x> | ∈HEI }, where the performance scores are considered as input and corresponding codes as the target output. Then use T D p−c to train a Multi-Layer Perception (MLP) to obtain RRModel. Note that there may exists different codes with totally the same or very similar accuracy score and parameter amount in T D p−c , and the contradictory outputs may mislead the loss function and thus makes RRModel less effective. To eliminate the influence of contradictory values, this solution would only preserve one code for each performance pair in T D p−c . However, such operation would also result in two defects: (1) Information loss, the valuable information contained in the deleted records is underutilized; (2) Difficulty in selection, how to select the most suitable code to preserve thus achieve the best recommendation effect is unknown, and many trails should be done to achieve the best results, which is time-consuming. To avoid the above two defects, we propose an auxiliary model-based loss function for RRModel, which helps RRModel adaptively learn the most suitable output values by making full use of all historical information in HEI. Suppose forward evaluation model (FEModel) is capable of mapping the given architecture code to its performance pair, i.e., (accuracy score, parameter amount) 2 . Then, the new loss function of RRModel is defined as follows: where x={x 1 , ..., x n } is a set of accuracy-parameter performance scores, and y = RRM odel(x) denotes the suitable architecture codes recommended by RRModel. Equation 2 measures the differences between the target performance x and the performance of the codes recommended by RRModel. It can help RRModel to automatically determine suitable outputs under the guidance of the auxiliary model FEModel. More specifically, we can input enough accuracy-parameter performance scores to RRModel without giving outputs, and RRModel can adjust its outputs adaptively according to the performance feedback provided by FEModel, and thus achieve reasonable recommendation. As for the FEModel, it is unknown since neural architecture evaluation is a black-box. However, we could utilize HEI to construct code-to-performance training data T D c−p = { | < x, acc, par>∈HEI }, and use T D c−p to train a MLP so as to approximate FEModel. Note that, different from T D p−c , T D c−p does not exist contradictory problems. Therefore, with the usage of the new loss function, RRModel can be built automatically and effectively by making full use of all historical data HEI, and the two defectes are avoided. The next step is to use the obtained RRModel to select superior architectures. Since the target is to optimize B(Ŝ) = { x∈Ŝ | y∈Ŝ, x≺y }, we need to find the architecture codes that are not Pareto dominated by the evaluated codesŜ. Thus, we should input more competitive performance scores, i.e., performance scores with higher accuracy or lower parameter amount than the scores of B(Ŝ), to RRModel. We denote these performance scores as Inputs Ideal , and its formula is given as follows: Figure 4 is an example of Inputs Ideal . Suppose shaped points are performance scores of elements in S, then shaded area is Inputs Ideal . After getting Inputs Ideal , we sample some superior performance scores randomly from Inputs Ideal as the inputs of RRModel, and thus get superior architectures. MoARR. With the usage of ARR, we develop MoARR algorithm, which deals with our multiobjective NAS problem effectively. Algorithm 1 is the pseudo code of MoARR. In MoARR, CNN code evaluation is very time-consuming due to the huge training dataset and large number of training epochs. In order to reduce the evaluation cost and thus speed up MoARR, we propose the fast evaluation strategy (FES) to quickly estimate the final validation accuracy of CNN architectures in S using only a few training epochs and less training dataset. FES. The core idea of FES is to use the following three types of characteristic attributes of an architecture x ∈ S to predict its ACC F inal (x), which is the validation accuracy of x obtained after x is fully trained using all training dataset: 2 The mappings are opposite in FEModel and RRModel • Model complexity, including FLOPs and parameter amount of x; • Structural attributes, including Density, layer number and reduction cell number of x, where Density(x) is the edge number divided by the dot number in DAG of x; • Quick evaluation scores, including top-1 accuracy, top-5 accuracy and loss value obtained by x after training for 12 epochs using 1% training dataset. This attribute-based prediction method comes from the Easy Stop Strategy (ESS) [34] , which is successfully applied to CNN architectures, that are stacked with many copies of a discovered cell, to reduce the evaluation cost. In FES, we make some adjustments to ESS, making this method suitable for our more complex CNN architectures that are stacked with diversified cell structures. Our adjustments fall into two categories. Their reasons are as follows: (1) Replacing cell attributes to architecture ones. ESS uses the complexity and structural attributes of the used cell to stand for that of the whole architecture, and FES can only use that of the whole architecture to achieve the same description. (2) Involving more attributes. Our architectures are more flexible, and we use less training dataset for fast evaluation, thus we need more complexity features and structural attributes to distinguish different architectures, and need more performance features to substitute for ACC Early . To make this prediction method work for the architectures in our search space S, in FES we sample some architectures from S to study the relationship between these attributes and ACC F inal . And finally we build a MLP regression model, denoted by M odel F ES , to predict ACC F inal (x) according to the above 8 attribute values of x ∈ S. In MoARR, we utilize the obtained M odel F ES to efficiently estimate the ACC F inal of candidate architectures recommended by ARR selection strategy. With the usage of M odel F ES , we only cost about 20 seconds to estimate the final accuracy of an architecture x ∈ S, which greatly reduces our evaluation cost and speeds up our algorithm. In this section, we test MoARR on common image classification benchmarks, and show its effectiveness compared to other state-of-the-art models (Section 3.1 to Section 3.3). We use CIFAR-10 [15] dataset for the main search and evaluation phase, and do transferability experiments on the wellknown benchmarks using the architecture found on CIFAR-10. In addition, we conduct an ablation study which asserts the role of MoARR in discovering novel architectures (Section 3.4). Using MoARR, we search on CIFAR-10 [15] for better lightweight CNN architectures. Considering that CNN architectures discovered by existing NAS work are generally have more than 2.5M parameters [12] , we set the upper limit of parameter amount P max in equation 2 to 2.5M. During the search phase, we select 50 architectures to evaluate for each iteration, and train each selected network for a fixed 12 epochs on CIFAR-10 using the FES as described in Section 2.4. Following [34] , we set the batch size to 256 and use Adam optimizer with β 1 =0.9, β 2 =0.999, ε=10 −8 . The initial learning rate is set to 0.001 and is reduced with a factor of 0.2 every 2 epochs. Our MoARR takes about six hours to accomplish the search phase on a single NVIDIA Tesla V100 GPU. After the search phase, we extract four excellent lightweight architectures from the Pareto boundary B(Ŝ) obtained by MoARR: (1) Two architectures with the highest accuracy score in { y∈B(Ŝ) | 2M ≤P AR(y) ≤2.5M }, which are represented by MoARR-Small1 and MoARR-Small2; (2) Two architectures with the highest accuracy score in { y∈B(Ŝ) | P AR(y)<2M }, which are represented by MoARR-Tiny1 and MoARR-Tiny2. We then use these architectures (whose encodings are shown in the supplementary material) to test the effectiveness of MoARR. We train the MoARR-Small and MoARR-Tiny networks for 600 epochs using a batch size of 96 and SGD optimizer with nesterov-momentum and a weight decay of 3×10 −4 . We start the learning rate of 0.025 and reduce it to 0 with the cosine learning rate scheduler. For regularization we use cutout [11] , scheduled drop-path [16] , auxiliary towers [27] and randomly cropping. All the training parameters are the same as DARTS [18] . [22] , where 1500 epochs and more regularization methods are applied. Our smallest network variant, MoARR-Tiny2, outperforms most previous models on CIFAR-10, while having much less parameters, i.e., it contains 33.3% to 94.4% fewer parameters than previous models. With the consideration of more flexible and diversified structures, we discover CNN models with less parameters and higher accuracy using existing cell structures, which demonstrates the significance of cell diversity and structural flexibility. In addition, our MoARR is the second fastest among the NAS methods listed in Table 2 , next to GDAS [12] . Using the architecture found by MoARR searched on CIFAR-10, we preform transferability tests on 6 popular classification benchmarks. ImageNet Results. Our ImageNet network is composed of two initial stem cells for downscaling and a new variant of MoARR-Small1 architecture for feature extraction and image classification. Following previous work, in the new variant of MoARR-Small1, we set C init to 184 and reduce 1 normal cell in each stage (stage 1 to 3), so the total number of network FLOPs is below 600M. Figure 3 from supplementary material shows our ImageNet architecture. We train the network for 250 ImageNet Architecture Table 3 : Transferability classification error on Ima-geNet dataset. epochs with one cycle of the power cosine learning rate [13] and a nesterov-momentum optimizer. Results are shown in Table 3 . We can observe from Table 3 that MoARR's transferability results on ImageNet are highly competitive, outperforming all previous NAS models. Additional Results. We further test MoARR transferability abilities on 5 smaller datasets: CIFAR-100 [15] , Fashion-MNIST [31] , SVHN [21] , Freiburg [14] and CINIC10 [8] . We choose to use the MoARR-Small1 architecture, with similar training scheme as [22] . Table 4 shows the performance of our model compared to other NAS methods. On Fashion-MNIST, MoARR-Small1 surpasses the next top architecture by 0.04%, achieving the second highest reported score on Fashion-MNIST, second only to [18] . On CIFAR-100, Freiburg and CINIC10, MoARR-Small1 surpasses all the other 6 architectures, achieving the lowest test errors. Table 4 : Transferability classification error on 5 datasets. Results marked with † are taken from [22] . In this part, we analyze the importance of MoARR. We examine whether MoARR is actually capable of finding good CNN architectures, or whether it is the design of our new search space that leads to MoARR's strong empirical performance. Comparing to Guided Random Search. We uniformly sample a CNN code from our search space, and build the CNN according to it, then we train this random model to convergence using the same settings mentioned in Section 3.2. The random CNN model has 3.5M parameters and achieves the test error of 2.89% on CIFAR-10 (using the setting of [22] yields 2.37% error rate), which is far worse than MoARR-Small1's 2.61% and 1.90%. It shows that our search space contains not only excellent lightweight architectures, but also many architectures with poor performance. Thus, efficient search strategy is necessary for our MOO NAS problem. The ARR selection strategy designed in MoARR can recommend good CNN codes with less parameters by analyzing the known performance information, which is effective. Comparing to Evolutionary Multi-Objective Search. In addition to random search, we compare with the classic evolutionary multi-objective method RVEA* [7] . We set the population size to 50, and use RVEA* instead to deal with our multi-objective NAS problem. Figure 5 reports the performance scores of architectures that are evaluated by RVEA* or MoARR in five generations. We can observe that our MoARR evaluates much fewer useless architectures, and optimizes B(Ŝ) more quickly than RVEA*. Compared with the evolutionary method, our ARR selection strategy can recommend better architectures by utilizing potential relations learned from historical information. Our optimization process is more efficient and thus reduce the evaluation cost, which is more suitable to deal with the expensive multiobjective NAS problems, coinciding with the discussions in Section 1. NAS is a popular and important research topic in deep learning. Many effective algorithms have been proposed to tackle this problem. Majority of them [12, 20, 6, 18, 23, 25] adopt the idea of micro search, which centers on learning cell structures and designs a neural architecture by stacking many copies of the discovered cells, and the minority are macro search methods [23, 1, 3, 2] , which directly discover the entire neural networks. The former ones greatly reduce the computation cost but may miss some good architectures due to the inflexible network structure used by them, and the latter ones consider more flexible structures but are incapable of finding good architectures within short time due to the huge search space. In this paper, we propose to construct more flexible network structures utilizing good cell structures discovered by previous work, and thus efficiently search huger space for better CNN architectures. Our idea combines the merits of two methods and achieves better results. More recently, with the increasing needs of deploying high-quality deep neural networks on real-world devices, multiple objectives are considered in NAS for real applications. Some works [28, 30, 4] tried to converted the multi-objective NAS tasks into the single-objective ones, and utilized the existing single-objective search methods, such as reinforcement learning [35, 23] , to deal with them. However, the weights in the single objective function are hard to determine, besides, the dimensional disunity of multiple objectives may result in poor robustness of single objective optimization. In this paper, we design MoARR to directly optimize multiple objectives, and thus avoid these problems. In this paper, we propose MoARR for finding good lightweight CNN architectures. We construct more flexible and diversified network architectures using existing cell structures, and adaptively learn CNN recommendation model utilizing the performance feedback for efficiently optimizing architectures. Experimental results show that our MoARR can discover more powerful and lightweight CNN model compared with the state-of-the-arts, which demonstrates the importance of structural diversity and effectiveness of our optimization method. Our cell reusing idea and the multi-objective NAS optimization method are not only applicable to CNN but also other kinds of nerural networks. In the future works, we will further explore more various kind of network structures such as GNN and RNN, and improve the efficiency of MoARR. To the best of our knowledge, we believe our work could benefit the society in areas that require image classification task. More specifically, our work could help to quickly generate high capacity models by utilizing current existing SOTA models when new problem come out and the image datasets are fresh. For example, to quickly generate robust chest CT image classification models after COVID-19 break out. Moreover, our lightweight model could deploy in light embedding devices, which make the technique more accessible for public. However, due to the possible system failure (e.g. misclassification), there may exist problems associated with public safety. For example, falsely classified products could enter the market which could cause serious problem, (e.g. unqualified medicine and agricultural products) and misclassified medical images may do harm to both the patients and the society. Such problem is hard to avoid due to the limited accuracy in current SOTA models as well as the unavoidable data quality problem, and we strongly hold that the model should be carefully adjusted and exhaustively tested (sometimes necessary manual assistance should be required) before putting into social practice. Designing neural network architectures using reinforcement learning SMASH: one-shot model architecture search through hypernetworks Efficient architecture search by network transformation Proxylessnas: Direct neural architecture search on target task and hardware DATA: differentiable architecture approximation RENAS: reinforced evolutionary neural architecture search A reference vector guided evolutionary algorithm for many-objective optimization CINIC-10 is not imagenet or CIFAR-10 A fast and elitist multiobjective genetic algorithm: NSGA-II An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints Improved regularization of convolutional neural networks with cutout Searching for a robust neural architecture in four GPU hours sharpdarts: Faster and more accurate differentiable architecture search The freiburg groceries dataset Learning multiple layers of features from tiny images Fractalnet: Ultra-deep neural networks without residuals Progressive neural architecture search DARTS: differentiable architecture search Neural architecture optimization XNAS: neural architecture search with expert advice Reading digits in natural images with unsupervised feature learning ASAP: architecture search, anneal and prune Efficient neural architecture search via parameter sharing Multi-objective evolutionary algorithms based on the summation of normalized objectives and diversified selection Regularized evolution for image classifier architecture search Multi-task learning as multi-objective optimization Going deeper with convolutions Platform-aware neural architecture search for mobile Rethinking model scaling for convolutional neural networks Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms CARS: continuous evolution for efficient neural architecture search Shufflenet: An extremely efficient convolutional neural network for mobile devices Practical block-wise neural network architecture generation Learning transferable architectures for scalable image recognition We compare MoARR with the classic NAS algorithms (Section 3). Experimental results show that MoARR can find a powerful and lightweight model (with 1.9% error rate and 2.3M parameters) on CIFAR-10 in 6 GPU hours, which outperforms the state-of-the-arts. The explored network architecture is transferable to ImageNet and 5 additional datasets, and achieves good results, e.g., 76.0% top-1 accuracy with only 4.9M parameters on ImageNet.