key: cord-0500706-qgxo1lcm
authors: Tang, Sheyang; Hosseini, Mahdi S.; Chen, Lina; Varma, Sonal; Rowsell, Corwyn; Damaskinos, Savvas; Plataniotis, Konstantinos N.; Wang, Zhou
title: Probeable DARTS with Application to Computational Pathology
date: 2021-08-16
journal: nan
DOI: nan
sha: 000cad8dfc9c3aa602688f1baafbd3a6f2e504ec
doc_id: 500706
cord_uid: qgxo1lcm

AI technology has made remarkable achievements in computational pathology (CPath), especially with the help of deep neural networks. However, the network performance is highly related to architecture design, which commonly requires human experts with domain knowledge. In this paper, we combat this challenge with the recent advance in neural architecture search (NAS) to find an optimal network for CPath applications. In particular, we use differentiable architecture search (DARTS) for its efficiency. We first adopt a probing metric to show that the original DARTS lacks proper hyperparameter tuning on the CIFAR dataset, and how the generalization issue can be addressed using an adaptive optimization strategy. We then apply our searching framework on CPath applications by searching for the optimum network architecture on a histological tissue type dataset (ADP). Results show that the searched network outperforms state-of-the-art networks in terms of prediction accuracy and computation complexity. We further conduct extensive experiments to demonstrate the transferability of the searched network to new CPath applications, the robustness against downscaled inputs, as well as the reliability of predictions.

Recent years have witnessed great advances in AI-based Computational Pathology (CPath) [22, 15] . The emerging AI techniques have shown their superiority in more accurate, efficient, and large-scale medical diagnoses [4] . In particular, Convolutional Neural Networks (CNNs) have been widely employed to extract meaningful information from medical images for various pathology applications, including disease diagnoses [5, 38] , medical image segmentation [27, 31] , etc. Yet designing the network architectures has * Equal contribution long been a manual process that requires adequate domain knowledge. As a result, it has become a common standard that architectures from CV applications (such as ResNet [8] and GoogLeNet [32] ) are transferred for technical developments in other fields, including CPath [29, 34] . The ultimate question is whether transferring architectures between the two domains is an efficient strategy. To answer this question, we first demonstrate how CV and CPath datasets are different. Here we compare the CIFAR [19] and ADP [11] datasets. Besides different data structures shown in Table. 1, the nature of images from both sides is also different, which makes CV datasets more complicated. First, the pixel resolution in CPath is fixed, corresponding to a fixed field of view (FOV) size. The root cause of such uniformity is the acquisition of whole slide images by a scanner in a much more controlled environment from both optics and illumination viewpoint [11] . In contrast, the pixel resolution in CV is randomly distributed across different images due to different image setup and configurations. CV images are captured in natural scenes where the distance has much variance. Examples from each imaging modality are shown in Figure 1(a) , where the ship images are taken from a further distance than the dog ones, resulting in larger pixel size and lower resolution. Second, target objects in CV images only occupy part of the whole FOV and the rest are background which is irrelevant to the class label. Note that the diversity of the background in Figure  1 (a) is very high. This is quite different in CPath where the background information is obtained from an empty area of the sample using uniform white light illumination [11] -leading to more uniform and homogeneous images. This is illustrated in Figure 1(b) , where the white part denotes the background. In the light of this difference, we form a hypothesis that such simplified imaging modality in CPath translates to simpler network architecture compared to CV. To this end, new network architectures should be designed for CPath applications.

Neural architecture search (NAS) has recently been proposed to automate the design of neural networks by searching for the optimal network structure on a given dataset. In many CV applications, NAS has outperformed state-ofthe-art manually designed networks in terms of prediction accuracy and computation complexity [7] . In medical image analysis, it has been utilized to find suitable networks for various applications, such as image segmentation for Magnetic Resonance Imaging (MRI) [17, 36, 3] , ultrasound imaging [35] , disease diagnoses from Computed Tomography (CT) scans [16, 9] , etc. In pathology, however, NAS is not fully explored. There is a lack of a general framework that can be easily extended to various CPath applications.

In this work, we propose an architecture search platform based on differentiable architecture search (DARTS) [23] . We choose DARTS because it is gradient-based and thus much more efficient and computation-friendly than other searching strategies including reinforcement learning [39] and evolutionary algorithms [26] . DARTS achieves this by relaxing the search space to be continuous and dividing the whole pipeline into a search phase and an evaluation phase. However, in CV applications, it is reported that DARTS tends to exhibit overfitting issues, and the searched architecture does not generalize well in the evaluation phase [21, 37] . To combat these challenges, we first conduct searching on CIFAR [19] and utilize a probing metric stable rank [12] for each layer. In this way, we can better monitor the searching process and show that the overfitting issue comes from improper hyperparameter tuning. In addition, we use an adaptive optimizer Adas [12] that automatically tunes the learning rates for each layer based on their probing metrics, so that the generalization ability of the searched ar-chitecture is improved. We then apply this searching framework on ADP [11] , which contains a great variety of histological tissue types that are representative enough, so that the searched architecture can generalize well in different CPath applications. The searched network outperforms the state-of-the-art architectures in the speed-accuracy tradeoff, which is crucial for real-time high-throughput CPath applications. We further conduct extensive experiments to show the transferability of the searched architecture on new CPath datasets, demonstrate its robustness against decreased input images, and verify its superiority in extracting label-pertinent features. Our main contributions are listed below:

• We use a probing metric to show that the existing DARTS framework lacks proper hyperparameter tuning, and use an adaptive optimizer to improve the generalization ability of the searched model;

• We apply the proposed searching platform on CPath applications and show the superiority of the searched model in prediction accuracy and computation complexity;

• We demonstrate the transferability of the searched architecture in various CPath applications, show its robustness against decreased resolutions and its reliability in prediction. Table 2 . Summary of NAS applications in medical image analysis.

Searching Strategy Gradient-based Reinforcement Learning Evolutionary Algorithms Segmentation [35, 17, 6] [3]

[36] Classification [25, 9] [10] [16] As NAS has achieved promising results in many CV applications [7] , several attempts are made to utilize NAS techniques to find optimum architectures for applications in medical image analysis. Based on the task and the searching strategy, these works can be categorized as in Table. 2. In applications of image segmentation, most works adopt a U-net structure, where detail configurations are searched in different manners. [35, 17] use differentiable architecture search to find cell structures as building blocks in the encoder and decoder. Bae et al. [3] utilize reinforcement learning to search for hyper-parameter configurations of the U-Net architecture. Yu et al. [36] first search for cell connections to form a U-Net topology using evolutionary algorithms, and then search for operations within each cell. Dong et al. [6] extend the differentiable searching framework to work in adversarial training.

For classification applications, the searching is more task-specific. Using gradient-based searching, Peng et al. [25] develop a network to predict distant metastases on PET-CT images, and He et al. [9] design a network for COVID-19 detection with Chest CT Scans. Hosseini et al. [10] use a reinforcement learning-based controller to find the best parameter configuration of a CNN model for histological tissue type classification. Jiang et al. [16] search for a network to classify pulmonary nodules with evolutionary algorithms.

To the best of our knowledge, there hasn't been any work that fully explores the potentials of NAS in digital pathology applications.

In this section, we introduce our searching algorithm. We first review the basic concepts of DARTS [23] , then show how the existing DARTS framework can be improved using a probing metric and a new optimizer. Finally, a network size-based searching is proposed to seek a trade-off between prediction accuracy and model complexity.

The goal of DARTS [23] is to search for two types of cells (namely normal and reduction) as building blocks, which are stacked to form a full network. Each cell is represented as a directed acyclic graph with N nodes, including two input nodes, intermediate nodes and one output node. Every node x i is a latent representation (e.g., feature map in CNN) and every edge (i, j) is a mixture of weighted candidate operations in a pre-defined operation search space O (e.g., convolution, skip-connection). The outputō i,j of an edge (i, j) is then a weighted sum of candidate operations [23] :ō

where α o i,j is an architecture parameter for weighting operation o (x i ). The output of an intermediate node x j is the sum of all input edges, i.e., x j = i<jō i,j (x j ). The output node of a cell is the concatenation of all intermediate nodes. Normal cells keep the input resolution while reduction cells decrease resolutions with stride 2 in all candidate operations.

In the searching procedure, the network weight w and architecture parameter α are jointly learned via bi-level optimization [23] :

where L val and L train denote the validation and training datasets, respectively. Using gradient descent, w and α can be updated alternatively during each training iteration.

When the searching is finished, the discrete cell architecture is obtained by replacing each edge by the operation with the largest architecture weight, then selecting the two strongest input edges for each intermediate node. Fig. 2 (b) illustrates the evolution of cell structure during searching. The discrete network is then retrained from scratch for final evaluation. The whole process is shown in Fig. 2 (a).

To monitor the searching process of DARTS, we adopt the explainability metric stable rank to probe the intermediate convolutional layers in different cells and quantify their learning quality as explained in [12] . Given a convolutional weight matrix, we first decompose it by low-rank factorization. This factors out the perturbation noise in the layer while keeping the most useful information in the low-rank component. The stable rank S is the normalized sum of the singular values of the low-rank matrix. It measures the norm energy of the convolutional weights and encodes the low-rank structure's space span of the output mapping. A higher value indicates better propagation of information through a convolutional layer [12] .

Using this probing metric, we monitor the searching phase of the existing DARTS framework applied on the CI-FAR100 dataset [19] . The left column of Fig. 3 shows an example of the stable rank evolution (top row) of the layers in one cell as well as the architecture weights evolution (bottom 2 rows) in two edges. We can see from the stable rank evolution that the convolutional layers are not learning well in the original DARTS with stochastic gradient descent (DARTS+SGD) and default initial learning rate 0.025 [23] -most layers generate zero stable rank through all training epochs. Recall the edge structure introduced in Sec. 3.1, each candidate operation is multiplied by a weight, which is between 0 and 1. This makes their gradients small during backpropagation. Therefore the convolutional layers are learning slowly. The bottom two images show that skip-connections (green curves) are preferred. The same phenomenon is reported in [21] that when searched on CV datasets, the original DARTS tends to select too many skipconnections, which is a kind of overfitting, resulting in a shallow network with poor representation ability.

In light of this, we increase the initial learning rate from 0.025 to 0.05 and 0.175. The stable rank evolution of layers in the same cell, as well as the architecture weights in the same edges, are shown in Fig. 3 (b) and (c). With the increase in initial learning rates, more layers generate a higher stable rank, which means they're learning better. In the meantime, the architecture weights evolution reveals that the preference for skip-connections is suppressed. With the help of the probing metric, we know that the existing DARTS framework lacks proper tuning in hyperparameters and how the searching can be improved. 

We have shown in the previous section that the probing metric can be used to tune proper initial learning rates for DARTS. Then why not utilize an optimizer that incor- porates such metric for learning rates adjustment during searching? To verify this, we adopt the Adas optimizer [12] , which adaptively adjusts the learning rate for each layer based on their stable rank evolution. At the end of each searching epoch, it first computes the difference in stable rank over consecutive epochs for each convolutional layer, and then adds (with a weight) result to the learning rate momentum. A hyperparameter scheduler beta is used for weighting this term. This process is illustrated in Fig. 2 (c) . The Adas optimizer is aware of the learning quality of each layer and therefore tunes their learning rates accordingly. Fig. 4 shows the training and validation errors when searching with different optimizers and initial learning rates. We can see that SGD with a 0.175 initial learning rate leads to overfitting during searching. The final gap between training and validation error is around 40%. While for Adas, the overfitting problem is reduced. The resulting final gap is around 20%. This indicates that Adas improves the searching process of DARTS with better generalization ability.

We investigate the trade-off between network size and test performance by searching for the optimal architectures with different numbers of cells and intermediate nodes.

This trade-off is crucial for high-throughput CPath applications in real-time. On the ADP dataset, DARTS+Adas obtains the best-performing architecture. It consists of four cells, and each cell contains three intermediate nodes. Fig. 6 (c) and (d) show the snapshots of the cell structures.

Our experiments contain two stages. In the first stage, we search for the optimum architectures on CIFAR and ADP. In the second stage, we evaluate the transferability of the architecture searched on ADP, as well as its robustness and reliability in various cases.

We carry out the searching on CIFAR and ADP datasets. The search space of candidate operations is the same as in [23] , including 1) 3x3 separable convolution, 2) 5x5 separable convolution, 3) 3x3 dilated separable convolution, 4) 5x5 dilated separable convolution, 5) 3x3 max pooling, 6) 3x3 average pooling, 7) skip connection, and 8) zero operation. We stack the cells sequentially to build a network for searching and evaluation. The details of network structures can be found in Section Network Structures of the supplementary material. In both CIFAR and ADP experiments, we test two optimization strategies for optimizing model weights, i.e., DARTS+SGD and DARTS+Adas. Detailed setup can be found in Section Hyperparameters of the Supplementary Material.

The searched network is discretized and then trained from scratch for final evaluation. Following [23] , in each parameter setting, we conduct four independent runs of searching with different random seeds. We then perform a quick evaluation for each searched architecture by training them from scratch for 100 epochs and pick the bestperforming one. The finalized architecture is trained from scratch for 600 epochs in three independent runs. We report the means and standard deviations of test accuracy. Training details can be found in Section Hyperparameters of the Supplementary Material.

On the CIFAR dataset, we search for the optimum optimizer and number of intermediate nodes. During searching, eight cells are stacked as in DARTS, while the number of nodes in each cell is tuned between 4, 5, and 6. During evaluation, to prevent the network size from being too large with more nodes, we also change the number of cells accordingly, i.e., 20 cells for 4 nodes, 17 cells for 5 nodes, and 14 cells for 6 nodes. Table. 3 and Table. 4 show the test performance of architectures searched on CIFAR-10 and CIFAR-100. We can see that in each setting, DARTS+Adas outperforms the default DARTS+SGD in terms of test accuracy, while a cost is paid in parameter size. This is because the original DARTS+SGD tends to select skip-connections in the final architecture as described above in Sec. 3.2. The optimum number of nodes is 4. To find the optimum architecture on the ADP dataset, we run the architecture search in all different parameter settings, i.e., different numbers of cells and nodes, different choices of optimizer. Note that in CIFAR experiments we search for a shallower network (with few cells) during searching but train a deeper one (with more cells) for evaluation due to the complexity of CV datasets. In ADP, however, we keep the number of cells the same in two stages. This is because ADP is a simpler dataset so we don't need to increase the model complexity during evaluation. This also brings more consistency to the search-evaluation pipeline.

Optimum optimizer and number of cells. We first search for the optimum optimizer and the number of cells. The test results of the searched architectures after final evaluation are shown in Table. 5. We also plot the accuracy versus parameter size in Fig. 5 (a) , where each dot represents a different choice of cell number. We can see that as the number of cells increases the accuracy of DARTS+SGD drops while its parameter size increases. When using DARTS+Adas, the test accuracy remains the highest across different numbers of cells, and the parameter size remains the smallest. The highest accuracy is achieved with 6 cells, while for 4 cells, the searched architecture has the smallest size but still obtains the second highest accuracy.

Optimum number of intermediate nodes. We then fix the number of cells as four and search for the optimum number of intermediate nodes. The test performance are shown in Tabel.6 and in Fig. 5 (b) . We can see that DARTS+Adas achieves higher accuracy than DARTS+SGD with fewer nodes, hence less computation complexity. The highest accuracy 94.46% is achieved with 3 nodes, leading to 0.31M parameters and 0.27G MAC operations. Fig. 6 shows the cell architectures searched with 2, 3, and 4 nodes using DARTS+Adas.

We select the searched architectures with 2, 3 and 4 nodes (namely DARTS-ADP-N2, -N3 and -N4), and train them on three more datasets: BCSS [1] , BACH [2] and Os-teosarcoma [20] . Details including data augmentations can be found in Section Datasets of the Supplementary Material. The goal is to evaluate how the searched architectures perform when transferred to different CPath datasets that cover different variations of single-vs multi-labels, multiclass problems, data samples, and organs. We also train several mobile-friendly architectures on them for comparison, including a 4-cell DARTS [23] . All networks are trained for 600 epochs with batch size 96, using the SGD optimizer with a 0.025 initial learning rate and cosine annealing scheduler.

As shown in Tabel.7, across all datasets, the group of DARTS-ADP networks achieves higher or comparable test accuracy than state-of-the-art networks, but with smaller parameter sizes. The 2-node version contains only 0.24M parameters but still ranks high in test accuracy. As for computation complexity, though MobileNetV2 0.35 [28] , MobileNetV3-small [13] , and MNASNet-small [33] achieve fewer MAC operations, their accuracies are two percent lower. This shows the superiority of DARTS-ADP in the speed-accuracy trade-off, which is desirable for CPath applications of high-throughput image analysis.

Another way to meet the needs of high-throughput applications is to decrease the image resolution. To evaluate the robustness of the architectures against downscaled inputs, we retrain several models with different resolutions (272, 136, 68, 34) in three datasets. Fig. 7 shows the performance of different network selection. Each line represents a network and each dot represents a specific resolution. The DARTS-based networks consistently achieve the highest accuracy with the lowest computation complexity in all resolutions and across all datasets. As the resolution decreases, their test accuracy exhibits a much less drop compared to ResNet18 [8] and MobileNetV2 [28] , which shows the robustness of the DARTS-based networks. Such robustness is also illustrated in the standard deviation (denoted by shades). Compared to 4-cell DARTS [23] , the two DARTS-ADP networks obtain higher or comparable test accuracy with lower computation complexity, which again shows their superiority in the speed-accuracy trade-off.

To better understand and interpret the performance of feature representation of different networks, we apply Grad-CAM [30] on the last convolutional layers of different networks to obtain the heatmaps for predicting ground truth labels. This visualization technique allows us to evaluate the reliability of different networks in label prediction. We randomly select five patches that contain different ground truth labels from the test set of ADP, BCSS, and BACH, and feed them into three networks for comparison. The selected net- ResNet18 [8] 93.430. 35 works are DARTS-ADP-N4, MobileNetV3-large [13] , and MNASNet-A1 [33] . Results are shown in Fig. 8 , where the heatmap indicates pixel-level confidence of pertinent labels of the image patch. According to pathologists' assessment, the overall performance of DARTS-ADP-N4 is the best. Examples are shown in the first column of Fig. 8(a) , where DARTS-ADP-N4 successfully highlights the region of Erythrocytes. Ei-ther MobileNetV3-large or MNASNet-A1 discovers incomplete or false regions. This demonstrates the superiority of DARTS-ADP in extracting label-pertinent features from image patches, and hence more reliable predictions.

In this paper, we propose a general DARTS-based searching framework for CPath applications. We first use a probing metric to show that the existing DARTS lacks proper hyperparameter tuning, and how the generalization performance of the searched model can be improved with an adaptive optimization strategy. We then apply this searching framework on a histological tissue type dataset ADP and develop architectures that outperform the state-of-the-art networks with higher prediction accuracy and lower computation complexity. We transfer the searched architectures to other CPath datasets including BCSS, BACH, and Osteosarcoma, and conduct extensive experiments to demonstrate the robustness and reliability of the networks in various cases.

Computational Pathology"

The macro network structures in both the searching and evaluation phases are formed by stacking the normal and reduction cells sequentially. At 1/3 and 2/3 of the total depth of the network, there are reduction cells. Fig. 9 shows the general network structure, where the stem block contains several convolutional layers and the classifier consists of a global pooling layer and a fully connected layer.

The final architecture searched on ADP [11] is shown in Fig. 10 . Note that there are no normal cells between the two reduction cells since the total number of cells is four, which is not divisible by three. 

CIFAR [19] . In the searching phase, we follow [23] to split the original training set into two parts, one for training and one for evaluation. In the evaluation phase, we use the default splits. We use random cropping with size 32x32 and random horizontal flipping as data augmentations.

CPath datasets. ADP and BCSS [1] are multi-label datasets, while BACH [2] and Osteosarcoma [20] are single-label. Their image resolution is all 272x272. We only conduct searching on ADP but evaluate the searched architecture on all four datasets. During searching, we treat half of the training set of ADP as the validation set. Data augmentations in all datasets include random horizontal and vertical flipping, random affine, and resize. Note that during searching on ADP, we resize the images to 64x64 to alleviate the computation overhead, and during evaluation, images are resized only in the test of different resolutions (136, 68, and 34).

In CIFAR experiments, we train the network for 50 epochs with batch size 64 and initial channels 16. We test two optimizers for optimizing model weights, which are the original SGD [23] and Adas [12] . For DARTS+SGD, we follow [23] to use initial learning rate 0.025, cosine annealing scheduler, momentum 0.9 and weight decay 3 × 10 −4 . For DARTS+Adas, we use initial learning rate 0.175, scheduler beta 0.98, momentum 0.9, and weight decay 3 × 10 −4 . As for architecture parameter optimization, we follow [23] to use Adam [18] optimizer with initial learning rate 3 × 10 −4 , momentum (0.5, 0.999), and weight decay 10 −3 .

In ADP experiments, most hyperparameters are the same except that we use batch size 32 due to computation overhead. We also increase the initial learning rate of DARTS+SGD to 0.175 for model weights optimization.

In both CIFAR and CPath experiments, we follow [23] to train the network for 600 epochs with batch size 96 and initial channels 36. We use SGD optimizer with an initial learning rate of 0.025, cosine annealing scheduler, momentum 0.9, and weight decay 3 × 10 −4 . Additional enhancements include cutout and auxiliary towers as in [23] . Note that we disable auxiliary towers in training when we compare the performance of the searched architectures with the state-of-the-art networks.

Structured crowdsourcing enables convolutional segmentation of histology images

Grand challenge on breast cancer histology images

Resource optimized neural architecture search for 3d medical image segmentation

Artificial intelligence in digital pathology-new tools for diagnosis and precision oncology

Mitosis detection in breast cancer histology images with deep neural networks

Neural architecture search for adversarial medical image segmentation

Neural architecture search: A survey

Deep residual learning for image recognition

Automated model design and benchmarking of 3d deep learning models for covid-19 detection with chest ct scans

On transferability of histological tissue labels in computational pathology

Atlas of digital pathology: A generalized hierarchical histological tissue typeannotated database for deep learning

Adas: Adaptive scheduling of stochastic gradients

Searching for mo-bilenetv3

Squeeze-and-excitation networks

Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases

Learning efficient, explainable and discriminative representations for pulmonary nodules classification

Scalable neural architecture search for 3d medical image segmentation

Adam: A method for stochastic optimization

Learning multiple layers of features from tiny images

Osteosarcoma data from ut southwestern/ut dallas for viable and necrotic tumor assessment [data set]. The Cancer Imaging Archive

Darts+: Improved differentiable architecture search with early stopping

Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis

Darts: Differentiable architecture search

Shufflenet v2: Practical guidelines for efficient cnn architecture design

Multi-modality information fusion for radiomicsbased neural architecture search

Regularized evolution for image classifier architecture search

Unet: Convolutional networks for biomedical image segmentation

Mobilenetv2: Inverted residuals and linear bottlenecks

H&e-stained whole slide image deep learning predicts spop mutation state in prostate cancer

Grad-cam: Visual explanations from deep networks via gradient-based localization

Accurate cervical cell segmentation from overlapping clumps in pap smear images

Going deeper with convolutions

Mnasnet: Platform-aware neural architecture search for mobile

Deep learning for identifying metastatic breast cancer

Nasunet: Neural architecture search for medical image segmentation

C2fnas: Coarseto-fine neural architecture search for 3d medical image segmentation

Understanding and robustifying differentiable architecture search

Automatic detection and classification of leukocytes using convolutional neural networks

Neural architecture search with reinforcement learning