key: cord-0614240-jaoj960t
authors: Xiang, Tiange; Zhang, Chaoyi; Wang, Xinyi; Song, Yang; Liu, Dongnan; Huang, Heng; Cai, Weidong
title: Towards Bi-directional Skip Connections in Encoder-Decoder Architectures and Beyond
date: 2022-03-11
journal: nan
DOI: nan
sha: 15043aa66220b131f41102a9ba1388641a1d3824
doc_id: 614240
cord_uid: jaoj960t

U-Net, as an encoder-decoder architecture with forward skip connections, has achieved promising results in various medical image analysis tasks. Many recent approaches have also extended U-Net with more complex building blocks, which typically increase the number of network parameters considerably. Such complexity makes the inference stage highly inefficient for clinical applications. Towards an effective yet economic segmentation network design, in this work, we propose backward skip connections that bring decoded features back to the encoder. Our design can be jointly adopted with forward skip connections in any encoder-decoder architecture forming a recurrence structure without introducing extra parameters. With the backward skip connections, we propose a U-Net based network family, namely Bi-directional O-shape networks, which set new benchmarks on multiple public medical imaging segmentation datasets. On the other hand, with the most plain architecture (BiO-Net), network computations inevitably increase along with the pre-set recurrence time. We have thus studied the deficiency bottleneck of such recurrent design and propose a novel two-phase Neural Architecture Search (NAS) algorithm, namely BiX-NAS, to search for the best multi-scale bi-directional skip connections. The ineffective skip connections are then discarded to reduce computational costs and speed up network inference. The finally searched BiX-Net yields the least network complexity and outperforms other state-of-the-art counterparts by large margins. We evaluate our methods on both 2D and 3D segmentation tasks in a total of six datasets. Extensive ablation studies have also been conducted to provide a comprehensive analysis for our proposed methods.

A B S T R A C T U-Net, as an encoder-decoder architecture with forward skip connections, has achieved promising results in various medical image analysis tasks. Many recent approaches have also extended U-Net with more complex building blocks, which typically increase the number of network parameters considerably. Such complexity makes the inference stage highly inefficient for clinical applications. Towards an effective yet economic segmentation network design, in this work, we propose backward skip connections that bring decoded features back to the encoder. Our design can be jointly adopted with forward skip connections in any encoder-decoder architecture forming a recurrence structure without introducing extra parameters. With the backward skip connections, we propose a U-Net based network family, namely Bi-directional O-shape networks, which set new benchmarks on multiple public medical imaging segmentation datasets. On the other hand, with the most plain architecture (BiO-Net), network computations inevitably increase along with the pre-set recurrence time. We have thus studied the deficiency bottleneck of such recurrent design and propose a novel two-phase Neural Architecture Search (NAS) algorithm, namely BiX-NAS, to search for the best multi-scale bi-directional skip connections. The ineffective skip connections are then discarded to reduce computational costs and speed up network inference. The finally searched BiX-Net yields the least network complexity and outperforms other state-of-the-art counterparts by large margins. We evaluate our methods on both 2D and 3D segmentation tasks in a total of six datasets. Extensive ablation studies have also been conducted to provide a comprehensive analysis for our proposed methods.

Accurate and efficient analysis of medical images is of great interest to the computer vision and medical communities. The diagnosis of potential disease from medical images relies on a wide range of features implied by visual clues. Such decision making process requires time-consuming efforts from physi-ter encoder-decoder aggregation strategies are needed for more advanced feature refinement. In this work, we study the skip mechanism in encoder-decoder architectures and design effective yet efficient segmentation networks.

Image segmentation. Computer-aided image segmentation has a long history (Haralick and Shapiro, 1985) in the computer vision field. With the recent success of deep learning, neural network based methods have demonstrated their ability in fast and accurate segmentation of digital images.

As a pioneer segmentation model, U-Net (Ronneberger et al., 2015) adopts an encoder-decoder structure that first encodes the visual signals provided in an image into high-dimensional features and then decodes the abstract semantics to learn the pixelto-pixel mapping between the input image and the segmentation mask. In contrast to conventional autoencoders (Hinton and Zemel, 1994) , skip connections are added between the encoder and decoder, so that encoded features are forwarded to the decoder. Such forward skip connections enable fine-grained aggregations of features and preserve better gradient flows during network optimization. Although other advanced approaches exist in different forms (Chen et al., 2017; Long et al., 2015) , U-Net often demonstrates its superior performances and serves as the most important backbone in medical image segmentation tasks.

U-Net variants. Recent studies have developed different building blocks to extend U-Net. For instance, V-Net (Milletari et al., 2016) applies U-Net to process 3D voxel data for 3D vision tasks. W-Net (Xia and Kulis, 2017) concatenates two U-Nets head-to-tail to approach image segmentation tasks in an unsupervised style. M-Net (Mehta and Sivaswamy, 2017 ) studies the impact of using multi-scale features through paired downsampling and upsampling layers. U-Net++ (Zhou et al., 2018) , as a direct extension of U-Net, designs nested building blocks with dense skip connections to better propagate encoded features. Apart from the modifications on network structure, attention U-Net (Oktay et al., 2018) incorporates attention mechanism into U-Net by learning self-attentions to be applied on the skipped features. However, the above U-Net variants require additional functional modules, leading to an escalation of network complexity. In our work, we improve the performance of U-Net via inserting novel backward feature skip connections, where building blocks are reused without introducing any extra network parameters.

Recurrent convolutional networks. Iterative refinement of features has been proved effective in many computer vision tasks (Han et al., 2018; Guo et al., 2019; Wang et al., 2019; Alom et al., 2018) . Auto-context (Tu and Bai, 2009 ) implicitly extracts image features together with context information by learning a series of classifiers in a similar recurrent manner. However their classifiers need to be independently built in a boosting style, where our methods recurse semantic features inside a single encoder-decoder network. (Guo et al., 2019) reuses ResNet residual blocks (He et al., 2016) to fully utilize the limited parameters. With the similar recurrent strategy, (Wang et al., 2019) proposed R-U-Net, which connects multiple U-Net architectures head-to-tail with shared parameters to enhance segmentation performances. R2U-Net (Alom et al., 2018) adopts a similar approach that only recurses the last building block at each level of refinement. By contrast, our method learns recurrent bi-directional connections between encoders and decoders.

Neural architecture search. Compared to manually crafted networks, Neural Architecture Search (NAS) algorithms automatically search for the optimal architecture in a defined search space. Reinforcement Learning (RL) based methods (Zoph and Le, 2016 ) utilize a stand-alone agent to monitor the search process under certain proxies. Evolution based methods (Real et al., 2017 (Real et al., , 2019 start with a set of randomly sampled ancestor networks and then progressively evolve the population to new generations with more powerful offspring networks. Differentiable methods (Liu et al., 2019b; Guo et al., 2020) optimize "architecture parameters" through back propagation by relaxing the discrete search space, and define the network topology based on the optimized architecture parameters.

NAS searched architectures also obtained success in segmentation tasks. Auto-DeepLab (Liu et al., 2019a ) designed a differentiable search space to determine the best operators and topologies for each building block. Similarly, NAS-Unet (Weng et al., 2019 ) also adopted a gradient-based search that automatically discovers basic cell structures to construct a U-Net like architecture. As a multi-scale counterpart, the search space in (Yan et al., 2020) covers multiple encoding and decoding levels. Their proposed MS-NAS has the ability of aggregating multilevel features, which leads to better segmentation performance. However, the search process of the above methods is time consuming and computational inefficient. In this work, we design an efficient two-phase NAS method to find a subset of optimal bi-directional skip connections that yield the least network parameters and computations.

Our overall contributions include: (1) We propose bidirectional skip connections in encoder-decoder architectures for an iterative aggregation of encoded and decoded features;

(2) We incorporate bi-directional skip connections into a simple U-Net architecture, namely BiO-Net, achieving better segmentation performance without using extra parameters; (3) The deficiency of BiO-Net is analyzed and an upgraded network, BiO-Net++, is designed with significantly less network complexity and multi-scale skip connections; (4) To further reduce the redundancies in BiO-Net++, we introduce an efficient twophase BiX-NAS method to automatically search for resourceaware and optimal skip connections from the BiO-Net++. The finally searched BiX-Net model achieves on par performance to the BiO-Net but with significantly less complexity; (5) Our proposed bi-directional skip connections along with the novel networks were evaluated on a variety of medical image segmentation tasks including: 2D nuclei segmentation, 2D multiorgan segmentation, 3D Covid-19 infection segmentation, and two 3D segmentation tasks from the Medical Segmentation Decathlon (MSD). Our methods establish new benchmarks on these datasets; (6) Extensive ablation studies were conducted to fully study the usefulness of each of the proposed components.

Compared to the preliminary versions of this work published in MICCAI 2020 (Xiang et al., 2020) and MICCAI 2021 (Wang et al., 2021) , our contributions made in this paper include: (1) Methodology is introduced with more intuitions and presented in more details through both plain texts and algorithmic presentations; (2) Discussions and analysis on drawbacks of our methods are provided in details; (3) More experiments were made for a more comprehensive evaluation on both 2D and 3D segmentation tasks; (4) Extensive ablative studies and analysis were conducted to better comprehend the bi-directional connections together with the progress evolutionary search.

Our project page with source codes has been made publicly available to foster any future research 1 .

For notion simplicity, here we define several symbols and notations before presenting our methods: T: The pre-set recurrence time. Note that T−1 represents the total time of backward feature skipping in our bi-directional networks; N: The expansion multiplier for feature map dimension. For example, with a base feature dimension of 32 and N = 1.25, the expanded dimension will be 40; W: The number of backward skip connections (start from the deepest encoding level); L: The greatest encoding depth; P: The population of candidate networks of a generation; Extraction stage: A sequential encoding or decoding process. There are two extraction stages (i.e., one encoder and one decoder) in U-Net; Searching block: Any building block in an extraction stage; C: The ready-to-be-searched candidate skips between a pair of extraction stages; E: The evolved skips from a set of candidate skips between a pair of extraction stages.

Let's first consider a simple encoder-decoder architecture similar to U-Net. Skip connections are built at different levels, facilitating the information exchange between encoder and decoder.

Motivation. Skip connections across different network levels have been proven effective for learning finer-grained feature representations (He et al., 2016; Ronneberger et al., 2015) . Toward better feature extraction, networks can be designed by stacking multiple U-Nets head to tail to construct skip connections in a more structured form (Xia and Kulis, 2017) . However, with more network parameters introduced in the extra U-Nets, the introduced improvements are limited. This motivates us to explore skip connections through a recurrent manner to not expand networks further but reuse parameters instead. In this way, we argue that semantic features can be skipped bidirectionally between both encoder and decoder. Our design aims at not only promoting feature aggregations better but also fully utilizing only the parameters in a single encoder-decoder network.

Forward Skip Connections. Forward skip connections carry forward the encoded low-level features to the decoder. There are two incoming feature streams to each decoder block: the sequential feature stream from a lower level decoder blockx in and the forward skipped feature stream f enc from the corresponding encoder block at the same level. The decoder block then fuses the two streams through a FUSE operation, and the features are subsequently propagated through the decoding convolutions DEC to generate semantic features f dec . This process can be defined as:

(1)

Backward Skip Connections. Paired to forward skip connections, we define the backward skip connections that pass the decoded high-level semantic features f dec back to the encoder. An encoder block now combines f dec with its original sequential input x in from an upper level encoder block to create additional aggregations between low-level and high-level features. Similar to 1, our encoding process can be formulated with its encoding convolutions ENC as:

2.1. Recurrent Inference Unlike forward skips, our proposed backward skip connections cannot work directly in the classic one-pass encoderdecoder architecture, since encoder layers will only be forwarded once. To enable backward feature skipping, we design an "O"-shape recurrent inference routine for bi-directional networks (Sec. 3) that propagate features iteratively between encoder and decoder through the bi-directional skip connections. Noticeably, such recurrent inference reuses existing network weights, and no extra parameters are introduced during the inference. Finally, the building block outputs using our bidirectional skip connections can be formulated as follows, at the iteration i:

where DOWN represents the downsampling process, and UP represents the upsampling process. (Hinton and Zemel, 1994) , U-Net (Ronneberger et al., 2015) , and our BiO-Net. With the same architecture structure, the major differences are the proposed bi-directional skip connections. Bottom: unrolled overview of our BiO-Net when T = 3, L = 4, and W = 4. Note that in BiO-Net, same level encoder/decoder blocks share the same weights, such that no extra parameters are required.

convolutional layers, batch normalization (Ioffe and Szegedy, 2015) layers, and ReLU (Nair and Hinton, 2010) activation layers. Note that no batch normalization layer is reused through network recursion. To align with U-Net, our BiO-Net adopts max pooling, convolution transpose and concatenation as DOWN, UP and FUSE, respectively. An overview of our BiO-Net is shown in Fig. 1 .

Giving an input image, we first apply three convolutionnormalization-activation combinations to extract low-level features. There is no skip connection attached to such preprocessing blocks and the parameters will not be reused. The initially extracted features are then sent to a cascade of encoder blocks that are recursed through the bi-directional skip connections. Note that no features are backward skipped at the first recurrence. To make the feature map size consistent along different iterations, we duplicate the encoded features at the first encoding iteration.

After the encoding stage, a bridge block with additional convolutions is employed to transfer the encoded features. Subsequently, a series of decoder blocks consume the features and recover encoded details using convolution transpose. During the decoding stage, our backward skip connections preserve retrieved features by concatenating them with the ones skipped from the same level encoder block. The recursion begins at the end of the last decoder block. After recursing for T iterations, the decoded output will be fed into the post-processing block constructed similarly to the pre-processing block. The post-processing blocks will not be involved in the recurrence.

Optimizing a recurrent network can be tricky, since multiple computation graphs are possibly involved in one network struc-ture. In classic Recurrent Neural Networks (RNN) (Hochreiter and Schmidhuber, 1997), BackPropagation Through Time (BPTT) strategy is used to calculate gradients based on outputs at different time stamps. However, in our bidirectional networks, the blocks at the finest-resolution are not recursable and only one final segmentation mask is predicted during the entire inference regardless of recurrence time. To this end, the gradient computation graph remains consistent to an unrolled version of BiO-Net ( Fig. 1, bottom) . During backpropagation, the gradients are accumulated at each learnable layer and are subsequently used to update the weights for only once.

For a fair validation of the proposed bi-directional skip connections, our BiO-Net structure is constructed identically to U-Net with the same upsampling, fusion and channel expansion strategies. However, we find that the identical setting puts unexpected computation burdens on our BiO-Net and there exists efficient alternatives to reduce the overall complexity.

When using concatenation as the fusion strategy in BiO-Net, the encoder feature channels have to be doubled for carrying additional backward skipped features. We optimize such discord by replacing channel-wise concatenation to element-wise average pooling as the fusion strategy. Moreover, instead of using convolution transpose for upsampling with extra learning parameters, we bilinearly resize the coarse-scale feature maps to be aligned with the finer ones. As a result, the above simple optimizations reduce the complexity of BiO-Net with a 34.86 times reduction on the network parameters.

Decoder block Skipped encoder block Skipped decoder block The optimization strategies are adopted in our other networks including BiO-Net++ and BiX-Net, which will be presented in detail shortly.

Multi-scale information provides better guidance for the analysis of various size objects (Yan et al., 2020) . To fuse multi-scale features in BiO-Net, precedent encoded/decoded features at all levels are densely connected to every decedent decoding/encoding level through the proposed bi-directional skip connections. We average over the multi-scale features and align inconsistent spatial dimensions via simple bilinear resizing. The suggested multi-scale BiO-Net++ is outlined in Fig.  2 , left.

Although BiO-Net++ promotes multi-scale feature fusion, empirically, we found that the dense skip connections bring marginal improvements in terms of the overall performance (Table 7) . A natural question hence arises about whether every level-to-level skip in BiO-Net++ carries indicative information and if there exists a subset of sparse skip connections that not only benefit from multi-scale feature fusions but yield the least complexity.

To this end, we present BiX-NAS, a two-phase NAS algorithm, to automatically find optimal and sparsely connected skip connections in the BiO-Net++. In Phase1, we follow a differentiable NAS strategy that aims at narrowing down the huge search space swiftly. In Phase2, we design an evolution-based NAS to progressively discover the best skips with the optimal segmentation performance and computation efficiency. In this study, we searched the final architecture on one dataset only and the same architecture is then transferred to all other tasks.

Phase1: Narrowing down search space via selection matrix. To identify a sparse set of skip connections, the dense ones between every pair of extraction stages are evaluated and sifted. Suppose there are N incoming feature streams in a desired searching block, we anticipate that only k ∈ [1, N − 2] candidate(s) of them could be accepted, which results in a search

in the SuperNet BiO-Net++ with L levels and T iterations. When N = 5, L = 4 and T = 3 (a common setup), the search space expands to 5 40 , escalating the searching difficulty for one optimal instance.

To alleviate such difficulty, we determine k candidate skips from N incoming skips in each searching block by reducing the easy-to-spot ineffective ones. Intuitively, one-to-one relaxation parameters α (Liu et al., 2019b) for each skip connection x could be registered and optimized along with BiO-Net++. The skip with the highest α is then picked as the output of Φ(·), such that Φ(·) = x arg max α , where x = {x 1 , · · · , x N } and α = {α 1 , · · · , α N } denote the full set of incoming skips, and their corresponding relaxation scores, respectively.

The above formulation outputs a fixed number of skips for every level. However, different levels may fuse different number of skips and the above intuitive formulation cannot suffice. Towards a more flexible skip selection, we construct a learnable selection matrix M ∈ R N×(N−2) that models the mappings between the N incoming skips and k anticipated candidates, and formulate Φ(·) as a fully differentiable equation below:

where the GumbelSoftmax trick (Jang et al., 2017) forces each of the (N − 2) columns of M to be an one-hot vector that votes for one of the N incoming skips. Our formulation generates (N − 2) selected skips with repetition allowed, achieving a flexible selection of [1, N − 2] different candidate skips (Fig. 3) . The unique candidate skips are then averaged out to be fed into subsequent blocks. Moreover, differing from (Liu et al., 2019b,a) , our Φ(·) formulation unifies the forward propagation behaviour during both searching and network inference stages.

Phase2: Progressive evolutionary search. After squeezing the initial search space, evolving randomly sampled candidate network instances becomes practical. To further reduce skip redundancies and identify the best network instance, we perform an additional evolutionary search to find the optimal skip set at all levels across all iterations. Specifically, we search the candidate skips C for all levels at the same time between a certain pair of adjacent extraction stages, and then progressively move to the next pair once the current search is concluded. As the connectivity of adjacent extraction stages depends on the connectivity of precedent ones, we initiate the search from the last extraction stage pair and progressively move to the first pair. The conventional evolutionary NAS algorithms optimize the SuperNet with each sub-network in a population P individually (Real et al., 2019 (Real et al., , 2017 , and then update P when all individual training finish. When searching for skip connections rather than operators, there are two major flaws of such strategy: first, optimizing SuperNet with sampled skip sets individually may result in unfair outcomes; second, the searching process could Algorithm 1 Progressive evolutionary search

for each searching block b do 4:

Randomly sample n skips from C b t→t+1 : for data batch X, target Y do 8:

Forward the head network with candidate skips C 1→2 , · · · , C t−2→t−1 .

for i = 1, · · · , s do 10:

Forward each tail network with sampled skips C i and previously evolved skips E t+1→t+2 , · · · , E 2T−1→2T . Get Pareto front from {C 1 , · · · , C s } and determine E t→t+1 . 16: end for be empirically slow. Assuming the forward and backward of each extraction stage takes I F and I B time, training all |P| instances individually per step takes 2T|P|(I F + I B ) in total.

Improving searching fairness and efficiency. To overcome the first flaw above, we define the concept of skip fairness and claim that all skip search algorithms need to meet such principle.

Definition 1. Skip Fairness. Let F = {f 1 , · · · , f N } be the incoming skipped features to any searching blocks in each evolv- ing architecture A i within a population P. The skip fairness re-

The above principle yields that, when searching between the same extraction stage pair, any corresponding level-to-level skips across different sampled architectures are required to carry identical features. Otherwise, the inconsistent incoming features would impact the search decision on the skipping topology, hence causing unexpected search unfairness. Gradient-based search algorithms (e.g., Phase1 search algorithm) meet this principle by its definition, as the same forwarded features are distributed to all candidate skips equally. However, the aforementioned conventional strategy violates such principle due to the inconsistent incoming features produced by the individually trained architectures.

Our proposed Phase2 search algorithm meets the skip fairness by synchronizing partial forwarded features in all sampled candidate networks. Specifically, suppose we are searching skips between the t th and t+1 th extraction stages (t ∈ [1, 2T−1]): network topology from the 1 st to t − 1 th stages is fixed and the forward process between such stages can be shared. We denote such stages as head network. On the contrary, network topology from the t th to 2T th stages varies as the changes of different sampled skips. We then denote such unfixed stages as tail networks, which share the same SuperNet weights but with distinct topologies. The forwarded features of head network are fed to all candidate tail networks individually, as shown in Fig.  4 . We average the losses of all tail networks, and backward the gradients through the SuperNet weights only once. Besides, our Phase2 searching process is empirically efficient and overcomes the second flaw above, as one-step training only requires

. After the search between an extraction stage pair completes, we follow a multi-objective selection criterion that retains the architectures on the Pareto front (Yang et al., 2020) based on both validation accuracy (IoU) and computational complexity (MACs). The proposed progressive evolutionary search details are presented in Algorithm 1 and our finally searched BiX-Net is shown in Fig. 2 , right.

We evaluated our methods on 4 tasks including 6 public 2D and 3D datasets. Each task requires distinct segmentation tar-gets and represents a different imaging modality. Details of the datasets are presented in Table 1 .

The MoNuSeg dataset (Kumar et al., 2017) consists of tissue image tiles at 40× magnification level from the TCGA (Tomczak et al., 2015) database. The original tissues were sampled from multiple patients with tumors in a variety of organs from different clinics. There are a total of 44 images each of 1000 × 1000 pixels, and they are divided into a training set of 30 images and a test set of 14 images. Following a common protocol (Graham et al., 2019) , we extracted 512 × 512 patches at 4 corners of each image, which enlarges the dataset by 4 times. Neither numeric normalization nor stain normalization was performed as pre-processing on the dataset.

The TNBC dataset (Naylor et al., 2018) contains 50 sampled tiles from breast tissues with a size of 512 × 512 pixels. There is no specific training-testing split available for TNBC. Compared to the data in MoNuSeg, TNBC tiles were extracted from considerably independent sources and were processed with different staining and color adjustment strategies. Therefore, we used TNBC as an extra validation set to evaluate the generalization ability of our proposed method on nuclei segmentation task by training on the MoNuSeg dataset only.

The CHAOS (Combined Healthy Abdominal Organ Segmentation) dataset (Kavur et al., 2021) consists CT and MRI volumes of different organs including liver, left kidney, right kidney and spleen. Similar to (Yan et al., 2020) , we use the training MRI image slices in this work to evaluate our methods on 2D multi-class organ segmentation task.

There are two types of MRI sequences in the dataset with each consists of 120 DICOM volumes: T1-DUAL (40 data for both in phase and out phase) and T2-SPIR (40 data). Each volume has being performed to scan abdomen under different radio frequency pulse and gradient combinations. The datasets are acquired by a 1.5T Philips MRI, which produces the 12 bit DICOM images with 256 × 256 resolution. The ISDs vary between 5.5-9 mm (average 7.84 mm), x-y spacing is between 1.36 -1.89 mm (average 1.61 mm) and the number of slices is between 26 and 50 (average 36). In total, there is 1594 slices (532 slice per sequence) in the dataset. Before being processed by the networks, we performed additional pre-processing tech-niques on the raw sequences by min-max normalization and histogram auto-contrast that extend values to span over [0, 1].

We used the most recent Covid-19 chest CT datasets (Ma et al., 2020) to validate our methods on 3D segmentation tasks. Each of the 20 3D volumes in the dataset has been annotated into 4 classes: background, left lung, right lung, and Covid-19 infection. With focus on the segmentation of Covid-19 infection regions, following (Müller et al., 2020) we combine both lung classes, resulting in a 3-class segmentation task. The CT volumes have an average number of 176 slices, and were collected either from the Coronacases Initiative or the Radiopaedia with a spatial resolution of 512 × 512 for Coronacases Initiative and 630 × 630 for Radiopaedia.

For fair comparisons, we followed the preprocessing strategies introduced in (Müller et al., 2020) . Specifically, based on the Hounsfield units (HU), the raw pixel intensity values within [-1250, 250] are kept, which cover the range of valid lung regions ([-1000, -700]) and Covid-19 infection regions ([50, 100] ). The clipping was only applied on the Coronacases Initiative CTs. We then performed z-score normalization on the clipped data to obtain the standardized gray-scale values. To resample the volumes for better neural network training, we adopted a target spacing of 1.58 × 1.58 × 2.70 mm 3 and resampled all data to the median shape of 267 × 254 × 104 mm 3 .

The Heart and Hippocampus datasets are parts of the Medical Segmentation Decathlon (MSD) (Simpson et al., 2019) , which is a commonly used 3D medical image segmentation benchmark. The Heart dataset contains 20 training MRI volumes scanned over the entire heart during a single cardiac phase. Images were originally obtained with the voxel resolution 1.25 × 1.25 × 2.7mm 3 . The target of this task is the segmentation of left atrium, resulting in a binary segmentation task. The Hippocampus dataset includes MRI volumes acquired in 90 healthy adults and 105 adults with a non-affective psychotic disorder (56 schizophrenia, 32 schizoaffective disorder, and 17 schizophreniform disorder). All collected volumes are in the resolution of 1.0 × 1.0 × 1.0mm 3 with labels for anterior and posterior regions, resulting in a three-class segmentation task.

Our pre-processing strategy follows a common protocol (Isensee et al., 2018) which includes 3 main steps. First, due to the large number of ineffective signals (i.e., background) exist in the datasets, feeding the raw volumes directly to the networks is computational expensive and time consuming. We thus cropped the data to the region of nonzero values with at least one non-background voxel presented in each volume. Second, similar to the Covid-19 dataset, we resampled voxel spacing to ease network training. For the Heart dataset, raw volumes were resampled to a target shape of 115 × 320 × 232 voxels. For the Hippocampus dataset, the target shape is 36 × 50 × 35 voxels. Third order spline interpolation was used for input volumes and nearest neighbor interpolation was used for the ground truth masks. Finally, all volumes were z-score normalized individually before being fed into the networks. When the cropping step drops more than 1/4 of the average sizes, we applied the normalization only within the nonzero regions.

Multiple evaluation metrics are used to quantify the experimental results. For the 2D datasets (MoNuSeg, TNBC, and CHAOS), we rely on mean Intersection over Union (mIoU) and Dice coefficient (DICE) scores to evaluate the models' performances. The two metrics are good indicators of the rate of true positive predictions. For the 3D datasets (Covid-19, Heart, and Hippocampus), metrics including the DICE score, sensitivity and specificity are used. Additionally, total number of network parameters and Multiple-ACcumulate operations (MACs) are also reported for network complexity.

For the datasets without standardized train/test splits (CHAOS, Covid-19, Heart and Hippocampus), we adopted 5fold cross validation (CV) on the training data.

For all comparison experiments, we report the average results along with the standard deviation of multiple independent runs for each model. For the datasets with clear testing set split (MoNuSeg and TNBC), we repeat the evaluation three times with different random seeds. For the datasets that require crossvalidation (CHAOS, Covid-19, Heart and Hippocampus), we compute the statistics across all validation folds.

Additionally, we perform two-tail paired t-test to analyze the statistical significance between our methods and the competing methods. Smaller p-values indicate greater statistical significance of improvement that our methods have achieved.

Our empirical experiments are designed as follows: (1) We firstly compare our BiO-Net and BiX-Net with other state-ofthe-art segmentation networks on the 2D nuclei segmentation tasks and the multi-class organ segmentation task.

(2) Then, we extend our networks to 3D segmentation tasks, compare to the counterparts that particularly designed for 3D data.

(3) Extensive ablation studies are conducted on our BiO-Net and BiX-NAS separately using the nuclei segmentation datasets. Note that we utilized only the MoNuSeg dataset for searching the BiX-Net, and the same architecture is then transferred to all other 2D and 3D tasks.

BiO-Net implementation details. Unless explicitly specified, our BiO-Net is constructed with an encoding depth of L = 4 and a backward skip connection built at each stage of the network, except for pre-processing and post-processing blocks. As a balance between computation efficiency and segmentation performance, we set the recurrence time T = 3, such that all backward skip connections are triggered 2 times (relevant studies are presented in Sec. 7.2.2). BiX-Net searching details. Following the same set up as BiO-Net, we construct the SuperNet BiO-Net++ with L = 4 and T = 3 as well, resulting in the same hierarchy in the finally searched BiX-Net. In Phase1 search process of our BiX-NAS, we jointly optimize the weights of SuperNet and the selection matrices using the same optimizer, rather than optimized separately (Liu et al., 2019b) . The Phase1 searching process took roughly 0.09 GPU-Day. In Phase2, there were in total 5 searching iterations when T = 3. At each iteration, for each retained architecture from the preceding iteration, we sampled s = 15 different skip sets to form the new P. Due to GPU memory limitation, we applied a channel expansion parameter of W = 0.75 and only retained less than three final architectures ranked by IoU at the Pareto front per search iteration. Our Phase2 searching process consumed 0.37 GPU-Day. After Phase2 searching finished, we inserted the searched bi-directional skip connections into a re-initialized encoder-decoder architecture. For the blocks that have been discarded by our search algorithm, we abandoned all computations (i.e., convolutions, normalizations and activations) in such blocks but still resized the incoming features to the corresponding scale.

All our training configurations followed common usages without any explicit tuning. The competing methods were trained under the identical settings for fair comparison. Note that it is a common protocol in the NAS community that searching is performed on a small dataset only and the searched network is then generalized to larger domains (Bender et al., 2018; Liu et al., 2019b; Yan et al., 2020) . Such protocol not only demands much lower search costs but also serves as good evaluation for the generalization ability of the proposed NAS algorithms. Therefore, we strictly followed such protocol and searched our BiX-Net on the small-scale MoNuSeg dataset only and transferred the same architecture to more complex 2D and 3D tasks.

Standard data augmentation strategies were applied to the input data on-the-fly including: random rotation (within the range [-15 • , +15 • ]), random translation (in both x-and ydirections; within the range of [-5%, 5%]), random shearing, random zooming (within the range [0, 0.2]), and random flipping (both horizontally and vertically). We train all models using the Adam (Kingma and Ba, 2015) optimizer for 300 epochs with batch size 2. For manually crafted networks including BiO-Net, the initial learning rate was set to 0.01 with a decay rate of 0.003 at every epoch. For all NAS searched networks including BiX-Net, the initial learning rate was set to 0.001 with the same decay rate. During the Phase1 search, we trained the BiO-Net++ Super-Net for 300 epochs with a base learning rate of 0.001 and a decay rate of 3e-3. During the Phase2 search, at each searching iteration, we trained the sampled candidate networks 40 epochs starting from a learning rate of 0.001 and then decayed by 0.1 every 10 epochs.

Since there are many more training slices in the CHAOS dataset than the nuclei segmentation datasets, we lowered the initial learning rate to 0.001, which remains consistent across both handcrafted and NAS searched networks. The learning rate was cosine-annealing scheduled to 10 −7 in 300 epochs. Data augmentations including random spatial translation within the range of [-30%, 30%], random flipping, and random elastic transformation with 1.5 alpha and 0.07 sigma were used. We trained all models using the SGD optimizer for 300 epochs with momentum 0.9, weight decay 0.0001 and batch size 16.

Differing from the above 2D segmentation tasks, we employed the Adam optimizer to minimize the both the cross entropy loss and the Tversky loss (Salehi et al., 2017) with 0.5 alpha, 0.5 beta and batch size 2. The initial learning rate is set to 0.0003 for BiO-Net and BiX-Net and 0.001 for other competing methods (recommended settings in (Müller et al., 2020) ). Learning rates were scaled down by 0.1 once learning stagnates with a patience of 20 epochs. The minimum limit of the learning rate is 1e −5 . Following (Müller et al., 2020) , the data augmentation process includes: random mirroring, random elastic transformations, random rotations, random brightness enhancement, random contrast, random gamma corrections, and random Gaussian noise injections. The probability of triggering each of the augmentation strategy is 15%.

Due to GPU memory limitations, the networks were trained with 160 × 160 × 80 voxel patches randomly sampled from the raw volumes. During inference, we sampled the patches via the sliding window strategy enabling an overlap of half the patch size (80 × 80 × 40 voxels) to stabilize boundary predictions.

For fair comparisons, we followed the same training settings and augmentation strategies introduced in (Isensee et al., 2018) . Adam optimizer with an initial learning rate of 0.0003 was used across all experiments. We trained the networks with randomly sampled patches from the pre-cropped volumes. For better stabilizing optimization, we ensure that at least 1/3 of each batch contains data in foreground class. We denote 250 training steps as one epoch, and set the maximum epoch limit to be 1000. Similar to the Covid-19 dataset, the learning rate was reduced by a factor of 5 when the training loss does not change by at least 0.005. We stopped the training when the learning rate fell below 0.000001 and no significant change was observed in the training loss. The augmentation strategies include random rotations, random scaling, random elastic deformations, gamma corrections, and mirroring.

In this section, we present the experimental results obtained following the implementation details introduced in Sec. 6. Table 2 . We compare our networks with three types of counterparts: (1) state-of-the-art U-Net variants: U-Net++ (Zhou et al., 2018) , Attention U-Net (Oktay et al., 2018) ; (2) recurrent networks: R-UNet (Wang et al., 2019) , R2U-Net (Wang et al., 2019) ; (3) NAS searched networks: NAS-UNet (Weng et al., 2019) , AutoDeepLab (Liu et al., 2019a) , MS-NAS (Yan et al., 2020) .

Although the U-Net variants demonstrate good performances on the validation set of MoNuSeg, they require a large number of network parameters. When validated on the TNBC dataset that has different data distributions, the generalization ability of such networks becomes limited. One advantage of recurrent networks is that the number of parameters remains the same with increased recurrence time. By recursing through the same network layers, performance is improved on both MoNuSeg and TNBC datasets. However, as discussed earlier, the computational costs (MACs) inevitably escalate with more recurrence time. With a good balance between segmentation performance and network complexity, the NAS searched networks yield better generalization results on the TNBC dataset than the manually crafted U-Net variants.

By incorporating our proposed bi-directional skip connections, BiO-Net demonstrates superior performances on both datasets. However, when T = 3, an increase of 75.7% MACs is observed against the plain U-Net baseline. As expected, our proposed BiX-NAS successfully discovered an optimal network instance with minimum computational costs. Our resource-aware BiX-Net achieves on par performance to the heavy BiO-Net on the MoNuSeg validation set and even better generalization results on the TNBC dataset.

Giving the great results achieved on binary segmentation tasks, we then examined if the same performance gain can be obtained on a multi-class segmentation task. The results on organ segmentation tasks for the CHAOS dataset are shown in Table 3 . Compared to other competing methods, our BiO-Net and BiX-Net demonstrate superior performances on all metrics across all classes. Noteworthy, BiX-Net is able to achieve the best results on a binary nuclei segmentation dataset when searched on another dataset in the same domain. However, the searched BiX-Net appears to be less effective when transferred to the CHAOS dataset for multi-organ segmentation. Such performance gap indicates a common generalization bottleneck between different domains. With 4× lower computational complexity and 34.9× fewer parameters than BiO-Net, BiX-Net still achieved the second best results that surpass all other comparison methods.

After the evaluations on 2D datasets, we then validate our BiO-Net and BiX-Net on 3D segmentation tasks. First, we present the quantitative results on the Covid-19 dataset in Table 4. The compared methods include: the baseline network 3D U-Net (Ç içek et al., 2016) , the state-of-the-art NAS searched networks V-Nas and UXNet (Ji et al., 2020) . Since 3D segmentation is much more challenging than the 2D tasks, the results achieved by our BiO-Net and BiX-Net are not statistically significant (i.e., p-values > 0.05) compared to other methods and therefore no cells are highlighted in Table 4 . However, except for the DICE score on lung segmentations, our proposed networks yield the highest results on all other metrics. Our findings are as follows: (1) A naive bi-directional network recursion for 3D segmentation cannot bring the same performance enhancement as in 2D tasks. This can be observed by comparing BiO-Net with the non-recursable 3D U-Net. Our understanding is that 3D features are more complex than the 2D ones. Therefore, the network would easily saturate with the same capacity. The same behaviour can also be observed in Sec. 7.2.1 where BiO-Net can hardly benefit from bi-directional skip connections with limited parameters.

(2) BiX-Net stands out with superior results on almost all metrics. Noticeably, BiX-Net was searched on MoNuSeg only, and the great performances prove the generalization ability to transfer the identical architecture to tasks in significant different domains. MoNuSeg (%) TNBC (%) 

Finally, our proposed methods were evaluated on the Heart and Hippocampus datasets from the medical segmentation decathlon. We compare our networks with the second place in the MSD official leader board: nnUNet, and used their provided model checkpoints to produce their predictions. The comparisons are reported in Table 5 . Due to the difficulty of the tasks, our methods only demonstrate marginal improvements. However, such improvements can be observed on nearly all the metrics. Noteworthy, as a computational efficient mobile network, our BiX-Net is able to achieve the state-of-the-art results.

For more intuitive comparisons, we present two qualitative segmentation results on each of the MoNuSeg, TNBC, CHAOS, and Covid-19 datasets in Fig. 5 . Clearly, our BiO-Net and BiX-Net produce the most accurate segmentation masks. According to the visualization results, there are two primary advantages of our methods: (1) Our methods include less noise in the final prediction (MoNuSeg first case, TNBC first case).

(2) Our methods produce the least false positive predictions (CHAOS).

We conducted extensive ablative experiments to inspect the behaviours of our methods under different setups. Unless explicitly specified, we used the nuclei segmentation datasets for the ablation studies.

To better investigate the effectiveness of our backward skip connections, we here study the impact of 3 different factors un-der different scenarios: different network size controlled by a channel expansion parameter M, different number of backward skip connections used from the deepest encoding level W and different encoding depth of the network L. For more comprehensive studies, we report the ablative results on different recurrence time from T = 1 to T = 3. We ignore all other influential factors and use the plain BiO-Net for evaluation.

We present the ablation results impacted by different factors in Table 6 . With the increasing recurrence time, performance improvements can be easily observed on nearly all experiments. Specifically, we observe that: (1) When the number of training parameters is restricted (e.g., M=0.25), our BiO-Net could hardly benefit from reusing convolutional layers through bi-directional skip connections. This observation aligns with the ones spotted in 3D segmentation tasks.

(2) Without significant reductions on training parameters, using less backward skip connection while maintaining the same forward ones (e.g., W=3) causes slight performance drop on MoNuSeg when T increases. However, the generalization results on TNBC appear even better than the reference setting, with further 0.6 improvement on the DICE score.

(3) Surprisingly, building bidirectional skips on a shallower network (e.g., L=3) not only reduces the total parameters but leads to impressive results on the MoNuSeg datasets. The same results performance boost was however not observed on the TNBC dataset.

Given the quantitative improvements brought by our bidirectional skip connections, one may wonder if the additional recurrence could actually generate meaningful features, instead of generating replicated or uninformative values. We visualize the averaged feature maps inferred at different recurrence steps in Fig. 7 ent time steps, validating that the recurrence inference through bi-directional skip connections is indeed able to generate indicative feature representations.

With the help of backward skip connections, our network is able to recurse through the encoder and decoder for an iterative refinement of the features. Under the memory constraints, we are interested in inspecting how recurrence time T impacts the final segmentation results. To this end, we conducted experiments on varying the recurrence time for T ∈ [1, 6]. In addition to the metrics measuring segmentation performances, extra metrics including peak inference GPU memory usage and MACs are reported to demonstrate the computational burdens change along with the increasing recurrent time.

Although recursing through encoder and decoder earns extra performance improvement, we hypothesize that our BiO-Net would eventually saturate giving the fixed training parameters, and results in performance drops after a certain time of recurrence. As shown in Fig. 8 , segmentation performances of BiO-Net consistently increase when T ≤ 3 and reach the greatest results at T = 3. However, a longer network recurrence harms the feature representation and leads to even worse results. When T ≥ 5 both IoU and DICE scores are even poorer than T = 1. Also, we note the peak GPU memory usage and MACs are linearly increased along with the recurrence time, hence, unrestricted recurrent inference is impractical. In Fig. 6 , we compare the network predictions under different maximum T. The above observations validate the demand of a resource-aware and efficient substitute, which motivate our BiX-Net design. As a good trade-off between network performance and computation costs, we adopt T = 3 to be our default setting and search BiX-Net across 6 extraction stages only (3 encoders + 3 decoders).

We claimed in Sec. 4.1 that the Phase1 searched architecture still has computation redundancies and a following evolutionbased Phase2 search can spot more effective and efficient subset skips. We conducted two extensive experiments to prove the necessity of the proposed progressive evolutionary search by training and evaluating the SuperNet BiO-Net++ and the Phase1 searched architecture directly.

We report the quantitative results on different intermediate networks in Table 7 . Compared to BiO-Net that is constructed of identical building blocks and fusion functions to the plain U-Net, the modified BiO-Net++ requires only 1/34.86 parameters and 1/4 computations. With the help of multi-scale feature fusions, BiO-Net++ achieves even better results on both datasets. After performing BiX-NAS to reduce the skip redundancies in BiO-Net++, our Phase1 searched architecture reduces the computations by 8.6% further with nearly no influences on the segmentation results. Finally, our Phase2 searched BiX-Net shrinks the computation of BiO-Net++ by 17% and still achieves on par results. We theoretically analyzed that our proposed searching method is computational efficient and could potentially lead to better results by adhering the skip fairness concept. Here we provide empirical validations on our claims: First, we compare two counterpart strategies: (1) randomly search skip connections across all extraction stages from the SuperNet (i.e., without progressive evolution); and (2) progressively search skip connections and train each network instance independently (i.e., without adhering skip fairness). Then, we construct the optimal network instances searched by the above strategies, and compare the searching cost and retraining performances on the MoNuSeg and TNBC datasets.

As shown in Table 8 , without an evolution schema, searching directly on randomly sampled skip connections leads to a suboptimal architecture instance. The results on both datasets are the lowest and the search process requires a great amount of time. As discussed earlier, a limited number of samples will not suffice to completely explore the vast search space and the true optimal instance is unlikely to be covered.

By evolving the skip connections from the last pair of extraction stages to the first, the final network instance achieves much better performances on both datasets. However, independent instance training violates the skip fairness and consumes an unneglectable search time, thus resulting in inferior segmentation performances and exceeding search time.

Eventually, with the proposed progressive evolutionary search, our BiX-Net achieves the best performances in terms of all metrics. By sharing the head network among the candidates, the skip fairness is perfectly met and greatly reduces the overall search time. Compared to other search strategies, our proposed algorithm only demands about half of the time, proving the efficiency and effectiveness of our progressive evolutionary search.

In this work, we studied the skip schema in encoder-decoder architectures. Paired to forward skip connections, we proposed backward skip connections to ship decoded features back to encoder. The bi-directional skip connections are incorporated into a simple encoder-decoder architecture, namely BiO-Net that reuses network parameters in a recurrent manner. We analyzed the complexity overhead of the plain BiO-Net and designed the economic multi-scale BiO-Net++ with dense skip connections. A two-phase NAS algorithm, BiX-NAS, was further proposed to automatically search for the optimal sub-set skip connections in BiO-Net++. At the first phase, we utilized a novel selection matrix to narrow down the search space. Our modification supports a flexible output of skips and unifies the forward behaviours during both searching and network inference. At the second phase, we progressively searched for the optimal skip connections between every pair of encoder and decoder. The finally searched BiX-Net yields only 1/39.4 parameters and 1/4.1 computation compared to the plain BiO-Net, but achieves even better segmentation performances on different benchmarks. We evaluated our methods on 3 2D datasets and 3 3D datasets that achieved state-of-the-art performances. Extensive ablation studies were conducted to provide more comprehensive analysis of each of the proposed methods and components.

Toward a natural extension in the future, one can easily upgrade our BiO-Net/BiX-Net to an universal solution by adopting the 'domain adaptor' seamlessly into the networks. Moreover, it is also possible to search for more advanced structures of such 'domain adaptor' in every building block by modifying our two-phase BiX-NAS.

Nuclei segmentation with recurrent residual convolutional neural networks

Understanding and simplifying one-shot architecture search

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

net: learning dense volumetric segmentation from sparse annotation

Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images

Dynamic recursive neural network

Single path one-shot neural architecture search with uniform sampling

Image super-resolution via dual-state recurrent networks

Image segmentation techniques. Computer vision, graphics, and image processing 29

Deep residual learning for image recognition

Autoencoders, minimum description length, and helmholtz free energy

Long short-term memory

2 -net: A 3d universal u-net for multi-domain medical image segmentation

Batch normalization: Accelerating deep network training by reducing internal covariate shift

nnunet: Self-adapting framework for u-net-based medical image segmentation

Categorical reparameterization with gumbelsoftmax

Uxnet: Searching multi-level feature aggregation for 3d medical image segmentation

CHAOS challengecombined (CT-MR) healthy abdominal organ segmentation

Adam: A method for stochastic optimization

A dataset and a technique for generalized nuclear segmentation for computational pathology

Auto-DeepLab: Hierarchical neural architecture search for semantic image segmentation

DARTS: Differentiable architecture search

Fully convolutional networks for semantic segmentation

Towards efficient covid-19 ct annotation: A benchmark for lung and infection segmentation

M-net: A convolutional neural network for deep brain structure segmentation

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Automated chest ct image segmentation of covid-19 lung infection based on 3d u-net

Rectified linear units improve restricted boltzmann machines

Segmentation of nuclei in histopathology images by deep regression of the distance map

Attention u-net: Learning where to look for the pancreas

Regularized evolution for image classifier architecture search

Large-scale evolution of image classifiers

U-Net: Convolutional networks for biomedical image segmentation

Tversky loss function for image segmentation using 3d fully convolutional deep networks

A large annotated medical image dataset for the development and evaluation of segmentation algorithms

The cancer genome atlas (tcga): an immeasurable source of knowledge

Auto-context and its application to high-level vision tasks and 3d brain image segmentation

Recurrent U-Net for resource-constrained segmentation

Bix-nas: Searching efficient bi-directional architecture for medical image segmentation

NAS-Unet: Neural architecture search for medical image segmentation

W-net: A deep model for fully unsupervised image segmentation

BiO-Net: Learning recurrent bi-directional connections for encoder-decoder architecture

MS-NAS: Multi-scale neural architecture search for medical image segmentation

CARS: Continuous evolution for efficient neural architecture search

UNet++: A nested u-net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (DLMIA)

V-nas: Neural architecture search for volumetric medical image segmentation

Neural architecture search with reinforcement learning