key: cord-0110564-e421wnz1 authors: Celaya, Adrian; Actor, Jonas A.; Muthusivarajan, Rajarajeswari; Gates, Evan; Chung, Caroline; Schellingerhout, Dawid; Riviere, Beatrice; Fuentes, David title: PocketNet: A Smaller Neural Network for Medical Image Analysis date: 2021-04-21 journal: nan DOI: nan sha: 3b8324f75d5c18c9b1e1d0077f58535944ca759a doc_id: 110564 cord_uid: e421wnz1 Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings. Abstract-Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings. Index Terms-Neural network, segmentation, pattern recognition and classification Deep learning is an increasingly common framework for automating and standardizing essential tasks related to medical image analysis that would otherwise be subject to wide variability. For example, delineating regions of interest (i.e., image segmentation) is necessary for computer-assisted diagnosis, intervention, and therapy [1] . Manual image segmentation is a tedious, time-consuming task whose results are often subject to wide variability among users [2] , [3] . On the other hand, fully automated segmentation can substantially reduce the time required for target volume delineation and produce more consistent segmentation masks [2] , [3] . Over the last several years, deep learning methods have produced impressive results for segmentation tasks such as labeling tumors or various anatomical structures [4] - [7] . However, the performance of deep learning methods comes at an enormous computational and monetary cost, independent of concerns about data quality. Training networks to convergence can take several days or weeks using specialized computing equipment with sufficient computing capacity and available memory to handle large imaging datasets. The cost of a workstation with suitable hardware specifications for training large deep learning models ranges from roughly $5,700 to $49,000, whereas a dedicated deep learning-enabled server blade can range from $31,500 to $134,000 [8] . Cloud-based solutions offer a more economical option for training models by allowing users to pay for time on accelerated computing instances. However, the cost of such instances ranges from $3 to $32 per hour, and additional measures must be implemented to protect patient privacy [9] . The latter may involve an institution entering into a service agreement with a cloudcomputing resource provider. The cost of such an agreement is another consideration that must be taken into account. We wish to reduce the costs -in model size, training time, and memory requirements -associated with training deep learning models while preserving their performance. To do so, we propose the PocketNet paradigm for deep learning models, a straightforward modification for existing network architectures that dramatically reduces the number of parameters in these architectures. We demonstrate that the reduced models resulting from the PocketNet modification achieve comparable results to their full-sized counterparts. Additionally, we profile the memory and time footprints for training full-sized and PocketNet architectures, demonstrating that PocketNet models yield significant savings in hardware requirements and training time. To our knowledge, this study constitutes the first study of the effects of doubling the number of channels in convolutional neural networks for medical imaging tasks such as classification and voxel-wise segmentation. The last several years have seen several attempts to decrease the number of parameters in deep learning architectures in medical imaging. Broadly, these attempts fall into two categories: post-processing tools and architecture design strategies. More specifically, pruning (post-processing tool), depth-wise separable (DS) convolutions (architecture design strategy), and filter reductions (architecture design strategy) are known to help mitigate the overparameterization of deep convolutional neural networks (CNNs) for 3D medical image segmentation [10] - [12] . These methods, alone and combined, give rise to many of the novel, efficient deep learning architectures that are currently popular. We briefly survey these network reduction strategies below. Introduced by LeCun et al., the purpose of pruning is to remove the redundant connections within a neural network [10] . Pruning starts with a large pre-trained model and involves deleting weights and iteratively retraining the model until a significant drop in performance occurs. Pruning in medical image analysis reduces the required inference time and GPU memory while maintaining model performance [13] - [15] . However, network pruning is a post-processing step that is applied to an existing pre-trained network; although useful, pruning does not solve the demands of training large, overparameterized models, which requires a great deal of memory and high-end computing hardware. Network architecture design strategies for reducing the number of parameters in deep neural networks for medical image analysis include DS convolutions and reduction of the number of feature maps at each layer. DS convolutions are well known for reducing the number of parameters associated with convolution [11] , [16] , [17] . DS convolution factorizes the standard convolution operation into two distinct steps: depth-wise convolution followed by a point-wise convolution. A normal convolution layer has a number of parameters that is quadratic in the number of channels. In contrast, DS convolutions have a number of parameters that is linear in the number of channels. In practice, recently developed medical neural network architectures take advantage of DS convolution to reduce the number of parameters in their networks by up to a factor of 5 while achieving results comparable with those of conventional convolution [18] , [19] . However, 3D DS convolution is not supported in standard deep learning packages and requires more memory than standard convolution during training due to the storage of an extra gradient layer. By convention, the number of feature maps doubles after each downsampling operation in a CNN. This growth in the number of feature maps accounts for a large portion of the number of training parameters. With this in mind, perhaps the simplest way to reduce overparameterization in a deep learning model is to reduce the number of feature maps. Van der Putten et al. explore this idea in [12] by dividing the number of feature maps in the decoder portion of a U-Net by a constant factor r. For increasing values of r, the number of parameters in the decoder is reduced by up to a factor of 100, and segmentation performance remains the same. However, this approach introduces another hyperparameter r that needs to be tuned, does not reduce the number of parameters in the encoder branch of the U-Net architecture, and is only applicable to segmentation models. This paper makes the following novel contributions: We propose a modification to existing network architectures that dramatically reduces the number of parameters while also retaining performance. Many common network architectures for imaging tasks rely on manipulating images (or image features) at multiple scales because natural images encapsulate data at multiple resolutions. As a result, most CNN architectures -including many popular state-of-the-art methods such as nnUNet [20] and HRNet [21] -follow a pattern of downsampling and upsampling, following the intuition of the original U-Net paper [4] that popularized this approach. In the architecture first presented there, the number of feature maps (i.e., channels) in each convolution operator is doubled each time the resolution of the images decreases; the justification being that the increased number of feature maps offsets the loss of information from downsampling. This idea, of compensating for lost information by increasing the number of features, can be traced before U-Net, to the original ImageNet paper [22] and earlier. In all of these architectures, from ImageNet to U-Net to the state-of-the-art architectures today, the recurring refrain is that training is limited by compute availability due to the network's size, since the number of parameters in all of these architectures grows exponentially as the number of channels is doubled. However, other classical methods that manipulate images (or other signals) at multiple scales assume a hierarchy of scales i.e. that information can be decomposed into coarse-scale and fine-scale features independently; most prominent among such methods are those based on wavelets [23] and on multigrid methods [24] . Intrinsic to these methods is the construction of a series of grids of appropriate resolution. At each resolution, the constructed grid allows for the specific frequencies to be resolved, without aliasing. Because of this hierarchy of scales, and that specific sets of frequencies are resolved by specific grids at specific scales, the information capacity of coarser scales is guaranteed to be less than that of finer scales, and as a result, fewer operations (for multigrid solvers) and memory (for wavelets) are required: information "lost" by downsampling into a smaller, coarser subspace is accounted for at a grid of different resolution, and when images are downsampled to a coarser resolution, it is not necessary to double the number of channels or dimensions to preserve the information capacity at each downsampling instance. As a result, we propose that the doubling of the channels at each resolution in CNN architectures like U-Net is not intrinsically necessary, since each depth in these architectures corresponds to features at a different scale. A generic U-Net framework is fully written out in Algorithm 1. In the Block procedure of this algorithm, each convolution kernel K i has some number of channels-in and channels-out that depend on the layer's depth in the 'U' of the architecture. The overall depth D used in this architecture is commonly set to D = 3 to D = 6 [4] , [5] . If the convolutions have c in channels-in and c out channels-out at the network's finest resolution, the convolutions in the next layer will have 2c in channels-in and 2c out channels-out. Subsequently, at resolution depth d, there are 2 d c in channels-in and 2 d c out channels-out for the convolutions. As a result, the number of parameters Algorithm 2 Multigrid V-Cycle, adapted from [25] procedure BLOCK(ν, A, u, f ) Residual update in a CNN grows exponentially with increasing depth, and this exponential growth is the driving factor for the large size of image segmentation networks. We remark that Algorithm 1 is nearly identical to a single V-cycle of a geometric multigrid solver for solving the linear system of equations Au = f , where the linear system A and the unknown variable u relate to a geometric grid involving multiple resolutions [24] , [25] . An algorithm for a V-cycle is shown in Algorithm 2. The PocketNet paradigm, defined in Definition 2.1, exploits this similarity between multigrid methods and U-Net-like architectures. This paradigm proposes that the number of feature maps used at the finest resolution is sufficiently rich to capture the relevant information for the imaging task at hand and that doubling the number of channels is unnecessary. Instead of doubling the number of feature maps at every level of a CNN, we keep them constant, substantially reducing the number of parameters in our models in the process. We designate network architectures that keep the number of feature maps constant as Pocket Networks, or PocketNets for short, in the sense that these networks are "small enough to fit in one's back pocket". For example, "Pocket U-Net" refers to a U-Net architecture where we apply our proposed modification. Figure 1 provides a visual representation of the PocketNet paradigm applied to a U-Net. We present a formalized definition of the PocketNet paradigm in Definition 2.1. Definition 2.1: A network architecture obeys the PocketNet paradigm if the range of all convolution operators (except the final output layer) present in the network is a subset of R h×w×cout , where c out is fixed. Such a network is called a Pocket Network, or PocketNet for short. While the relationship between multigrid algorithms and U-Nets has been explored (see [26] and references therein), the PocketNet paradigm is applicable to any CNN architecture, regardless of whether or not the architecture is similar in overall structure to a U-Net. The number of parameters saved in the PocketNet architectures is substantial. Assume that a PocketNet and its corresponding full-sized architecture are identical apart from the doubling of the number of channels in standard convolutions. For simplicity's sake, we further assume that the number of convolutions performed at each resolution is the same, denoted by C. Also for simplicity's sake, we assume that the window of each convolution kernel is isotropic, with a stencil width of k in each dimension. We denote the maximum resolution depth, i.e. the number of downsampling operations in the network architecture, as D. In a full-sized network, a convolution kernel at resolution depth d operating on an nD image (for n = 2 or n = 3) with a stencil width k has k n c in 2 d c out 2 d parameters. Therefore, the full-sized network has the following number of parameters (1) On the other hand, the convolution kernels in a PocketNet architecture have k n c in c out parameters at each resolution depth, resulting in the following number of parameters Ck n c in c out = (D + 1)Ck n c in c out . (2) Therefore, the PocketNet paradigm reduces the number of parameters in a network by up to a factor of This analysis highlights that the growth of parameters with increasing depth is exponential for full networks but only linear for PocketNets. B. Experiments 1) Data: We test PocketNet on a range of recent public medical imaging challenge datasets. These datasets encompass three segmentation problems and one classification task, all with various data set sizes, complexity, and modalities. Two of our segmentation tasks -binary liver segmentation in the MICCAI Liver and Tumor Segmentation (LiTS) Challenge 2017 dataset [27] and single-contrast brain extraction in the Neurofeedback Skull-stripped (NFBS) repository [28] -are comparatively simple. We use these datasets for baseline comparisons, much like the MNIST Fashion and CIFAR-10 datasets are used to evaluate newly proposed classification architectures. The third segmentation task -multilabel tumor segmentation in the MICCAI Brain Tumor Segmentation (BraTS) Challenge 2020 dataset [29] - [31] -is commonly viewed as a more complex and technical segmentation problem than LiTS or NFBS segmentation. Therefore, we use BraTS data for our performance benchmarks (see Section III-B). Finally, our classification task -binary classification in the COVIDx8B dataset [32] -evaluates the PocketNet paradigm with a much larger dataset, demonstrating its appropriateness for pertinent problems other than image segmentation. These datasets and their related pre-processing and post-processing methods are described below. a) LiTS Data: For the LiTS dataset, we perform binary liver segmentation. This dataset consists of the 131 CT scans from the MICCAI 2017 Challenge's multi-institutional training set. These scans vary significantly in the number of slices in the axial direction and voxel resolution, although all axial slices are at 512×512 resolution. As a result, we use the preprocessing steps proposed by nnUNet to handle this variability [20] . We resample each image to the median resolution of the training data in the x and y-directions and use the 90th percentile resolution in the z-direction. For intensity normalization, we window each image according to the foreground voxels' 0.5 and 99.5 percentile intensity values across all of the training data. This scheme results in windowing from -17 to 201 HU. We also apply z-score normalization according to the foreground voxels' mean and standard deviation. The LiTS dataset is available for download https://competitions. codalab.org/competitions/17094#learn the details-overview. b) NFBS Data: The segmentation task for the NFBS dataset is extraction (i.e., segmentation) of the brain from MR data. The NFBS dataset consists of 125 T1-weighted MR images with manually labeled ground truth masks. All images are provided with an isotropic voxel resolution of 1×1×1 mm 3 and are of 256×256×192 resolution. For pre-processing, we apply z-score intensity normalization. The NFBS dataset is available for download at http: //preprocessed-connectomes-project.org/NFB skullstripped/. c) BraTS Data: The BraTS training set contains 369 multimodal scans from 19 institutions. Each set of scans includes a T1-weighted, post-contrast T1-weighted, T2weighted, and T2 Fluid Attenuated Inversion Recovery volume and a multilabel ground truth segmentation. We merge the labels in each ground truth segmentation and perform whole tumor segmentation for our analysis. All volumes are provided at an isotropic voxel resolution of 1×1×1 mm 3 , co-registered to one another, and skull stripped, with a size of 240×240×155. We crop each image according to the brainmask (i.e., non-zero voxels) and apply z-score intensity normalization on only non-zero voxels for pre-processing. The BraTS training dataset is available for download at https://www.med.upenn.edu/cbica/brats2020/registration.html. d) COVIDx8B Data: The classification task for the COVIDx8B dataset is COVID-19 detection on 2D chest xrays. The COVIDx8B dataset consists of a training set with 15,952 images and an independent test set with 400 images. The training set is class-imbalanced, with 13,794 COVID- 19 negative images and 2,158 COVID-19 positive images. However, the COVIDx8B test set is class-balanced, with 200 COVID-19-negative and 200 COVID-19-positive images. We resize each image to a resolution of 256×256 and apply z-score intensity normalization as pre-processing steps. The COVIDx8B dataset is available for download at https://github. com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md. 2) Network Architectures: For each of our tasks and datasets, we compare the performance of various full-sized architectures with their PocketNet counterparts. a) Segmentation Architectures: We examine the effects of our proposed modification strategy using four segmentation architectures -U-Net [5] , ResNet [7] , DenseNet [6] , and HRNet [21] . The first three architectures possess a standard U-Net backbone [4] . The proposed U-Net, ResNet, and DenseNet architectures differ in their block designs as seen in Figure 2 . The HRNet architecture does not utilize a U-Net backbone. Instead, it maintains high resolution features throughout the architecture. b) COVIDx8B Architectures: Our classification architecture for the COVIDx8B dataset is a U-Net encoder with four downsampling layers. The final layer of the U-Net encoder is flattened via global max-pooling and passed to a fully connected layer for the network's final output. Visually, this architecture is represented by the left half of the Pocket U-Net architecture in Figure 1 . 3) Training Protocols and Hyperparameters: For each task (e.g., segmentation and classification), we initialize the first layer in each network with 32 feature maps and use the Adam optimizer [33] . The initial learning rate is set to 0.001 and, once learning stagnates, is reduced by a factor of 0.90. Training for all segmentation tasks uses a batch size of two and a batch size of 32 for COVIDx8B classification. For all segmentation tasks, we use a patch size of 128 × 128 × 128 and apply random flips, additive noise, and Gaussian blur to each batch as data augmentation. To evaluate a predicted segmentation mask's validity, we use the Sorensen-Dice Similarity Coefficient (Dice), the 95 th percentile Hausdorff distance, and the average surface distance. Implementations of these metric are available through the SimpleITK Python package [34] - [36] . For the segmentation tasks, our loss function is calculated as an L 2 relaxation of the Dice score; for a true segmentation Y true and a predicted segmentation Y pred , our L 2 -Dice loss function is taken from [37] and is given as For the classification tasks, we use categorical cross-entropy as our loss function, with outputs being of two classes (COVID-19 positive and negative).To evaluate each classification model's validity, we use the receiver operating characteristic area under the curve (AUC) metric. This metric is available via the scikit-learn Python package. Our models are implemented in Python using TensorFlow (v2.8.0) and trained on an NVIDIA Quadro RTX 8000 GPU [38] . All network weights are initialized using the default initializers from TensorFlow, and all other hyperparameters are left at their default values. The code for each network architecture is available at github.com/aecelaya/pocketnet. We perform inference on test images using a sliding window approach for each segmentation task, where the window size equals the patch size set during training. After each window prediction, we slide the window by half the size of the patch. Additionally, we apply a Gaussian importance weighting (σ = 0.125) to each window prediction [20] . As a post-processing step, we take the largest connected component in each image. We use the training parameters described in Section II-B3 to train the architectures listed in Section II-B2. For the segmentation tasks described in Section II-B1, we employ a five-fold cross-validation scheme to obtain predictions for each dataset and architecture. For classification on the COVIDx8B dataset, we train each model with the training set and generate predictions on the test set. These results of these experiments are shown in Tables I and II. Generally, we do not see significant (p < 0.05 [Wilcoxon signed-rank test]) differences in performance between Pocket-Net and full-sized architectures. There is a small (≤ 1%) difference in performance for cases where there are significant differences in performance. These insignificant or minor differences in performance indicate a reduction in the number of parameters by more than an order of magnitude. Using the training parameters described in Section II-B3, we profile the training performance of a full U-Net and a Pocket U-Net using the BraTS dataset. Namely, we measure peak GPU memory utilization during training and the average time per training step for varying batch sizes for each network using the TensorFlow Profiler [39] . To ensure accurate comparisons of performance, we conduct these experiments on a Google Colaboratory notebook with a dedicated NVIDIA Tesla T4 GPU with 16 GB of available memory. The GPU memory usage and training time per step for this experiment are shown in Figure 3 ; we see that our PocketNet architecture reduces memory usage and speeds up training time for every batch size. Specifically, the Pocket U-Net reduces the peak memory usage for training by between 28.3% and 87.7%, with smaller batch sizes resulting in greater savings. This relationship is possibly due to the increasing portion of GPU memory allocated for storing data as the batch size increases. The PocketNet models improve the average time per training step by between 25.0% and 43.2%, with larger batches yielding greater time savings. This behavior may be due to larger batch sizes taking advantage of the computational parallelism of modern GPUs. In addition to training performance, we profile the inference throughput of full-sized networks and their PocketNet counterparts. Table III shows the inference throughput of each architecture for various image sizes. For architectures with a standard U-Net backbone (i.e., U-Net, ResNet, and DenseNet), we see modest improvements in inference speed for the PocketNet variants that range from 1% to 12%. A possible explanation for these minor improvements in inference speed is the highly parallelized computation of convolution operators on modern GPUs. For HRNet, we see more significant improvements in inference throughput ranging from 12% to 17%. Within the HRNet architecture, we see more non-convolution operations like upsampling, downsampling, and addition than in U-shaped architectures, which may explain the more significant increases in throughput for the Pocket HRNet architecture. To assess the effects of feature map doubling in U-shaped architectures, we perform an ablation study on a standard Unet using the LiTS dataset. In the first iteration of the ablation study, we start with a Full U-Net where we double the number of feature maps at every resolution level. In the next iteration, we construct a U-Net where we stop doubling feature maps after the second-to-last resolution level (i.e., 8 For each of these networks, we perform a five-fold crossvalidation using training and inference parameters described in Sections II-B3 and II-B4. Table IV shows the results of each iteration. In every case, we see small differences in the distribution of the resulting Dice scores. This small difference in performance among iterations suggests that doubling the number of feature maps at each resolution level might be unnecessary. We conjecture that the comparable performance between the PocketNet and full architectures is due to both networks having similar representation capabilities, that ultimately both networks build similar representations of the image data as they compute the final segmentations. To test whether the networks learn similar features, we look at the mean of the where N test is the number patches in the test set, V j is the volume of test patch j, and f j i k is the intensity of the k th voxel in f j i . Figure 4 shows the averages of the resulting fea-ture maps. We see a similar number of features being activated with similar intensities for both cases. This similarity suggests that the full and Pocket U-Nets learn similar latent feature representations used for the final voxel-wise classification. Note that we sort the mean feature activations from highest to lowest for visual purposes. This order does not matter because the indexing in any hidden layer can always be permuted by the next layer. A possible concern is that PocketNet models, due to their reduced parameter count, could saturate earlier during training than do full-sized architectures, which could result in the comparable performance we observe in our results in Section III-A. To test this, we repeat the experiments described above for the COVIDx8B and BraTS data challenges using successively less data in the training set. For every iteration, we keep a fixed validation and test set. For the COVIDx8B dataset, we fix 10% of the training data as a validation set and use the original test set. Similarly, for the BraTS data, we take 20% of the training data as a test set and use 10% of the remaining patients as a validation set. Additionally, we do not use data augmentation for this particular experiment. The results of this are shown in Figure 5a and Figure 5b . In Figure 5a , we see that the AUC values increase for both the PocketNet and full architectures as the size of the training set increases. Furthermore, we observe that both of these AUC values plateau to 1.0 (i.e., perfect prediction) and the PocketNet classifier saturates sooner than its full-sized counterpart does. These observations suggest that the reduced architecture resulting from the PocketNet paradigm learns faster with fewer data points than its full-sized counterpart. Similarly, Figure 5b shows that the segmentation accuracy of the full-sized and PocketNet BraTS U-Net architectures improves as the number of data points included during training increases, and that both architecture types show the expected improvement in performance with each increase in the dataset size, plateauing to similar distributions. Our results show that large numbers of parameters (millions or tens of millions) may not be necessary for deep learning in medical image analysis, as comparable performance is achievable with substantially smaller networks using the same architectures but without doubling the number of channels at coarser resolutions. This suggests that overparameterization, which is increasingly regarded as a key reason why neural networks learn efficiently, might not be as critical as previously suggested [40] - [42] . However, we note that our PocketNets may still be are overparameterized, and the combination of our proposed PocketNet paradigm with other model reduction techniques should be explored. For example, replacing the traditional convolution layers with DS convolution layers in our Pocket ResNet for LiTS liver segmentation further reduces the number of parameters to roughly 10,000. Pruning an already trained PocketNet model may also potentially yield further parameter reductions. The deep learning tasks presented in this study are all single-label segmentation or binary classification. The goal of ongoing and future work using the PocketNet paradigm is to test this approach on more complex domains such as BraTS multi-class tumor segmentation and LiTS tumor segmentation. Figure 6 shows an example of a multiclass segmentation prediction mask produced by a Pocket DenseNet. Our results for PocketNet architectures applied to the BraTS multi-class segmentation task are available at https: //www.cbica.upenn.edu/BraTS20/lboardValidation.html under the team name "aecmda" and will be updated periodically. When we employ PocketNet models, we achieve similar performance to full-sized networks while enjoying the advantages of faster training times and lower memory requirements. The smaller models produced by our proposed PocketNet paradigm can potentially lower the entry costs (computational and monetary) of training deep learning models in resourceconstrained environments without access to specialized computing equipment. With less GPU memory required for training, cheaper hardware can be purchased, or less expensive cloud computing instances can be used to train deep learning models for medical image analysis. The faster training times for PocketNets can also reduce costs by reducing the number of hours spent training models on cloud computing instances. As to why the PocketNet models perform at least comparable to their counterpart full models, the similarity in intensities of each signal in the final layer activation maps suggests that the models ultimately learn the same representations. Despite the reduced number of parameters, the approximation space that the Pocket architectures can represent is comparable to the approximation space that the full models can achieve. This conjecture is supported by the ablation study results in Table IV : the median Dice scores for e.g. U-Net is nearly identical, regardless of the depth that the feature map stops doubling. In this study, the additional features supplied at depths where the doubling continued did not improve the model's performance, as the approximation spaces are all similar regardless of whether the number of features was doubled at that depth. This ablation study suggests that, since there is no increased benefit of having larger models with the number of features doubling per layer, that for these medical imaging problems, doubling the number of channels is unnecessary and PocketNet models can be used to achieve comparable accuracy instead. Since these smaller models are just as expressive (and capable) as their full counterparts, these models can be trained (and later, deployed) with cheaper hardware or by provisioning smaller cloud instances, saving time, money, and effort by institutions performing deep learning medical image analysis. Automated medical image segmentation techniques Evaluation of an automatic segmentation algorithm for definition of head and neck organs at risk Fully automated brain resection cavity delineation for radiation target volume definition in glioblastoma patients using deep learning U-net: Convolutional networks for biomedical image segmentation 3d u-net: learning dense volumetric segmentation from sparse annotation Densely connected convolutional networks Identity mappings in deep residual networks Gpu cloud, workstations, servers, laptops for deep learning Amazon aws ec2 pricing Optimal brain damage 3d depthwise convolution: Reducing model parameters in 3d vision tasks Influence of decoder size for binary segmentation tasks in medical imaging Unet++: A nested u-net architecture for medical image segmentation Nas-unet: Neural architecture search for medical image segmentation Medical image segmentation algorithm based on feedback mechanism convolutional neural network Xception: Deep learning with depthwise separable convolutions Mobilenets: Efficient convolutional neural networks for mobile vision applications Efficient 3d deep learning model for medical image semantic segmentation X-net: Brain stroke lesion segmentation based on depthwise separable convolution and long-range dependencies nnu-net: a self-configuring method for deep learning-based biomedical image segmentation Deep high-resolution representation learning for human pose estimation Imagenet classification with deep convolutional neural networks Image processing and analysis: variational, PDE, wavelet, and stochastic methods Multigrid methods. Routledge Iterative methods for sparse linear systems MgNet: A unified framework of multigrid and convolutional neural network The liver tumor segmentation benchmark (LiTS) The preprocessed connectomes project repository of manually corrected skull-stripped T1-weighted anatomical MRI data The multimodal brain tumor image segmentation benchmark (brats) Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features Covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Adam: A method for stochastic optimization The design of simpleitk Simpleitk image-analysis notebooks: a collaborative environment for education and reproducible research Image segmentation, registration and characterization in r with simpleitk Identification of kernels in a convolutional neural network: connections between the level set equation and deep learning for image segmentation Keras TensorFlow: Large-scale machine learning on heterogeneous systems A convergence theory for deep learning via over-parameterization On the optimization of deep networks: Implicit acceleration by overparameterization Overfitting in adversarially robust deep learning