key: cord-0911984-1u5olzz1
authors: Nirthika, Rajendran; Manivannan, Siyamalan; Ramanan, Amirthalingam; Wang, Ruixuan
title: Pooling in convolutional neural networks for medical image analysis: a survey and an empirical study
date: 2022-02-01
journal: Neural Comput Appl
DOI: 10.1007/s00521-022-06953-8
sha: b2d28c909752afa746aface3233bf6f0ca0c3b89
doc_id: 911984
cord_uid: 1u5olzz1

Convolutional neural networks (CNN) are widely used in computer vision and medical image analysis as the state-of-the-art technique. In CNN, pooling layers are included mainly for downsampling the feature maps by aggregating features from local regions. Pooling can help CNN to learn invariant features and reduce computational complexity. Although the max and the average pooling are the widely used ones, various other pooling techniques are also proposed for different purposes, which include techniques to reduce overfitting, to capture higher-order information such as correlation between features, to capture spatial or structural information, etc. As not all of these pooling techniques are well-explored for medical image analysis, this paper provides a comprehensive review of various pooling techniques proposed in the literature of computer vision and medical image analysis. In addition, an extensive set of experiments are conducted to compare a selected set of pooling techniques on two different medical image classification problems, namely HEp-2 cells and diabetic retinopathy image classification. Experiments suggest that the most appropriate pooling mechanism for a particular classification task is related to the scale of the class-specific features with respect to the image size. As this is the first work focusing on pooling techniques for the application of medical image analysis, we believe that this review and the comparative study will provide a guideline to the choice of pooling mechanisms for various medical image analysis tasks. In addition, by carefully choosing the pooling operations with the standard ResNet architecture, we show new state-of-the-art results on both HEp-2 cells and diabetic retinopathy image datasets.

Convolutional neural networks (CNNs) are the state-of-theart methods for various computer vision and medical image analysis tasks such as image classification [55, 95, 98, 109, 124, 126] and segmentation [35, 109, 118] . CNN often consists of multiple convolutional layers followed by one or more fully connected layers, where each convolutional layer often includes convolution, nonlinear activation and optionally pooling operators.

The purpose of pooling is mainly to down-sample the feature maps and to learn larger-scale image features that are invariant to small local transformations (e.g., translation, scaling, and rotation). It is a process of aggregating the features from each spatial region, e.g., averaging the values in each 3 Â 3 region at each feature channel.

Pooling does not only increase the size of the receptive field of convolutional kernels (neurons) over layers, but also reduces the computational complexity and the memory requirements as it reduces the resolution of the feature maps while preserving important features that are needed for processing by the subsequent layers. In medical image analysis, pooling can help to handle variance in lesion sizes [3] and positions [94] .

Various pooling methods have been proposed for different purposes. For example, soft pooling (e.g., [42, 55, 90, 124, 129] ) is proposed to take advantages of both the widely used max and average pooling; stochastic pooling (e.g., [39, 101, 132, 142, 143] ) is proposed to overcome the overfitting issue in CNN training; spatial pyramid pooling and its variants are to capture spatial or structural information in the images (e.g., [45, 91, 130] ); higher-order pooling (e.g., [25, 31, 34, [69] [70] [71] 141] ) is to capture higher-order statistical information of the feature maps, etc.

However, most of these approaches were proposed for and evaluated on computer vision image datasets (e.g., PASCAL VOC 2012 [29] , Cityscapes [23] , CIFAR-10 [58] ) and their applicability for medical image classification has not been well-investigated.

In this work, we review different pooling methods proposed in computer vision and medical imaging literature, and report examples of medical imaging applications where some of these pooling methods are used (refer Table 2 ). In addition, we conduct an experimental study to compare the performance of pooling methods on two different medical image classification tasks, i.e., classifications of HEp-2 cells and diabetic retinopathy images.

Selection of identified papers for review: An initial selection of papers was done by the aid of Google scholar. Different keywords related to pooling (e.g., pooling, pooling in CNN, pooling in medical imaging, attention weighted pooling, feature aggregation, etc.) were used to identify relevant papers. As the majority of the identified papers use existing pooling techniques, the papers which propose novel pooling approaches were mainly identified, and selected for review. This gave us around 121 papers in total, among them 87 papers proposed different pooling techniques and in 34 papers different pooling techniques are applied for different tasks. Among the selected papers, 90 and 31 papers, respectively, discuss pooling methods in computer vision and medical imaging.

The main contributions of this work include:

• To our best knowledge, this is the first work to review various pooling methods in deep learning particularly for medical imaging applications. • As many of the pooling methods (e.g., higher-order pooling [25, 31, 34, [69] [70] [71] 141] ) have not been explored for medical imaging, we perform an extensive set of comparative experiments on selected pooling methods to investigate their performance on two public medical image datasets.

The rest of this paper is organized as follows. Section 2 reviews the work related to different pooling methods proposed in computer vision and medical image analysis. Section 3 summarizes the dataset and the experimental settings. Results are reported and discussed in Sect. 4 . A detail discussion about our work is given in Sect. 5 and Sect. 6 concludes this paper.

There are two groups of pooling generally used in CNNs. The first one is local pooling, where the pooling is performed from small local regions (e.g., 3 Â 3) to downsample the feature maps. The second one is global pooling, which is performed from each of the entire feature map to get a scalar value of a feature vector for image representation. This representation is then passed to the fully connected layers for classification. For example, there are four local pooling and one global pooling layers included in the well-known DenseNet [51] . The part of a feature map (channel) within the pooling region P. The channel index is omitted for simplicity. 

Max and average pooling Max and/or average pooling Classification [46] Image classification and localization of lesions [93, 126] Retina Segmentation [90] Cell image classification [77] HEp-2 cells Image classification and detection of pneumonia [95] X-Ray (chest)

Weakly supervised learning [55] X-Ray (chest)

Multiple sclerosis identification [122] MRI (brain)

Object localization [111] --Linear combination of max and average pooling Mixed max-average pooling [63] Classification [63] --Gated max-average pooling [63] Classification [63] --Dynamic correlation pooling [11] Classification [11] --

Generalized max pooling Segmentation [129] , Classification [7] Multiple Instance Learning [135] Histopathology

Root-mean-square pooling [53] Classification [53] --Log-sum-exp pooling [90] Segmentation [90] Weakly supervised classification and localization: thorax diseases [124] X-Ray (chest)

Proximal femur fractures [55] X-Ray (bone)

Histopathology cancer image classification [135] Histopathology

Polynomial pooling [129] Segmentation [129] --Learned-norm pooling [42] Classification [42] --' p pooling [7] Classification [7] --Rank-based pooling [101] Classification [101] Cerebral micro-bleed detection [120] MRI (brain)

Multipartite pooling [99] Classification [99] Ordinal pooling [60] Classification [60] --Multi-activation pooling [151] Classification [151] --aI pooling [28] Classification [28] --Global feature guided local pooling [57] Classification [57] --SQUare-root (SQU) pooling [15] Image instance retrieval [15] -- --Stochastic pooling to handle overfitting Stochastic pooling [142] Classification [142] Multiple sclerosis identification [122] MRI (brain)

Alcoholism Detection [121] MRI (brain)

COVID-19 diagnosis [149] CT (chest) Rank-based stochastic pooling [101] Classification [101] Abnormal breast identification [148] Breast

Mixed pooling [139] Classification [139] Brain tumor segmentation [10] MRI (brain)

Hybrid pooling [112] Classification [82, 112, 113] --Max pooling dropout [132] Classification [132] --S3 pooling [143] Classification [143] --Fractional max pooling [39] Classification [39] Retinopathy image classification [40] Retina

Sparsity-based stochastic pooling [104] Classification [104] --

Classification [100] --PatchShuffle stochastic pooling [123] -Diagnosis of COVID-19 [123] CT (chest)

Pooling to encode spatial or structural information Spatial pyramid pooling [45] Classification, Detection [45] Hand gesture recognition [110] , Image steganalysis [146] Brain image segmentation [118] Prostate image segmentation [35] Tumor segmentation for rectal cancer radiotherapy [79] MRI (brain) MRI (prostate) MRI, CT (rectum)

Concentric circle pooling [91] Remote sensing scene classification [91] --Polycentric circle pooling [92] Remote sensing image recognition. [92] --Pose pooling kernels [145] Fine-grained image classification [145] --Geometric ' p norm pooling [30] Classification [30] --Cell pyramid matching [130] (non CNN) -Cell image classification [77, 130] HEp-2 cells

Multi-pooling [117] -Brain tumor segmentation [117] MRI (brain)

Donut-shaped spatial pooling [62] -Cell image classification [62] HEp-2 cells Structure based graph pooling [14] Action recognition [14] --Atrous Spatial Pyramid Pooling (ASPP) [12] Segmentation [12] Multi-scale retinal vessel segmentation [134] Retina

Higher-order pooling Second oder pooling [9] Classification, Segmentation [9] --Bilinear pooling [71] Fine-grained classification [71] --Improved bilinear pooling [70] Fine-grained classification [70] -a-pooling [102] Fine-grained classification [102] --Statistically-motivated second-order pooling [141] Classification, Fine-grained classification [141] --Global second order pooling [34] Classification [34] --Kernel pooling [25] Classification [25] --Global covariance pooling [68] Classification [68] -- [20] Witter identification and document classification [20] --Adaptive spatial pooling [75] Classification [75] Retrieving brain tumors [17] CE-MRI (brain)

Deep Adaptive Temporal Pooling (DATP) [103] Human activity recognition [103] --Dynamic temporal pooling [64] Time series classification [64] --

Learnable Pooling Module (LPM) [86] Full-face gaze estimation [86] Brain surface analysis [38] MRI (brain)

Notations: Consider a set of feature maps F and a pooling region P defined on one of these feature map, F k , as in Fig. 1 . Assume that x 2 R W 0 ÂH 0 represents the features that are inside the pooling region P on F k . For example, W 0 ¼ H 0 ¼ 3 in the local pooling case and W 0 ¼ W and H 0 ¼ H in the global pooling case, where W and H represent the width and the height of the feature map, respectively. In the following, we assume that x is vectorized to simplify the math operations, i.e.,

The average pooling and the max pooling [6] are widely used in CNNs [46, 49, 51, 59] because of their simplicitythey do not have any parameters to tune. The average pooling summarizes all the features in the pooling region and can be defined as

On the other hand, max pooling selects only the strongest activation in the pooling region, i.e.,

The average pooling and the max pooling have their own merits and disadvantages.

Averaging reduces the effect of noisy features. But as it gives equal importance to all the elements in the pooling region, background regions may dominate in the pooled representation, and hence, may reduce the discriminative power. In contrast, max pooling selects the largest value in each pooling region, and hence can avoid the effect of unwanted background features. However, as it selects only the maximum element, the pooled representation may capture noisy features.

The average and max pooling can be applied in different scenarios. Consider a situation in medical image analysis where lesion appears only in a small part of the image. In this case, average pooling may not be a good choice as the elements of the pooling region corresponding to background pixels will tend to dominate the pooled representation. However, average pooling may be more appropriate for some other scenarios, e.g., classification of abnormal images from normal ones where abnormality spread all over the abnormal image. Unlike average pooling, max pooling is a nonlinear operator 1 which increases the nonlinearity of the network. In the training stage of a network, all the neurons that are connected to the average pooling layer will be updated via backpropagation as the output of all the neurons contribute to the output of average pooling. In contrast, as max pooling selects only the strongest activation, only the neurons which are connected to the neuron outputting the strongest activation will be allowed to learn.

Note that in addition to CNNs, max and average pooling also have been well-explored in traditional feature encoding approaches such as, bag-of-words [24] and its variations such as sparse coding [53] , vector of locally aggregated descriptors [52] and Fisher vectors [88] (discussed in Sect. 2.10). Average pooling is widely used in all of these methods except sparse coding, where max pooling is widely used. As listed in Table 2 , max and average pooling are very well-explored in medical image analysis for different problems, including HEp-2 cell image [93, 126] , multiple sclerosis identification from MRI images [122] , etc.

Since neither max pooling nor average pooling consistently performs better than the other [6] , approaches have been proposed to take advantages of both. This line of research includes a direct combination of max and average pooling with weights (Sect. 2.2) and soft pooling (Sect. 2.3). However, unlike max and average poolings, new parameters are introduced in these approaches, causing additional overhead in parameter learning or tuning.

To overcome the problems associated with the max and the average pooling (discussed in Sect. 2.1), in mixed maxaverage pooling [63] , the max pooling and the average pooling are simply added together with weights to take advantage of both, i.e.,

where a 2 0; 1 ½ is a learnable parameter that determines the mixing proportion. There are multiple options available here when choosing this parameter. The same a could be used for the entire network, or a set of a's could be used, one for each pooling layer (i.e., a l ; 1 l L, where L is the number of layers), or even different regions of different pooling layers may use different mixing proportions.

The mixing proportion, a, in Eq. (3) is a parameter which does not depend on the individual characteristics of a given image, although it can be learned in the network training process. The images from the same dataset could have different characteristics. For example in medical images, for some images, the lesions could be localized (appear only in some parts), but for some other images, lesions could be spread all over the image. In that case the mixing proportion should depend on the characteristics of each image than the characteristics of the dataset, and therefore it should be determined for each image separately. This is the motivation behind the Gated Mix-Average pooling [63] .

The gated mix-average pooling can be defined as:

where w 2 R n is a weight vector (called the gating mask in [63] ) to be learned when training the network, and rðÁÞ is a sigmoid function which converts the transformed input (w T x) to a value between 0 and 1. This value is then used to weight the contribution of the max and the average pooled results as shown in Eq. (4). As with mixed max-average pooling (Eq. (3)), the new parameters (w) can also be learned in different ways, e.g., separately for each layer or separately for each of the channels in each layer of the network. Both in mixed max-average and gated mix-average pooling, each pooling region (of a particular feature map) is considered independently from each other. Dynamic Correlation pooling [11] also uses the same formulation as in Eq. (3); however, the weighting proportion for each pooling region is determined based on the correlation between that region and its adjacent regions; average pooling gets higher weight if the correlation is high, and max pooling on the other hand.

To the best of our knowledge, as listed in Table 2 , soft pooling approaches are widely used in medical imaging than using linear combination of max and average pooling techniques. 

Soft pooling is used as an intermediate form between max and average pooling. Unlike simply adding the max and average pooling as in Sect. 2.2, in soft pooling, a smooth differentiable function is used to approximate the max and the average pooling for different parameter settings. For example, in the Generalized Mean (GM) [135] function,

the parameter r controls the softness, i.e., when r ¼ 1 this function is equivalent to average pooling, and when r ! 1 this approximates max pooling. Various such approximations are used, including Log-Sum-Exp pooling (LSE) [55, 90, 124] , Polynomial pooling [129] , Learned-Norm pooling [42] , ' p pooling [7] , a Integration (aI) pooling [28] , Rank-based pooling [99, 101] , Dynamic pooling [84] , Smooth-Maximum pooling [5] , Soft pooling [107] , Maxfun pooling [26] , Ordinal pooling [60] . As most these functions are differential approximation of max pooling, they are widely explored in (non-CNN based) Multiple Instance Learning approaches in computer vision [8] and medical image analysis [76, 135] (Table 2 ), in addition to CNN-based image classification [28, 42, 124, 144] and segmentation [90, 129] .

The learned-norm pooling [42] and ' p pooling [7] use similar formulation as in Eq. (5) . The Root-Mean-Square pooling [53] is a special case (r ¼ 2) of GM. The aI pooling [28] introduces a formulation, where different statistics such as arithmetic mean, harmonic mean, maximum and minimum are special cases. aI pooling is given as

This pooling shows marginal improvements over the max pooling, a pooling (Sect. 2.6) and ' p pooling on some computer vision datasets in [28] .

In the rank-based pooling [101] , first the elements in each pooling region are ordered (ranked) and then the top-k elements (elements with highest activations) are averaged together as the pooled representation. When k ¼ 1 and k ¼ N this pooling is equivalent to max and average pooling, respectively. Ordinal pooling [60] and multi-activation pooling [151] are similar to rank-based pooling, which also use the rank of the elements when applying pooling.

The free parameter(s) in the above soft pooling functions could be the same for the entire network, or could be different for different layers and either could be fixed [90] or learned [42, 129] . For example, in aI pooling [28] , the parameters (a's) are learned for each layer separately via back-propagation, and in polynomial pooling [129] , a sidebranch net is used to determine the parameters of each pooling region.

In all the above soft pooling approaches, the result of the pooling is just based on the characteristics of the pooling region of a particular feature map itself. But differently from these approaches, in Global Feature Guided Local pooling (GFGP) [57] , the pooled result of a particular region is not only based on that region itself, but also it depends on some global statistics of the feature map. The GFGP is formulated as

where

The weights 2 w i determine the type of pooling and are learned through an optimization process, and k is channel (particular feature map)-dependent parameter, determined (learned) based on the statistics of the global features of that channel. Note that average and max pooling can be obtained when k ¼ 0 and k ! 1, respectively.

One of the main issue when training CNNs with limited data is overfitting. Mixed-pooling [139] , Hybrid pooling [112] , Stochastic pooling [142] , Rank-Based Stochastic pooling [101] , Max pooling dropout [132] , Stochastic Spatial Sampling (S3 pooling) [143] and Fractional Max pooling [39] are proposed to reduce overfitting by introducing various forms of randomness in pooling configurations and/or the way the pooling is performed in the training process. Because of this randomness in training, the trained model can be thought as an ensemble of similar networks, with each random pooling configuration defining a different member of the ensemble. As listed in Table 2 , these stochastic pooling approaches are widely used in medical imaging (e.g., COVID-19 diagnosis [149] , abnormal breast identification [148] , brain tumor segmentation [10] ) as usually the models in medical imaging are trained with small amount of training data, and these pooling approaches can help to handle the issues with overfitting.

Mixed pooling [139] and hybrid pooling [112] introduce randomness in training by randomly selecting either max or average operations for pooling, i.e.,

where k is a random value to be either 0 or 1 that determines which pooling to be selected, i.e., when k ¼ 1 the max pooling and when k ¼ 0 the average pooling is selected, respectively. This randomness cannot be used in the testing time. Therefore, in [139] , the statistics about how many times the max and the average operations are selected for pooling for each feature map in the training phase are recorded. Based on this statistics whatever pooling used frequently for each layer in the training phase is selected to use at the testing phase. Stochastic pooling [142] introduces randomness in training by randomly selecting an activation (instead of selecting either maximum as in max pooling or all the elements as in average pooling) within each pooling region according to a multinomial distribution given by the values within that pooling region. Here, the values in each pooling region are first converted into probability values by dividing each of the value by the sum of all the values in that pooling region, i.e.,

Then, a location l within each pooling region is sampled based on the corresponding probability values to get the pooled representation of that region. The locations to sample for each pooling region in each layer for each training example are drawn independently to one another. In testing time, a probabilistic weighting scheme was used, where the pooled representation of a pooling region is calculated as follows:

This can be seen as a weighted average pooling, where the probability values are used to weight the corresponding elements in the pooling regions.

In stochastic pooling, still over-fitting may happen particularly when the training data are limited. This is because strong activations will always have the highest probability to be sampled. Therefore, rank-based stochastic pooling [101] suggests a different way to calculate the probabilities based on the ranks of the activations inside each pooling region.

Instead of sampling only one value from each pooling region as stochastic pooling does, a set of values could be randomly sampled first and then pooling could be applied on these random sampled activations as in max pooling dropout [132] . Max pooling dropout first applies dropout on the feature maps to drop p% of the features and then applies max pooling on the retaining features, and show better performance than stochastic pooling for particular values of p.

Unlike the above approaches where randomness is introduced in the pooling stage, in S3 pooling [143] and fractional max pooling [39] , randomness is introduced in the spatial sampling stage. The standard max pooling can be viewed as a two-step procedure (Fig. 2d ). In the first step, max pooling is performed from the feature map with a stride of 1. Then in the second step, spatial downsampling is performed uniformly on the resultant map by extracting the top-left corner element of each disjoint s Â s window, resulting in a feature map with s times smaller spatial dimensions. S3 pooling differs from the traditional max pooling in the second step. Instead of the uniform sampling used by max pooling, S3 pooling proposes non-uniform sampling to downsample the pooled feature maps.

Max pooling reduces the size of the feature maps by an integer multiplicative factor s (the value of stride). Usually, s is set to two in most architectures (e.g., ResNet [46] ), and therefore reducing the size of the feature maps by half of its original size every time pooling is applied, and hence, limiting the number of pooling layers used. In fractional max pooling [39] , s is allowed to take a non-integer value, i.e., 1\s\2, to allow the use of larger number of pooling layers.

Because of this non-uniform nature of downsampling used in S3 pooling and fractional max pooling the downsampled feature maps get distorted. This distortion provides a way for data augmentation to improve the generality of the network.

For some problems, encoding spatial information is necessary, for example, in natural images sky is always in the upper part of the image. Encoding such information may lead to more informative and discriminative feature representation. Similarly in some medical images, this kind of information is very useful. For example, the Golgi class in Fig. 4 has a unique ring like structure around the cells.

Encoding that structure in the feature representation may help to easily discriminate that class from others. Various approaches [30, 43, 45, 91, 92, 115, 117, 130, 136, 145] have been proposed to encode local structure information in the pooled representation.

Spatial Pyramid pooling (SPP) [45] (Fig. 3a) is a popular way to include spatial structure information in the pooled representation. It divides the feature map into grids of cells and applies the standard max or average pooling from each cell separately. Then, these cell-based pooled representations are concatenated together as the image representation. SPP is very useful for rigid structures, but it may not be appropriate for images containing objects with different poses, e.g., birds with different poses. To overcome this, in [145] a part-based pooling strategy is proposed for finegrained image classification. Here, from each image, different parts (e.g., head, tail, body of a bird) are detected first. Then, the features from each detected parts are pooled and concatenated together as the final image representation (Fig. 3b) .

Both SPP and the part-based pooling strategies may not be very useful for the images with rotated objects. To capture rotationally invariant spatial structure, representations with CNNs Concentric Circle pooling [91] and Polycentric Circle pooling [92] were proposed and applied for recognizing remote sensing images, where the pooling regions are defined as concentric circles (Fig. 3d) . A similar approach, Multi-pooling [117] , was proposed to cope with lesions (brain tumors) with different sizes, where features extracted from different sized concentric regions are concatenated together as representations. Cell Pyramid Matching (CPM) [130] is another approach to capture spatial structure information, specifically for cell image classification. In CPM, the segmentation mask of each cell is used to define the pooling regions as shown in Fig. 3c . CPM also adopted in [77] for the same purpose. Both in [130] and [77] CPM was used with traditional feature representations such as bag-of-words and not with CNNs. Note that CPM requires additional input in the form of segmentation masks to identify the border of each cell. All the above approaches are meant to capture largescale spatial structure information. On the other hand, Geometric ' p -Norm pooling [30] aims to capture local structure information (e.g., from image regions of size 5 Â 5) for the sparse coding-based (non-CNN) representations by applying weights to different locations of the pooling region. However, with CNN, this pooling is equivalent to first applying a nonlinear transformation on the feature maps and then applying a convolution for aggregation.

Average pooling only captures the first-order statistics (i.e., mean) of each pooling region, by pooling from each channel (feature map) separately. This pooling, hence, neither captures the interaction between different feature maps, nor the interaction between the features from different regions of the same feature map. This interaction may capture additional details such as object co-occurrence [137] . Therefore, capturing higher-order statistical information via covariance matrices can improve the ability of CNNs to learn complex nonlinear class boundaries.

Recently, incorporating higher-order statistical pooling approaches with CNNs got attention [18, 25, 31, 34, [69] [70] [71] 141] and have achieved state-of-the-art results on a variety of tasks including object recognition, fine-grained visual categorization, and object detection.

Second-Order pooling was initially proposed in [9] for aggregating SIFT descriptors (non-CNN). The max and the average second-order pooling are defined in [9] as follows:

where u i 2 R d is the ith feature descriptor (e.g., SIFT [72] ) from region R, d is the dimensionality of u i and u i Á u T i is the outer product between descriptor s i with itself, capturing the pairwise correlations between the elements of u i . The pooled representation (matrix of size d Â d) was then passed through a nonlinear transformation and a normalization process before giving it to a linear classifier.

This idea is then extended with CNN features in [25, 34, [69] [70] [71] 141] . For example, in Improved Bilinear pooling [70] u i is a feature from the last layer of a CNN model (Fig. 1) . The pooled features were then passed through a normalization layer before performing finegrained classification. Both in [70, 71] , second-order pooling is applied only at the end of the network; in contrast, in [34] , second-order pooling is applied throughout the network (from lower to higher layers) and shows improved performance than applying them at the end of the network. Extensions of this pooling include compact [31, 141] and kernelized [25] versions. In addition, the association between second-order pooling and Attention-Based pooling is analyzed in [36] . The formulation of the a-pooling [102] allows for a continuous transition between average and bilinear pooling by the introduction of a trainable parameter a.

Discriminative details could be lost due to improper pooling mechanisms, particularly, in the early stage of the networks. This information loss may hinder the learning process and result in sub-optimal models [32] . Detail Preserving pooling (DPP) [96] and Local Importance-Based pooling (LIP) [32] aim to reduce this information loss by preserving important features when pooling. Because some activations are important than others, both of these approaches weight the contribution of activations in the pooling region as given in Eq. (8) . However, they differ from each other (and from [57] discussed in Sect. 2.3) in the way the weights are determined. In DPP, higher weights are given to the activations which are different from the activation at the center of the pooling region as those activations are assumed to carry more information, i.e.,

wherex c is the activation at the center of the preprocessed pooling region. The parameters a and k are learned together with other parameters in network training. But in LIP, the weights are determined using a subnetwork attached to each pooling layer. Therefore, LIP can also be considered as an attention-based pooling approach (Sect. 2.8) as the subnetwork learns a saliency map to weight each element of the feature map. LIP shows improved recognition rate on the ImageNet dataset over DPP in [32] .

In addition, these approaches also can be considered as soft pooling approaches; for particular parameter settings, they approximate the standard average and the max pooling. For example, when a ¼ k ¼ 0 in Eq. (15), DPP becomes average pooling.

Larger networks cannot be deployed in resource constrained devices as they have large memory requirements. One way is to handle this problem is by reducing the number of layers of the network by rapid downsampling. Rapid downsampling of the feature maps by a large factor can simply lead to information loss, and hence reduced performance. RNNPool [97] tries to alleviate this problem by incorporating recurrent nets for downsampling, where two recurrent nets were used, the first one summarizes the feature maps horizontally and vertically, and the second one summarizes the outputs of the first one as the pooled results.

In these kind of approaches, each element of the feature map is weighted by the corresponding weight from the attention/saliency map and then pooling is performed on this weighted feature map as a weighted average pooling [27] . Attention map highlights discriminative regions in the feature maps by giving higher weights to them compared to the non-discriminative regions. Therefore, one can expect to get a discriminative pooled representation when pooling from attention weighted feature maps than pooling directly from the original feature maps.

Attention-based pooling [36, 41, 50, 67, 73, 89, 127] has received much focus recently. Different attention models differ from the way the attention maps are generated. For example, in Cross Convolutional Layer pooling [73] , the feature vectors from the feature map of a particular layer are weighted by each of feature maps from its subsequent layer. In [67] , a separate subnetwork is used to learn the attention maps. Double-attention network (A 2 -network) [13] uses a double attention mechanism, where the first attention step uses a second-order attention pooling to aggregate the features from the entire feature map, and the second attention step distributes the key features. Convolutional Block Attention Module (CBAM) [131] contains two attention mechanisms: channel attention module followed by spatial attention module, where the channel attention module aims to capture the inter-channel relationship of features; on the other hand, the spatial attention module aims to capture the inter-spatial relationship of features. Global Learnable Pooling (GLPool) [147] can also be considered as an attention mechanism, where the weight of each pooling location is considered as a parameter and learned together with other network parameters in an end-to-end manner. [36] mathematically shows that the attention weighted pooling is equivalent to a low-rank approximation of second-order pooling.

Attention mechanisms also have been investigated in medical imaging (Table 2) ; for example, a separate branch of the network was used to get the attention maps for Glaucoma detection in [67] . A reinforcement learningbased recurrent attention model for pulmonary lesion detection from chest X-Rays was proposed in [89] . An attention-guided CNN was proposed for thorax disease classification in [41] , where the regions identified by the global branch are further analyzed by the local branch, and then the outputs from both branches are fused for the final classification.

The Generalized Max pooling (GMP) [83] does not explicitly specify the pooling function, but it implicitly learns the 'pooled' representation using an optimization framework which equalizes the similarity between the local descriptors of an image (U ¼ u 1 ; . . .; u n ½ ) and their 'pooled' representation (û), i.e.,

where 1 denotes a N dimensional vector of all ones. By doing so, the 'pooled' representation will capture the properties of the max pooling for the bag-of-words-based hard-encoded local descriptors (binary representation). However, for other descriptors such as features from the last layer CNN, this 'pooled' representation can be affected by the frequent descriptors, and hence, may not be similar to max pooling. It is shown in [83] that this 'pooled' representation of an image is equivalent to weighted average of its local descriptors, i.e.,

where b is the vector of weights.

Since GMP is an unsupervised representation learning, in Task-Driven Feature pooling [133] GMP was extended to supervised learning, where the 'pooled' representations are learned jointly with a classifier to maximize the classification accuracy and showed improved accuracy over the traditional max and average pooling with fixed feature representations. Deep Generalized Max pooling [20] integrates the idea of GMP in a deep learning framework.

Bag-of-words (BoW) [24] and its variants such as Vector of Locally Aggregated Descriptors (VLAD) [52] and Fisher Vectors (FV) [88] are well-known (non-CNN-based) feature encoding and aggregation techniques for order-less representation of handcrafted local descriptors, and have been widely used in Computer Vision [24, 52, 81, 88] and Medical Imaging [77, 114] community. In these approaches, first the local features from all the training images are clustered into a set of clusters and then the local features from each image falling inside each cluster are aggregated using different statistics; BoW uses count statistics, VLAD aggregates gradients and FV uses second-order statistical information in addition to the statistics used by BoW and VLAD. The aggregated statistics from each cluster are then concatenated as the final feature representation of an image.

Methods also proposed to integrate these approaches with CNN as feature aggregation techniques by either using them with features extracted from pre-trained CNN [21, 37, 138] , or learning CNNs together with the parameters of BoW [80, 87] , VLAD [2, 80, 140] and FV [80] in an end-to-end manner. A recent work [80] reports significant performance improvement by the learned aggregation schemes (BoW, VLAD and FV) over average pooling for video classification. To the best of our knowledge, these techniques are only used at the end of the network for feature aggregation.

Various other approaches based on the max, average, and their variants also proposed for different reasons. For example, Transformation Invariant pooling (TI-pooling) [61] applies max pooling on the CNN features extracted from the transformed versions of an image to represent that image, so that the representation will capture transformation invariant features. The Hierarchical Mix-pooling [78] applies max pooling on the average pooled feature maps (and vice versa) to reduce information loss and shows improved performance than applying either max or average pooling alone on the Sparse Coding-based representations. Most of the pooling techniques for downsampling the feature maps are not invertible due to information loss, i.e., upsampling a downsampled feature map cannot recover the lost information in the downsampling. LiftPool [150] is a recently proposed pooling technique, which aims to build pooling layers that are invertible. Kernelized subspace pooling [128] was proposed to obtain a highly invariant description (invariant to flipping, rotation, etc.) from the CNN for image patch matching, where, the principal components of the feature maps from the last layer of the CNN are considered as the pooled output, and showed that descriptors obtained in this way are discriminative and highly invariant for image patch matching.

Convolutions also can be viewed as weighted average pooling, where the filters are learned in the training process. Tree pooling [63] learns a set of filters and ways to combine them. In [63] , Tree pooling shows improved performance over max, average, mixed max-average (Sect. 2.2), and gated pooling (Sect. 2.2). However, Tree pooling contains more parameters to learn than mixed maxaverage and gated pooling. Strided Convolutions [105] , on the other hand, are convolutions, use larger strides ( [ 1) for downsampling the feature maps. Unlike the traditional max and average pooling, where pooling is performed from each input feature map (channel) independently, in strided convolutions all the set of input feature channels are used to generate each output feature map/channel. Therefore, they need to learn many extra parameters. In Learning pooling (LEAP) [108] strided convolutions are applied independently from each channel of the feature maps to reduce the number of parameters required with strided convolutions. As discussed in Sect. 1, pooling makes the features robust to local transformation invariance. In contrary, strided convolutions capture local structures or positional information.

In this section, we explain the datasets, evaluation criteria, network architecture, experimental settings, and the considered pooling strategies.

We use the following two public medical image datasets for comparing different pooling strategies: (1) Human Epithelial type 2 (HEp-2) cells dataset 3 and (2) Diabetic Retinopathy (DR) dataset. 4 In the HEp-2 cells dataset, each cell covers the entire image as shown in Fig. 4 . The lesions in the DR dataset (Fig. 5 ) covers only small parts of the images. In both datasets, the task is to classify each image into one of the predefined classes.

This dataset 3 is from the I3A HEp-2 (Indirect Immunofluorescence Image Analysis-Human Epithelial Type-II) Cell and Specimen image classification competition organized by the International Conference on Pattern Recognition (ICPR), 2014. There were two tasks in this competition: Task 1 is to classify individual cell images into one of the six classes (Homogeneous, Speckled, Nucleolar membrane, Centromere, and Golgi), and the Task 2 is to classify specimen images into one of the seven classes (Homogeneous, Speckled, Nucleolar Membrane, Centromere, Golgi, and Mitotic Spindle). Each specimen image contains a large number of cell images of same type. In this work we focus on Task 1 -cell image classification. The training set of both tasks were released to the participants of the competition and the test sets were kept private by the organizers of the competition. As the training set of Task 1 dataset contains a smaller number (13, 596) of images and its test set is inaccessible, we used the Task 2 dataset to extract 26, 078 cell images as explained in [77] . We sample 60% of these images from each class and use them as our training set, and use the rest of the images as the test set. When sampling, we make sure that the training data and the test data contain cell images from disjoint set of specimen images. The number of images in the training and test sets from each class of this dataset is given in Table 3 .

Note that all of these images are in gray scale, and the size of each image is approximately 70 Â 70 pixels. We resize each image into pixels of size 112 Â 112. Images are normalized (zero mean and unit variance) before giving them to the CNN. Data augmentation, such as random mirroring, rotations (AE180 ), and random cropping of size 96 Â 96 pixels were used at the training time. In the testing time, images were cropped at the center and no augmentation were used.

Mean Class Accuracy (MCA) was used as the evaluation measure, as it is the required metric by the competition. It is defined as:

where R k is the correct classification rate for class k, and Kð¼ 6Þ is the number of classes.

This dataset is from the Kaggle Diabetic Retinopathy Detection challenge. 5 classes is given in Table 4 . Example images from each class are shown in Fig. 5 . Each image is preprocessed as explained in [40] as follows: First, the images were rescaled to have the same radius, then the local average color is subtracted from each channel and are mapped to 50% of gray level (intensity value of 128).

In training, each image is first resized to 512 Â 512 pixels. Each channel of the image is normalized (zero mean and unit variance) before it is used by the CNN. Data augmentation, such as random mirroring, rotations (AE180 ), and random cropping of size 448 Â 448 pixels, was used at the training time. In the testing time, images were cropped at the center, and no augmentation was used.

We used the Quadratic Weighted Kappa (QK) as the evaluation measure as it is used by the competition. QK measures the level of agreement between the predictions made by the system (A), and the annotator (B), and can be defined as

where W, O and E are matrices of size K Â K. O is the confusion matrix, each element in O, i.e., O ij indicates how many times an image received the rating i by A, and rating j by B. The expected outcomes, E, is calculated as the outer product between the actual histogram vector of outcomes and the predicted histogram vector. E is normalized such that E and O have the same sum. The element, W ij , of the weight matrix, W, is given as

For both datasets, we use a ResNet [46] -based CNN architecture. Table 5 illustrates the components of our selected CNN, which contains an input layer and three residual blocks. The input layer and each of the first two residual blocks are followed by transition layers to downsample the feature maps by half of its original sizes. Two approaches were considered for the transition layers. In the first case, a 3 Â 3 convolution with a stride of 2 is applied, and in the second case this convolution is replaced by a pooling operation. Global pooling is applied at the end of the network to get an image representation, which is then passed to a linear classification layer to get the classification scores. Note that as the images from the HEp-2 cells dataset are small in size (96 Â 96), a stride of one is used in the first convolutional layer for this dataset.

The below settings were used unless otherwise specified.

For this dataset, the network was trained from scratch as we have a larger number of training images and the sizes of them are small. The initial learning rate was set to 0.01, which was then divided by a factor of 10 at the end of the 40th and then at the end of the 70th epochs, respectively. The total number of epochs were set to 80. We used weighted Cross-Entropy loss to handle imbalanced classes, where the weights are set to the inverse class frequencies.

The network is optimized using Stochastic Gradient Decent (SGD) with the Nesterov momentum of 0.9 and a weight decay of 10 À4 . The batch size was set to 64.

For this dataset, the weights of the CNN were initialized with the weights of an ImageNet pre-trained model as recommended in [109] . The initial learning rate was set to 0.005, and it was divided by a factor of 10 at the end of the 90th and 120th epochs, while the total number of epochs was set to 130. Following [4] , we directly use Quadratic Weighted Kappa as the loss function. The network was optimized using SGD with the Nesterov momentum of 0.9 and a weight decay 5 Â 10 À4 . The batch size was set to 32. But for the bilinear pooling, we found that the above selected initial learning was quite large, and therefore, we set the learning rate to 0.001 and the batch size to 18 due to memory constraints. 

All the reported experiments in this work are iterated ten times and the mean and the standard deviation of the MCA for HEp-2 cells and QK for the DR datasets over these iterations are reported as the evaluation measures. In addition, for each experiment, the independent samples t test was used to calculate the p values to get the statistical significance of the obtained results compared to the top performing method in that experiment. Table 6 compares the performance of max pooling, average pooling, mixed max-average pooling [63] , and GM pooling [7] . Here, each of these pooling was applied to all the transition and the global pooling layers. On the HEp-2 cells dataset, average pooling gives significantly better performance (p:05) than max pooling (87:50% vs 85:62%). The performance of mixed max-average and GM pooling are in between the performance obtained by the average and max pooling. This result is expected as in most of the cases each image of the HEp-2 cells dataset contains exactly one cell and it covers almost all the image regions. Therefore, averaging will help to capture the property of each cell, and hence, gives better performance than max pooling. On the other hand, GM pooling gives significantly improved performance (p:05) than the max and the average pooling on the DR dataset. This result is also expected, as the lesions in the DR datasets are small (do not cover the entire image). Max pooling may capture noisy features as it focus on the top activated elements of each feature map, and the average pooling averaging out all the activations, and therefore, the background features will dominate in the image representation. Therefore, GM pooling is a balance between max and average pooling-it considers not only the top activated element, but also other elements which have high activations. Table 7 reports the effect of the value r in GM pooling. On the HEp-2 cells dataset larger values of r (r [ 5) lead to a significant drop (p:05) in performance. This aligns with the results obtained with the average and max pooling, as r ¼ 1 corresponding to average and r ! 1 corresponding to max pooling, respectively. On the DR dataset r ¼ 3 gives better QK values than others. Larger values of r (r [ 5) give a significant drop in performance (p:05) as they approximate max pooling, and hence, may capture noisy features. However, although r ¼ 3 gives the best kappa scores, we observed that smaller r values (i.e., r 5) give statistically similar (p [ 0:05) performance. Table 8 reports effect of the mixing proportion, a, for the mixed max-average pooling. Small value of a give better performance than large values.

As explained in Sect. 3.2, the transition layers for downsampling the featuremaps could be either pooling layers or convolutional layers. In Sect. 4.1, pooling was used for downsampling in all the transition layers. This section investigates the effect in performance when using convolutions for downsampling the feature maps instead of pooling. Table 9 reports the results. For the HEp-2 cells, dataset applying average pooling at the first transition layer gives significantly better performance (p:05) than applying max pooling. We believe that this is due to the size of the images. As the image sizes are small (96 Â 96), applying max pooling at the early stage of the network easily discards many valuable information, and leads to drop in performance. Here, applying average pooling at the first layer generally gives significantly better performance (p:05) regardless of the global pooling operation used. Applying average pooling at both the first transition layer and the global pooling layer leads to significantly better performance than any other combination (p:05).

For the DR dataset, applying max or average pooling at the first transition layer gives similar performance when fixing the global pooling operation. But applying average pooling as the global pooling operator gives improved QK values than applying max pooling as the global pooling operator. Applying GM pooling on both the first transition and global pooling layers, on the other hand, gives the best QK values (the reason is discussed in Sect. 4.1) compared to most of the considered combinations (p:05). However, this (GM pooling) gives statistically similar performance (p [ 0:05) compared to applying max and average pooling at the first and the global pooling layers respectively.

When comparing Tables 6 and 9 , applying convolution at the intermediate transition layers for downsampling the feature maps or applying pooling for downsampling give similar performance (p [ 0:05) on both datasets. Here, we used convolution as the downsampling operation in all the transition layers except the first one. The improved bilinear pooling [70] was used to capture higher-order statistical information between feature channels at the global pooling layer. Table 10 reports the results. On both datasets bilinear pooling does not show significant improvement in performance than applying average pooling as the global pooling operator. We conduct another experiment to investigate whether the higher-order information obtained by the bilinear pooling can add complementary information to the feature representation obtained by other pooling approaches, e.g., global average pooling. The results in Table 10 does not show any considerable improvement (p [ 0:05) when combining bilinear pooling with global average pooling.

The CNN trained on the HEp-2 cell images dataset may overfit to the training data as we got a high training MCA ( $ 96%). In this experiment, we investigate different stochastic pooling approaches to reduce overfitting, such as stochastic pooling [142] , max pooling dropout [132] , and S3 pooling [143] . Here, we applied stochastic pooling/dropout only at the global pooling layer.

As expected, stochastic pooling and max pooling dropout give lower MCA (Table 11 ), as they are based on max pooling. Remember that stochastic pooling selects an (only one) activation from each pooling region based on the multinomial distribution given by the values inside it, and the max-pooling dropout randomly drop s% (in our experiments we set s ¼ 0:1%) of the elements (make them equal to zero) and then apply max pooling on this new pooling region. We also considered average-pooling dropout, where we randomly drop 0:1% of the elements (make them equal to zero) and then apply average pooling instead of max pooling, which gives $ 1% improvement compared to max pooling dropout.

S3 pooling gives the lowest MCA compared to all the approaches we considered. We found that none of these stochastic pooling approaches give significant improvements compared to our baseline-average pooling without any stochastic pooling/dropout (88:02 AE 0:19% from Table 9 ).

Here, we experiment with two different attention mechanisms (explained under the Sect. 2.8): Double Attention (A 2 -block) [13] and Convolutional Block Attention Module (CBAM) [131] . These blocks were added before the global pooling layer. 

Note that the focus of this work is to compare different pooling mechanisms to find out which one is better under some scenarios, and we are not particularly focused on building a state-of-the-art system. However, based on the findings from our previous experiments reported in this paper, in this section, we compare our results with the state-of-the-art and show that on both datasets our approach leads to new state-of-the-art results.

Most of the existing work [1, 40, 56, 93] for DR image analysis focus on building custom CNN architectures. For example, multiple filter sizes and different color spaces were explored for fine-grained classification of DR lesions in [116] . Attention-based mechanisms [127] were also explored. A recent work [85] analyses different loss functions for optimizing Kappa as the evaluation measure for DR image analysis. As explained in Sect. 3.1.2 all the experiments on the DR dataset reported previously in this paper are based on a subset of the entire training set. To compare with the stateof-the-art, in this section we use all the images from the training set (35, 126 images) to train the CNN, and test it In this experiment, we select four different pooling settings from the previous experiment (Table 12 ) and train four separate ResNet-18 models based on each of these pooling settings. The pooling settings considered here are: (1) max-avg: max and the average pooling are applied to the first and the global pooling layers respectively, (2) GM-GM: GM pooling is applied at the first and the global pooling layers, (3) max-A 2 -avg: max and the average pooling are applied to the first and the global pooling layers, respectively, and A 2 attention layer is added before the global pooling layer, (4) GM-A 2 -GM: GM pooling is applied to the first and the global pooling layers, and A 2 attention layer is added before the global pooling layer. As most of the state-of-the-art methods (e.g., [40, 127] ) make use of the features from both eyes (left and right, as they have high correlation) for the classification of a particular eye, we also combine the features from both eyes before pass it to the classification layer of each model. To make it consistent with other state-of-the-art methods, we use mean squared error as our loss function. Adam optimizer with an initial learning rate of 10 À4 was used to optimize the network parameters. The number of epochs and the batch size was set to 60 and 16, respectively.

From Table 13 , we can observe that all the different pooling settings (our single models) give similar QK values compared to each other, and compared to the state-of-theart methods. We believe that this is because the results are almost saturated at a QK value of $ 0:855. We can also observe that the ensemble of our four models improves the overall QK values and leads to the state-of-the-art results on both the validation and the test sets. Note that compared to Zoom-in-net [127] , our method is not only simple, but also make use of a standard network architecture (ResNet-18) with different pooling mechanisms.

A significant amount of work has been done for HEp-2 cell image classification, and can be categorized into handcrafted features-based approaches, and deep learningbased approaches. Various handcrafted features such as multi-resolution local patterns [77] , shape index histograms [62] , gray-level histogram statistics [48] , co-occurrence matrix features [48] , Local Binary Patterns [47] , and SIFT [47, 77] features have been explored. Recently, CNN [33, 65] -based approaches also became popular for HEp-2 cell image classification.

In this literature of HEp-2 cell image classification, different methods use different test sets for comparison as the test set of this dataset is not publicly available (explained in Sect. 3.1.1). Some methods completely discard the specimen information when constructing the test set. It is observed in [77] that when the specimen information is discarded, a very high MCA ( [ 95%) can be easily obtained even with handcrafted features. As explained in Sect. 3.1.1, we considered specimen information when splitting the dataset, and compare our method with the methods which also consider specimen information when constructing the test set. Table 14 compares our results with the state-of-the-art results on the HEp-2 cells image dataset, and show that our results are the new state-of-the-art. We can observe from Table 14 that our method beats other methods with a ResNet architecture with carefully chosen pooling layers. It also can be noted that we achieve the new state-of-the-art results with a small amount of training data (15, 314 images) compared to other methods, for example, the work of [65] uses a training set which contains over 100, 000 images.

The following section summarizes the work of this paper based on the pooling techniques reviewed and the findings of the experiments.

As discussed, pooling can help to learn invariant features, reduces overfitting, and reduces computational complexity by downsampling the feature maps. There are two types of pooling operations used in CNNs, they are: local pooling and global pooling. Local pooling is applied from small image regions (e.g., 3 Â 3) at the early stages of the CNNs to capture local features, and the global pooling is applied at the end of the network from the entire feature map to get a feature representation, which will be then used by the fully connected layer for classification. The max and the average pooling are the widely used pooling techniques. They are used both in local and the global pooling layers, and their applicability depends on the application. Max pooling considers only the mostly activated elements in each feature map, and discards all the other activations as irrelevant. This activated element could be a noisy one. Our experiments suggest that max pooling is appropriate in situations where the class specific features (e.g., abnormal regions in medical images) are smaller in size compared to the image size. In the learning stage of the network, the network nodes which are connected only to this maximum activated element will be updated, which makes the learning of the network slow. Usually maximum pooling is applied at the early stages of the network to capture the important local image features. This is appropriate when the size of the images are large enough. However, our experiments show that when the size of the images are small, applying max pooling at the early stages of the network leads to information loss, and hence, drop in classification performance compared to applying average pooling at the early layers. The average pooling, on the other hand, gives equal weights to all the activations, regardless of their importance. Therefore, the class specific information in the feature maps could be downgraded and the features correspond to the background could dominate in the pooled representation. Usually average pooling is used as the global pooling operator to capture the contribution of all the features (e.g., Resnet [46] ). In addition, the network may converge faster as all the network nodes are updated in the learning stage. Our experiments also prove that applying average pooling is a better choice than max pooling as the the global pooling operator.

As discussed, the max or the average pooling cannot be applicable in all the scenarios. Each of them have their own merits and demerits. To overcome this, and to take the advantage of both, mix max-average pooling (Sect. 2.2) and the soft pooling techniques are proposed (Sect. 2.3). The max and the average pooling are combined with weights in the mix max-average pooling. In soft pooling, the pooled representation is obtained as the weighted sum of the local features. In this way, all the elements in a feature map will contribute to the pooled representation, and their contributions are determined based on their activation values -larger weights for high activations, and the lower weights on the other hand. There are various ways proposed to determine these weights (refer Sect. 2.3). Rank-based pooling (explained in Sect. 2.3) also can be considered as softpooling techniques, but they differ in the way how the local features are weighted in the pooled representations. In rank-based pooling, top k activations receive a weight of one, and the others receive zero weights. However, unlike the max and the average pooling, these approaches (mix max-average, soft pooling and the rank based pooling) introduce new free parameters which need to be selected carefully for improved classification performance. We show by experiments that these approaches are applicable in situations where the class specific features are small in size compared to the size of the images. In this scenarios, softpooling gives improved performance compared to max and average pooling. In addition, we also show by experiments that when the class specific features spread all over the images (e.g., each abnormal image in the medical domain contains more abnormal regions than the normal ones), the average pooling is the better choice.

One of the main problem with training CNNs is overfitting, particularly when the CNN is trained with small amount of data. To reduce overfitting, various pooling techniques are investigated, which tries to apply some stochasticity in the pooling process (These techniques are explained in Sect. 2.4). Mainly, two types of stochasticity are generally used. The first type is focusing on the pooling stage itself, and the second type is focusing on the spatial sampling stage of the pooling. For example, the Stochastic pooling [142] introduces randomness in the pooling stage of the network training by randomly selecting an activation within each pooling region according to a multinomial distribution given by the values within that pooling region. In S3 pooling [143] and Fractional Max Pooling [39] randomness is applied at the spatial sampling stage of the pooling. Note that, dropout [106] is another way to reduce overfitting by randomly dropping some network nodes at the training stage, and it is a computationally efficient approach than most of the above mentioned approaches for reducing overfitting.

The global pooling operation helps to get an orderless representation of the local features. This orderless representation is very useful to capture discriminative features regardless of where they appear in images. However, for some classification problems this orderless representation may fail to capture some very important information as it completely discards the location information of the local features. In natural images, the sky is always in the upper part of the images. Similarly, for some medical image classification problems such as, classifying cell images, the location information may be useful. The Golgi class (refer Fig. 4 ) has a ring-like structure, capturing this information is very useful to discriminate the Golgi class from the others. The orderless representation may fail to capture such information. To capture this local structure information there are various approaches such as Spatial Pyramid Pooling [45] , Cell Pyramid Pooling [130] , etc. are proposed. These approaches are discussed in detail in Sect. 2.5.

It has been reported in the literature that for fine-grained image classification capturing the interaction between different features is very much useful to improve the classification performance [71] . The widely used average pooling uses a first order statistics, i.e., mean, to aggregate the features. This statistics is not designed to capture the interaction between different local features, such as object co-occurrence, and may not be useful for find-grained image classification. Higher-oder statistical pooling techniques such as bilinear pooling [31, 70, 71] are proposed for this purpose, and reportedly show improved performance over max and the average pooling for fine-grained image classification. We discuss these techniques in detail in Sect. 2.6. However, our experiments hardly show any improvement by bilinear pooling compared to the traditional max and the average pooling as the global pooling operators.

All the features in a particular feature map cannot be considered equally. Some feature are important than others. Attention based pooling (discussed in Sect. 2.8) recently received much attention. These approaches weights the importance of the local image features by the use of attention maps generated in the training process. In the attention maps, the 'important' regions receive higher weights than the 'unimportant' regions. The pooled representation is then obtained as the attention weighted aggregation of the local features. Our experiments on the DR dataset shows improved performance when applying attention based pooling compared to without using them.

Most of the pooling approaches discussed above (including, max, average, mixed max-average [63] ) only consider the statistics of the features that are inside the considered pooling region of a particular feature map when applying pooling. Here, pooling regions in each feature map are considered independently from each other. However, some approaches also consider the statistics of the entire feature map (e.g., Global Feature Guided Local pooling [57] ), or some statistics from the adjacent pooling regions of the considered pooling region (e.g., Dynamic Correlation pooling [11] ) to calculate the output or to determine the type of pooling to be applied on the considered pooling region. Usually, pooling is performed separately from each feature map (or channel), and therefore, the number of channels in the pooled representation is same as the number of input channels. Examples of such approaches include, the average and max pooling, linear combination of them, soft pooling, and stochastic pooling. However, different channels of the same set of feature maps are highly correlated and should be treated jointly [83] . Second-order pooling (Sect. 2.6), implicit pooling mechanisms (Sect. 2.9), clustering-based aggregation schemes (Sect. 2.10), and strided convolutions (Sect. 2.11) do not consider feature channels independently. Therefore, the number of channels in the output is not necessarily the same as the number of channels in the input feature maps.For example, in strided convolution [105] the number of channels in the output feature maps is equivalent to the number of convolutions used. To take the advantage of different types of pooling mechanisms, their combinations were also considered, e.g., in [96] DPP is combined with S3 pooling to retain the important information in the feature maps and at the same time learning representations which are less prune to overfitting.

It should be noted that there is no single pooling technique which can work in all the scenarios, and the selection of it usually depends on the characteristics of the application. One of the limitation of our study is only two datasets were considered, but note that they are different from each other in terms of their modalities and characteristics.

In this paper, we reviewed different kinds of pooling techniques proposed in the literature of computer vision, together with the medical imaging domains where these techniques are used (refer Table 2 ). The advantages, disadvantages and their applicability in different scenarios are discussed in detail. In addition, a comprehensive set of experiments are conducted on a selected set of pooling techniques tested on two public medical image datasets.

Our experimental results suggest that the pooling technique for a particular classification task should be selected by considering the scale of the class specific features that appear in the images. We found that global average pooling generally gives better results than global max pooling. In addition, applying max pooling at the earlier stages of the network, particularly for the dataset with smaller sized images may lead to drop in performance due to information loss by max pooling. Higher-order statistics in terms of bilinear pooling to capture the interaction between different feature channels do not seem to provide significant improvement compared to some simple approaches such as max and average pooling on the two datasets we considered. Adding attention layers improve the classification performance compared to a system where no attention layers are used.

We believe that this review and the comparative study will provide a guideline to the choice of pooling mechanisms for various medical image analysis tasks.

NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research

Kaggle diabetic retinopathy detection: team o\_O solution

NetVLAD: CNN architecture for weakly supervised place recognition

A progressively-trained scale-invariant and boundary-aware deep neural network for the automatic 3D segmentation of lung lesions

A simple squared-error reformulation for ordinal classification

Comparison of methods generalizing max-and average-pooling

A theoretical analysis of feature pooling in visual recognition

Signal recovery from pooling representations

Multiple instance learning: a survey of problem characteristics and applications

Semantic segmentation with second-order pooling

A mix-pooling CNN architecture with FCRF for brain tumor segmentation

A convolutional neural network with dynamic correlation pooling

DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

A 2 -Nets: double attention networks

Graph convolutional network with structure pooling and jointwise channel attention for action recognition

Gated squareroot pooling for image instance retrieval

ViP: virtual pooling for accelerating CNN-based image classification and object detection

Retrieval of brain tumors by adaptive spatial pooling and fisher vector representation

Second-order temporal pooling for action recognition

Higher-order pooling of CNN features via kernel linearization for action recognition

Deep generalized max pooling

Deep filter banks for texture recognition and segmentation

Diabetic retinopathy classification using an efficient convolutional neural network

The cityscapes dataset for semantic urban scene understanding

Visual categorization with bags of keypoints

Kernel pooling for convolutional neural networks

Maximal function pooling with applications

Images as sets of locally weighted features

Alpha-pooling for convolutional neural networks

The pascal visual object classes (voc) challenge

Geometric lp-norm feature pooling for image classification

Compact bilinear pooling

LIP: Local importance-based pooling

HEp-2 cell image classification with deep convolutional neural networks

Global second-order pooling convolutional networks

Encoder-decoder with dense dilated spatial pyramid pooling for prostate MR images segmentation

Attentional pooling for action recognition

Multi-scale orderless pooling of deep convolutional activation features

Learnable pooling in graph convolution networks for brain surface analysis

Kaggle diabetic retinopathy detection competition report

Diagnose like a Radiologist: attention guided convolutional neural network for thorax disease classification

Learned-norm pooling for deep feedforward and recurrent neural networks

HEp-2 cell classification using K-support spatial pooling in deep CNNs

CABNet: category attention block for imbalanced diabetic retinopathy grading

Spatial pyramid pooling in deep convolutional networks for visual recognition

Deep residual learning for image recognition

International competition on cells classification by fluorescent image analysis

HEp-2 cell classification in indirect immunofluorescence images

Squeeze-and-excitation networks

Fc4: fully convolutional color constancy with confidence-weighted pooling

Densely connected convolutional networks

Aggregating local descriptors into a compact image representation

Linear spatial pyramid matching using sparse coding for image classification

RunPool: a dynamic pooling layer for convolution neural network

Weakly-supervised localization and classification of proximal femur fractures

Kaggle diabetic retinopathy detection: 3rd place solution report

Global feature guided local pooling

Learning multiple layers of features from tiny images

Imagenet classification with deep convolutional neural networks. In: advances in neural information processing systems

Ordinal pooling networks: for preserving information over shrinking feature maps

TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks

HEp-2 cell classification using shape index histograms with donut-shaped spatial pooling

Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree

Learnable dynamic temporal pooling for time series classification

Deep CNNs for HEp-2 cells classification: a cross-specimen analysis

Detachable second-order pooling: Toward high-performance first-order networks

A large-scale database and a CNN model for attention-based glaucoma detection

Towards faster training of global covariance pooling networks by iterative matrix square root normalization

Is second-order information helpful for large-scale visual recognition?

Improved bilinear pooling with CNNs

Bilinear CNNs for fine-grained visual recognition

Object recognition from local scale-invariant features

Cross-convolutional-layer pooling for image recognition

Hierarchical adaptive pooling by capturing high-order dependency for graph representation learning

Adaptive spatial pooling for image classification

Subcategory classifiers for multiple-instance learning and its application to retinal nerve fiber layer visibility classification

An automated pattern recognition system for classifying indirect immunofluorescence images of HEp-2 cells and specimens

Hierarchical mixpooling and its applications to biomedical image classification

Cascaded atrous convolution and spatial pyramid pooling for Neural Computing and Applications more accurate tumor target segmentation for rectal cancer radiotherapy

Learnable pooling with Context Gating for video classification

Bags of local convolutional features for scalable instance search

Accurate classification of cherry fruit using deep CNN based on hybrid pooling approach

Generalized max pooling

A dynamic pooling based convolutional neural network approach to detect chronic kidney disease

Loss functions for optimizing kappa as the evaluation measure for classifying diabetic retinopathy and prostate cancer images

LPM: learnable pooling module for efficient full-face gaze estimation

Learning bag-of-features pooling for deep convolutional neural networks

Fisher kernels on visual vocabularies for image categorization

Learning to detect chest radiographs containing pulmonary lesions using visual attention networks

From image-level to pixel-level labeling with convolutional networks

Concentric circle pooling in deep convolutional networks for remote sensing scene classification

Polycentric circle pooling in deep convolutional networks for high-resolution remote sensing image recognition

Deep image mining for diabetic retinopathy screening

Convolutional neural networks: an overview and application in radiology

CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning

Detail-preserving pooling in deep networks

RNNPool: efficient non-linear pooling for ram constrained inference

Evaluation of pooling operations in convolutional architectures for object recognition

Multipartite pooling for deep convolutional neural networks

EasyConvPooling: random pooling with easy convolution for accelerating training and testing

Rank-based pooling for deep convolutional neural networks

Generalized orderless pooling performs implicit salient matching

Deep adaptive temporal pooling for activity recognition

A sparsity-based stochastic pooling mechanism for deep convolutional neural networks

Striving for simplicity: the all convolutional net

Dropout: a simple way to prevent neural networks from overfitting

Refining activation downsampling with SoftPool

Learning pooling for convolutional neural network

Convolutional neural networks for medical image analysis: full training or fine tuning?

Convolutional neual network with spatial pyramid pooling for hand gesture recognition

Particular object retrieval with integral max-pooling of CNN activations

A hybrid pooling method for convolutional neural networks

Hybrid pooling for enhancement of generalization ability in deep convolutional neural networks

Bag-of-words representation in image annotation: a review

Adaptive region pooling for object detection

New deep neural nets for fine-grained diabetic retinopathy recognition on hybrid color space

The application of series multi-pooling convolutional neural networks for medical image segmentation

RP-Net: a 3D convolutional neural network for brain segmentation from magnetic resonance imaging

Global gated mixture of second-order pooling for improving deep convolutional neural networks

Cerebral microbleed detection based on the convolution neural network with rank based average pooling

Alcoholism detection by data augmentation and convolutional neural network with stochastic pooling

Neural Computing and Applications

Multiple sclerosis identification by 14-layer convolutional neural network with batch normalization, dropout, and stochastic pooling

PSSPNN: PatchShuffle stochastic pooling neural network for an explainable diagnosis of COVID-19 with multiple-way data augmentation

Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Second-order pooling for graph neural networks

Diabetic retinopathy detection via deep convolutional networks for discriminative localization and visual explanation

Zoom-innet: deep mining lesions for diabetic retinopathy detection

Kernelized subspace pooling for deep local descriptors

Building detail-sensitive semantic segmentation networks with polynomial pooling

Automatic classification of human epithelial type 2 cell indirect immunofluorescence images using cell pyramid matching

CBAM: convolutional block attention module

Max-pooling dropout for regularization of convolutional neural networks

Task-driven feature pooling for image classification

Multi-scale retinal vessel segmentation using encoder-decoder network with squeeze-and-excitation connection and atrous spatial pyramid pooling

Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering

Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers

Spatial pyramid co-occurrence for image classification

So Kweon: Multi-scale pyramid pooling for deep convolutional representation

Mixed pooling for convolutional neural networks

Spatial pyramidenhanced NetVLAD with weighted triplet loss for place recognition

Statistically-motivated second-order pooling

Stochastic pooling for regularization of deep convolutional neural networks

S3Pool: pooling with stochastic spatial sampling

AlphaMEX: a smarter global pooling method for convolutional neural networks

Pose pooling kernels for sub-category recognition

Depth-wise separable convolutions and multi-level pooling for an efficient spatial CNN-based steganalysis

Global learnable pooling with enhancing distinctive feature for image classification

Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based stochastic pooling

A five-layer deep convolutional neural network with stochastic pooling for chest CT-based COVID-19 diagnosis

Liftpool: Bidirectional convnet pooling

Multiactivation pooling method in convolutional neural networks for image recognition

Acknowledgements NR was partially supported by the NSF Sri Lanka grant NSF/RPHS/2016/D02. We gratefully acknowledge the Neural Computing and Applications

Conflict of interest The authors declare that they have no conflict of interest.