key: cord-1015372-mo4nfrkt authors: Li, Qiufu; Shen, Linlin; Guo, Sheng; Lai, Zhihui title: WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification date: 2021-07-28 journal: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society DOI: 10.1109/tip.2021.3101395 sha: 20ecca5d7c7ba46f3363bb8cb7a14484fecd9a1a doc_id: 1015372 cord_uid: mo4nfrkt Though widely used in image classification, convolutional neural networks (CNNs) are prone to noise interruptions, i.e. the CNN output can be drastically changed by small image noise. To improve the noise robustness, we try to integrate CNNs with wavelet by replacing the common down-sampling (max-pooling, strided-convolution, and average pooling) with discrete wavelet transform (DWT). We firstly propose general DWT and inverse DWT (IDWT) layers applicable to various orthogonal and biorthogonal discrete wavelets like Haar, Daubechies, and Cohen, etc., and then design wavelet integrated CNNs (WaveCNets) by integrating DWT into the commonly used CNNs (VGG, ResNets, and DenseNet). During the down-sampling, WaveCNets apply DWT to decompose the feature maps into the low-frequency and high-frequency components. Containing the main information including the basic object structures, the low-frequency component is transmitted into the following layers to generate robust high-level features. The high-frequency components are dropped to remove most of the data noises. The experimental results show that %wavelet accelerates the CNN training, and WaveCNets achieve higher accuracy on ImageNet than various vanilla CNNs. We have also tested the performance of WaveCNets on the noisy version of ImageNet, ImageNet-C and six adversarial attacks, the results suggest that the proposed DWT/IDWT layers could provide better noise-robustness and adversarial robustness. When applying WaveCNets as backbones, the performance of object detectors (i.e., faster R-CNN and RetinaNet) on COCO detection dataset are consistently improved. We believe that suppression of aliasing effect, i.e. separation of low frequency and high frequency information, is the main advantages of our approach. The code of our DWT/IDWT layer and different WaveCNets are available at https://github.com/CVI-SZU/WaveCNet. S MALL noise, including the common spatial noise [1] and the specially designed adversarial noise [2] - [6] , can drastically change the final predication of well trained convolutional neuronal network (CNN) for image classification. The recent studies [7] , [8] show that the noise may be enlarged Max-pooling is a commonly used down-sampling operation in the deep networks, which could easily breaks the basic object structures. Discrete Wavelet Transform (DWT) decomposes an image X into its low-frequency component X ll and highfrequency components X lh , X hl , X hh . While X lh , X hl , X hh represent image details including most of the noise, X ll is a low resolution version of the image, where the basic object structures are represented. In the figures, the window boundary in area A (AP) and the poles in area B (BP) are broken by max-pooling, while the principal features of these objects are kept in the DWT output (AW and BW). as the image data flows through the deep networks. These phenomena illustrate the weak noise-robustness of CNNs. The weak noise-robustness of CNNs is closely related to the down-sampling. The commonly used down-sampling operations in deep networks, such as max-pooling, averagepooling, and strided-convolution, usually ignore the classic sampling theorem [9] , which result in aliasing among the data components in different frequency intervals. While the noise of data is mostly high-frequency components, the low-frequency component contains the main information, such as the basic object structures. Therefore, the aliasing introduces residual noise in the down-sampled data and breaks the basic structures, arXiv:2107.13335v1 [cs.CV] 28 Jul 2021 which degrades the accuracy and noise-robustness of CNNs. Fig. 1 presents a max-pooling example. In signal processing, to avoid the aliasing, the data is usually decomposed into different frequency intervals using time-frequency analysis tools, such as wavelet [10] , [11] . Discrete wavelet transform (DWT), consisting of filtering and down-sampling, decomposes a 2D data into four lowfrequency and high-frequency components, as Fig. 1 shows. DWT separates the main information and details of the data, while inverse DWT (IDWT), consisting of up-sampling and filtering, reconstructs the original data using the DWT output. In this paper, to suppress the aliasing effect in CNNs for noise-robust image classification, we integrate the commonly used CNN architectures with discrete wavelet transform. We firstly analyze the data forward and backward propagation in the wavelet transforms (DWT and IDWT), and rewrite them as general network layers in PyTorch [12] . Then, we design wavelet integrated convolutional networks (WaveCNets). In WaveCNets, during the down-sampling, the feature maps are decomposed by DWT into the low-frequency and highfrequency components. While the low-frequency component is transmitted to the following layers for robust high-level features, the high-frequency components are dropped to resist the noise propagation. WaveCNets are evaluated on ImageNet [13] and COCO [14] in terms of classification accuracy, noiserobustness, adversarial robustness, and detection precision, when various discrete wavelets and CNN architectures are used. In summary: 1) We design general DWT/IDWT layer applicable to various discrete wavelets, which could be used to design end-to-end wavelet integrated deep networks. 2) We propose WaveCNets by transforming the feature maps using DWT during down-sampling to suppress aliasing effect for noise-robust image classification. 3) Our WaveCNets are evaluated by both classification and detection tasks. The results on ImageNet classification show that our approach can achieve higher accuracy, better noise-robustness, and increased adversarial robustness. The detection results on COCO benchmark suggest that faster R-CNN [15] and RetinaNet [16] , using WaveCNets as the backbones, achieve better performance on detecting small, medium, and large objects. An earlier version of this paper [17] has been accepted by the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2020). We have extensively extended the contents by presenting the introduction of wavelet theory, exploring the details of 2D DWT/IDWT layers and giving the details of backward propagation of both 1D and 2D DWT/IDWT layers. For experiments, we tested the performance of using wavelet denoising as a preprocessing step and the comparison shows that our DWT/IDWT layers achieves significantly better noise-robustness. We have also tested the performance of our WaveCNets against adversarial attacks. As expected, the results suggest that our DWT/IDWT layers can reasonably improve the robustness of different CNNs against adversarial samples. In addition, we also evaluate the performance of WaveCNets based backbones for object detection task using COCO dataset. The results show that wavelet could consistently improve the detection performances on small, medium, and large objects. The recent studies show that ImageNet-trained CNNs prefer to extract features from object textures sensitive to noise [8] , [18] , and the noise could be enlarged as the image data flows through layers in the CNNs [7] , [19] , leading to the final wrong predictions. These works illustrate the weak noise-robustness of modern CNNs. When the input image is corrupted by noise, the output of CNN can be significantly changed, regardless of whether the noise is easily perceived by human or not [1] - [6] . The common spatial noise, such as Gaussian noise, shot noise, and impulse noise, can noticeably degrades the image quality, and decrease the classification accuracy of CNNs [1] . Adversarial noise, produced by specially designed algorithms [2] - [6] , can successfully attack the well trained CNNs, although the noise is not perceived by human visual system. A benchmark evaluating CNN performance on noisy images is proposed in [1] . Our WaveCNets will be evaluated using this benchmark. The conventional data augmentation by adding noise to the training images could increase the CNN performance on noisy images [7] . Stylized ImageNet [8] is proposed via stylizing Im-ageNet images with style transfer to train the CNNs to extract more robust features from object structures. The augmentation of training data could noticeably increase the training time or decrease the accuracy on the normal clean images. The recent studies propose CNN architectures integrating filtering block to denoise feature maps during network inference. In [19] , the authors design a high-level representation guided denoiser to filter the contaminated images before inputting them into CNN, which complicates the whole deep network architecture. A spatial filtering block is presented in [7] to denoise the CNN feature maps and suppress the noise effect on the CNN prediction. However, the filtering block filter the feature maps using Gaussian filtering, mean filtering, or median filtering, which do denoising in the whole frequency domain and easily break basic object structure represented by the low-frequency component. Therefore, this denoising block requires a residual structure for the CNN to converge. The weak noise-robustness of CNNs is closely related to down-sampling in the networks. The down-sampling operations, such as max-pooling, average-pooling, and stridedconvolution, are introduced into deep networks for local connectivity and weight sharing. These operations could easily erase or dilute image details [20] , [21] , and then break the basic object structures. While mixed-pooling [20] and stochastic pooling [21] are proposed to address the drawbacks, the maxpooling, average-pooling, and strided-convolution are still the most widely applied down-sampling operations in CNNs [22] - [25] . The above down-sampling operations usually ignore the Nyquist's sampling theorem [9] , [26] , [27] , which result in aliasing among the data components, break basic object structures and accumulate random noise during CNN inference. Fig. 1 presents a max-pooling example. Low-pass filtering is integrated with the down-sampling in anti-aliased CNNs [27] , to increase the shift-invariance of CNNs. The author is surprised at the increased classification accuracy and better noise-robustness. Our WaveCNets and anti-aliased CNNs are significantly different in two aspects: (1) While anti-aliased CNNs are designed to increase the CNN shift-invariance, the WaveCNets are conceived on the idea of suppressing aliasing effect using wavelet transform to increase the noiserobustness. (2) The low-pass filters used in anti-aliased CNNs are empirically designed based on the row vectors of Pascal's triangle, which is ad hoc and no theoretical justifications are given. No up-sampling operation, i.e., reconstruction, of the low-pass filter is available. Therefore, the anti-aliased U-Net [27] , the encoder-decoder version of anti-aliased deep network, has to apply the same filtering after normal up-sampling to achieve the anti-aliasing effect. In comparison, our WaveCNets are justified by the well defined wavelet theory [10] , [11] . Both the usual down-sampling and up-sampling operations can be replaced by DWT and IDWT [17] , [28] , respectively. In deep networks for image-to-image translation tasks, the up-sampling operations, such as max-unpooling in SegNet [29] , deconvolution in U-Net [30] , and bilinear interpolation in DeepLab [31] , [32] are widely applied to upgrade the feature map resolution. However, these up-sampling operations can not precisely recover the original data, due to the absence of the strict mathematical terms. They do not perform well in the restoration of image details, while the proposed DWT/IDWT layer could relieve this drawback [28] . Wavelet [10] , [11] has wide applications in signal processing, pattern recognition, etc., due to its superior performance in time-frequency analysis. Discrete wavelet transform (DWT) decomposes a data into various components, and separates the main information and details of the data. The original data could be reconstructed by inverse DWT (IDWT) using the DWT output. In signal processing, DWT is a useful tool for anti-aliasing, and we mainly explore in this paper its application in suppressing the aliasing effect in CNNs for noise-robust image classification. When Mallat et al explore the optimal deep network from mathematical and algorithmic perspective, they present Scat-Net [33] by cascading wavelet transform with average-pooling and nonlinear modulus operation. ScatNet preserves the image detail information and extracts a translation invariant feature robust to deformations. Compared with the CNNs of same period, ScatNet achieves better performance on the texture discrimination and handwritten digit recognition tasks. In [34] , Wiatowski et al extend the idea of ScatNet to semi-discrete frames and general nonlinear operations (ReLU, Sigmoid, etc.). However, ScatNet is essentially a hand-designed feature extractor without learnable parameters. Due to the strict mathematical terms, ScatNet can not be easily transferred to image-to-image tasks, such as image segmentation. In the early studies of wavelet integrated neural networks, the researchers implement wavelet transform using parameterized one-layer networks, and search the optimal wavelet in parameter domain for function approximation [35] , signal representation [36] . The recent work [37] applies this method with deeper network for image classification. However, this wavelet parameterized deep network is difficult to train because of the significantly increased computation [37] . In current deep learning, while wavelet transform is commonly applied as image preprocessing or postprocessing [38] - [41] , it is also used as down-sampling or up-sampling operations to design deep netwotks [42] - [45] . Multi-level wavelet CNN (MWCNN) [42] is a wavelet integrated encoder-decoder for image restoration, which implements wavelet package transform (WPT) by processing the concatenation of various components of the input data in a unified way. However, the details represented by the high-frequency components may not be perceived by MWCNN, because the data amplitude of the high-frequency components is much smaller than that of low-frequency one. In [44] , the authors apply dual-tree complex wavelet transform (DT-CWT) and design convolutionalwavelet neural network (CWNN), to extract robust features from SAR images. CWNN adopts DT-CWT to suppress the noise and keep the object structures in the SAR images, which contains only two convolutional layers. DT-CWT is redundant, and the average value of the two low-frequency components is taken as the down-sampling output of CWNN. In [43] , the authors propose wavelet pooling layer using a two-level DWT and a one-level IDWT, while the back-propagation is implemented using one-level DWT and a two-level IDWT, which does not follow the mathematical principle of gradient. The author design wavelet integrated networks and evaluate them on various datasets (MNIST [46] , CIFAR-10 [47] , SHVN [48] , and KDEF [49] ). However, these wavelet integrated networks consist of only four or five layers, and they are not systematically studied on the standard large-scale image dataset ImageNet [13] . The recent work also implement the application of wavelet transform in deep network based image style transfer [45] . Due to the absence of general wavelet transform layers, the above methods are only evaluated using only one or two wavelets like Haar or dual-tree complex wavelet, which is not extensively evaluated. Our method is trying to design the general discrete wavelet transform (DWT/IDWT) layers and apply them to improve the performance of CNNs for image classification. We firstly present the basic wavelet theory. Wavelet [10] , [11] is associated with scaling function φ(x) and wavelet function ψ(x), whose shifts and expansions compose stable basis for the signal space L 2 (x). With the basis, a signal can be decomposed and reconstructed. The scaling and wavelet functions of discrete wavelet are closely related with low-pass filter l = {l k } k∈Z and high-pass filter h = {h k } k∈Z , respectively. In practice, these filters are applied for the data decomposition and reconstruction in DWT and IDWT. Orthogonal wavelets Daubechies wavelet is orthogonal, a set of orthogonal basis for L 2 (x) could be derived from its scaling and wavelet functions. Daubechies wavelet has an approximation order parameter p, and the length of its filter is 2p. Table I shows the low-pass filter l = {l k } of the wavelets with order p, 1 ≤ p ≤ 6, while the high-pass filter h = {h k } can be deduced from where N is an odd number. Daubechies (1) is Haar wavelet. Biorthogonal wavelets Cohen wavelets are symmetric biorthogonal wavelets, and each of them is associated with scaling function φ, wavelet function ψ, and their dual functions φ,ψ. Correspondingly, it has four filters l, h,l, andh. While a signal is decomposed using filters l and h with DWT, it can be reconstructed using the dual filtersl andh with IDWT. Cohen wavelet is with two order parameters p andp. Table II shows the low-pass filters with orders 2 ≤ p =p ≤ 5. Their high-pass filters can be deduced from where N is an odd number. Cohen(1, 1) is Haar wavelet. Wavelet theory is valid for finite or infinite filters, but the infinite case is rarely covered in practical interest. Besides the othogonal and biorthogonal wavelets, various wavelets and beyond-wavelets, including multi-wavelets [50] , dual-tree complex wavelet [51] , ridgelet [52] , curvelet [53] , bandelet [54] , and contourlet [55] , etc., have been designed and studied. Wavelets have been widely used in signal processing, numerical analysis, pattern recognition, computer vision, quantum mechanics, etc. The key issues for the general wavelet transform (DWT/IDWT) layers are the data forward and backward propagations. Although the following analysis is for orthogonal wavelets and 1D/2D data, it can be generalized to other discrete wavelets and 3D data with only slight changes. 1) 1D DWT/IDWT: DWT decomposes a given 1D data x = {x j } j∈Z into its low-frequency component and l = {l k } k∈Z , h = {h k } k∈Z are the low-pass and high-pass filters of an orthogonal wavelet. According to Eq. (4), DWT consists of filtering and down-sampling. IDWT reconstructs x using x low , x high , where In expressions with matrices and vectors, Eq. (4) and Eq. (5) can be rewritten as where Eqs. (6) -(8) present the forward propagations for 1D DWT and IDWT. The backward propagation of 1D DWT is closely associated with the gradients ∂xlow ∂x and ∂xhigh ∂x , which can be derived from Eqs. (6), Similarly, for the back propagation of IDWT, the key issues are the gradients ∂x ∂xlow , ∂x ∂xhigh , which can be derived from Eq. 2) 2D DWT/IDWT: Given a 2D data X, the DWT usually do 1D DWT on its every row and column, i.e., and the corresponding 2D IDWT is implemented with In the output of 2D DWT, X ll is the low-frequency component of input X, which represents the main information, including the basic object structures; X lh , X hl , X hh are three high-frequency components, which save the horizontal, vertical, and diagonal details of X, respectively. The general denoising approach using wavelet. The simplest wavelet based "denoising" method, DWT ll . Fig. 2 : The general denoising approach based on wavelet transforms and the one used in WaveCNet. While the general denoising approach keeps the data size, the simplest "denoising" method, DWT ll , halves the data size. Suppose 2D DWT/IDWT are applied in deep network, then the backward propagation of 2D DWT can be implemented by the gradients, where G is the backward propagation output from the layer following the 2D DWT. Similarly, the backward propagation of 2D IDWT is implemented by gradients, where G is the backward propagation output from the layer following the 2D IDWT. The forward and backward propagations of 3D DWT and IDWT are slightly more complicated, but similar to that of 1D/2D DWT and IDWT. In practice, we choose the wavelets associated with finite filters, as that shown in Table I and Table II . For finite data x ∈ R N and X ∈ R M ×N , the L, H are truncated to be the size of N 2 × N or M 2 × M . We rewrite 1D/2D/3D DWT and IDWT as network layers in PyTorch [12] , which are applicable to various discrete orthogonal and biorthogonal wavelets, like Haar, Daubechies, and Cohen. In the layers, we do DWT and IDWT channel by channel for multi-channel data. The noise are mostly represented by the high-frequency components in a noisy 2D data X. Therefore, as Fig. 2 Denoising operations While the commonly used downsampling operations may suffer from aliasing effects, our wavelet integrated down-sampling do low-pass filter before down-sampling, which helps suppressing the aliasing effects. shows, the general wavelet based denoising [56] , [57] consists of three steps: (1) decompose the noisy data X using DWT into low-frequency component X ll and high-frequency components X lh , X hl , X hh , (2) filter the high-frequency components, (3) reconstruct the data with the processed components using IDWT. In this paper, we choose the simplest wavelet based "denoising", i.e., dropping the high-frequency components, as Fig. 2 (b) shows. DWT ll denotes the transform mapping the data to its low-frequency component. Based on the commonly used CNNs, including VGG16bn, ResNets, and DenseNet121, we design WaveCNets (WVGG16bn, WResNets, WDenseNet121) by replacing their down-sampling operation with DWT ll . As Fig. 3 shows, WaveCNets replace the max-pooling and average-pooling with DWT ll , and upgrade strided-convolution using convolution with stride of 1 followed by DWT ll , i.e., where "MaxPool s ", "Conv s ", and "AvgPool s " denote the maxpooling, strided-convolution, and average-pooling with stride s, respectively. In the down-sampling of WaveCNets, while DWT ll halves the size of the feature maps, it denoises them by removing their high-frequency components. The output of DWT ll , i.e., the low-frequency component, saves the main information of the feature map to extract the identifiable features. In other words, DWT ll could help WaveCNets to suppress the aliasing effect, i.e., maintain the basic object structures in the the CNN feature maps and resist the noise propagation. Therefore, it is expected that wavelet will lead to higher accuracy and better noise-robustness for CNN based image classification. Compared with the original CNNs, WaveCNets do not employ new learnable parameters, while the wavelet transform introduces additional computation. We here analyze the multiply-add operations increased by wavelet transform in WaveCNets. Given a 2D tensor X with size of M × N and C channels, the amount of multiply-adds used in 2D DWT is and the amount of multiply-adds used in 2D IDWT is according to Eqs. (6) - (16) . Table III presents the ratios of wavelet related multiply-adds over the total operations for WaveCNets, when the input size is 3 × 224 × 224. We only count the amount of multiply-adds in DWT ll for WaveCNets. In our experiments, we mainly evaluate classification accuracies and noise-robustness of WaveCNets using ImageNet [13] . We also test the improvement of WaveCNets on the classification of adversarial samples and detection of COCO. A. ImageNet classification ImageNet contains 1.2M training and 50K validation images from 1000 categories. WaveCNets are trained on the training set from scratch when various wavelets are used. We train them 90 epochs using stochastic gradient descent (SGD) with batch size of 256. The initial learning rate is 0.05 for WVGG16bn and 0.1 for WResNets and WDenseNet121, which is decayed by multiplying 0.1 every 30 epochs. Table IV presents the top-1 accuracy of WaveCNets on ImageNet validation set, where "haar", "dbp", and "chp.p" denote the Haar wavelet, Daubechies wavelet with approximation order p, and Cohen wavelet with orders (p,p). While Haar and Cohen wavelets are symmetric, Daubechies are not. In Table IV , the classification results of the original CNNs are taken as the baseline, which are sourced from the official PyTorch 1 . The accuracy difference of WaveCNets compared with the baseline are parenthesized in the table. The symmetric Haar and Cohen wavelets improve the classification accuracy of all CNNs, while the best wavelet varies with CNN. Take ResNet18 for example, the symmetric wavelets improve the In their first 10 epochs, the loss of WResNet18(Haar) decreases faster than that of ResNet18, and the tread keeps nearly the same in the following epochs, which suggests that wavelet could accelerates the CNN training in the initial training stage. accuracy by about 1.50%, and the best performance (71.62%) of WResNet18 is achieved with wavelet "ch2.2". However, as the order increases, the asymmetric Daubechies wavelet decreases the classification accuracy of CNNs. Daubechies wavelets with lower orders ("db2" and "db3") could improve the CNN accuracy, while that with higher orders ("db5" and "db6") may reduce the accuracy. For example, the top-1 accuracy of WResNet18 decreases from 71.48% to 68.74%, as the order increases from 2 to 6. In general, CNNs transformed by the symmetric wavelets perform better than that by asymmetric wavelets. We retrain the original ResNet18 using the standard Im-ageNet classification training repository in PyTorch. Fig. 4 compares the losses of ResNet18 and WResNet18(Haar) during the training procedure, which adopts red dashed and blue dashed lines to denote the train losses of ResNet18 and WResNet18(Haar), respectively. In the initial training stage (the first 10 epochs), the loss of WResNet18(Haar) decreases faster than that of ResNet18, and the trends become similar in the following epochs, which suggests that wavelet accelerates the ResNet18 training in the initial training stage. Finally, the training loss of WResNet18(Haar) is about 0.08 lower than that of ResNet18, although they employ the same amount of learnable parameters. On the validation set, WResNet18 loss (blue solid line) is also always lower than ResNet18 loss (red solid line), which leads to the increase of classification accuracy by 1.71%. The examples in Fig. 5 cover the various deep network architectures (VGG16bn, ResNet18, ResNet34, ResNet50, and ResNet101) and various objects (suit, espresso, lycaenid, palace, jay, minican, wall clock, stupa, etc.). While the original CNNs adopt various down-sampling operations, including max-pooling, average-pooling, and strided-convolution, WaveCNets replace them using DWT ll . From Fig. 5 , one can find that the backgrounds of the feature maps produced by WaveCNets are cleaner than that produced by CNNs, and the object structures in the former are more complete than that in the latter. For example, in the first row of Fig. 5(g) , the wall clock boundary in the ResNet50 feature map with size of 56 × 56 are fuzzy, and the basic structures of wall clocks have been totally hidden by strong noise in the feature map with size of 28 × 28. In the second row, the backgrounds of feature maps produced by WResNet50(ch3.3) are very clean, and it is easy to figure out the wall clock structures in the feature map with size of 56 × 56 and the wall clock areas in the feature map with size of 28×28. The above observations illustrate that the common down-sampling operations could result in aliasing effect, i.e., accumulating noise and breaking the basic object In each subfigure, the first row shows the input image and its two feature maps output from the original CNNs, while the second row shows the related information (the image, CNN, and WaveCNet names) and the feature maps output from the WaveCNets. Compared with CNNs, the feature maps of WaveCNets are cleaner and the object structures are more complete. structures, while DWT in WaveCNets relieves the drawbacks. We believe that this is the reason why WaveCNets achieve increased accuracy. In [27] , the author is surprised at the increased classification accuracy of CNNs after low-pass filtering is integrated into the down-sampling. In [8] , the authors show that "ImageNettrained CNNs are strongly biased towards recognising textures rather than shapes". Our experimental results suggest that this may be sourced from the commonly used down-sampling operations, which tend to break the object structures and accumulate noise in the feature maps. In [1] , the authors corrupt the ImageNet validation set using 15 visual corruptions with five severity levels, to create ImageNet-C and test the robustness of ImageNet-trained classifiers to the input corruptions. The 15 corruptions are sourced from four categories, i.e., noise (Gaussian noise, shot noise, impulse noise), blur (defocus blur, frosted glass blur, motion blur, zoom blur), weather (snow, frost, fog, brightness), and digital (contrast, elastic, pixelate, JPEG-compression). to evaluate the performance of a trained classifier f . In Eq. (31), the authors normalize the error using the top-1 error of AlexNet [58] to adjust the difference of various corruptions. In this section, we use the noise part (750K images, 50K × 3 × 5) of ImageNet-C and to evaluate the noise-robustness of WaveCNet f . We test the top-1 errors of WaveCNets and AlexNet on each noise corruption type c at each level of severity s, when WaveCNets and AlexNet are trained on the clean ImageNet training set. Then, we compute mCE WaveCNet noise according to Eqs. (31) and (32) . The Table V shows the detailed results (the "noisy" columns). In Fig. 6 , we show the noise mCEs of WaveCNets for different network architectures and various wavelets. The "baseline" corresponds to the noise mCEs of original CNN architectures, while "dbp", "chp.p" and "haar" correspond to the mCEs of WaveCNets with different wavelets. Except VGG16bn, our method obviously increase the noiserobustness of the CNN architectures for image classification. For example, the noise mCE of ResNet18 (with navy blue color and down triangle marker in Fig. 6 ) decreases from 88.97 ("baseline") to 80.38 ("ch2.2"). One can find that all wavelets including "db5" and "db6" improve the noiserobustness of ResNet18, ResNet34, and ResNet50, although the classification accuracy of the WResNets with "db5" and "db6" for the clean images may be lower than that of the original ResNets. It means that our methods indeed increase the noise-robustness of these network architectures. Fig. 7 shows two example feature maps for well trained ResNet18 and WResNet18 with clean and noisy images as input. In each subfigure, the first column shows the clean example image with size of 224 × 224 from ImageNet validation set and its three noisy versions corrupted by Gaussian noise, shot noise, and impulse noise. The second column shows their feature maps generated by ResNet18, while the third column shows the feature maps generated by WResNet18 integrated with wavelet "ch2.2". These feature maps are captured from the 16th output channel of the last layer in the network blocks with tensor size of 56 × 56. From these examples, one can find that it is difficult for the original CNN to suppress noise, while WaveCNet could suppress the noise and maintain the object structure during its inference. For example, in Fig. 7(a) , the vase structures in the two feature maps generated by both ResNet18 and WResNet18(ch2.2) are complete, when the clean vase image is fed into the networks. However, after the image is corrupted by noise, the ResNet18 feature maps contain very strong noise and the vase structures vanish, while the basic structures could still be observed from the WRes-Net18 feature maps. For the moped in Fig. 7(b) , the network without wavelet, ResNet18, tends to focus its response on the two indigo areas of the moped in the feature maps, and the noise totally breaks the moped structures in the feature maps for the three noisy images. In contrary, wavelet suppresses the noise and the basic moped structures are still complete in the feature maps generated by WResNet18, whether the input image is corrupted by noise or not. This advantage improves the robustness of WaveCNets against the noise. The noise-robustness of VGG16bn is inferior to that of ResNet34, although they achieve similar accuracy (73.37% and 73.30%). Our method does not significantly improve the noise-robustness of VGG16bn, although it increases the accuracy by 1.03%. It means that the VGG16bn may not be a proper architecture in terms of noise-robustness. While wavelet integrated CNNs like WVGG16bn, WRes-Nets and WDenseNet121 have been show to achieve better classification accuracy on ImageNet and noise-robustness on ImageNet-C, the wavelet denoising block shown in 2(a) can actually be applied to preprocess the noisy images and further improve the performance of various CNNs. For the noisy images in ImageNet-C, We test a so called soft threshold filtering operation to filter the noises in highfrequency components X lh , X hl , X hh , and test the performance of different CNNs when such a denoising is applied, or not. The commonly used soft threshold filter operation is defined as below: For multiple levels of DWT and IDWT, various thresholds λ, or functions λ(x), could be selected. In our work, we perform one level DWT and IDWT and the value of λ is set as 0.1. Table V shows the performance of different CNNs when the wavelet based denoising block is applied to denoise the input images, or not. The numbers in brackets denote the differences of mCE when the denoising is applied, or not. One can observe from the Fig. 7 : The feature maps sourced from clean and noisy images. In each subfigure, the first column shows the clean example image from ImageNet validation set and its three noisy versions corrupted by Gaussian noise, shot noise, and impulse noise, respectively. The second column shows their feature maps generated by ResNet18, while the third column shows the feature maps generated by WResNet18 using wavelet "ch2.2". It is difficult for the original CNN to suppress noise, while WaveCNet could suppress the noise and maintain the object structure during its inference. (2,2) . The results also suggest that our proposed DWT integration in CNNs produces a much better improvement of the noise-robustness than the wavelet based denoising block, e.g., the mCE (80.38) of WResNet18(ch2.2) using the original noisy input is significantly lower than that of ResNet18 (87.77) input with the denoised data. Average DWT (b) Concatenation mode [42] . Fig. 8 : Wavelet integrated down-sampling in various modes. In these two modes, the high-frequency components are added into or concatenated with the low-frequency component, which would damage the useful information because of the highfrequency noise. Different with our wavelet based down-sampling ( Fig. 2(b) ), there are other wavelet integrated down-sampling modes in literatures. In [44] , the authors adopt as down-sampling output the average value of the multiple components of wavelet transform, as Fig. 8(a) shows. In [42] , the authors concatenate all the components output from DWT, and process them in a unified way, as Fig. 8(b) shows. Taking ResNet18 as backbone, we compare our wavelet based down-sampling with the previous approaches, in terms of classification accuracy and noise-robustness. We rebuild ResNet18 using the three down-sampling modes shown in Fig. 2(b) and Fig. 8 , and denote them as WResNet18, WRes-Net18 A, and WResNet18 C, respectively. We train them on ImageNet when various wavelets are used. Table VI shows the accuracy on ImageNet and the noise mCEs on the ImageNet-C. Generally, the networks using wavelet based down-sampling achieve better accuracy and noise mCE than that of original ResNet18 (69.76% accuracy and 88.97 mCE). Similar to WResNet18, the number of parameters of WRes-Net18 A is the same with that of original ResNet18. However, the added high-frequency components in the feature maps damage the information contained in the low-frequency component, because of the high-frequency noise. WResNet18 A performs the worst among the networks using wavelet based down-sampling. Due to the tensor concatenation, WResNet18 C employs much more parameters (21.62 × 10 6 ) than WResNet18 and WResNet18 A (11.69×10 6 ). WResNet18 C thus increase the accuracy of WResNet18 by 0.11% to 0.77%, when various wavelets are used. However, due to the included noise, the concatenation does not evidently improve the noise-robustness. In addition, the amount of parameters for WResNet18 C is almost the same with that for WResNet34 (21.80×10 6 ), while the accuracy and noise mCE of WResNet34 are obviously superior to that of WResNet18 C. A lot of researches [2]- [7] have shown that CNNs can be fooled by adversarial samples, which are generated by adding specially designed noises to the normal samples. While such noises are usually high-frequency components [59] and not noticeable to human eyes, we believe that our WaveCNets can filter such noises and increase the robustness of the CNNs against such adversarial attacks. We test the adversarial robustness of WaveCNets using six attack algorithms, i.e., fast gradient sign method (FGSM) [2] , iterative FGSM (IFGSM) [3] , iterative least-likerly class attack (iterLL) [3] , random FGSM (RFGSM) [4] , Carlini and Wagner's L 2 attacker (CW) [5] , and projected gradient descent (PGD) [6] . Using a public code of the above attackers 2 , we generate adversarial samples using the ImageNet validation set with the scenario of black-box attacks, and the attack model is "InceptionV3" trained on ImageNet. Table VII shows the accuracy of WaveCNets (WResNet101 and WDenseNet121) on the six adversarial sample sets; the classifiers are only trained on the clean ImageNet training set. In Table VII , "none" corresponds to the accuracy of original CNNs (ResNet101 and DenseNet121) on ImageNet validation set attracted by the six attack algorithms. The parenthesized numbers are accuracy differences of WaveCNets compared with the CNN results. From Table VII , we find that the wavelets could increase the accuracy of CNNs on the adversarial samples. Take PGD attack for example, our WResNet101 using different wavelets generally achieve 3% higher accuracy than the original ResNet101 (53.94%). The best performance is achieved by "ch2.2", which is abut 4.35% higher than that of ResNet101. While our WaveCNets achieve consistent resistance to various attackers, their performances are inferior to that of the specially trained defense methods [7] . With adversarial training, the feature denoising method proposed in [7] achieves the state-of-the-art result in defending adversarial attackers. However, as the authors point out in their paper, the customdesigned denoising block requires residual connection for stable training of their deep networks. It seems that the spatial filtering used in their method, such as mean filtering and median filtering, tend to perform denoising in the whole frequency domain, which might easily break basic object structures in the feature maps. In contrast, the general DWT module in WaveCNets denoises the feature maps in the high-frequency interval, which can keep the basic object structure and thus lead to increased adversarial-robustness without requirement of adversarial training. To further illustrate the efficiency of wavelets in deep learning, we conduct detection experiments on COCO [14] benchmark, by comparing the performance of faster R-CNN [15] , RetinaNet [16] , and their wavelet integrated versions. We apply the original off-the-shelf detectors in mmdetection [60] , and build their wavelet integrated versions by replacing original backbones (ResNet50 and ResNet101) with WaveC-Nets (WResNet50 and WResNet101). In the wavelet integrated detectors, the DWT layer is applied to down-sample the feature maps and suppress the aliasing effects. Wavelet "haar" and "ch3.3" are applied into these two backbones, since they show the best performance on ImageNet classification task. With the default hyper-parameters, the detectors are trained on trainval135k (115K images) of COCO. Table VIII shows the hyper-parameter setting. We train each detector for 12 epochs from scratch, and drop the learning rate by a factor of 0.1 at epoch 8 and 11 . The trained detectors are tested on the minival (5K images) and test-dev (20K images) sets; Table VIII Table VIII indicate the performance difference with the base-line. For comparison, we also list the detection performance of original detectors trained by 24 epochs. As shown in Table VIII , with 12-epoch training, the detection performance of wavelet integrated detectors are superior to that of the original detectors, and comparable to or even better than that of original detectors trained by 24 epochs. Taking faster R-CNN with ResNet101 for example, on test-dev set, the AP of the 12-epoch trained detector is 39.8%, which is increased by 1.4% to 41.2%, after integrating with wavelet "ch3.3". The increased AP is even higher than that of original detector (40.5%) trained by 24 epochs. Comparing their detection performance on the objects with different sizes, one can find that the wavelet could consistently improve the detector performance on the all three categories of objects. For faster R-CNN with ResNet101 on test-dev set, the wavelet integrated detector increases the performance on the small, medium, and large objects by 1.1%, 1.7%, and 1.2%, respectively. While extension of training time to 24 epochs could further increase the AP M and AP L of the baseline detector by 0.7% and 1.6%, such extension decreases AP S by 0.7%. Since aliasing effects are available in original detectors, extending training time seems to enlarge such effects. As the basic structures of small objects are more easily damaged by the aliasing effects, the detection performance on small objects decreases with the increase of training epochs. In contrast, wavelet could suppress the aliasing effects in the deep networks, and consistently increase the performance of detectors for objects with different sizes. Fig. 9 visually shows four example images detected by various detectors, i.e., Faster R-CNNs and RetinaNets with different backbones (ResNet101 and WResNet101). In Fig. 9 , the first column shows the manual detections, which are taken as ground truths. While the second and third columns show the a in mmdetection, the default ratio of initial learning rate to batch size is 1 800 for faster R-CNN, and 1 1600 for RetinaNet. b the original detectors trained for 24 epochs are downloaded from mmdetection. In their training, the learning rates are decayed at epoch 16 and 22. images detected by detectors trained for 12 epochs, integrated with or without using wavelet, the fourth column shows the images detected by the one trained for 24 epochs without using wavelet. As Fig. 9 shows, the detectors without using wavelet usually falsely detect or miss the small target objects in the images, while the detectors integrated with wavelet correctly detect all target objects. Taking the "bench" image (shown in the first row of Fig. 9 ) for example, the confidence of detected bench increases from 0.810 to 0.861 when wavelet "ch3.3" is integrated to the backbone of ResNet101. Though increase of training epoch (24) could also improve such confidence of baseline RetinaNet with backbone of ResNet101, it falsely detects some poles as birds. Compared with the commonly used down-sampling, DWT needs more multiply-add operations, which would decreases the detection speed. As shown in the 6th column of Table VIII , the detection speed of faster R-CNN slightly decreases from 18.1 FPS (frames per second) to 16.3 FPS, when wavelet "haar" is integrated to the backbone of ResNet50 and Nvidia Tesla V100 GPU is used as the computing platform. We conduct 3D object classification experiments on 3D point cloud dataset ModelNet40 [61] , using 3D VGG and ResNet. ModelNet40 consists of 12311 (9843 for training and 2468 for test) 3D point cloud data captured from 40 categories of objects. As shown in Fig. 10(a) , the point cloud data representing a 3D object contains a series of 3D coordinates. As shown in Fig. 10(b) , we transform the 3D point cloud data to 3D tensor with size of 32 × 32 × 32, as the input of deep networks. We perform the classification experiments using 3D VGG9, 3D ResNet16, and their wavelet integrated versions (3D WVGG9 and WResNet16). As 3D versions of VGG and ResNet, 3D VGG9 and ResNet16 are designed by removing the last two convolutional layers of 3D VGG11 and ResNet18. The two 3D WaveCNets, 3D WVGG9 and WResNet16, apply 3D DWT to down-sample 3D feature maps. Using the ModelNet40 training set, we train 3D WVGG9 and WResNet16 from scratch for 80 epochs, with batch size of 256 and initial learning rate of 0.1. The learning rate is dropped by a factor of 0.1 every 25 epochs. For comparison, we also train the 3D VGG9 and 3D ResNet16, using the same hyper-parameter setting. We evaluate their performance on the test set of ModelNet40, and list the results in Table IX. In Table IX , "3D CNN (baseline)" corresponds to the results of original 3D VGG9 and ResNet16, and "3D WaveCNet" represents that of 3D WVGG9 and WResNet16, with different wavelets like Haar ("haar"), Cohen ("chp.p"), and Daubechies ("dbp"). On the test set, the accuracy of 3D VGG9 and ResNet16 are 87.08% and 84.32%, respectively. The integration of Cohen wavelet "ch5.5" and "ch2.2" improve the accuracy to 88.25% and 85.90%, respectively. Similarly, Haar and Cohen wavelets also improve the performance of both 3D VGG9 and 3D ResNet16. While all of different Daubechies wavelets could improve the accuracy of 3D ResNet16, "db4", "db5", and "db6" decrease the performance of 3D VGG9. In summary, similar to the results of 2D image classification, the symmetric Haar and Cohen wavelets can consistently improve the performance of 3D CNNs, while the improvements of asymmetric Daubechies wavelet are not guaranteed. In deep networks, the commonly used down-sampling operations result in the aliasing effects on the feature maps, accumulating noise and breaking the object structures, which leads to the weak noise-robustness of CNNs for image classification. To suppress the aliasing effects, we design the general discrete wavelet transform (DWT) and inverse DWT (IDWT) layers, and propose wavelet integrated convolutional networks (WaveCNets) by replacing the down-sampling operations in the common CNNs with DWT. During the inference of WaveCNets, wavelet helps the networks to keep the basic object structures and resist the noise propagation. WaveCNets achieve higher image classification accuracy, better noiserobustness, and increased adversarial robustness with various commonly used CNN architectures on ImageNet. Our method can also improve the detection performance of faster R-CNN (a) 3D point cloud data. (b) 3D tensor. and RetinaNet on COCO benchmark. In future, we will exhaustively study the application of DWT/IDWT layers in the image-to-image tasks. We will also explore the applications of 3D DWT/IDWT integrated deep networks for 3D medical image processing. Benchmarking neural network robustness to common corruptions and perturbations Explaining and harnessing adversarial examples Adversarial examples in the physical world Ensemble adversarial training: attacks and defences Towards evaluating the robustness of neural networks Towards deep learning models resistant to adversarial attacks Feature denoising for improving adversarial robustness increasing shape bias improves accuracy and robustness Certain topics in telegraph transmission theory A theory for multiresolution signal decomposition: the wavelet representation Ten lectures on wavelets Automatic differentiation in pytorch Imagenet: A large-scale hierarchical image database Microsoft coco: Common objects in context Faster r-cnn: Towards real-time object detection with region proposal networks Focal loss for dense object detection Wavelet integrated cnns for noiserobust image classification Approximating cnns with bag-of-localfeatures models works surprisingly well on imagenet Defense against adversarial attacks using high-level representation guided denoiser Mixed pooling for convolutional neural networks Stochastic pooling for regularization of deep convolutional neural networks Deep residual learning for image recognition Densely connected convolutional networks Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Very deep convolutional networks for large-scale image recognition Why do deep convolutional networks generalize so poorly to small image transformations Making convolutional networks shift-invariant again Wavesnet: Wavelet integrated deep networks for image segmentation Segnet: A deep convolutional encoder-decoder architecture for image segmentation U-net: Convolutional networks for biomedical image segmentation Semantic image segmentation with deep convolutional nets and fully connected crfs Encoderdecoder with atrous separable convolution for semantic image segmentation Invariant scattering convolution networks A mathematical theory of deep convolutional neural networks for feature extraction Wavelet networks Neural network adaptive wavelets for signal representation and classification Multipath learnable wavelet neural network for image classification Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution Attribute enhanced face aging with wavelet-based generative adversarial networks Wavelet-enhanced convolutional neural network: a new idea in a deep learning paradigm Waveletfcnn: A deep time series classification model for wind turbine blade icing detection Multi-level waveletcnn for image restoration Wavelet pooling for convolutional neural networks Sar image segmentation based on convolutional-wavelet neural network and markov random field Photorealistic style transfer via wavelet transforms Gradient-based learning applied to document recognition Learning multiple layers of features from tiny images Reading digits in natural images with unsupervised feature learning CD ROM from Department of Clinical Neuroscience, Psychology section Wavelets and multiwavelets The dual-tree complex wavelet transform Combined complex ridgelet shrinkage and total variation minimization Fast discrete curvelet transforms Image compression with geometrical wavelets The contourlet transform: an efficient directional multiresolution image representation De-noising by soft-thresholding Ideal spatial adaptation by wavelet shrinkage Imagenet classification with deep convolutional neural networks High-frequency component helps explain the generalization of convolutional neural networks MMDetection: Open mmlab detection toolbox and benchmark He is currently research associate fellow at the same school Linlin Shen is currently the "Pengcheng Scholar" Distinguished Professor at School of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China. He is also an Honorary professor at School