key: cord-0301504-2l2yaj6d
authors: Sharma, Prasen Kumar; Abraham, Arun; Rajendiran, Vikram Nelvoy
title: A Generalized Zero-Shot Quantization of Deep Convolutional Neural Networks via Learned Weights Statistics
date: 2021-12-06
journal: nan
DOI: 10.1109/tmm.2021.3134158
sha: f441e85f9a1a26fcbb1543f79d7b7511a780c034
doc_id: 301504
cord_uid: 2l2yaj6d

Quantizing the floating-point weights and activations of deep convolutional neural networks to fixed-point representation yields reduced memory footprints and inference time. Recently, efforts have been afoot towards zero-shot quantization that does not require original unlabelled training samples of a given task. These best-published works heavily rely on the learned batch normalization (BN) parameters to infer the range of the activations for quantization. In particular, these methods are built upon either empirical estimation framework or the data distillation approach, for computing the range of the activations. However, the performance of such schemes severely degrades when presented with a network that does not accommodate BN layers. In this line of thought, we propose a generalized zero-shot quantization (GZSQ) framework that neither requires original data nor relies on BN layer statistics. We have utilized the data distillation approach and leveraged only the pre-trained weights of the model to estimate enriched data for range calibration of the activations. To the best of our knowledge, this is the first work that utilizes the distribution of the pretrained weights to assist the process of zero-shot quantization. The proposed scheme has significantly outperformed the existing zero-shot works, e.g., an improvement of ~ 33% in classification accuracy for MobileNetV2 and several other models that are w&w/o BN layers, for a variety of tasks. We have also demonstrated the efficacy of the proposed work across multiple open-source quantization frameworks. Importantly, our work is the first attempt towards the post-training zero-shot quantization of futuristic unnormalized deep neural networks.

D EEP CNNs are renowned for their tremendous capabilities to learn robust and complex features to produce remarkable results. Over the last decade, they have been widely adopted in various fields such as Computer Vision [1] , Speech Processing [2] , Natural Language Processing [3] , etc., spanning from system to on-device platforms. However, © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Prasen Kumar Sharma was an intern with Samsung R&D Institute India-Bangalore during this work. Part of this work was done when he was at Indian Institute of Technology Guwahati, India several modern deep neural networks have an enormous size, which makes it challenging to deploy those on real-time resource constraint devices. For e.g., LeCun et al. [4] proposed LeNet5 model that has approx. 400K weights. About two decades later, AlexNet [5] and VGG-16 [6] were proposed that have more than 60M, 130M weights, respectively. Each of these models, when stored in 32-bit floating-point (FP32) format, requires excessive storage, e.g., VGG16 [6] , that has a size of 130M×4Bytes 520MBytes. Coates et al. [7] even proposed a model that has 11B weights. Such models may also result in higher inference latency, processing capacity, energy consumption, and difficulty in training [8] . Furthermore, a majority of the inference specific accelerators, such as Google Edge TPU, support only the quantized models. For e.g., MobileNetV2 [9] , that takes 51 milliseconds (ms) per inference for ImageNet [10] classification on a Desktop CPU, takes only 2.6 ms on a device with Google Edge TPU, about 20× faster. Therefore, it becomes necessary to compress the large-sized models in order to accommodate them on resource constraint devices and accelerators for an efficient real-time computing.

Existing methods for deep model compression are primarily based on matrix factorization, pruning, quantization, and designing deep CNN's with fewer parameters [9, [11] [12] [13] [14] [15] [16] [17] [18] [19] . Also, quite a few schemes [20] [21] [22] have been proposed that train a deep CNN based on Bayesian methods in order to aid the quantization and pruning at a later stage. However, in this work, we focus on quantization and propose a novel principled zero-shot approach.

Background: Quantization allows us to store or process the tensors at lower and ultra-low bit precisions, e.g., <4, 8, 16>bit fixed-point from FP32 format [23] . It yields more resourceefficient integer operations. Given a tensor x (in FP32), its quantized value Q(x) can be approximated as

where ∆, z denote scale and zero-point offset, respectively. The ∆ can be computed as

where n = 2 b − 1 with b denoting lower bit precision. The zero-point offset then can be computed as follows

arXiv:2112.02834v2 [cs.CV] 11 Dec 2021

Overall, the quantization schemes can be classified into (a) Per-tensor (QS-PT), and (b) Per-channel (QS-PC), alias Peraxis. Per-tensor quantization requires ∆ and z to be computed across the tensor. Whereas in per-channel, they are calculated across each channel of the tensor. Further, they are categorized as symmetric if z is 0, asymmetric (alias affine) otherwise [23] . We refer the readers to [23, 24] for a detailed review of various quantization schemes.

In the case of post-training quantization, ∆ and z are computed for both weights and activations prior to inference. Their computations (see Eqns. 2, and 3) depend on range of the input tensor x, i.e., min(x), and max(x). We refer the process of estimating the range of the input tensor as "range calibration" in the rest of the paper. Given a pre-trained model in FP32, range calibration of the weights can be achieved without any difficulty. However, for activations 1 , one requires access to the original unlabelled training samples. Once the dataset, either in full or limited, is available, the activations can be generated, followed by computing the ∆ and z of the same. Based on the availability of the original unlabelled training samples, either in full, limited, or none, the recent works can be categorized into (a) quantization-aware training [23] [24] [25] [26] [27] [28] [29] [30] , (b) requires limited data [15, [31] [32] [33] , and (c) data-free approaches [34] [35] [36] [37] [38] , respectively.

Challenges and Motivation: Recently, much attention has been given to the data-free approaches [34] [35] [36] 38] since the availability of the original data is infeasible in some cases. For e.g., Medical Imaging, where the user's privacy is prioritized above all. These schemes may also be analogously referred to as zero-shot approaches 2 . For estimating the ∆ and z of the activations w/o utilizing the original data, these zero-shot schemes mainly rely on the learned BN [39] layer's parameters. The long-term statistics of BN [39] layer may consist of some information about the original training samples [36] , which may be useful for range calibration of the activations.

BN [39] layer in a deep CNN is an indispensable component that is believed to improve the generalization, accelerate convergence, aid higher learning rate, and stabilize the training process [40] . However, in recent studies [40] [41] [42] , it has been shown that none of these benefits is unique to the BN [39] layer. Also, BN [39] parameters are folded into the weights and biases of convolution and fully-connected layers for an efficient inference, which is a widely adopted practice [23] . The BN [39] folding introduces high variance into the weights of the pre-trained model, that makes quantization difficult [23] . Hence, the existing zero-shot approaches [34, 35] fail when the BN [39] layer is either folded or absent in the network. Now, an obvious question that arises, is, "what else can be utilized from the pre-trained model that may represent original training samples, given the zero-shot condition holds." Perhaps, the pre-trained weights. For any given task, weights are learned to extract the imperative features from the input training samples in order to achieve the desired objective with minimal penalty [43] . 1 interchangeably referred to as features in the paper 2 since they do not require original data samples.

Further, when the BN [39] layer is folded, the introduced variance in the weights may have some auxiliary information about the distribution of the original data. Both the pre-trained weights and folded-variance have not been exploited in any of the recent works [15, [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] .

Therefore, in this work, we have proposed a generalized zero-shot approach, namely GZSQ, that neither requires original training samples nor depends on BN [39] layer parameters for range calibration of the activations. The proposed GZSQ acts as an API call and is built upon the data distillation framework that only considers the pre-trained weights of the model. In particular, GZSQ approximates the substitutes for BN statistics for each layer by only utilizing the pre-trained weights of the network. Hence, it eliminates the requirement of the BN layers in the model if one wants to incorporate data distillation for the range calibration of the activations under zero-shot condition. We have evaluated the proposed method across multiple quantization frameworks with both QS-PC and QS-PT schemes. Further, in our experiments, we have considered the models that are w and w/o BN [39] layers, including the BN [39] folding in the former case, for the tasks of classification [10, 44] and object detection [45] .

The rest of the paper is organized as follows: Section II briefly reviews the related existing works. Section III presents the proposed GZSQ framework. The experimental details, results, and ablation study are presented in Sections IV, V, and VI, respectively. Lastly, Section VII summarizes the work.

II. RELATED DEVELOPMENTS Besides quantization, several works have been proposed to address the restricted memory footprints and inference latency of the modern deep CNNs. Going orthogonally to quantization, these methods are based on − efficient neural architecture design, knowledge distillation, pruning, hardware co-design, and matrix factorization.

Architecture design-based methods [9, [16] [17] [18] [19] proposed the deep CNNs that have fewer number of parameters using Neural Architecture Search (NAS) [46] , for e.g., MobileNetV2 [9] . Knowledge distillation based methods [47] [48] [49] extracted the latent information from the original model using studentteacher paradigm and proposed the sparse version of the model. To reduce the number of trainable parameters, pruning [11] [12] [13] [14] sets the insignificant weights to zero. This includes methods based on Hessian matrix of the loss functions [50, 51] , and iterative pruning [52] . Kwon et al. [53] proposed to co-design the deep CNNs and neural net accelerators [54] to achieve significant performance gain. Matrix factorization based schemes [55, 56] decompose the weight matrix into the low-rank (L) and sparse (S) matrices, where both L and S require less storage. Later, Han et al. [57] proposed a method based on the ensemble of pruning, quantization and Huffman coding [58] . Recently, He et al. [59] proposed the reinforcement learning-based method for model compression. However, here, we focus on existing quantization works based on the requirement and availability of the original training samples.

Require full training set: Quantization reduces the memory footprints of the deep CNNs by reducing the bit precisions of weights and activations. However, it may also introduce noise and lead to the significant performance degradation, especially when quantizing FP32 to ultra-low bit precisions. To address this, several methods [23-30, 60, 61] follow the trivial approach for quantization by performing the Quantizationaware Training (QAT) [24] . These approaches may result in the lowest degradation in final accuracy. However, it requires a full training set to train the model, which may not be possible in some cases, e.g., Medicine or Satellite Image-based systems, due to privacy. Besides the availability of original data, QATbased schemes are time and resource consuming.

Require limited data: To overcome this drawback, a few methods [15, [31] [32] [33] , based on weights factorization and channel splitting, have been proposed, requiring limited training samples. In particular, Choukroun et al. [32] proposed a method, namely OMSE, that minimizes the 2 distance between quantized and original tensor. Also, Banner et al. [31] proposed a method called ACIQ that analytically computes the clipping range and per-channel bit allocations for deep CNNs. However, per-channel quantization for activations is infeasible in practice. Zhao et al. [15] proposed a method called OCS to solve the problem of outlier channel.

Zero-shot approaches: To overcome the drawback of original data dependency, either in full or limited, Nagel et al. [34] proposed a data-free quantization method, namely DFQ. It is based on the weights equalization and bias correction. However, to estimate the ∆ and z of the activations, DFQ entirely depends on the BN's [39] shift (β) and scale (γ) parameters. The j th layer activation ranges are first calculated as β j ± n.γ j , where n has been empirically set to 6. Later, ∆ j and z j have been computed using Eqns. 2 and 3, respectively. Subsequently, following [36, 37] , Cai et al. [35] proposed a zero-shot framework, called ZEROQ, to generate the data that matches the statistics of the training set. For this, ZEROQ [35] has utilized the BN [39] layer's statistics, as

where µ r i /σ r i are the mean/std-dev of the distilled data's (y r ) distribution at layer i, and µ i /σ i are the corresponding mean/std-dev parameters stored in the i th BN [39] layer. The distilled data y r then has been used to infer the range of the activations to estimate the ∆ and z. Later, Xu et al. [38] proposed a generative approach [62] to produce a fake data for range calibration by exploiting the classification boundary knowledge and BN [39] statistics. The proposed generator architecture is inspired from AC-GAN [63] , which also results in additional memory overhead.

Recently, a shift in trend has been observed towards the Mixed-Precision Quantization (MPQ) [64] [65] [66] , where a different bit-precision is set adaptive to each layer in the network. However, it further raises another level of difficulty in finding the optimal bit-width for each layer, given a predetermined support by the hardware accelerators to MPQ. Along with DFQ [34] , these best-published works fail when the model does not comprise of BN [39] layers. In that case, it becomes difficult to either utilize the distillation approach or BN [39] parameters (β, γ) and infer activation ranges without original training samples.

Moreover, the trend [40] [41] [42] observed in the past two years shows that the community has started to look for alternatives to BN [39] . Therefore, it should be mentioned that in the coming years, a majority of the cutting-edge deep CNNs may not consist of BN [39] layers at all. This work is the first attempt towards the quantization of such futuristic models.

Our Contributions: We present a novel solution that allows us to utilize the distillation approach, even if the BN [39] layers are folded or absent in a model. For this, we first estimate the substitutes to BN statistics (µ, σ) for every layer by utilizing the pre-trained weights of the model only. We divide this substitutes estimation phase into (a) statistics estimation, followed by (b) empirical statistics adjustment sub-phases. We then perform the data distillation, similar to in Eqn. 4, by utilizing the estimated substitutes with a novel objective function. Essentially, our key contributions are fourfold:

• We propose a generalized zero-shot quantization (GZSQ) framework that neither requires original data nor relies on BN [39] layers statistics of the models. In other words, we leverage the pre-trained weights of the model "only" and estimate an enriched data which can be used for range calibration of the activations. • We also propose to use the absolute Z-score based loss function over 1 and 2 norms or Kullback-Leibler (KL) divergence for an efficient data distillation. • To test the generalization proficiency of the proposed GZSQ, we have evaluated our distillation approach across multiple open-source quantization frameworks. We have also benchmarked the models that are w & w/o BN [39] layers on the tasks of classification and object detection. • To appraise the efficacy of the proposed method, we have shown the results on a different problem domain, e.g., Medical Imaging. Moreover, we have presented an ablation study to demonstrate the effect of various cost functions and BN [39] parameters folding before & after data distillation.

Weights of n th conv2d layer in F f n n th layer activations in F µ f n , σ f n Mean and std-dev of f n µ a 1:N , σ a 1:N Substitutes of BN [39] statistics A Set consisting of BN [39] substitutes N (µ, σ)

Gaussian with µ mean and σ std-dev

In this section, following the notations in Table I , we describe the regime of operations of the proposed GZSQ scheme.

We first describe the process of estimating A, for F w/o using original data samples. Following [67] , if W 1:N are initialized with a Gaussian distribution (N (µ, σ)) [68] 3 , owing to the Central Limit Theorem, W 1:N will also converge to N (µ, σ) at the end of training. We also assume the input J for distilled data estimation to follow N (0, 1). Now, let us first consider the base case where in F , BN [39] layers exist upto layer n, where n ∈ (1, N ). The activations of layer n then can be written as

where g n denotes n th layer activation function. Knowing that the pre-activations of layer n follow N (µ, σ) due to BN n [34] , and if g n is some form of the class of clipped linear activations, such as ReLU or ReLU6, then f n will also follow a clipped (i, j) Normal distribution, where i < j, and j can be ∞. Now, let us assume, from n + 1 st layer onwards, BN [39] layers do not exist in F . The activations of n + 1 st layer can be written as

We know that W n+1 and f n follow N (µ, σ). Following the property of convolution of two Gaussian distributions N (p 1 , q 1 ), N (p 2 , q 2 ) is also a Gaussian N (p 1 + p 2 , q 2 1 + q 2 2 ) 4 , the f n+1 will also follow a clipped N (µ, σ) when b 1:N = 0 5 . However, we will consider the extra biases introduced due to BN [39] folding in Section III-D.

The substitutes µ a n+1 and σ a n+1 then can be estimated as follows µ a n+1 = µ Wn+1 + µ a n σ a n+1 = σ 2 Wn+1 + σ 2 a n .

This can go on for the subsequent layers (n + 2 : N ) in F . Backtracking our initial assumption of BN [39] layers upto n th layer in F to the 1 st layer, the µ a 2 , σ a 2 can be estimated following the above procedure. Now, suppose, even the 1 st layer in F is not followed by a BN [39] layer. Then its activations can be written as f 1 = g 1 W 1 * J . Holding to our initial assumption of J to follow N (0, 1), the f 1 will also follow N (µ, σ) and µ a 1 , σ a 1 can be estimated using Eqn. 7.

The channel-wise addition in Eqn. 7 is true, iff the number of channels (C) of the addends are same. However, with the recent advancement in deep learning, there exists a variety of networks where the condition holds false [9, [70] [71] [72] . We classify such pair of layers into the sets of (a) Expansion when C(W n+1 ) > C(a n ), and (b) Contraction when C(W n+1 ) < C(a n ). We empirically define the following set of values as {min, mean ± min, mean, max ± mean, max}, to be used 3 In practice, weights of many deep CNNs are initialized with Gaussian version of the Kaiming initialization [69] . 4 The readers are advised to refer this link for more details. 5 The pre-trained models we have studied in this paper do not use bias.

for accommodating the required number of channels in the addend with lower C. Intuitively, a pair of filters of a deep CNN layer may have some correlation (strong or weak), resulting in correlated feature maps [73] . Therefore, the best approximation of the required statistics from the current ones w/o utilizing any explicit information may be made by leveraging any value from the set mentioned above.

For e.g., in the case of expansion, repeating the current statistics across the channels works for most of the networks, except for ShuffleNet [71] and InceptionV3 [72] -based models, where mean(µ a n ; σ a n )−min(µ a n ; σ a n ) works.

Similarly, for contraction, where in most cases mean(µ a n ;σ a n )−min(µ a n ;σ a n ) works, except for SqueezeNet [74] where mean values are set with mean(µ a n ) and for MobileNetV2 [9] where std-dev values are set using mean(σ a n )+min(σ a n ).

With the set A as substitutes to BN [39] statistics, we have followed the distillation approach similar to ZEROQ [35] . For this, we input a random instance of N (0, 1), denoted by y, to F and generate set of activations for each layer. We then compute the statistics of each activation as (µ f n , σ f n ) ∀n ∈ 1 : N , and minimize the distributional difference with the corresponding values in A. However, instead of utilizing 1 or 2 norm, or KL divergence, we have considered the absolute difference of Z-score test as a loss function L Z (, ., ) for an efficient knowledge distillation as

where u, v can be any two tensors with different distributions and s = 1e-6 6 . The 1 and 2 norms are known to favour the sparse and non-sparse data, respectively. It has been widely accepted that in a deep CNN, the first few activations may have high non-sparsity whereas later layers may have high sparsity [75] . The shortcomings of utilizing KL divergence in GZSQ have been discussed in Section VI-B.

Therefore, instead of data or sparsity based optimization, we propose a distributional minimization (L Z ) and define the final loss as

L Z (f n , a n ) + L Z (y, N (0, 1)).

We finetune y using the following objective function arg min

and then use theŷ (distilled data w/o utilizing BN [39] statistics) to infer the ranges of the activations in F . 6 Similar to ZEROQ [35] , it has been set to avoid the division by 0 in Eqn. 8 if σu and σv are 0. 

µ a n−1 , σ a n−1 ← ESA(µ a n−1 , σ a n−1 , W n ) 6 µ a n , σ a n ← SE(µ a n−1 , σ a n−1 , W n ) / * Initialize y with a random instance * / 7 y ← N (0, 1)

In practice, for a resource-efficient inference, the BN [39] parameters are folded into the weights and biases of the convolutional and fully-connected layers of a deep CNN. For e.g., folded weights and biases of the n th convolutional layer in F can be written as

where γ n , β n , µ B n , σ B n represent the scale, shift, long-term mean, and long-term std-dev of the n th BN [39] layer, respectively. It has been observed during our experiments that the introduced bias, when BN [39] layers are folded before distillation, may lead to a slight (severe in some cases; e.g., ResNet18 and ResNet50; see Section VI-C and Table. XIII) degradation in the accuracy of the quantized models. Therefore, the introduced bias term has to be accommodated into the proposed distillation approach (see Sections III-A, III-B), when folding is performed before distillation. With the σ b n fold = 0, the refined mean in Eqn. 7 can be written as µ a n = µ Wn,fold + µ a n−1 + b n fold .

By utilizing the distilled dataŷ to infer the ranges of the activations, the scale ∆ and zero-point offset z can be computed for each layer activation. For W, one can compute the ranges and ∆, z as usual. In this paper, we have presented the results for both QS-PC and QS-PT schemes. Also, to test the generalization capacity of the proposed GZSQ scheme, we have incorporated both (a) ZEROQ's [35] quantization simulation 7 , and (b) Pytorch's Post-Training Static Quantization 8 , in our experimentation. In the latter case, we have adopted the QS-PT (affine) scheme with histogram observer, considering the hardware practicality, for the activations. Whereas, for the weights, we have shown the results for both QS-PC (symmetric) with per-channel min-max observer & QS-PT (affine) with min-max observer, schemes. Observers record the range of the tensor and use it to compute the quantization parameters (∆, z). The details are beyond the scope of this work, hence, we refer the readers to 9 .

An overview of the proposed GZSQ framework has been given in the Algorithm 1.

This section first describes the details of the datasets, covering the vast diversity across the tasks, that have been used to evaluate the quantized models. It then shifts the attention towards the details of experimental settings and the stateof-the-art baselines, which have been appraised against the proposed GZSQ framework.

Datasets: We start our results (see: Section V) by showing the classification accuracy on CIFAR-10 [44] and ImageNet [10] datasets. CIFAR-10 consists of ∼ 60K images across 10 different classes, of which pre-determined 10K samples have been used for testing the accuracy of the quantized model. ImageNet is one of the largest available benchmark datasets for image classification that consists of ∼ 1.2M natural images for training and 50K for validation, across 1K classes. We have also presented the results for the task of object detection using Microsoft COCO [45] dataset. It consists of more than 200K images across 80 different object categories.

Lastly, to test the efficacy of the proposed GZSQ, we have evaluated the quantized models in the domain of medical imaging for the task of pneumonia classification in chest Xrays. The incorporated binary classification aims to categorize the given X-ray samples into normal and pneumonic classes. The pneumonic samples have been collected by utilizing both bacterial and viral infected cases. For this, we exploited the pre-split publicly available dataset 10 [76] consisting of ∼ 6K chest X-rays. The complete dataset has been divided into train, validation, and test subsets. The training subset consists of 5.2K X-rays with 1341 and 3875 samples corresponding to normal and pneumonic cases. The validation subset comprises 16 equally distributed radiographs. Whereas the test set contains 624 images with 234 and 390 samples corresponding to normal and pneumonic cases. Samples of chest X-rays of normal and pneumonic persons are shown in Figure 1 . The normal chest X-ray (see Figure 1 ; top) exhibits clear lungs without any abnormal opacified regions. Whereas the pneumonic X-ray (see Figure 1 ; bottom) typically depicts a focal lobar consolidation or interstitial pattern in one or both lungs.

It should be mentioned that for most of the tasks, we first adopt the publicly available pre-trained models, perform GZSQ, and test the quantized models on the validation or test set. For the tasks where it is difficult to get the pretrained models, we first train the models using the suggested configuration in the respective papers. Later, we perform GZSQ, and test the quantized models on the validation or test set. The proposed GZSQ scheme does not utilize either of the train, test, or validation samples in any form for any purpose at any stage to quantize a model. 

We have performed our experiments using Pytorch [77] framework with Adam [78] optimizer and a learning-rate of 10 −4 for 500 iterations on Nvidia Tesla P100 GPU. In all our experiments, mean results have been presented with the corresponding std-devs across 10 runs, each with a different random seed.

Competing methods: We have evaluated the proposed scheme on various models that are with BN [39] layers, such as ResNets [70] , MobileNets [9] , ShuffleNets [71] , InceptionV3 [72] , SqueezeNext [16] for the task of image classification. We have also considered the models that are w/o BN [39] layers, such as Fixup-ResNets [40] and ISONets [41] , for image classification. RetinaNet [79] has been preferred for the task of object detection. For the case of medical imaging, we have considered the problem of Pneumonia classification in chest X-rays [76] . For this, we have adopted ResNet18 [70] and ResNext [80] '20) , and the most recent zero-shot work, namely, ZEROQ [35] (CVPR '20) . For a fair comparison, all methods were evaluated using respective default parameters.

A. Classification Table II shows the CIFAR-10 classification accuracy on Fixup-ResNets [40] models that are w/o BN [39] layers, when quantized with different configuration. W8A8 denotes 8-bit integer precision for both weights and activations. It can be observed that for most of the Fixup-ResNet variants, the proposed GZSQ has outperformed the existing ZEROQ [35] approach across different quantization configurations, especially for W2A8. It should also be noted that for W2A8 11 , even though the proposed GZSQ utilizes the pre-trained weights for estimating the ∆ and z of the activations, there is not much degradation in the accuracy, compared to the ZEROQ [35] .

Similarly, Table III shows the quantization accuracy of ISONets [41] models that are also w/o BN [39] layers on ImageNet [10] classification. It can be observed that for most of the models, GZSQ has achieved near FP32 accuracy, outperforming the ZEROQ [35] by a significant margin. From Tables II, III, it may be noted that ZEROQ [35] behaves similar to unit Gaussian in the absence of BN [39] layers due to its underlying backbone architecture, and results lower accuracy with quantized models. Whereas, the results obtained by using proposed GZSQ are not only superior to unit Gaussian and ZEROQ [35] , but also close to FP32 accuracy. Table IV shows the results obtained on ImageNet [10] classification for various models 12 that consists of BN [39] layers. It can be observed that the proposed GZSQ has shown a remarkable improvement of over OCS [15] and ZEROQ [35] on majority of the models, especially with +1.25% for SqueezeNextV5 [16] . Tables. V, VI present the ImageNet [10] classification accuracy for MobileNetV2 [9] and ResNet18 The results in Tables II:IV , XI are computed using ZEROQ's [35] quantization simulation, whereas the results shown in Tables VII:X, XIII are computed using Pytorch's post-training static quantization framework. It can be observed from the Tables VII, VIII that the proposed GZSQ has outperformed the ZEROQ [35] on majority of the models 13 . Even though we have assumed the input for the data distillation (see Sections III-A, III-B, III-C) to be a random N (0, 1) instance, the GZSQ has shown a significant performance gain over Gaussian (0,1) input. It may be due to the incorporation of pre-trained weights in the proposed distillation process that has not been considered in ZEROQ [35] .

It may also be concluded that the GZSQ's distilled data has been more beneficial for inferring the activation ranges over N (0, 1) and ZEROQ's [35] . We have shown the results for both the cases when BN [39] folding is performed before and after data distillation. We have also shown the quantization results when a random subset of training set is used to infer the activation ranges.

Among high-level vision tasks, the complexity increases when attention shifts from classification to object detection in an image. Object detection aims to predict the set of bounding boxes around the objects with pre-defined category labels in an input image. To demonstrate the robustness of our approach, we have also evaluated the proposed scheme on the task of object detection using the Microsoft COCO [45] dataset. For this, we have adopted RetinaNet [79] single-stage detector which has a state-of-the-art results i.e., 36.4 mean average precision (mAP) (standard 0.5:0.05:0.95 metric). The adopted RetinaNet has a ResNet50 as backbone architecture. For quantization, in this paper, we have processed the pre-trained RetinaNet with INT8 configuration by using 

Unlike natural images, X-rays and Magnetic Resonance Imaging (MRI) [84] scans do not comprise of many indistinguishable features, in general, among each other. It is challenging to differentiate between the X-rays or MRI scans of a normal and diseased person. Several methods based on deep learning frameworks have been proposed for many diseases [85, 86] . In this work, considering the COVID-19 [87] era, we have adopted one of the significant problems, namely, Pneumonia classification [88] in chest X-rays. For this, we have considered the ResNet18 [70] and ResNext101 [80] as baseline models and trained for the binary classification problem using the benchmark dataset [76] with a pre-determined train/val/test split. The trained model then has been quantized, and it can be observed from Table. IX that the proposed GZSQ has outperformed the ZEROQ [35] and N (0, 1) by a noticeable margin.

Additional results are given in the provided supplementary material, wherein, Figure 5 shows the comparison among the class activation maps (CAMs) generated by utilizing the quantized models using ZEROQ [35] , and GZSQ. It can be observed that the CAMs generated by using the GZSQ's quantized model are more similar to those of original FP32 model, than those of ZEROQ's. The readers are encouraged to refer Section VIII of the supplementary material for more details.

Table. X shows the reduction in size (MB) of the models due to quantization.

In this section, first, we deeply investigate the distilled data for various models. We then present a brief ablation study of cost functions and the effect of extra biases introduced due to BN [39] folding before distillation. A. Distilled Data Figure 3 shows the histograms of distilled data obtained for the models that are w and w/o BN [39] layers. In Figure 3 (Row 1), the histograms of obtained distilled data have been plotted using unit Gaussian, ZEROQ [35] and GZSQ for the ISONet [41] model (w/o BN layers). It can be observed that the ZEROQ [35] , by its underlying backbone structure, results the distilled data similar to Gaussian but with a higher variance in the absence of BN [39] layers. This high-variance may be due to the initialization of distilled data using uniform distribution in ZEROQ [35] approach. The initialized uniform noise, in the absence of BN [39] layers, learns to shift towards a unit Gaussian within certain iterations 14 . After finetuning, the left-out traces of uniform noise in the distilled data of the ZEROQ [35] may have caused the degradation in the accuracy for certain models, when compared to unit Gaussian and GZSQ, as shown in Tables. III, II. Whereas, in Row 2, similar plots have been given for the MobileNetV2 [9] model (w BN [39] layers). To show the effect of BN [39] folding, which has been ignored in the ZEROQ [35] , we have presented the plots for both the scenarios, when the BN [39] folding is performed before and after the data distillation. Unit Gaussian generated with different seeds is independent of the BN [39] folding. Therefore not much degradation has been observed in the case of N (0, 1), as shown in Tables. VII, VIII. However, it can be observed from Figure 3 (Row 2) that distilled data obtained when BN [39] folding is performed before distillation is totally different when the same is performed after distillation. Especially, in the case of ZEROQ [35] , which results in a significant difference between final accuracies (see Tables. VII, VIII) . This is due to the fact that extra bias introduced because of BN [39] folding into the weights has not been considered in the ZEROQ's distillation process. Consequently, it may have resulted in a poor correlation between the folded weights and the distilled data generated by looking at the BN [39] statistics prior to folding. Whereas, the same has been considered in the proposed GZSQ framework, N (0, 1) , ZEROQ [35] , and the proposed GZSQ schemes. Row 1 shows the data for the ISONet [41] model. Row 2 shows the data for MobileNetV2 [9] when BN-folding is performed before and after data distillation.

as described in Section III-D. Thus the obtained distilled data is not only distinguishable from unit Gaussian, but also does not exhibit any severe deviation because of the BN [39] folding. Figure 2 shows the histogram plots of the distilled data obtained for the task of pneumonia classification using ResNext101 [80] .

To show the effect of proposed L Z cost function (see Eqn. 8), we have presented a detailed study by performing the following baselines:

• GZSQ-µ 1 , where 1 norm between the means has been used as a cost function instead of proposed L Z ,

• GZSQ-σ 1 , where 1 norm between the std-devs has been used as a cost function instead of proposed L Z ,

• GZSQ-1 , where 1 norm has been used as a cost function instead of proposed L Z ,

• GZSQ-µ 2 , where 2 norm between the means has been used as a cost function instead of proposed L Z ,

• GZSQ-σ 2 , where 2 norm between the std-devs has been used as a cost function instead of proposed L Z ,

• GZSQ-2 , where 2 norm (similar to ZEROQ [35] ) has been used as a cost function instead of proposed L Z ,

• GZSQ-L * KL , where KL divergence has been used to measure the difference between two distributions instead of proposed L Z . Since the estimated mean of BN [39] substitutes may have negative values (see Section III-A, ref: Eqn. 7), for which the KL divergence may be undefined. Hence, we have only utilized the std-devs to measure the difference. Therefore the adopted loss for GZSQ-L * KL is

where KL(.) refers to Kullback-Leibler divergence, instead of Eqn. 8 [89] . It has been observed that for most of the cases, the parent baselines GZSQ-1 and GZSQ-2 have shown considerable superiority over aforementioned leaf baselines GZSQ-µ 1 , GZSQσ 1 , GZSQ-µ 2 , GZSQ-σ 2 . Further, it can be observed from Table. XI that the proposed L Z cost function has been proven to be more beneficial than the 1 and 2 norms. It has also been observed that the 1 and 2 norms are useful in restoring the structural information (see Figure 4 ). Whereas the proposed L Z loss restores the distributional properties of the data. To show the difference qualitatively, we present the visuals of distilled data when BN [39] folding is performed before and after the distillation phase for both ZEROQ [35] and GZSQ on ResNet18 [70] model, in Figure 4 . It can be observed that the ZEROQ [35] retains some structural information when BN [39] folding is performed after distillation phase (see Figure 4 , row:top, col:left). This may be due to the utilized BN [39] layer statistics. However, in this case, ZEROQ [35] in general results the poor accuracy (see Section VI-A). Whereas, a significant loss in structural information can be observed when folding is performed prior to distillation (see Figure 4 , row:bottom, col:left). Similar to ZEROQ [35] , we evaluated our baseline GZSQ-2 and observed that the proposed scheme is also able to retain the structural information when folding is performed after distillation phase (see Figure 4 , row:top, col:middle). However, in contradiction to ZEROQ [35] , the proposed baseline also retains some structural information in the distilled data when folding is performed before distillation phase (see Figure 4 , row:bottom, col:middle). It may be due to the incorporation of extra biases in the distillation phase, introduced due to BN [39] folding (see Section III-D), which may have carried, inherently, some information about the distribution of original training samples.

Therefore it can be concluded that some structural information about the data can be retained even without utilizing the BN [39] statistics in the distillation phase. However, such information may not be much beneficial for some models, e.g., SqueezeNextV5 [16] , as shown in Table. XI or ZEROQ's [35] results for ResNet18 [70] as shown in Table. VII. Coming to the distributional minimization, we performed a baseline, namely, GZSQ-L * KL , that utilizes the KL divergence instead of the proposed L Z cost function. However, it can observed from the Table. XII that the proposed L Z loss yields better performance than the KL divergence based cost function. It should be mentioned that the KL divergence has been utilized with half of its capability since the mean values could not be incorporated in Eqn. 19 . Though, it may be taken as a slight drawback of GZSQ. But, nothing for certain can be concluded about the end performance if KL divergence would have been utilized fully against the proposed L Z .

It can also be observed from Figure 4 (row:top/bottom, col:right) that the stills of distilled data, when utilizing GZSQ, may not comprise of much structural information. However, it retains rich distributional properties that helps in achieving the near FP32 accuracy post quantization, as shown in Table. XI. When comparing with unit Gaussian samples, the inclusion of pre-trained weights in the distillation process of GZSQ may have acted as the cherry on the top.

In Table XIII , we have shown the results to demonstrate the effect of extra biases, introduced due to BN [39] folding, when considered or ignored during our distillation process. It can be observed that for most of models, especially in the case of ResNet18 [70] , incorporating extra biases in the distillation phase (see Section III-D) yields better results than ignoring the same in the distillation phase, irrespective of the quantization configuration. The incorporated biases may have reduced the correlation shift between the folded weights and distilled data, resulting in a near-accurate estimation of activation ranges to calculate quantization parameters.

In this work, we have presented a novel generalized zeroshot quantization (GZSQ) framework for the post-training quantization of deep CNNs, that leverages only the pre-trained weights of the model. The proposed zero-shot approach does not rely on the original unlabelled data or the learned BN layer parameters to infer the activation ranges. Our proposed scheme is built upon the data distillation approach. And, we argued that Z-score based loss could be more effective than 1 and 2 norms or KL divergence, during distillation. We have benchmarked the models (w & w/o BN layers) for the tasks of classification and object detection. We have also presented the quantization results for pneumonia classification in chest X-rays. We have shown that our method performed well, even when BN layer is folded or absent. Further, our approach is more generalized, outperforming the best-published works across multiple quantization frameworks. We have also presented a detailed ablation study, demonstrating the effect of various cost functions and BN folding.

The presented work is the first attempt towards the posttraining zero-shot quantization of the futuristic unnormalized deep CNNs. And, in the subsequent time, we would like to extend our approach to mixed precision quantization and other problem domains. For e.g., multi-modal classification, image de-noising & reconstruction, a glimpse of whose foundation has been given in the Figure 6 of the supplemental file. On an abstract level, class activation maps depict the regions in the input image, which have been addressed by the deep models with more focus during classification or any other similar high-level vision task. Figure 5 shows the class activation maps generated by using the quantized models for the task of Imagenet [10] classification. For this, we have utilized the SqueezeNextV5 [16] model as FP32 baseline. This brief study aims to demonstrate the effect of quantization on the model's focus through the lens of activation maps. In Figure  5 , PC denotes the predicted class by the quantized model, whereas TC refers to the true class of the input image. It can be observed that across a variety of classes, activation maps generated by the quantized model using GZSQ are more similar to FP32 maps when compared to ZEROQ [35] maps. Further, the scores generated by the GZSQ quantized model are more aligned towards those of the FP32 against ZEROQ [35] . The warmer colors in the activation map denote the regions with more focus, whereas the lighter colors show the regions with less emphasis. Therefore, in terms of priority, it can be concluded that the quantized model using the GZSQ scheme performs more similarly to the original FP32 than the ZEROQ's [35] quantized model. To show the robustness of the GZSQ across different domains, we have presented a brief ablation study on single image de-raining using quantized models. For this, we have adopted the pre-trained DRN [91] model as FP32 baseline and perform INT8 quantization using N (0, 1), ZEROQ [35] and GZSQ. For quantitative analysis, we have utilized Real Internet [90] dataset that consists of 146 real-world rainy images collected from the internet. The selected real-world rainy images do not have ground truth; hence for a fair comparison, we have utilized the reference-less image quality metrics, namely, Naturalness Image Quality Evaluator (NIQE) [92] , Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) [93] , and Total Variation Error (TV). The average spatial dimension of the images in the incorporated dataset is 371 px in height and 568 px in width. In general, image de-raining models trained on synthetic datasets tend to learn a different distribution of rain-streaks, unlike in real-world images, which is why most deep learning-based methods for image de-raining do not generalize well. However, in this paper, we restrict our analysis to the effect of quantization only. It can be observed from 6 : Sample results on single image rain-streak removal by using the quantized models (INT8) from ZEROQ [35] and GZSQ in terms of Naturalness Image Quality Evaluator (NIQE; smaller is better). The original baseline FP32 model is DRN [91] .

obtained by using GZSQ's quantized model have surpassed ZEROQ [35] in terms of NIQE [92] and BRISQUE [93] . We have also presented the qualitative results in Figure 6 in terms of NIQE [92] against ZEROQ [35] .

Rethinking zero-shot video classification: End-to-end training for realistic applications

Discriminative multi-modality speech recognition

On the general value of evidence, and bilingual scene-text visual question answering

Gradient-based learning applied to document recognition

Imagenet classification with deep convolutional neural networks

Very deep convolutional networks for large-scale image recognition

Deep learning with cots hpc systems

Rate distortion for model Compression:From theory to practice

Mo-bilenetv2: Inverted residuals and linear bottlenecks

ImageNet: A Large-Scale Hierarchical Image Database

Hrank: Filter pruning using high-rank feature map

Multi-dimensional pruning: A unified framework for model compression

Neural network pruning with residual-connections and limited-data

Learning filter pruning criteria for deep convolutional neural networks acceleration

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

Squeezenext: Hardware-aware neural network design

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size

Shufflenet: An extremely efficient convolutional neural network for mobile devices

Bayesian compression for deep learning

Soft weight-sharing for neural network compression

Improved bayesian compression

Quantizing deep convolutional networks for efficient inference: A whitepaper

Quantization and training of neural networks for efficient integer-arithmetic-only inference

PACT: parameterized clipping activation for quantized neural networks

Xnor-net: Imagenet classification using binary convolutional neural networks

Lq-nets: Learned quantization for highly accurate and compact deep neural networks

Incremental network quantization: Towards lossless cnns with low-precision weights

Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients

CAT: compression-aware training for bandwidth reduction

Post training 4-bit quantization of convolutional networks for rapid-deployment

Low-bit quantization of neural networks for efficient inference

Same, same but different -recovering neural network quantization error through weight factorization

Datafree quantization through weight equalization and bias correction

Zeroq: A novel zero shot quantization framework

The knowledge within: Methods for data-free model compression

Inceptionism: Going deeper into neural networks

Generative low-bitwidth data free quantization

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Fixup initialization: Residual learning without normalization

Deep isometric learning for visual recognition

Characterizing signal propagation to close the performance gap in unnormalized resnets

Visualizing and understanding convolutional networks

Cifar-10

Microsoft coco: Common objects in context

Neural architecture search with reinforcement learning

Fitnets: Hints for thin deep nets

Distilling the knowledge in a neural network

Variational student: Learning compact and sparser networks in knowledge distillation framework

Optimal Brain Damage

Second order derivatives for network pruning: Optimal brain surgeon

Learning both weights and connections for efficient neural network

Co-design of deep neural nets and neural net accelerators for embedded vision applications

A survey of FPGA based neural network accelerator

Exploiting linear structure within convolutional networks for efficient evaluation

An exploration of parameter redundancy in deep networks with circulant projections

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

A method for the construction of minimum-redundancy codes

Amc: Automl for model compression and acceleration on mobile devices

Iterative deep neural network quantization with lipschitz constraint

Progressive learning of low-precision networks for image classification

Generative adversarial nets

Conditional image synthesis with auxiliary classifier GANs

Haq: Hardware-aware automated quantization with mixed precision

Adabits: Neural network quantization with adaptive bit-widths

Adaptive loss-aware quantization for multi-bit networks

Tradi: Tracking deep neural network weight distributions

Understanding the difficulty of training deep feedforward neural networks," ser

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Deep residual learning for image recognition

Shufflenet: An extremely efficient convolutional neural network for mobile devices

Rethinking the inception architecture for computer vision

Leveraging filter correlations for deep model compression

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size

Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization

Identifying medical diagnoses and treatable diseases by image-based deep learning

Pytorch: An imperative style, highperformance deep learning library

Adam: A method for stochastic optimization

Focal loss for dense object detection

Aggregated residual transformations for deep neural networks

Value-aware quantization for training and inference of neural networks

Resiliency of deep neural networks under quantization

Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems

MRI from Picture to Proton

Dagan: Deep dealiasing generative adversarial networks for fast compressed sensing mri reconstruction

Anatomically constrained neural networks (acnns): Application to cardiac image enhancement and segmentation

The proximal origin of sars-cov-2

Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning

Dreaming to distill: Data-free knowledge transfer via deepinversion

Spatial attentive single-image deraining with a high quality real rain dataset

Dual recursive network for fast image deraining

Making a "completely blind" image quality analyzer

No-reference image quality assessment in the spatial domain

The authors would like to thank the anonymous Reviewers and Editors for their insightful observations, comments and suggestions, which helped to improve the quality of this work.