key: cord-0520769-m2yrrekl authors: Baruch, Moran; Greenberg, Lev; Moshkowich, Guy title: Fighting COVID-19 in the Dark: Methodology for Improved Inference Using Homomorphically Encrypted DNN date: 2021-11-05 journal: nan DOI: nan sha: d45c90e0d161d5ab63c57e92f76d7a85d2b129c4 doc_id: 520769 cord_uid: m2yrrekl Privacy-preserving deep neural network (DNN) inference is a necessity in different regulated industries such as healthcare, finance, and retail. Recently, homomorphic encryption (HE) has been used as a method to enable analytics while addressing privacy concerns. HE enables secure predictions over encrypted data. However, there are several challenges related to the use of HE, including DNN size limitations and the lack of support for some operation types. Most notably, the commonly used ReLU activation is not supported under some HE schemes. We propose a structured methodology to replace ReLU with a quadratic polynomial activation. To address the accuracy degradation issue, we use a pre-trained model that trains another HE-friendly model, using techniques such as"trainable activation"functions and knowledge distillation. We demonstrate our methodology on the AlexNet architecture, using the chest X-Ray and CT datasets for COVID-19 detection. Our experiments show that by using our approach, the gap between the F1 score and accuracy of the models trained with ReLU and the HE-friendly model is narrowed down to within a mere 1.1 - 5.3 percent degradation. The ability to run DNN inference on untrusted cloud environments is becoming critical for many industries such as healthcare, finance, and retail. Doing so while adhering to privacy regulations such as HIPAA [Cen96] and GDPR [EU 16] is not trivial. For example, consider a hospital that wishes to analyze and classify medical images (e.g., [WLW20, GWW20] ) on the cloud. Regulations may force the hospital to encrypt these images before uploading them to the cloud, which can prevent the analytical evaluation of the data in the cloud without first performing decryption. HE for DNN inference is an active research topic [NRPH19, GBDL + 16, HTG17] , with the goal of using a trained DNN model to classify encrypted data. However, this task is not trivial due to limitations that an HE scheme may possess. We describe two principal challenges of HE inference. Multiplication depth. Some HE schemes only allow for a certain number of consecutive multiplication operations. To tackle this challenge, such schemes use a bootstrapping operation [Gen09] that allows further computation. Multiplication depth is defined as the longest chain of sequential multiplication operations of the HE evaluated function. Because bootstrapping is an expensive operation in terms of run-time, reducing the multiplication depth allows us to to reduce or even avoid bootstrapping altogether, while speeding up the entire computation. Non-polynomial operations. Some modern HE schemes support only basic arithmetic operations of addition and multiplication e.g., CKKS [CKKS17] and BGV [BGV14] schemes. This means that a DNN component that cannot be represented as a composition of these arithmetic operations, cannot be computed directly in HE. One way to overcome this limitation is by using a polynomial approximation to approximate the operation. For example, the ReLU activation function defined as ReLU(x) = max(0, x) is approximated by a polynomial in [LKL + 21, TPD + 19, MZ17, HTG17] . A second option is to replace the operation with a similar but different HE-friendly operation. For example, this may involve replacing a max-pooling operation with the HE-friendly operation of average-pooling, which in many use cases does not affect the DNN performance [GBDL + 16] . A third way, is to use a client-aided design [LTJS + 21] , where the hard-to-compute operation is sent to the data-owner who decrypts the data, computes the operation, encrypts the result, and sends it back to the cloud to continue its HE computation. We prefer to avoid this method because, in addition to the communication complexity, a recent theoretical attack [AV21] raised concerns about the security of client-aided designs. Our Contributions. We propose a new methodology that combines several techniques for adapting and training HE-friendly DNNs for encrypted inference. In these DNNs, we replace the ReLU activation functions by customized quadratic polynomial approximations. Our methodology also enables the entire inference process to occur without interaction with the data-owner, e.g., in the cloud environment. We show empirically that the resulting inference accuracy is comparable with the inference accuracy of our baseline, the original DNN with ReLU activation, and max-pooling. The applied techniques include: • Low-degree polynomial activation functions with trainable coefficients • Method for gradual replacement of the original activation functions during the training • Adaptation of the knowledge distillation (KD) technique to train an HE-friendly model from a pre-trained baseline model in its vanilla settings We evaluated the efficiency of our method on the AlexNet [KSH12] architecture for COVID-19 detection over CT and chest X-Ray (CXR) large images of size 224×224×3. The results demonstrated a minimal degradation of up to 5.3% in the F 1 score, compared to the original AlexNet model. Organization. The document is organized as follows. Section 2 surveys the relevant literature. Section 3 presents our methodology and the techniques that we use. We present our experiments in Section 4 and conclude in Section 5. The ReLU function uses the non-polynomial max operation, which is not supported by some HE schemes such as CKKS. For these schemes, addressing this limitation is possible using methods such as lookup tables and polynomial approximations. Using lookup tables to approximate ReLU were introduced in [PUZ93, KM10] , and were used for training DNN homomorphically in [LFFJ19, NRPH19] . One disadvantage of this approach is the low resolution of the lookup table, which is limited by the number of lookup table entries. This number is significantly lower than the number of values possible in a single or double floating-point number. In addition, this technique is not available for all HE schemes, e.g., for CKKS. The second approach involves techniques for replacing the ReLU activation function with a polynomial approximation function. This can be done by using either an analytical method to approximate the polynomial or machine learning to train the polynomial coefficients. The work of [CKK + 19] describes a method for approximating the generic max function. However, using this method often leads to a high degree polynomial approximation, which increases the multiplication depth and the accumulated noise. In addition, this approximation is applicable only when the input operands are limited to a specific range. Unlike the above methods that can approximate generic functions, we are interested in ReLU. ReLU is a special case of the max function, where one of the max input operands is fixed to zero. Hence, it is possible to use other approximation methods that yield polynomials with even lower degrees and better performance, while improving the overall efficiency. The square function (square(x) = x 2 ) [GBDL + 16] is a well-known low-degree polynomial replacement for the ReLU function. However, when the number of layers in the model grows, the accuracy of the model degrades significantly. To mitigate this degradation, [LKL + 21, TPD + 19, MZ17, HTG17] suggested using a higher-degree polynomial, which again leads to high computational depth. Another mitigation was suggested in [TPD + 19] , which approximated ReLU using a quadratic polynomial instead of a simple square function. To evaluate the performance of these methods, the authors use a lighter variant of AlexNet [KSH12] with images of size 32 × 32 × 3. We tested their approach on the original AlexNet architecture with larger images of size 224 × 224 × 3. As reported in Section 4, this approximation suffers from a significant accuracy degradation of up to 35%. The studies above searched for a ReLU replacement that would serve as a good polynomial approximation. They then replace all the ReLU occurrences in a model with this approximation. In contrast, we consider a fine-tuning approach, in which we use a different activation per layer, without necessarily approximating the ReLU activation function. To this end, we used a DNN to train the coefficients of the different polynomials. We call this technique 'trainable activation' 1 . A similar approach was suggested by [PUZ93, ZXF02, SSCU19] . Here, the authors trained a polynomial per neuron in small networks of several dozens of neurons. Obviously, this approach is not feasible in modern networks, where the number of neurons is in the order of millions and the number of parameters requiring optimization is huge. Instead of training a polynomial per neuron, [WLW + 18] suggested training a polynomial per layer, for all channels together. Subsequently, they evaluated their activation functions on 3 models, where the largest model has 4 convolutional layers, 2 average-pooling layers, 2 fully-connected and 3 polynomial activations, accompanied by a batch normalization layer. The experiments were applied on the MNIST dataset, which consists of images of size 28 × 28 × 1. We scaled up this approach by using a larger model (see Appendix 5) , over larger images, and extended their methodology by combining additional techniques to help the model converge better. Our evaluation shows that we were able to narrow down the accuracy gap between the HE-friendly AlexNet, trained using the approach of [WLW + 18], and our baseline. Our experiments show that, compared to their approach, we improved the performance of the model by up to 12.5%. Our goal was to replace the ReLU function with a polynomial activation function. This section describes the training methodology we used to achieve comparable performance with the baseline model. Our approach was to design a trainable polynomial that would replace the ReLU activation, without approximating it. Recent papers suggested approximating ReLU using high-degree polynomials to achieve a good approximation. However, for a polynomial of degree n, this requires an order of log 2 (n) multiplications [Sch75] , which increases the computation depth significantly as the number of multiplications grows. Therefore, we suggest using a trainable polynomial activation of a 2 nd degree polynomial without the constant term. We use the form ax 2 + bx, where a and b are trainable coefficients, which we train per layer individually. A similar approach was presented by [WLW + 18], where each such activation layer only increments the multiplication depth by 1. Applying such a significant architecture change at once to a complex model, without first adapting the model weights, can lead to a steep drop in accuracy. Hence, we designed a new approach that we call Smooth-Transition. We start by training a model that includes ReLU activation layers for e 0 epochs. On the next d epochs, we smoothly transition from the ReLU functions to the polynomial activation functions poly_act(). Then, we continue to train the model on the transitioned HE-friendly architecture. To model this, we use the ratio parameter λ e per epoch e. and set the weighted activation function at epoch e as weighted_act λe (x) := (1 − λ e ) · ReLU (x) + λ e · poly_act(x). To help the network converge, we initiate the quadratic function poly_act(x) = ax 2 + bx as a linear function that is somewhat similar to ReLU, by setting a = 0 and b = 1. Using polynomial activations instead of ReLU activations is less suitable for the classification task. To strengthen the model, we adopted the well known KD [HVD15] technique. KD enables a "knowledge" transfer from a stronger pre-trained 'teacher' model to a weaker 'student' model. In practice, the student model is usually smaller than the teacher [MFL + 20]. In our case, replacing ReLU by a polynomial activation weakens the HE-friendly model; therefore, the original model is used as a 'teacher' model to assist in training the 'student' HE-friendly model. We use the response-based KD approach, one of the simplest KD methods [GYMT21] . Here, an additional term is added to the loss function to measure discrepancies between the predictions of the teacher and student models. Moreover, in the commonly used soft target technique [HVD15] , soft targets are used instead of the original predictions of the teacher and student models: is the soft target version of the prediction for the class i, z i are the original prediction logits, and τ is the temperature [HVD15] . With τ = 1 the above formula becomes the standard "softmax" output: using a higher temperature value (τ > 1) produces a more uniform distribution of the probabilities over the classes. The resulting loss function becomes [Li18, Li21] : s and Q τ t are vectors of the soft target predictions of student and teacher models with the same temperature τ > 1, Q 1 s is the "softmax" student prediction, y true is the original labels, CE is cross-entropy loss function and α is the hyperparameter controlling the relative weight of the additional KD loss term. For our experiments we consider two datasets: COVIDx and COVIDx CT-2A. COVIDx [WLW20] . This is a dataset of CXR images labeled as: Normal, Pneumonia, or COVID-19. It is an open access benchmark dataset comprising ∼20, 000 CXR images, with the largest number of publicly available COVID-19 positive cases. This dataset collects its data from six chest X-Ray datasets [CMD + 20, A20, TSL + 21, CRK + 20, CVS + 13, RKQ + 21], and combines them into a big dataset that is updated over time with more COVID-19 positive CXR images. The number of images we used per class is depicted in Table 1 . When creating this dataset, we verified that there are no patients overlapping between the train, test, and validation subsets. We applied an augmentation process to the data, similar to [WLW20] . COVIDx CT-2A [GWW20] . This dataset consists of 194,922 chest CT slices from 3,745 patients, with the same classes as in the previous dataset. Due to time limitations, we used a random subset of the original dataset, as depicted in Table 1 . Each image was augmented as follows: resize into 224 × 224 × 3, random rotation, horizontal flip, vertical flip, color jitter, and normalize. The authors of the COVIDx chest x-ray and CT datasets [WLW20, GWW20] also developed tailored models named COVID-Net. Due to the constraint of computational depth, we instead used a pretrained AlexNet [KSH12] model that yields similar results. Since the original AlexNet is not HEfriendly, we present the steps for transforming the original model into an HE-friendly one. We implemented an AlexNet model based on PyTorch 2 , and added a batch normalization layer. To avoid additional computational depth, after the training process ends, we absorb the coefficients of the batch normalization into the weights of the next layer, as suggested by [IÖ18] . See Appendix A for a detailed description of neural network architectures. Table 2 summarizes our experimental results using different methods on the COVIDx and COVIDx-CT-2A datasets. In every experiment, we measured the accuracy and macro-average of the F 1 scores on all of the three classes. We repeated every experiment five times with different seeds and report the average results and the standard deviation in Table 2 . For more details regarding the training setup, see Appendix B. As can be seen from We introduced a new methodology for training an HE-friendly model that replaces the ReLU activation functions with a trainable quadratic approximation. Our approach uses techniques such as polynomial activation functions with trainable coefficients, gradual replacement of activation layers during training, and KD. We tested our methodology on chest CT and CXR image datasets using the AlexNet architecture, and showed that the performance of our trained model is only 5% less accurate than the reference model, making it at least 15% better than all previously suggested HE-friendly models. In future work, we plan to further evaluate our methods on more complicated models and domains, and evaluate the new HE-model's performance in the encrypted domain. Actualmed COVID-19 chest x-ray data initiative On the privacy of protocols based on cpa-secure homomorphic encryption Leveled) fully homomorphic encryption without bootstrapping The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Numerical method for comparison on homomorphically encrypted numbers Homomorphic encryption for arithmetic of approximate numbers Nasser Al Emadi, Mamun Bin Ibne Reaz, and Mohammad Tariqul Islam. Can AI Help in Screening Viral and COVID-19 Pneumonia? /679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy Fully homomorphic encryption using ideal lattices Computing arbitrary functions of encrypted data COVIDNet-CT: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases From Chest CT Images Knowledge distillation: A survey Deep neural networks over encrypted data Distilling the knowledge in a neural network FHE-Compatible Batch Normalization for Privacy Preserving Deep Learning An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks Imagenet classification with deep convolutional neural networks Fast and accurately training deep neural networks on encrypted data Exploring knowledge distillation of dnns for efficient hardware solutions Exploring knowledge distillation of dnns for efficient hardware solutions Privacy-preserving machine learning with fully homomorphic encryption for deep neural network Enabling homomorphically encrypted inference for large dnn models Improved knowledge distillation via teacher assistant SecureML: A system for scalable privacypreserving machine learning Towards deep neural network training on encrypted data Neural networks with digital lut activation functions Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images A lower bound for the length of addition chains Learning activation functions from data using cubic spline interpolation Privacy preserving neural network inference on encrypted data with gpus The RSNA international COVID-19 open annotated radiology database (RICORD) PPolyNets: Achieving high prediction accuracy and efficiency with parametric polynomial activations Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Neuron-adaptive higher order neuralnetwork models for automated financial data modeling MaxPool2d(kernel=3 × 3, stride=2) MaxPool2d(kernel=3 × 3, stride=2) Dropout(p=0.2) FC(in=9216, out=4096, ReLU) Dropout(p=0.2) FC(in=4096, out=4096, ReLU) FC(in=4096, out=3) AvgPool2d(kernel=3 × 3, stride=2) AvgPool2d(kernel=3 × 3, stride=2) Dropout(p=0.2) FC(in=4096, out=3) Total number of layers To the original AlexNet network, we added a batch normalization layer as our baseline. Here, all convolution layers use padding='same'. Note that the original model implemented by PyTorch has a Global Average Pooling layer with an output size that is 6 × 6. For input images of size 224 × 224, it is equivalent to the identity function so we ignored it in our experiments. We evaluated each experiment with 5 different seeds: 111, 222, 333, 444, 555. During the training process, we loaded images in mini-batches of size 32, and optimized the loss function using Adam optimizer. The number of epochs differs for each task, as does the learning rate, which was usually 3e-5 or 3e-4.The activation replacement started at epoch 3, and in case of smooth transition it was gradually replaced for 10 epochs. We replaced all ReLU activations in parallel. We found that it is better to initialize the coefficients with values similar to the form of ReLU. Therefore, the coefficients of each trainable activation were initialized as 0.0X 2 + 1.1x or 0.0X 2 + 1.1x.For the distillation process described in Section 3.3, we set the temperature value to 10, and the α parameter was set to 0.1.