key: cord-0707384-vx6qr4ou
authors: Wang, Yong; Wang, Fan-chuan; Liu, Fei; Wang, Xiao-hu
title: Securing content-based image retrieval on the cloud using generative models
date: 2022-04-08
journal: Multimed Tools Appl
DOI: 10.1007/s11042-022-12880-6
sha: 1e26285f3df81ab430ed9d252ba0cc82110fb3cb
doc_id: 707384
cord_uid: vx6qr4ou

Content-based image retrieval (CBIR) with deep neural networks (DNNs) on the cloud has tremendous business and technical advantages to handle large-scale image repositories. However, cloud-based CBIR service raises challenges in image data and DNN model security. Typically, users who wish to request CBIR services on the cloud require their input images remaining confidential. On the other hand, image owners may intentionally (or unintentionally) upload adversarial examples to the cloud servers, which potentially leads to the misbehavior of CBIR services. Generative Adversarial Networks (GANs) can be utilized to defense against such malicious behavior. However, the GANs model, if not well protected, can be easily abused by the cloud to reconstruct the users’ original image data. In this paper, we focus on the problem of secure generative model evaluation and secure gradient descent (GD) computation in GANs. We propose two secure generative model evaluation algorithms and two secure minimizer protocols. Furthermore, we propose and implement Sec-Defense-Gan, a secure image reconstruction framework which can keep the image data, the generative model details and corresponding outputs confidential from the cloud. Finally, We carried out a set of benchmarks over two public available image datasets to show the performance and correctness of Sec-Defense-Gan.

Deep neural networks (DNNs) have shown unprecedented superiority for visual object recognition, which may be the most powerful approach for Content-Based Image Retrieval to the cloud. Sec-Defense-Gan consists of two major components, i.e., SecGan and SecMinimizer. SecGan provides a secure generative model evaluation functionality while SecMinimizer performs secure gradient update.

-We apply Sec-Defense-Gan into secure content-based image retrieval system which assumes that the image data owner may be adversary. The system can protect the image data, the GANs model details (e.g., weight matrix) and the image retrieval results from the cloud, while defending against the adversarial examples which have a different distribution of the training samples used by the NN model.

The rest of the article is organized as follows. We review the related works in Section 2. Section 3 explains the preliminaries. In Sections 4, 5 and 6, we described our framework and implementations in detail. The experimental evaluations are shown in Section 7. Finally, we discuss and draw brief conclusions in Section 8.

Content-based image retrieval (CBIR) has been studied for several decades. CBIR technologies are typically of high computational complexity and storage-consuming because images need to be indexed by their visual contents (features). Hence, it is an inevitable choice to store users' images onto the cloud servers which can further provide CBIR services. Although outsourcing CBIR services to the cloud has great business advantages, privacy concerns arises because users default that the cloud service providers are not trustworthy.

Homomorphic encryption (HE) based techniques can be apply to secure CBIR service. In such schemes, users encrypt images pixel by pixel by utilizing a homomorphic cryptosystem (e.g., Paillier [26] , ElGamal [10] , or Lattice-based AHE [8] ), which allows the cloud to index and process their images in the encrypted domain. Hsu et al. [18] proposed a highprecision CBIR algorithm by adopting Paillier cryptosystem to encrypt private images. This approach is suffered from significative ciphertext expansion, which leads to slow encryption and decryption time and scalability issues. Hu et al. [19] further proposed an efficient scheme for SIFT feature extraction by utilizing the ring-Learn-With-Error (r-LWE) homomorphic cryptosystem [6] , different from their previous scheme proposed in [18] , their batched secure multiplication protocol is built on Some-What Homomorphic Encryption (SWHE) scheme that enables the two parties to securely compute the products of multiple pairs of private integers simultaneously, with computation and communication costs greatly reduced. A variant work proposed by Zheng et al. [33] replaced Paillier ciphertexts with pointers to a ciphertext table. It reduced the number of encryption operations and minimized ciphertext expansion. Li et al. [22] proposed a double-decryption SIFT feature extraction scheme based on the BCP cryptosystem, which is an additively homomorphic scheme with two independent decryption algorithms.

Although HE-based schemes allow the cloud server to process and index their encrypted images, which is semantically secure. Unfortunately, they present much higher time and space complexity [11] . More importantly, these schemes naturally are facing with ciphertext expansion and noise growth problems [5, 7-9, 13, 24] . These have potentially negative effects on the scalability and accuracy. For example, schemes in [19, 22, 33] can only deal with the integer values of SIFT vectors and accept limited additive homomorphic operations. It is hardly applicable when considering CBIR with deep features, such as features extracted by convolutional neural networks (CNNs), because these schemes perform very poorly due to the large multiplicative depth in a CNN. The last but not the least, these schemes naturally assume that image owners are honest. It requires design of HE based de-convolution operations when apply GAN to defense against such malicious users.

In this work, we consider the secure reconstruction of the generative model in a GAN framework by utilizing the homomorphic encryption in conjunction with multi-party computation.

GANs include two neural networks, generative model G and discriminative model D. G : R k → R n maps a low-dimensional latent space to high-dimensional sample space of x. D is a binary neural network classifier. In the training phase, G and D are typically learned in an adversarial fashion using actual input data samples x and random vectors z. An isotropic Gaussian prior is usually assumed on z. While G learns to generate outputs G(z) that have a distribution similar to that of x, D learns to discriminate between "real" samples x and "fake" samples G(z). D and G are trained in an alternating fashion to minimize the following min − max loss:

It was shown that the optimal GAN is obtained when the resulting generator distribution ρ g = ρ data (x) [14] . Because GANs are difficult to train in practice [16] , many alternative formulations have been proposed. Wasserstein GANs (WGANs) [16] have shown the stability of their training methods, which use the Wasserstein distance in a loss function:

One of WGANs application scenario is to use a WGAN trained on legitimate training samples to defense against adversarial samples.

Defense-Gan is a new framework proposed by samangouei [27] for the purpose of defending against adversarial attacks in classification problems. Defense-Gan employs a novel defense strategy which use a WGAN trained on legitimate (un-perturbed) training samples to "denoise" adversarial examples. At inference time, given a trained GAN generator G and an image x to be classified, z * is first found to minimize

This non-convex minimization problem is approximated by performing a fixed number of L Gradient Descent (GD) steps using R different random initialization of z. Figure 1 further shows the process of L Gradient Descent (GD) steps in detail:

G(z * ) is then given as an input to the CBIR system. The overview of this application scenario is illustrated in Fig. 2 . Sign G denotes the generator network and Z 0 denotes a group of random vectors. Given an image x, Z 0 is updated by minimizing the loss iteratively and get the result Z L . Finally, z * is selected from the vector group Z L by executing the function argmin

A general abstraction of packed additively homomorphic encryption (PAHE) schemes supports packing multiple plaintexts into a single ciphertext, performing SIMD homomorphic additions (SIMDAdd), scalar multiplications (SIMDScMult) and permuting the plaintext slots (Perm). 

, namely the vector u whose slots are permuted according to the permutation Π .

PAHE includes an encryption algorithm, a deterministic decryption algorithm, and a homomorphic evaluation algorithm. The encryption algorithm takes a plaintext message vector u from some message space and encrypts it using a private key s k into a ciphertext denoted as [ u] . The decryption algorithm takes the ciphertext [ u] and the key s k and recovers the message vector u. The homomorphic evaluation algorithm takes as input one or more ciphertexts that encrypt messages u 0 , u 1 , · · · , and outputs another ciphertext that encrypts a message u = f ( u 0 , u 1 , · · · ) for some function f . The function f is constructed by using the three basic homomorphic operations: Single Instrument Multiple Data (SIMD) addition Fig. 2 An application of DEFENSE-GAN in CBIR system. The GAN is trained on the available classifier training dataset in an unsupervised manner. The classifier can be trained on the original training images, their reconstructions G(z * ) using the generator G. As long as the GAN is appropriately trained and has enough capacity to represent the data, original clean images and their reconstructions should not defer much (SIMDAdd), SIMD scalar multiplication (SIMDScMult), and permuting the plaintext slots (Perm).

Generally, a PAHE schema is parameterized by four constants that are the cyclotomic order m, the ciphertext modulus q, the plaintext modulus p, and the standard deviation σ of a symmetric discrete Gaussian noise distribution (χ ). It satisfies: (1) IND-CPA security, which requires that ciphertexts of any two messages u and u be computationally indistinguishable; and (2) Function Privacy, which requires that the ciphertext generated by homomorphic evaluation, together with the private key sk, reveals the underlying message, namely the output f (·), but does not reveal any other information about the function f .

For any a linear layer in this paper, which is either Deconv or F C layer, it can be evaluated by homomorphic matrix-vector multiplication and homomorphic convolution. The PAHE scheme ensures that nothing about the linear layer's inputs or outputs will be leaked.

Secure two-party computation (2-PC) is a type of protocols that allow two parties to jointly compute a function (f 1 (x, y), f 2 (x, y)) ← f (x, y) without learning each other's input. The Obliv-C [32] is a robust 2-PC garbled circuits implementation. It provides an extensible programming tool for secure computation by exposing the important aspects of data-oblivious computation, while providing a high-level language and the ability to seamlessly integrate with standard C code. Programs typically read, process, and output data in native C code, performing only the secure computation in Obliv-C code.

The main language addition is an obliv qualifier, applied to C types and constructs. Typing rules enforce that obliv types remain secret unless explicitly revealed. Code within oblivious functions and conditionals cannot modify public data, except within a qualified obliv block, in which the code is always executed. These rules allow programmers to reason about data security and develop modular libraries. Recent evaluation of Obliv-C [17] indicates its usability on a range of criteria, including language expressiveness, capabilities of the cryptographic back-end, and accessibility to developers.

For any a non-linear layer in this paper, which typically is ReLU or Sigmod, it can securely be calculated by the secure 2-PC computation scheme.

As is illustrated in Fig. 3 , there are three entities in Sec-Defense-Gan, i.e., the cloud servers (S 1 and S 2 ), the image owner, and the model provider.

A model provider provides a pre-trained WGAN model to the cloud servers, who treat the pre-trained WGAN model details (e.g., weight matrix) as sensitive and hopes it to be protected from the cloud servers. We assume the model provider is honest because it has no motivation to provide a "fake" pre-trained model. Technically, this assumption could be easily guaranteed by the access control and authorization mechanism.

An image owner generates or collects images and outsources the image data onto the cloud servers. We assume that the image data are sensitive and needs to be protected from the cloud servers. We also assume that the image owner can potentially be malicious, who may intentionally (or unintentionally) upload adversarial examples to the cloud servers. These adversarial examples are carefully crafted perturbations added to a legitimate input samples which can cause the legitimate samples to be misclassified at inference time.

The cloud servers (S 1 and S 2 ) filter out adversarial examples uploaded from the image owners by running the pre-trained WGAN model. We assume that the cloud servers follow the semi-honest adversary model, i.e., they follow the protocol specifications and algorithms exactly, but may attempt to learn additional information by analyzing intermediate computations. In general, secure protocols under the semi-honest model are more efficient than those under the malicious adversary model, and most of the practical SMC protocols are secure under the semi-honest model. We refer the readers to [12] for more details about the security definitions and models. By the semi-honest model, we implicitly assume that the cloud servers do not collude. This model is realistic in the current cloud service market, because S 1 and S 2 could be cloud servers which are provided by legitimate, well-known companies (e.g., Amazon, Google, and Microsoft), collusion between any of them is highly unlikely.

Our system goal is to protect the image data, the outsourced WGAN model details (e.g., weight matrix), and the outputs of the WGAN model from the cloud servers S 1 and S 2 , while defending against the adversarial examples which have a different distribution of the training samples used by the classifier.

An image owner splits an image I into two additive shares I a and I b , w.r.t I a = I + I b , which are then stored onto S 1 and S 2 respectively. In addition, S 2 holds the pre-trained WGAN model to reduce adversarial noise of the uploaded images. The WGAN model is outsourced by the model provider after the model parameters (e.g., weight matrix) are encrypted with sk. S 2 carries out the homomorphic operations of the linear layers of the WGAN model. S 1 keeps its private key s k and runs Yao's GC protocols with S 2 to perform the secure computations of the non-linear layers. S 1 and S 2 holds a share, {G(z * ) a , G(z * ) b }, of the final outputs of the WGAN model, which can be treated as the inputs of the classifier (e.g., VGG-16) for the image I .

As is illustrated in Fig. 4 , the primary functionalities of our system consists two components: Secure SecGan and SecMinimizer 

) and the additive share form {I a , I b } of image I , it finds z * by cooperatively running a fixed number L of Gradient Descent (GD) steps.

The generative model used in GANs includes four layers (Fig. 5 ).

-F C(m) is a fully-connected layer with m outputs. Its input is a 128-dimensional vector z. -DeConv(m, k × k, s) is deconvolution with m feature maps, filter size k × k, and stride s. The layers used in GANs can be classified into two types: linear layers (e.g., F C and DeConv) and non-linear layers (e.g., Relu and Sigmod). There are many works that proposed 2-PC secure computation of non-linear layer functions [20] . As for the linear layers, Homomorphic based solutions can achieve good performance. However, considering the de-convolution, a practical solution is needed to achieve both the security and efficiency. On the other hand, we try to convert operations in both linear layers (e.g., F C and DeConv) into the same homomorphic operations to deal with linear layers in one shot.

The de-convolution in the Deconv layer is the transpose of convolution, which returns a feature map that has the same shape as the training samples. We can implement it by utilizing Lattice-based PAHE scheme. There are two methods to implement the de-convolution, i.e., direction convolution and matrix-vector product. The remaining problem is to pad the inputs for direct convolution, or convert the Deconv to matrix multiplication and addition operations. For ease of description, the Deconv layer can be defined by 5 parameters: s is used to denote the stride, k = k w = k h denotes the filter size (width and height), i = i w = i h denotes the input size, pa = pa w = pa h denotes the padding size and o = o w = o h denotes the output size.

The direct convolution can be thought of as dilating the input (by adding s − 1 zeros between adjacent input elements), padding it with k − 1 zeros, and cross-correlating it with the filter. For convolution, we use half (SAME) padding (padding size pa is k/2 ) and unit strides (s is 1). That is, the output size is the same with the input size of the convolution. Fig. 6 (a) as an example, let the input size i = 5, k = 3 and s = 1, we have pa = 1, the output size is o = i = 5, i.e, we only need to pad the original input (blue entries) with zeros (white entries). As for s > 1, the padding size is pa = k − pa − 1 and the output size is o = s · (i − 1) + k − 2 · pa . Taking Fig. 6 (b) as another example, let the input size i = 3, k = 3, s = 2 and pa = 1, the output size is o = 5, it is equivalent to convolute 3 × 3 filter over a 3 × 3 input (with s − 1 = 1 zero inserted between inputs) padded with a 1 × 1 border of zeros using unit strides.

The disadvantage of direct convolution is its inefficiency. It will increase addition and multiplication operations in the ciphertext when padding zeros to the input. Let's consider convolution process again, for any input size 

where pa h and pa w is the height and the width of the padding size respectively. Hence, we have

where pa top , pa down , pa lef t and pa right are the number of elements required in the four direction of the input matrix: up, down, left and right.

We use matrix-vector product to represent convolution process. First, we initialize a (o h * o w ) × (i h * i w ) sparse matrix C with zeros. Then, we use a matrix with the same shape as the input, where each value is set to the index number of the input elements, i.e., 1, 2, ..., i h × i w . We pad the matrix with zeros according to the (4). We slide the filter with stride s on the padded matrix. Each time that the filter slides, we record the k h × k w filter window. Finally, we fill each value of the filter into the element with non-zero index number in the corresponding sparse matrix C.

Consider a de-convolution with i = 4, k = 3 and s = 3, according to the (4), we have pa top = 1, pa down = 1, pa lef t = 1 and pa right = 1. Let's take the first slide as an example in Fig. 7 , we record the 3 × 3 filter window, and the non-zero positions are 1, 2, 5 and 6. Hence, we fill each value of the filter (i.e., w 1,1 , w 1,2 , w 2,1 and w 2,2 ) into them. Similarly, Fig. 7 An example of matrix-vector product with the sparse matrix. Filter size is 3 × 3, sparse matrix size is 4 × 16, and we set i = 4, k = 3, s = 3 we can fill the non-zero positions in the sparse matrix by sliding the filter window with stride s = 3.

Single channel de-convolution. The sparse matrix C can be derived directly. Figure 8 shows an example of single channel de-convolution that is parameterized as i = 2, k = 2, s = 1, pa = 0 and o = 3. Considering the corresponding forward convolution process, its input size is i = 3 and its output size is o = 2. The sparse matrix C is of size 4 × 9. We flatten the input X into a 4-dimensional vector. Hence, the de-convolution can be easily computed by linear operation that is the matrix-vector product Y = C T X, where C T is the transpose of C.

Multiple channel de-convolution. The sparse matrix C can be derived by matrix splicing. We use c o × c i × k h × k w denotes the filter size and c i × i h × i w denotes the input. Figure 9 shows an example of multiple channel de-convolution with c i = 3, c o = 2, i = 2, k = 2, s = 1 and pa = 0. The input X is of size 3 × 2 × 2, the filter is of size 2 × 3 × 2 × 2 and the output Y is of size 2 × 3 × 3. Considering the corresponding forward convolution process, its input is of size 2 × 3 × 3 and its output is of size 3 × 2 × 2. Its filter is of size 3 × 2 × 2 × 2 which is obtained by reversing input and output channels of de-convolution, i.e.,

For each output channel corresponding to c i filters, we have c i sparse matrices. We horizontally join these c i sparse matrices and vertically join the c o output channels. Hence, we obtain the sparse matrix C with the size of

Finally, we perform linear operation Y = C T X that takes the 3 × 2 × 2 size of input matrix flattened as a 12-dimensional vector and produces an 18-dimensional vector. The latter is reshaped as the 2 × 3 × 3 size of output matrix. Secure De-convolution with large inputs. The lattice-based PAHE scheme can be used to evaluate matrix-vector product over ciphertext, such as the libraries provided by Gazelle [20] . In the PAHE scheme, the input is packed into a single n-slots ciphertext so that it allows us better utilization of SIMD and control the noise growth. However, the packing methods require that the input size n i and the output size n o are smaller than n. Let's consider the noise in the ciphertext. In the lattice-based PAHE scheme, the noise is bounded by the coefficients of the sampled error polynomials, the plaintext size and the number of operations (addition and multiplication). We refer readers to [1] for details. In our framework, the size of a plaintext is a single machine word (64 bits). The coefficients of the sampled error polynomials are pre-defined constants. The number of homomorphic operations is determined by the size of the sparse matrix C, which is n o × n i . The matrix-vector product results can be correctly decrypted if η < q/(2p) where η is the overall at most noise growth. We have η = n i · p · √ n · η c where η c is the noise growth of the element wise multiplication between C and X. Hence, the maximum size of the input, n i , should satisfy n i < q/(2 · p 2 · √ n · η c ). In practice, when n i is larger than n, a straight forward way is to divide the input into small blocks and pack each of them [20, 23] . Figure 10(a) shows that the original input X and C are split into n × n sized blocks that are to be packed independently. However, the homomorphic additions of these blocks are required to achieve the final matrix-vector products. When n i becomes particularly large, there will be too many number of blocks involved in the homomorphic additions. Consequently, the noise level introduced by these operations may cause the decryption failed.

Since we have two semi-honest servers (S 1 and S 2 ) in our framework, the sever S 1 keeps the private key sk and decrypts the encrypted intermediate results before performing nonlinear layer computations based on 2-PC secure computation. S 2 performs homomorphic evaluation of the linear layers. Obviously, the success of decryption will clear the noise introduced by homomorphic operations. Therefore, we combine PAHE and 2-PC plaintext addition techniques. The idea is to divide the input (X and C) into blocks to perform homomorphic matrix-vector product separately on S 2 . After adding a uniform random vector r to each intermediate ciphertext homomorphically, S 2 sends them to S 1 , the latter performs the decryption. After the additions in plaintext, S 1 encrypts the results with s k and sends it back to S 2 . S 2 gets the matrix-vector product results by subtracting r homomorphically. In this way, it can accept large number of inputs and correctly perform homomorphic matrix-vector product (homomorphic additions and multiplications) with low memory consumption (see Section 7). Figure 10 (b) takes a group of column block summation as an example, S 2 obtains ciphertext [y ij ] by performing homomorphic matrix-vector product on each block [C ij ] and [X j ], where i = 1, ..., n o /n and j = 1, ..., n i /n. Then, S 2 gets [y ij + r j ] = [y ij ] + r j by homomorphically adding uniform random vector r j to [y ij ], which are sent to S 1 . The latter decrypts them with the private key sk to get the plaintexts y ij + r j . After which, the summation of the plaintexts n i /n j =1 y ij + r j is encrypted again by S 1 , denotes as [y i + r], and sent back to S 2 . S 2 removes the random vector r by homomorphically subtracting 

The implementation of SecMinimizer includes three components: ec(R, d) . It returns a pair of d-dimensional vector sets

, each of which has R vectors. S 1 (S 2 ) generates the d-dimensional vector set Z 0a (Z 0b ) with Gaussian distribution, respectively. Then, S 1 encrypts Z 0a and sends the ciphertext [Z 0a ] to S 2 . The latter gets the initial vector set Z 0 in the encrypted form by performing homomorphic addition, i.e.,

. Given the secret shares X a and X b of input X, it returns the secret shares Z * a and Z * b of the optimal vector set Z * , where z *

Z * a is held by S 1 and Z * b is kept by S 2 . We use mean squared error (MSE) L(X, Z) = ||G(Z) − X|| 2 2 as the loss function, where G(Z) = G(Z) a + G(Z) b is the outputs of the WGAN by invoking the function SecGan. We use L iterations of gradient descent (GD) to find the optimal vector set Z * .

-(X a , X b ) ← Reconstruction(Z * a , Z * b ). Given the optimal vector set Z * a holding by S 1 , and Z * b holding by S 2 , S 1 and S 2 cooperatively invoke argmin and the function SecGan to obtain the final reconstructed input X , which is split into two shares X a holding by S 1 and X b holding by S 2 .

The overall view of the secure minimizer is illustrated in Algorithm 1. It includes secure gradient computation and secure gradient update based on the garbled circuits.

For the sake of simplicity, we assume R = 1. Given the loss function L(X, z) = ||G(z) − X|| 2 2 , a straight forward computation is using the numerical differentiation to approximate the true value of the gradient by choosing a small h, such as symmetric difference quotient. Hence, the finite difference approximation of the gradient for the vector z is: where z is a vector of size 128, and X is a given input of the size i h × i w .

Note that S 1 and S 2 keep the additive shares G(z) a , G(z) b of the SecGan outputs and X a , X b of the inputs respectively, hence, we have

The gradient descent update is computed recursively as follows:

Let's take first iteration as an example. Recall that S 1 keeps the additive share z 0 a and ∇ z L(z, X) a | z=z 0 , S 2 keeps another share z 0 b and ∇ z L(z, X) b | z=z 0 , we have

Accordingly, we let S 1 and S 2 independently compute x = G(z) a −X a and y = G(z) b − X b . Then, they compute the loss function L(z, X) further the gradient ∇ z L(z, X) using 2-PC secure computation based on the garbled circuits. The inputs of the circuits include x from S 1 , y and randomly generated number r from S 2 . The outputs are r denoted as ∇ z L(z, X) b to S 2 and ∇ z L(z, X) + r denoted as ∇ z L(z, X) a to S 1 . According to the (8), S 1 and S 2 can independently compute x = z 0 a + η × ∇ z L(z, X) a | z=z 0 and y = z 0 b + η × ∇ z L(z, X) b | z=z 0 . Similarly, the gradient can be updated by using garbled circuits. The garbled circuits can be implemented using sum, multiplication and division gates.

The numerical differentiation requires repeatedly invoking the function SecGan to get the outputs. For example, when z is a 128-dimensional vector, it needs call 256 times of the function SecGan to compute ∇ z L(z, X)| z=z (i) . Obviously, its efficiency is very low. On the other hand, its accuracy is dependent on h and the Linear property of the function SecGan, which may cause the results unacceptable.

We can find derivative of the composite functions by the chain rule. The derivative of the composition equals the derivative of the outside function with respect to the inside, and times the derivative of the inside function. The number of functions that make up the composition determines how many differentiation steps are necessary.

Let's consider the loss function again, we have

where L(z, X) is a scalar and G(z) can be treated as a matrix. Hence, ∂L(z,x) ∂G(z) = Mean(2 × (G(z) − X)) has the same shape as G(z). For the ∂G(z) ∂z , we can compute ∂G(z) ∂z by the chain rule. Taking the WGAN network as an example in Fig. 5 (o f (z) ))))))). We give the mathematical derivation as follows: (10) where '·' denotes the matrix-vector product and '•' denotes component-wise product between vectors. If we can compute the gradient of each layer independently, we can convert the (10) to the two type of operations. Consequently, we can utilize the PAHE scheme instead of 2-PC secure computation to securely calculate the gradient of the WGAN, which will make our implementation simpler and more efficient.

Let's take a close look at each layer:

-Derivative of F C and DeConv layer. The F C layer performs the matrix-vector product. Hence, given the inputs x, we have

where W is the weight matrix used by F C layer. Recall that we can transform the de-convolution into matrix-vector product as well, similarly, given the inputs x, we have 

Hence, we have ∂o σ ∂x = o σ (x), which is a vector with the same shape of the inputs x. Based on the gradient calculations above, the (10) can be transformed to:

where G denotes ∂o σ

. Now, we can use chain rule to calculate the gradient of the WGAN by only invoking the function SecGan once. It is very fortunate that the operations involved in (11) are matrix-vector product and component-wise vector product. Therefore, we can use the PAHE scheme combined with the 2-PC plaintext addition to implement the secure gradient computation, which is the same as the secure de-convolution computation technique proposed in Section 5. Considering the last layer of the example in (z, x) ], of size 128 × 1. In Section 7, we will further show the detailed information and performance of each layer. Dataset Mnist. We select the Mnist dataset of handwritten digits to measure the feasibility of our framework. The MNIST dataset has a training set of 60,000 examples and a test set of 10,000 examples. The training set is made up of numbers written by 250 different people and the value of each label is an integer between 0 and 9. Each image consists of 28 × 28 pixels, each pixel is represented by a gray value. We randomly selected 5,000 images in our experiments. They are grouped by content into 10 categories, each of which contains 500 images and is corresponding to an integer number of 0 to 9. In the experiments, we selected distinct collections of images, containing 500, 1,000, 1,500,..., and 5,000 distinct images, respectively.

Dataset coral. The publicly available Coral database has been used for the CBIR task. The database consists of 1,000 test images which are divided into 10 different classes. Each class consists of 100 images. The classes are created based on the different objects available in the images. The classes are African peoples, elephant, beach, rose, building, horse, bus, mountain, dinosaur, and food dish.

The attacking model. We take Fast Gradient Sign Method (FGSM) under black-box attack model to evaluate the effectiveness of our framework. Given an image x and its corresponding true label y, the FGSM under black-box attack sets the perturbation δ to:

It has been shown that FGSM under black-box attack [15] is extremely fast rather than optimal. It simply uses the sign of the gradient at every pixel to determine the direction with which to change the corresponding pixel value.

The FGSM under black-box attack assumes that adversaries have no access to the neural network parameters, neither do they have access to a large training dataset. The adversary trains a substitute model using a very small dataset augmented by synthetic images. Adversarial samples are then found by applying the FGSM method on the substitute model. Evaluation metrics. We evaluate the performance of the implementation of the functions SecGan and SecMinimizer respectively, which includes the time consumption (termed as time cost), the memory space consumption (termed as memory cost), and the overall communication overhead between the cloud servers S 1 and S 2 (termed as communication cost). The communication cost refers to the size of the intermediate data in bytes exchanged between the servers. The functions include online and offline phases. The online phase begins with the cloud servers S 1 and S 2 sharing their inputs and the generation of random vector z. Then, S 1 and S 2 execute the input reconstruction task by cooperatively running L steps of GD. The offline phase includes the pre-process of the inputs and the circuits garbling. The classification accuracy (termed as accuracy) with adversary samples is also measured to show the correctness of our implementation.

The communication cost is measured with the tool NetHogs 0.8. Please note that we only measured the time consumed by each server, and the communication delay is ignored. Implementation details. Our framework needs a packed additive homomorphic encryption (PAHE) scheme and a two-party secure computation (2-PC) scheme. Parameters for both schemes are selected for a 128-bit security level. The PAHE scheme was implemented Fig. 11 Performance of direct convolution v.s. the number of input channels by using Gazelle, where the plaintext modulus p is set to 22 bits and the ciphertext modulus q is chosen to be a 60-bit psuedo-Mersenne prime. The 2-PC scheme was implemented by using Obliv-C. The circuits garbling phase is counted into the offline phase because its independent of the users' inputs.

Secure de-convolution benchmarks. We first benchmark the impact of the direct convolution on the secure de-convolution performance. Because we use the channel packing technique, the input is packed and encrypted c i /c n ciphertexts and the output contains c o /c n ciphertexts, where c n represents the number of channels that fit in a single ciphertext. Hence, we need c o × c i × k h × k w × i h × i w SIMDAdd and SIMDScMult calls. We measured the memory cost, time cost and communication cost under different number of input channels and output channels, which are presented in Figs. 11 and 12 , respectively. In our benchmark, for each channel, the input size is set to 4 × 4, the output size is set to 8 × 8, the filter size is set to 5 × 5 and the stride is set to 2. Thus, we have c n = 32.

Secure matrix-vector product benchmarks. We further benchmark the impact of the matrix-vector product on the secure de-convolution performance. The de-convolution operation is transformed into matrix-vector product by deriving the sparse matrix C with the size of n o × n i = (c o * o w * o h ) × (c i * i w * i h ). Hence, we need n i SIMDScMult and n i − 1 SIMDAdd calls. Obviously, the number of homomorphic operations is dropped compared with the direct convolution. We used the same input parameters and measured the memory cost, time cost and communication cost under different number of input channels and output channels, which are presented in Figs. 13 and 14 , respectively. Because we divide the input into blocks to perform homomorphic matrix-vector product separately when the number of input channels is larger than 128, the communication cost increases. Compared with 

We report runtimes and network bandwidth for the secure evaluation of the generator in the WGAN network. We compare the two de-convolution methods that are composed into the evaluation. The results are shown in Table 1 .

We benchmark the two methods that are used during the secure SGD computation in the SecMinimizer. Table 2 shows the time cost and the communication cost of the two methods for one gradient update. Because the numerical differentiation method needs repeatedly calling the generator in the WGAN network, the time cost is very heavy. In shark contrast, the chain rule method is 400× faster. The online time of the one gradient update in the WGAN network is about 1.6 seconds.

We compose the SecGan and SecMinimizer from the previous section and evaluate the complete WGAN network over Mnist dataset. We set L = 1, R = 1. Table 3 compares the runtimes and bandwidth of our method and that of Samangouei's [27] in plaintext. Our 

Since our work focus on comparing the accuracy differences rather than the accuracy between our method and Samangouei's, the selection of NN network is not important. We refer to the model structure of Samangouei [27] and choose model A {a four-layer convolutional neural network with two Conv layers and two F C layers} as our classifier and model B {a four-layer convolutional neural network with three Conv layers and one F C layer} as our substitute network. When the original images are used to train the classifier and defense-GAN is used, we refer to it as Defense-GAN-Ori. Under the Defense-GAN-Ori, we compare Sec-Defense-Gan with defense-GAN proposed by Samangouei [27] under FGSM blackbox attacks as well as under no attack. Figure 15 shows the classification accuracy of the classifiers using Sec-Defense-Gan with various number of iterations L (R = 10, = 0.3) on the Mnist dataset. We use simple parameter discretization method in the experiments, i.e., we represent real numbers with a fixed precision of n decimal points by directly scale up real numbers by 10n times and converting them into 64-bit integers. Obviously, the larger n will lead to the better accuracy, but may cause result overflow when performing the homomorphic addition. Therefore, there is a certain loss of accuracy in the experiments even the results show that our solution performs consistently well under the FGSM black-box attack.

In our experiment, we use Mnist and Coral dataset. The images of different classes are divided into training and testing set. For the training of the classifier model, 80% images In the experiment, the two-layer CNN model is used. For the first layer of the CNN model, 32 feature maps of 3 · · · 3 are used for convolution with an input image to derive the features. After the convolution, 2 · · · 2 max pooling is used to reduce the feature size. For the second layer, 16 feature maps of 3 · · · 3 are used for the convolution with the output of the first layer to derive the features. After the convolution, 2 · · · 2 max pooling is used to reduce the feature size. After the max pooling, flattening is used to convert feature maps into a single column vector, which will be passed to the neural network. The performance of the CBIR can be measured by the precision rate. Precision Rate is defined as the ratio of the total number of relevant images retrieved to the total number of images retrieved:

Number of relevant images retrived Total number of images retrived (13) Table 4 and 5 shows the classification precision of the CBIR on Mnist and Coral datasets. Finally, Fig. 16 shows the content-based image retrieval results on Mnist dataset where the most relevant images have been successfully retrieved at top ranks. In the figure, the first column is the provided query image. The remaining columns are the retrieved images sorted according to their similarity (the Euclidean distance, see [23] for details) to the query image.

In this paper, we design two secure de-convolution algorithms and two secure gradient computation and update protocols to accomplish image reconstruction task. With these build blocks, we propose and implement Sec-Defense-Gan and apply it to secure CBIR system. Sec-Defense-Gan uses a judicious combination of PAHE scheme and garbled circuit based two-party computation, both of which satisfy IND-CPA security, i.e., ciphertexts of any two messages are computational indistinguishable. Sec-Defense-Gan guarantees the input/output image data and GANs model details confidentiality while providing the ability to defense against adversarial image examples. The application of Sec-Defense-Gan to CBIR over datasets shows that it can successfully retrieve similar images from the datasets without leaking input image data, intermediate results and retrieved image data to the cloud.

There are several ways in which Sec-Defense-Gan can be improved. First, PAHE scheme supports SIMD operations over ciphertext domain, however, the time-consuming is still very heavy because Deconv or F C layer has homomorphic operations quadratic in the input size. It is a straightforward way to consider the sparsity of the weight matrix in the Deconv or F C layer. The weight matrix can be 'compressed' before it is packed and encrypted. Another important direction is to focus more on specific adversarial image type rather than general examples, which may reduce L iterations of gradient descent (GD) to find the optimal vector set Z * . More importantly, our work did not investigate theoretical bounds on the information leakage of each function in our framework, which would remain as a topic of 

The authors declare no competing interests.

Functional encryption for inner product predicates from learning with errors

2d object recognition: a comparative analysis of sift, surf and orb feature descriptors

An efficient technique for object recognition using shi-tomasi corner detection algorithm

Transfer learning for image classification using vgg19: Caltech-101 image data set

Efficient Fully Homomorphic Encryption from (Standard) LWE

Fully homomorphic encryption from ring-lwe and security for key dependent messages

TFHE: Fast Fully Homomorphic Encryption Over the Torus

Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds

CryptoNets: applying neural networks to encrypted data with high throughput and accuracy

A public key cryptosystem and a signature scheme based on discrete logarithms

Privacy-Preserving Content-Based Image Retrieval in the Cloud

The knowledge complexity of interactive proof systems

Explaining and harnessing adversarial examples

Improved training of wasserstein gans

SoK: General Purpose Compilers for Secure Multi-Party Computation

Image feature extraction in encrypted domain with privacy-preserving sift

Securing sift: privacy-preserving outsourcing computation of feature extractions over encrypted image data

GAZELLE: A low latency framework for secure neural network inference

Face detection in still images under occlusion and non-uniform illumination

Privacy-preserving outsourcing of image feature extraction in cloud computing

Intelligent and secure content-based image retrieval for mobile users

) A unified probabilistic framework for robust manifold learning and embedding

Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning

Public-key cryptosystems based on composite degree residuosity classes

Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models. ArXiv, 1805.06605. access date

Face mask detection using yolov3 and faster r-cnn models: Covid-19 environment

Intriguing properties of neural networks

At-gan: An adversarial generator model for nonconstrained adversarial examples

Beyond inferring class representatives: Userlevel privacy leakage from federated learning

Obliv-C: A language for extensible data-oblivious computation

An efficient image homomorphic encryption scheme with small ciphertext expansion

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.