key: cord-0059525-cnetchg2
authors: Jeon, Hyeonseong; Bang, Youngoh; Woo, Simon S.
title: FDFtNet: Facing Off Fake Images Using Fake Detection Fine-Tuning Network
date: 2020-08-01
journal: ICT Systems Security and Privacy Protection
DOI: 10.1007/978-3-030-58201-2_28
sha: f331166f1895d680040b6f6418448ca651026a1e
doc_id: 59525
cord_uid: cnetchg2

Creating fake images and videos such as “Deepfake” has become much easier these days due to the advancement in Generative Adversarial Networks (GANs). Moreover, recent research such as the few-shot learning can create highly realistic personalized fake images with only a few images. Therefore, the threat of Deepfake to be used for a variety of malicious intents such as propagating fake images and videos becomes prevalent. And detecting these machine-generated fake images has been more challenging than ever. In this work, we propose a light-weight robust fine-tuning neural network-based classifier architecture called Fake Detection Fine-tuning Network (FDFtNet), which is capable of detecting many of the new fake face image generation models, and can be easily combined with existing image classification networks and fine-tuned on a few datasets. In contrast to many existing methods, our approach aims to reuse popular pre-trained models with only a few images for fine-tuning to effectively detect fake images. The core of our approach is to introduce an image-based self-attention module called Fine-Tune Transformer that uses only the attention module and the down-sampling layer. This module is added to the pre-trained model and fine-tuned on a few data to search for new sets of feature space to detect fake images. We experiment with our FDFtNet on the GANs-based dataset (Progressive Growing GAN) and Deepfake-based dataset (Deepfake and Face2Face) with a small input image resolution of 64[Formula: see text] 64 that complicates detection. Our FDFtNet achieves an overall accuracy of 90.29% in detecting fake images generated from the GANs-based dataset, outperforming the state-of-the-art.

The emergence of Generative Adversarial Networks (GANs) [6] , which produces high-quality images through a generator and a discriminator that is trained adversely and competitively, enables the generated outputs to be highly realistic and sophisticated [17, 18, 34, 38] . However, such high-quality images and videos generated by machines have been abused and harmed the general public (e.g., DeepFake [33] ). Furthermore, a recent study using the few-shot learning technique [28] in GAN allows Deep Learning models to produce high-quality outputs with only a small amount of training data. Zakharov et al. [38] demonstrated that models capable of generating highly realistic personalized talking head faces could be constructed using few-shot learning techniques, where the training inputs provide attention to the generator as a compressed form of feature landmarks, extracted through embedding layers. Leveraging this method, DeepFake can easily be generated even with only a small amount of training data. Recently reported incidents [36] related to DeepFake [33] and DeepNude show that these technologies are an imminent threat to the public.

Most of the previous approaches have focused on exploiting metadata information or handcrafted characteristics of images to detect fake images. However, these approaches fail to detect GAN-based fake images, because they are created from scratch and metadata can be also forged; handcrafted features are no longer useful for detection. Recent models, such as ShallowNet [30] and FakeTalk-erDetect [16] , used neural networks to detect GANs-generated fake images Yu et al. [37] used patterns from GAN generated fake to show improvement in detection. FaceForensics [23] showed various forgery detection techniques. However, they lack generalization and will thus have difficulties coping with newly developed DeepFake generation techniques.

In this paper, we propose Fake Detection Fine-tuning Network (FDFtNet), a new robust fine-tuning neural network-based architecture for fake image detection. FDFtNet combines Fine-Tune Transformer (FTT), with a pre-trained Convolutional Neural Network (CNN) as a backbone, and MobileNet block V3 (MBblockV3). Figure 1 shows an overview of our approach, where we utilize wellknown, existing CNN architectures [7, [11] [12] [13] 27, 29] for fake image detection. Our FTT is designed to use different feature extraction from images using the selfattention, and MBblockV3 extracts the feature using different convolution and structure techniques. MBblockV3 is added to the pre-trained model as a backbone network after removing the classification layers. We apply data augmentation by implementing the Cutout method to overcome the limitation of using a small finetuning dataset and improve the performance. Our approach provides a reusable fine-tuning network, improving the existing backbone CNN architectures, which were not designed to detect fake images effectively. Our main contributions are as follows:

-We propose FDFtNet, a novel neural network-based fake image detector, showing superior performance on detecting fake images compared to previous approaches by achieving 97.02% accuracy, improving the baseline model accuracy from 4% to 45% through our methods. -We provide a robust fine-tuning neural network-based classifier, which requires only a small amount of data for fine-tuning and can be easily integrated with popular CNN architectures.

Traditional Image Forgery Detection. Many researchers [5, 19, 21, 35, 37] have investigated various digital forensics algorithms to detect forged images. One way to detect forged images is to analyze them in the frequency domain. However, it is difficult to analyze images with refined, smooth edges, thus giving rise to a different method. In JPEG Ghost [5] , the forged part is regularly copied from different real images. The normalized pixel distance of the reproduced image differs from the original image, causing a difference in JPEG quality. However, this method will not work if the original image and the forged image have the same quality level. Another approach is Error Level Analysis (ELA) [19] , which checks the error level of the images. However, with GANsgenerated fake images, ELA cannot classify the error level between the real and generated images. Another algorithm called the Copy-move Forgery detection [21] is based on Pixel Based approach. Firstly, the dyadic wavelet transform (DWT) is applied to the input image. This transforms the original image to an image of a reduced dimension representation, i.e., the LL1 sub-band. Then this LL1 sub-band is divided into sub-images. To compute the spatial offset between the Copy-move regions, the phase correlation is adopted. The Copy-Move regions can easily be located by pixel matching, which shifts the input image according to the offset and calculates the difference between its shifted version and the original image. In the final step, the Mathematical Morphological Operations (MMO) are used to remove isolated points to improve the location. Traditional digital forensic tools fail to detect GANs-generated images because they are generated as a single image. For this reason, these approaches are not effective.

Image Forgery Detection with Neural Network. Various CNN-based models have been used to detect forged images. ShallowNet [30] outperformed previous architectures in detecting real vs. PGGAN with a shallow layer architecture. However, their approach showed limitations when detecting other types of DeepFake images. FaceForensic++ [24] proposed a forgery detection method tailored to facial manipulations and provided an extensive evaluation in a supervised manner. In addition, they introduced an automatic metric that takes into account the four forms of distortion in realistic scenarios (i.e., random encoding and random dimensions). Using these benchmarks, they analyzed various methods of forgery detection pipeline. However, transfer learning or finetuning capabilities were not explored. Recent research by Yu et al. [37] proposed a method by learning the metadata, mentioned as GAN fingerprints, to effectively detect GANs-generated images. However, our method includes deepfake datasets as well as GANs for detection without the usage of metadata.

To achieve long-term dependencies on image data, CNN needs to increase the amount of computation via deeper layers, because one-time convolution computation sees only the convolution kernel size. In contrast, self-attention solves this long-term dependency issue by using the softmax outputs of the entire sequence that provide attention to CNN. Zhang et al. [39] used self-attention modules to generate images with GANs. Our FTT is different in that we build only self-attention modules, such as Transformer, during the feature extraction in the classification tasks. We apply FTT for the image feature extractor and not for the generator. This approach is similar to the Multi-head Attention Module [32] (Query, key, and Value), but the difference is that FTT is suitable for the image to be applied to the 1 × 1 convolution.

CelebA. CelebFaces Attributes Dataset (CelebA) [20] is a large-scale face attributes dataset with more than 200,000 celebrity images. It is widely used for benchmarking and as inputs for generating training and test datasets for various GAN and VAE approaches. We use CelebA as an input to generate PGGAN [17] fake images.

For the GAN-generated image, we used Progressive Growing GANs Dataset (PGGAN) [17] , consisting of 100,000 GAN-generated fake celebrity images at 1024 × 1024 resolution using the CelebA dataset. The key idea in PGGAN is to grow both the generator and discriminator progressively. The training starts with both the generator and the discriminator having a low resolution. New layers are added as the training process advances, thus increasing the resolution of the generated images.

Deepfakes. Deepfakes [33] was the first publicly available method, which anyone can download and use to produce fake images and videos. The code is based on two autoencoders with a shared encoder. The trained encoder and decoder of the source image are applied to the target image face to produce a forged image. The output of the autoencoder is then blended with the target image. For our experiment, we used the dataset provided by Google/Jigsaw (Fig. 2 ).

Illustration of our datasets. CelebA [20] images are used as inputs for PGGAN [17] fake image generation. Images from the FaceForensics [23] dataset are cropped and used as input images for Deepfakes [33] and Face2Face [31] fake image generation.

FaceForensics. FaceForensics [23] is a video dataset comprised of more than 500,000 frames, containing faces from 1004 videos that can be used to study image or video forgeries. An automated version of Face2Face [31] approach is used to create the videos. The goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. Face2Face re-renders the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the realworld illumination. Since our goal is to detect fake images, we use each frame from the generated output.

We used the following CNN networks as our backbone networks, as shown in Fig. 1 , as well as our baselines (backbone networks): SqueezeNet, ShallowNetV3, ResNetV2, and Xception. Each network is pre-trained from each dataset (i.e., PGGAN, Deepfakes, and Face2Face).

SqueezeNet. SqueezeNet [14] has an AlexNet-level accuracy with fewer parameters and would generally have poor performance in fake detection tasks because SqueezeNet is not designed for fake detection. We chose SqueezeNet as the baseline because our FDFtNet can provide a huge improvement. 

In addition, since this approach has not been tested on deepfakes other than those generated by PGGAN, we aim to investigate the performance.

ResNetV2. ResNetV2 has been widely adopted in many image classification tasks. We chose ResNetV2 [8] as one of the baselines, because ResNetV2 has an opposing characteristic to ShallowNetV3 in terms of the model depth, i.e., ResNetV2 has 50 layers, while ShallowNetV3 has only 8 layers. We believe that these two architectures would show complementary results, and we plan to see the effect of our approach on such deep and shallow CNN architectures.

Xception. Xception [2] has been served as the baseline for fake image detection in [24, 30] . For FaceForenscis++, Xception showed the highest accuracy, i.e., 96.36% in Deepfake and 86.86% in Face2Face, justifying our choice of it as a baseline. Xception has no FC layers, but extracts various image feature spaces thanks to depthwise separable convolutions, compared to the burdensome FC layers in the ShallowNetV3. We cut the classification layers in a pre-trained model, and add our FTT and MBblockV3 modules.

Fine-Tune Transformer (FTT) consists of several self-attention modules, as shown in Fig. 3 , where each attention module has f (x), g(x), and h(x) using a 1 × 1 convolution filter. We iterate M times from the image inputs. M is a hyper-parameter, and we empirically determined that M = 3 yields the highest performance. Fig. 3 , the input x of the previous layers or the input image is divided into three feature spaces f (x), g(x), and h(x). As shown in Eq. 1, all of them are obtained through the 1 × 1 convolution, where W f , W g , and W h are the respective filter weights of each space. f (x) and g(x) have b channel bottleneck ratio parameter, C b , where C is the number of channels. In this study, we choose b = 8 as suggested by Zhang et al. [39] . In particular, we use the dot-product attention to produce the attention map β in Fig. 3 , synthesizing the i th and j th locations after the Sof tmax operation as shown in the above equation.

After obtaining the attention map β, we apply the Batchdot operation to multiply the attention map β j,i with h(x), as shown in Eq. 2, and produce output o j . After the Batchdot multiplication, o j is added to the input x i . Finally, the self-attention feature map, y i , is obtained via multiplying γ and adding the input x i , as shown in Eq. 3. In particular, γ is a learnable parameter initialized as 0 at the early stage of learning. This is favorable since the softmax function equally provides attention to all the feature spaces at the early stage of learning. Next, in our FTT, we apply the self-attention module three times (M = 3) with an input size of 64 × 64 × 3, as shown in Table 2 . The first layer is a 3 × 3 separable convolution with 32 filters and 2 strides followed by Batch Normalization (BN) [15] and ReLU. The dimension of the output feature map from the self-attention module is 32, 64, and 128, respectively; the width (the number of channels) is doubled when the resolution is down-sampled, as shown in Table 2 . After that, self-attention is performed three times (M = 3), followed by Separa-bleConv3 × 3, BN, and ReLU. The main reason we apply self-attention modules in FTT is to overcome the limitations of CNN in achieving long-term dependencies, caused by the use of numerous Conv filters with a small size. On the other hand, only one-time use of the FTT is necessary to achieve the long-term dependencies, avoiding the construction of deep CNN layers. Also, a three-time application of self-attention modules allows us to explore and learn diverse deep features of the input images via fine-tuning.

We chose MobileNet block V3 (MBblockV3) to explore the image feature space through inverted residual structure and linear bottleneck [25] . Depthwise separable convolutions, as in Xception and MobileNetV1 [11] , are also included in MBblockV3. Comprehensively, MobileNet is an architecture that has already proven its efficiency by using a small number of parameters, drastically increasing computational efficiency. We chose MBblockV3, because it is a suitable module for the efficient extraction of the feature space over the pre-trained feature space. FTT and MBblockV3 are repeatedly used M and N times, respectively. Each of them is added before the final classification layer. MBblockV3 has the parameter N after the pre-trained model. In our experiment, we use N = 4, determined empirically, yielding the best performance for fine-tuning. In particular, we use the modified h-swish [10] and the ReLU6 as activation functions. This nonlinearity [4, 9, 22] significantly improves the performance of neural networks and is defined as follows:

Since clipping the input values at the bottom layers may have a side effect of distorting the data distribution [26] , we apply these activation functions at the top layers to reduce distortion and extract different signals from ReLU. Next, the Squeeze-and-Excitation blocks (SE block) in Squeeze and Excitation networks [12] are applied in the bottleneck layer. Global information on the image resolution is embedded in the squeeze stage, and information aggregation is used to capture channel dependencies and is re-calibrated through the gated computation (element-wise multiplication), similar to the attention mechanism in the excitation stage. Details of the SE block parameters are summarized in Table 3 .

All datasets have train, validation, test, and fine-tune sets. The size of each dataset is shown in Table 4 . Our FDFtNet is trained with Stochastic Gradient Descent (SGD) with momentum for 300 epochs on all datasets. The learning rate is initialized at 0.3 and annealed using a cosine function. The momentum rate is set to 0.9, and the mini-batch size is set to 128. Early stopping is applied, when the validation loss ceases to decrease for 20 epochs. To reenact the most challenging scenario in detecting fake images, all input images are resized to 64 × 64 resolution.

Data Augmentation. Input images are translated into a width and height range of [−2, 2] with the nearest-padding on empty pixels generated after translation. Zoom and rotation are also applied to a degree range of [−0.2, 0.2]. We also perform random horizontal flipping. These data augmentations are applied to all fine-tune sets. For validation and test sets, only a 1/255 scaling augmentation to the input image is applied.

Cutout. Cutout method applies squared zero masks on a random location of each input image. Figure 4 presents an example of a Cutout data augmentation. DeVries et al. [3] used random zero masks of 16 pixels for CIFAR-10 (32 × 32 pixels images), 5 random iteration parameters α for cutting, and 16 random size multipliers β for the cutting masks. We use 4 × 4 pixels mask, 3 iterations, and 5-size multipliers for cutting masks for 64 × 64 images (α = 3 and β = 5). Since we use random translation, we do not use random center cropping, which was used in the original paper. When we conducted with the original setting, we faced severe underfitting with no convergence of losses. We observed Table 4 . The respective size of the train, validation, test, and fine-tune sets. We use only 1,000 real and fake images, respectively, for fine-tuning. higher performance with a setting of low Cutout parameters (α = 3 and β = 5) as compared to the implementation without Cutout, which showed strong overfitting. Because we fine-tune with a small amount of data, we apply this non-aggressive parameter setting.

We present our overall performance results in Table 5 . In Table 5 , we use the accuracy (ACC) and AUROC as evaluation metrics. We experimented with all four baseline models on each dataset with similar training strategies. The experimental results show that our FDFtNet has superior detection performance in both ACC and AUROC, compared to all the baselines. In terms of training data size, our model shows high performance using 1,000 images for real and fake, respectively.

To yield the best detection performance, we freeze the weight parameters of all layers of the pre-trained models. FTT with parameter M = 3 is used, and MBblockV3 with parameter N = 2 is added; the same data augmentation is applied. Table 5 shows the results of our models compared with the baseline models. Our results show that Xception, among all baseline models, achieved the highest performance (87.12% ACC and 94.96% AUROC). Our model showed a performance of 90.29% ACC and 95.98% AUROC, which is higher than that of ShallowNetV3 with an ensemble [30] . ShallowNetV3 is improved from 85.73% and 92.90% ACC to 88.03% and 94.53% AUROC, respectively, similar to the ensemble version. SqueezeNet baseline shows the lowest baseline performance, but it is significantly improved to a similar level to that of ShallowNetV3, from 50.00% to 92.76%, by applying our model. Deepfake. Here also, the same data augmentation techniques are applied. For FTT, we use M = 3 and N = 4 for MBblockV3. Cutout has α = 3 iteration parameters and β = 10 multiplier parameters. The results show that all models achieve significant improvement in performance. Face2Face. The training strategies for Face2Face are very similar to those of the Deepfake dataset. Data augmentation is also applied. M , N , α, and β are set to 3, 4, 3, and 10, respectively. The interesting point is that ResNetV2 baseline performed poorly (58.83% ACC and 62.47% AUROC), but significant improvements are made using our methods (85.15% ACC and 92.91% AUROC). Our results demonstrate the generalization ability of our approach, improving the poorly performing baseline above 90% across all models and datasets. Compared to FaceForensics Benchmark Results [1] , the highest state-of-the-art method is Xception, which shows 96.4% ACC in Deepfake and 86.9% ACC in Face2Face. Our FDFtNet achieves higher performance (97.02% and 96.67%) than the current state-of-the-art method for the same dataset.

In this section, we validate each module and technique through an ablation study. In Table 6 , we choose the Xception model and the Deepfake dataset to compare our model with and without the FTT, while all other settings remain the same. With FTT, we can achieve about 2.5% higher performance than without FTT, as shown in Table 6 . Our current work has the following limitations: First, we used both real and fake data for training and fine-tuning, but we have constrained resources in practice. In FakeTalkerDetect [16] for fake detection, researchers used Siamese networks for training only on real data. However, in our implementation, few-shot learning and unbalanced learning are major obstacles to achieving high performance. Second, transfer learning is required to improve performance. We trained each model on each dataset. For future work, we plan to research the transfer learning ability to further generalize our model.

We propose FDFtNet, which is a robust fine-tuning neural network-based architecture, to detect fake images and significantly improve the baseline CNN architectures. Our model achieves the state-of-the-art accuracy in fake image detection on the GAN-based dataset and the Deepfake-based dataset. Our experimental results with the use of a limited amount of data show the exploration and exploitation of image feature space beyond the pre-trained models. Our results show that FDFtNet is a promising method for detecting fake images generated by powerful deep learning methods, requiring only a small amount of images for re-training. Therefore, FDFtNet can be a viable option even for detecting new fake images in a real-world scenario, where available datasets are extremely limited. Further, we offer open source versions of our work for it to be widely leveraged by the research community 1 .

This table lists the benchmark results for the binary classification scenario

Xception: deep learning with depthwise separable convolutions

Improved regularization of convolutional neural networks with cutout

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Exposing digital forgeries from jpeg ghosts

Generative adversarial nets

Deep residual learning for image recognition

Identity mappings in deep residual networks

Bridging nonlinearities and stochastic regularizers with gaussian error linear units

Searching for mobilenetv3

Mobilenets: efficient convolutional neural networks for mobile vision applications

Squeeze-and-excitation networks

Densenet: implementing efficient convnet descriptor pyramids

Squeezenet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size

Batch normalization: accelerating deep network training by reducing internal covariate shift

Faketalkerdetect: effective and practical realistic neural talking head detection with a highly unbalanced dataset

Progressive growing of gans for improved quality, stability, and variation

A style-based generator architecture for generative adversarial networks

A picture's worth

Deep learning face attributes in the wild

Image forgery types and their detection: a review

Swish: a self-gated activation function

Faceforensics: a large-scale video dataset for forgery detection in human faces

Face-forensics++: learning to detect manipulated facial images

Mobilenetv 2: inverted residuals and linear bottlenecks

A quantizationfriendly separable convolution for mobilenets

Very deep convolutional networks for large-scale image recognition

Meta-transfer learning for few-shot learning

Going deeper with convolutions

Gan is a friend or foe?: a framework to detect various fake face images

Face2Face: real-time Face Capture and Reenactment of RGB Videos

Attention is all you need

Sliced wasserstein generative models

Estimating distribution costs with the e aton-k ortum model

Altering faces via ai deepfake may be outlawed. China Daily

Attributing fake images to GANs: learning and analyzing Gan fingerprints

Few-shot adversarial learning of realistic neural talking head models

Self-attention generative adversarial networks

We thank Siho Han for providing his expertise to greatly improve this work. This work was partly supported by Institute of Information