key: cord-0762305-7yw43xui
authors: Zhao, Lin; Chen, Changsheng; Huang, Jiwu
title: Deep Learning-based Forgery Attack on Document Images
date: 2021-02-01
journal: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
DOI: 10.1109/tip.2021.3112048
sha: 2c56c6f7f35e0f51c10e18288d39bc713f8cf413
doc_id: 762305
cord_uid: 7yw43xui

With the ongoing popularization of online services, the digital document images have been used in various applications. Meanwhile, there have emerged some deep learning-based text editing algorithms which alter the textual information of an image . In this work, we present a document forgery algorithm to edit practical document images. To achieve this goal, the limitations of existing text editing algorithms towards complicated characters and complex background are addressed by a set of network design strategies. First, the unnecessary confusion in the supervision data is avoided by disentangling the textual and background information in the source images. Second, to capture the structure of some complicated components, the text skeleton is provided as auxiliary information and the continuity in texture is considered explicitly in the loss function. Third, the forgery traces induced by the text editing operation are mitigated by some post-processing operations which consider the distortions from the print-and-scan channel. Quantitative comparisons of the proposed method and the exiting approach have shown the advantages of our design by reducing the about 2/3 reconstruction error measured in MSE, improving reconstruction quality measured in PSNR and in SSIM by 4 dB and 0.21, respectively. Qualitative experiments have confirmed that the reconstruction results of the proposed method are visually better than the existing approach. More importantly, we have demonstrated the performance of the proposed document forgery algorithm under a practical scenario where an attacker is able to alter the textual information in an identity document using only one sample in the target domain. The forged-and-recaptured samples created by the proposed text editing attack and recapturing operation have successfully fooled some existing document authentication systems.

Deep Learning-based Forgery Attack on Document Images

Lin Zhao, Student Member, IEEE, Changsheng Chen, Senior Member, IEEE, and Jiwu Huang, Fellow, IEEE Abstract-With the ongoing popularization of online services, the digital document images have been used in various applications. Meanwhile, there have emerged some deep learning-based text editing algorithms which alter the textual information of an image in an end-to-end fashion. In this work, we present a low-cost document forgery algorithm by the existing deep learning-based technologies to edit practical document images. To achieve this goal, the limitations of existing text editing algorithms towards complicated characters and complex background are addressed by a set of network design strategies. First, the unnecessary confusion in the supervision data is avoided by disentangling the textual and background information in the source images. Second, to capture the structure of some complicated components, the text skeleton is provided as auxiliary information and the continuity in texture is considered explicitly in the loss function. Third, the forgery traces induced by the text editing operation are mitigated by some post-processing operations which consider the distortions from the print-andscan channel. Quantitative comparisons of the proposed method and the exiting approach have shown the advantages of our design by reducing the about 2/3 reconstruction error measured in MSE, improving reconstruction quality measured in PSNR and in SSIM by 4 dB and 0.21, respectively. Qualitative experiments have confirmed that the reconstruction results of the proposed method are visually better than the existing approach in both complicated characters and complex texture. More importantly, we have demonstrated the performance of the proposed document forgery algorithm under a practical scenario where an attacker is able to alter the textual information in an identity document using only one sample in the target domain. The forged-andrecaptured samples created by the proposed text editing attack and recapturing operation have successfully fooled some existing document authentication systems.

Index Terms-Document Image, Text Editing, Deep Learning,

Due to the COVID-19 pandemic, we have observed an unprecedented demand for online document authentication in the applications of e-commerce and e-government. Some important document images were uploaded to online platforms for various purposes. However, the content of document can be altered by some image editing tools or deep learningbased technologies. As an illustration in Fig. 1(a) , we show an example on Document Forgery Attack dataset from Alibaba Tianchi Competition [1] forged with the proposed document The authors are with the Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, and National Engineering Laboratory for Big Data System Computing Technology, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China. They are also with Shenzhen Institute of Artificial Intelligence and Robotics for Society, China (e-mail: zhaolin2016@email.szu.edu.cn, cschen@szu.edu.cn, jwhuang@szu.edu.cn). forgery approach. Some key information on the original image is edited and then the document is recaptured to conceal the forgery trace. It is a low-cost (automatic, and without the need of skilled professional) and dangerous act if an attacker uses such forge-and-recapture document images to launch illegal attack.

Recently, it has been demonstrated that characters and words in natural images can be edited with convolutional neural networks [2] - [4] in an end-to-end fashion. Similar to the framework of DeepFake [5] , these models have been trained to disentangle different components in the document images, such as text, style and background. During the process of text editing, the input textual information (plain text with the targeted contents) is converted to a text image with targeted style and background. It should be noted that these works [2] - [4] are originally proposed for the visual translation and AR translation applications. To the best of our knowledge, there is no existing works on evaluating impacts of the above deep learning-based textual contents generation schemes towards document security. The edited text images have not been investigated from a forensic aspect.

Authentication of hardcopy documents with digitally acquired document images is a forensic research topic with broad interest. Although the edited document image in digital domain can be traced with some existing tamper detection and localization schemes [6] , it has been shown that detection of document forgery with small manipulation region (e.g., key information in a document) is challenging [7] . Moreover, recapturing operation (replay attack) is an effective way to conceal the forgery trace [8] , [9] . A formal attack model with two scenarios is shown in Fig. 2 . For a common document (e.g., an identity card), the attacker's own copy can be edited to perform an impersonate attack of a target identity. For a document with specific template, the attacker would steal a digital copy of the document, and forge his/her own document image to get unauthorized access.

To understand the security threat, one should note that detecting recapturing attack in digital documents is very different from detecting spoofing in other media, e.g., face and natural images. For example, the forensic trace from depth in face [10] , [11] and natural images [9] , [12] , as well as Moiré pattern [13] artifacts in displayed images are not available in document images. Both the captured and recaptured versions of a hardcopy document are acquired from flat paper surfaces, which lack the distinct differences between a 3D natural scene versus a flat surface or a pixelated display. Thus, the advancement of the deep learning technologies in text editing may have already put our document image at stake.

In this work, we build a deep learning-based document forgery network to attack the existing digital document authentication system under a practical scenario. The approach can be divided into two stages, i.e., document forgery and document recapturing. In the document forgery stage, the target text region is disentangled to yield the text, style and background components. To allow text editing of characters with complicated structure under complex background, several important strategies are introduced. First, to avoid confusions in different components of the source images (e.g., between complex background textures and texts), the textual information is extracted by subsequently performing inpainting and differentiation on the input image. Second, to capture the structure of some complicated components, the text skeleton is provided as auxiliary information and the continuity in texture is considered explicitly in the loss function. Last but not least, the forgery traces between the forgery and background regions are mitigated by post-processing operations with considerations on distortions from the print-and-scan process. In the recapturing stage, the forged document is printed and scanned with some off-the-shelf devices. In the experiment, the network is trained with a publicly available document image dataset and some synthetic textual images with complicated background. Ablation study shows the importance of our strategies in designing and training our document forgery network. Moreover, we demonstrate the document forgery performance under a practical scenario where an attacker generates a forgery document with only one sample in the target domain. In our case, an identity document with complex background can also be edited by a single sample fine-tuning operation. Finally, the edited images are printed and scanned to conceal the forgery traces. We show that the forge-andrecapture samples by the proposed attack have successfully fooled some existing document authentication systems.

The main contributions of this work are summarized as follows.

• We propose the first deep learning-based text editing network towards document images with complicated characters and complex background. Together with the recapturing attack, we show that the forge-and-recapture samples have successfully fooled some state-of-the-art document authentication systems. • We mitigate the visual artifacts introduced by the text editing operation by color pre-compensation and inverse halftoning operations, which consider the distortions from print-and-scan channel, to produce a high-quality forgery result. • We demonstrate the document forgery performance under a practical scenario where an attacker alters the textual information in an identity document (with Chinese characters and complex texture) by fine-tuning the proposed scheme fine-tuned with one sample in the target domain.

The remaining of this paper is organized as follows. Section II reviews the related literatures on deep learning-based text editing. Section III introduces the proposed document forgery method. Section IV describes the datasets and training procedure of our experiments. Section V compares the proposed algorithm with the exiting text editing methods, and demonstrates the feasibility of attacking the existing document authentication systems with the forge-and-recapture attack. Section VI concludes this paper.

Recently, text image synthesis has become a hot topic in the field of computer vision. Text synthesis tasks have been implemented on scene images for visual translation and augmented reality applications. The GAN-based text synthesis technique renders more realistic text regions in natural scene images. Wu et al. first addressed the problem of word or text-line level scene text editing by an end-to-end trainable Style Retention Network (SRNet) [2] . SRNet consists of three learnable modules, including text conversion module, background inpainting module and fusion module, which are used for text editing, background erasure, as well as text and background fusion, respectively. The design of the network facilitates the modules to be pre-trained separately, reduces the difficulty in end-to-end training of complicate network. Compared with the work of character replacement, SRNet works in word-level which is a more efficient and intuitive way of document editing. Experimental results show that SRNet is able to edit the textual information in some natural scene images. Roy et al. [3] designed a Scene Text Editor using Font Adaptive Neural Network (STEFANN) to edit texts in scene images. However, a one-hot encoding of length 26 of the target character is adopted in STEFANN to represent the 26 upper-case English alphabets in the latent feature space. Such one-hot encoding is expandable to lower-case English alphabets and Arabic numerals. However, it is not applicable to Chinese which is with a much larger character set (more than 3000 characters in common use) [14] . Thus, STEFANN is not suitable for editing Chinese documents. Yang et al. [4] proposed an image texts swapping scheme (SwapText) in scenes with special attention on the performance in perspective and curved text images. In the following, we mainly focus on SRNet [2] since it is the most relevant work to our task on editing text in document images for two reasons. First, it is applicable to Chinese character unlike STEFFANN [3] . Second, it keeps a relatively simple network structure compared to SwapText [4] which considers curved texts that uncommonly found on a document.

The difficulties of editing Chinese text in documents images mainly lies in background inpainting and text style conversion. In the background inpainting process, we need to fill the background after erasing the textual region. The image background, as an important visual cue, is the main factor affecting the similarity between the synthesized and the ground-truth text images. However, as shown in Fig. 3 , the reconstructed regions show discontinuity in texture that degrades the visual quality. This is mainly due to the background reconstruction loss of SRNet compares the inpainted and original images pixel by pixel and weights the distortions in different region equally, while human inspects the results mainly from the structural components, e.g., texture.

In text style conversion process, the SRNet inputs the source image (with source text, target style and background) to the text conversion subnet. However, as shown in Fig. 4(c) , the text style has not been transferred from (a) to (c). Especially, the Chinese character with more strokes is distorted more seriously than the English alphabets. This is because different components (source text, target style, and background) in the source image introduces confusion in the text style conversion process. It should be noted that such distortion is more obvious for Chinese characters due to two reasons. On the one hand, the number of Chinese characters is huge, with more than 3,000 characters in common use. It is more difficult to train a style conversion network for thousands of Chinese characters than dozens of English alphabets. On the other hand, the font composition of Chinese characters is complex, as it consists of five standard strokes with multiple radicals. Therefore, text editing of Chinese characters in document with complex background still presents great challenges.

In addition, most of the target contents of the existing works are scene images rather than document images. It requires the artifacts in synthesized text image to be unobtrusive towards human visual system, rather than undetectable under forensic tools. Therefore, the existing works [2] - [4] have not considered to further process the text editing results with regards to the distortions from print-and-scan channel, such as color degradation, and halftoning [15] .

As shown in Fig. 5 , the document forgery attack is divided into the forgery (through the proposed deep network, ForgeNet) and recapturing steps. For the forgery process, the document image acquired by an imaging device is employed as input to the ForgeNet. It is divided into three regions, i.e., text region, image region, and background region (the areas that are not included in the first two categories). The background region is processed by the inverse halftoning module (IHNet) to remove the halftone dots in the printed document. The original content in the image region is replaced by the target image, and the resulting image is fed into the print-and-scan pre-compensation module (PCNet) and IHNet. It should be noted that the PCNet deliberately distorts the color and introduces halftone patterns in the edited region such that the discrepancies between the edited and background regions are compensated. The text region is subsequently forwarded to the text editing module (TENet), PCNet and IHNet. After processed by the ForgeNet, the three regions are stitched together to form a complete document image. Lastly, the forged document image is recaptured by cameras or scanners to finish the forge-and-recapture attack. For clarity, the definitions of main symbols in our work is summarized in Tab. I. In the following paragraphs, the TENet, PCNet, and IHNet within the ForgeNet will be elaborated.

In this part, a deep learning-based architecture, TENet is proposed to edit the textual information in document images. As shown in Fig. 6 , TENet consists of three subnets. The background inpainting subnet generates a complete background by filling the original text region with the predicted content. The text conversion subnet replaces the text content of the source image I s with the target text I t while preserving the original style. The fusion subnet merges the output from the last two subnets and yields the edited image with the target text and original background.

1) Background Inpainting Subnet: Prior to performing text editing, we need to erase the text in the original text region and fill the background. In this part, we adopt the original encoderdecoder structure in SRNet [2] to complete the background inpainting. The L 1 loss and adversarial loss [16] is employed to optimize the initial background inpainting subnet. The loss function of background inpainting subnet written as

where E denotes the expectation operation, D b denotes the discriminator network of the background inpainting subnet, O b is the output of the background inpainting subnet, I b is the ground-truth of background images, λ b is the weighting factor that is set to 10 to balance adversarial loss and L 1 loss in our experiment.

As shown in Fig. 3 , the background inpainting performance degrades seriously under complex backgrounds. As discussed in Sec. II, the texture continuity in the background region was not considered in the existing network design [2] , [4] . In our approach, we adopt the background inpainting subnet in SRNet for a rough reconstruction, and the fine details of background inpainting will be reconstructed in the fusion subnet (Sec. III-A3).

2) Text Conversion Subnet: The purpose of the text conversion subnet is to convert the target texts to the style of source texts. In this subnet, the text properties that can be transferred include fonts, sizes, color, etc.

However, the performance of text conversion subnet in [2] degraded significantly (as shown in Fig. 3 ) if the background region of the source image I s contains complex textures. Therefore, we propose to isolate the text region from the background texture before carrying out text conversion. Firstly, the background image O b is obtained by the background inpainting subnet proposed in Sec. III-A1. Secondly, we differentiate the background image O b and the source image I s to get the source text image without background I s . Due to the subtle differences between O b and the corresponding ground-truth I b , there will be some residuals in the differential image of I s and O b . These residuals can be removed by postprocessing operation, such as filtering and binarization, and the source text image without background I s is obtained.

The target text image I t and I s are fed into text conversion subnet which follows the encoder-decoder FCN framework. The network can then convert I t according to the style of I s without interference from the background region.

However, different from the training data provided in [2] , our target documents (as shown in Fig. 1 ) contain a significant amount of Chinese characters which are with more complex structure than that of the English alphabets and Arabic numerals. Besides, the number of Chinese characters is huge, with more than 3,000 characters in common use. Therefore, instead of using a ResBlock-based text skeletons extraction subnet in [2] , we directly adopt a hard-coded component [17] for text skeleton extraction in our implementation to avoid unnecessary distortions. Such designs avoid the training overhead for Chinese characters, though the flexibility of the network is reduced. Intuitively, the L 1 loss can be applied to train text conversion subnet. However, without weighting the text and background region, the output of text conversion subnet may leave visible artifacts on character edges. We proposed to add an binary mask of the target styled text image M t to weight different components in the loss function. The loss of the text conversion subnet can be written as

where |M t | 0 is the L 0 norm of M t , and L t1 is the L 1 loss between the output of text conversion subnet O t and the corresponding ground-truth. It should be noted that during testing, T sk is replaced with the text skeleton image of the intermediate result O t after performing decoding.

3) Fusion Subnet: We use the fusion subnet to fuse the output of the background inpainting subnet O b and the output of the text conversion subnet O t . In order to improve the quality of the text editing image, we further divide the fusion subnet into coarse fusion subnet and fine fusion subnet.

The coarse fusion subnet follows a generic encode-decode architecture. We first perform three layers of downsampling of the text-converted output O t . Next, the downsampled feature maps are fed into 4 residual blocks (ResBlocks) [18] . It is noteworthy that we connect the feature maps of the background inpainting subnet to the corresponding feature map with the same resolutions in the decoding layers of coarse fusion subnet to allow a straight path for feature reusing. After decoding and up-sampling, the coarse fusion image O cf is obtained. The loss function of the coarse fusion subnet is adopted from SRNet [2] as

where D f denotes the discriminator network of the coarse fusion subnet, I f is the ground-truth, O cf is the output of the coarse fusion subnet, and λ cf is the balance factor which is set to 10 in our implementation. Next, we further improve the quality by considering the continuity of background texture in the fine fusion subnet. The input to this subnet is a single feature tensor which is obtained by concatenating the coarsely fused image O cf and the edge map T e along the channel-axis, that is

It should be noted that T e is extracted from the ground-truth using Canny edge detector in the training process; while, in the testing process, T e is the edge map extracted from output of the coarse fusion subnet O cf .

In fine fusion subnet, the edge map of ground-truth plays a role in correcting the detail in the background area and maintaining texture continuity [19] . We attaches [O cf , T e ] T to 4 ResBlocks to enhance the high-frequency details in the image and to remove the artifacts created by the low-frequency reconstruction in coarse fusion subnet. The loss function of fine fusion subnet is defined as

where O f f is the output of the fine fusion subnet.

In order to reduce perceptual image distortion, we introduce a VGG-loss based on VGG-19 [20] . The VGG-loss is divided into a perceptual loss [21] and a style loss [22] , which are

where i ∈ [1, 5] indexes the layers from relu1_1 to relu5_1 layer of VGG-19 model, φ i is the activation map of the i-th layer, G φ i is the Gram matrix of the i-th layer, and the weighting factors λ g1 and λ g2 are set to 1 and 500, respectively.

The whole loss function for the fusion subnet is defined as

Eventually, the overall loss for TENet can be written as

where G is the generator of TENet.

Since the edited text regions are digital images (without print-and-scan distortions), yet the background regions have been through the print-and-scan process. If stitching the edited text and background regions directly, the boundary artifacts will be obvious. We propose to pre-compensate the text regions with print-and-scan distortion before combining different regions. The print-and-scan process introduces nonlinear distortions such as changes in contrast and brightness, various sources of noises, which can be modelled as a non-linear mapping function [15] . However, it is more difficult to model the distortion parametrically under uncontrolled conditions. Inspired by display-camera transfer simulation in [23] , we propose the PCNet with an auto-encoder structure (shown in Fig. 7) to simulate the intensity variation and noise in the print-and-scan process.

We choose the local patch-wise texture matching loss function of the more lightweight VGG-16 network in order to improve the overall performance of the network [19] , that is

The loss function of PCNet is defined as

where O p is the output of PCNet, and I p is the ground-truth of O p . The local patch-wise texture matching loss between O p and I p with weight λ p is also considered. In our experiment, the weight λ p is set to 0.02. In practice, the original document image I o is not accessible by the attacker. Therefore, a denoised version of the document image I d is employed in the training process as an estimation of the original document image. In our experiment, the denoised images are generated by the NoiseWare plugin of Adobe Photoshop [24] . Essentially, PCNet learns the intensity mapping and noise distortion in the print-and-scan channel. As shown in Sec. V-B2, the distortion model can be trained adaptively with a small amount of fine-tuning samples to pre-compensate the channel distortion.

According to [25] , halftoning is a technique that simulates the continuous intensity variation in a digital image by changing the size, frequency or shape of the ink dots during printing or scanning. After the print-and-scan process or processing by our PCNet, the document image can be regarded as clusters of halftone dots. If the image is re-printed and recaptured without restoration, the halftone patterns generated during the first and second printing process will interfere with each other and introduce aliasing distortions, e.g., Moiré artifacts [26] . In order to make the forge-and-recapture attack more realistic, the IHNet is proposed to remove the halftoning pattern in the forged document images before recapturing.

We follow the design of network in [19] to remove the halftoning dots in the printed document images. The IH-Net can be divided into two steps. The first step extracts the shape, color (low-frequency features) and edges (highfrequency features) of the document image via CoarseNet and EdgeNet, respectively. The resulting features are fed into the second stage where image enhancements like recovering missing texture details are implemented. However, a much simpler structure is adopted since the content of a document image is much more regular and simpler than that of a natural image. The simplification includes removing the high-level network components (e.g., the object classification subnet) and the discriminator in [19] . By such simplification, the network is much more efficient.

Specifically, the CoarseNet with an encoder-decoder structure is employed for the rough reconstruction of the shape and color of halftone input images. Besides L 1 loss, a global texture loss function (defined in Eq. 10) based on the VGG-16 structure is used to measure the loss in texture statistics. Therefore, the overall loss function of CoarseNet is defined as

where O c is the output of CoarseNet and I d is the denoised version of the document image, and λ c is the weighting factor set to 0.02 in our implementation. Due to the downsampling operation in the encoder part of CoarseNet, the high-frequency features are not preserved in the reconstructed images. However, the high frequency components, such as edge and contour of the objects are important visual landmarks in the image reconstruction task. Therefore, the edge map is provided as auxiliary information to the reconstruction process.

Instead of detecting edges with Canny edge detector (as shown in the fusion subnet in Sec. III-A3), an end-to-end convolutional network is proposed here to extract the contour of characters and background texture from I p . This is because the traditional edge detector will also detect the edges from halftone dots in I p which should be removed by the IHNet. Due to the binary nature of an edge map, the cross-entropy function is used as the loss function of EdgeNet, that is

where I e and O e are the edge map of the ground-truth and output of EdgeNet, respectively. The output maps from CoarseNet and EdgeNet are concatenated along the channel-axis to form a single feature tensor before fed into the DetailNet, that is [O c , O e ] T . DetailNet adopts the residual network that integrates low and high frequency features. It clears the remaining artifacts in the low frequency reconstruction, and enhances the details. The loss function of the network is defined as

where O d is the output of DetailNet and O d e is the edge-map obtained by feeding O d to EdgeNet. We set the weights as λ d1 = 100, λ d2 = 0.1, λ d3 = 0.5, respectively. 

A. Datasets 1) Synthetic Character Dataset: The editing object of our task contains a large number of Chinese characters. To train TENet, we construct a synthetic character dataset D t including text types in Chinese characters, English alphabets and Arabic numerals. As shown in Fig. 9 , the dataset consists of eight types of images, which are summarized as follows: The synthetic text dataset D t contains a total of 400,000 images, with 50,000 images of each type.

2) Student Card Image Dataset: To facilitate the training of our ForgeNet, a high-quality dataset consists of captured document images from various devices is needed. As shown in Fig. 10 , we use the student card dataset from our group [27] . The original images in this dataset are synthesized using Adobe CorelDRAW and printed on acrylic plastic material by a third-party manufacturer. It contains a total of 12 student cards from 5 universities. The dataset is collected by 11 offthe-shelf imaging devices, including 6 camera phones (XiaoMi 8, RedMi Note 5, Oppo Reno, Huawei P9, Apple iPhone 6 and iPhone 6s) and 5 scanners (Brother DCP-1519, Benq K810, Epson V330, Epson V850 and HP Laserjet M176n). In total, the dataset consists of 132 high-quality captured images of student card images. In our experiments, these document images are used in the forgery and recapture operations. This dataset is denoted as D c . 

The training process of the proposed ForgeNet is carried out in several phases. The TENet, PCNet and IHNet are pretrained separately.

1) Training TENet: For training TENet, we use the synthetic chinese character dataset D t in Sec. IV-A1. In order to cater for the network input dimension, we adjust the height of all images to 128 pixels and keep the original aspect ratio. In addition, the 400,000 images in the dataset are divided into training set, validation set and testing set in an 8:1:1 ratio. Different portions of the dataset are fed into the corresponding inputs of the network for training. With a given training dataset, the model parameters (random initialization) are optimized by minimizing the loss function. We implement a pix2pix-based network architecture [28] and train the model using the Adam optimizer (β 1 = 0.5, β 2 = 0.999). The batch size is set to 8. Since it is not simple to conduct end-to-end joint training on such a complicated network, we first input the corresponding images into the background inpainting subnet and text conversion subnet for pre-training with a training period of 10 epochs. Subsequently, the fusion subnet joins the end-to-end training with a training period of 20 epochs, and the learning rate gradually decreases from 2×10 −4 to 2×10 −6 . We use a NVIDIA TITAN RTX GPU card for training with a total training duration of 3 days.

2) Training PCNet: As shown in Fig. 11 , training PCNet requires pairs of the original and the captured document images. PCNet learns the mapping from the original image to the captured image to simulate the print-and-scan distortions. We use dataset D c in Sec. IV-A2 to train PCNet. The dataset D c consists of 12 original images I o and 132 captured images I p of the documents. One may employ I o and I p to train PCNet so as to learn the distortions in print-and-scan channel. However, in practice, it is often difficult for an attacker to obtain the original document image I o . Alternatively, we use the denoised version of the captured document images I d as an approximation of the original images. In our experiment, the NoiseWare plugin in Adobe Photoshop [24] is employed to remove the distortions in the captured images. In order to accommodate the network input size, all images in the dataset D c are cropped to image patches with a resolution of 256 × 256 pixels. In addition, data augmentation strategies such as rotation, cropping, and mirroring are carried out to expand the number of datasets D c to 20,000 image patches, with 80% of the data used for training, 10% for validation, and 10% for testing. PCNet is trained with 20 epochs using the ADAM optimizer (β 1 = 0.9, β 2 = 0.999) with a learning rate of 1 × 10 −4 and no weight decay and the batchsize is set to 8. The parameter of activation functions LeakyReLu and ELU are set as α = 0.2 and α = 1.0, respectively. The training process lasts for 1 day on a NVIDIA TITAN RTX GPU card.

3) Training IHNet: As shown in Fig. 12 , the denoised document image I d in dataset D c is also used to train IHNet. The edge image I e of I d , and the artificially generated halftone image I h are also needed in the training process. The halftone image I h is generated by applying color halftone patterns to I d in Photoshop with amplitude modulation technique and various parameters (random halftone angles for different color channels). The edge image I e is an edge map of I d obtained by Canny edge detection. Similarly, all images in the dataset are cropped to a resolution of 256 × 256 pixels to fit the size of the network input. Data augmentation strategies are also employed to expand the dataset to 20,000 images, which are then divided into ratio of 8:1:1 for the training, validation and testing sets, respectively. IHNet uses an ADAM optimizer (β 1 = 0.9, β 2 = 0.999) with an initial learning rate of 1 × 10 −4 . Since the network is divided into three subnets, we first pre-train CoarseNet and EdgeNet, respectively. After 10 epochs, DetailNet joins the end-to-end training with a decaying rate of 0.9. The batchsize is set to 8. The training stops after 20 epochs. The training lasts for 2 days on a NVIDIA TITAN RTX GPU card. 

In the following, we first evaluate the performance of the proposed TENet in both the synthetic character dataset and the student card dataset without distortions from the print-andscan channel. Then, the performance of ForgeNet (including TENet, PCNet and IHNet) is studied under practical setups, including forgery under the channel distortion, with a single sample, and attacking the state-of-the-art document authentication systems. Finally, some future research directions on detection of such forge-and-recapture attack are discussed.

A. Performance Evaluation on TENet 1) Performance on Synthetic Character Dataset: In Sec. III-A, we propose the text editing network, TENet by adapting SRNet [2] to our task. However, SRNet is originally designed for editing English alphabets and Arabic numerals in scene images for visual translation and augmented reality applications. As shown in Fig. 3, 4 and 13(b) , it does not perform well on Chinese characters with complicated structure, especially in document with complex background. In this part, we qualitatively and quantitatively examine the modules in TENet which are different from SRNet so as to show the effectiveness of our approach. Three main differences between our proposed SRNet and TENet are as follows. First, we perform image differentiation operation between the source image I s and the output O b of the background inpainting subnet to obtain style text image without background I s . Second, I s is then fed into a hard-coded component to extract the text skeleton of the style text which is then directly input to the text conversion subnet as supervision information. Third, instead of only using a general U-Net structure to fuse different components (as in SRNet), we adopt a fine fusion subnet in TENet with consideration on texture continuity. We randomly select 500 images from our synthetic character dataset D t as a testing set for comparison. Quantitative analysis with three commonly used metrics are performed to evaluate the resulting image distortion, including Mean Square Error (MSE, a.k.a. l 2 error) , Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity (SSIM) [29] . The edited results by different approaches are compared with the ground-truth to calculate these metrics. Image Differentiation (ID). After removing the image differentiation part, we find that the text generation gets worse (as shown in Fig. 13(c) ). The distortion is more severe in the case of source images with complex backgrounds. Due to the interference of the background textures, text conversion subnet cannot accurately distinguish the foreground (characters) from background. It leads to difficulty in extracting text style features and character strokes are therefore distorted. For example, the residual of the original characters are still visible in background of the last two figures in Fig. 13(c) ). In contrast, using the differential image of the source image I s and the output O b as input to the text conversion subnet can avoid background interference, allowing the text conversion subnet to focus on extracting text styles without confusions from the background. It leads to a better text conversion performance. From Tab. II, we can see that without image differentiation, there is a significant drop in MSE and PSNR compared to the proposed TENet. The above experiments indicate that the differentiation operation is essential in producing high quality styled text.

Fine Fusion (FF). The performance of TENet under complex background mainly relies on the fine fusion subnet. If the fine fusion subnet is removed, the resulting image suffers from loss of high-frequency details. This is because the remaining subnets (the background inpainting subnet and the text conversion subnet) are of U-Net based structure which downsamples the input images before reconstruction. As shown in Fig. 13(d) , the resulting text images are blurry. Besides, the SRNet does not take into account the continuity of the background texture during image reconstruction. The texture components in the resulting images are discontinuous. The results of Tab. II show that the impact of removing fine fusion component is much more significant than the others. It is due to the fact that the background region is much larger than the foreground region in our test images, and the contribution of the fine fusion subnet is mainly in the background. Skeleton Supervision (SS). Visually, Chinese characters are much more complex than English alphabets and Arabic numerals in terms of the number of stokes and the interaction of the strokes in a character. The skeleton supervision information is important in providing accurate supervision on the skeleton of Chinese characters. If the skeleton is extracted using a general trainable network (as designed in SRNet) instead of using a hard-coded the style text, the text skeleton extraction performance will be degraded. As shown in Fig. 13(e) , by removing the skeleton supervision component, the character strokes in the resulting images appear distorted and the characters are not styled correctly. From Tab. II, we learn that the skeleton supervision has less impact on the overall image quality, as it only affects the character stroke generation. However, the style of characters is vital in creating high quality forgery samples.

In summary, the results look unrealistic in the absence of these three components as shown in the ablation study in Fig. 13 (c)-(e). The importance of image differentiation, fine fusion, and skeleton supervision are reflected in the quality of characters, the background texture, and the character skeleton, respectively. Both quantitative analysis and visual examples clearly indicate the importance of the three components.

Although TENet shows excellent text editing performance on most document images, it still has some limitations. When the structure of target character is complex or the number of characters is large, TENet may fail. Fig. 14 shows two failure cases. In the top row, the performance of the text conversion subnet is degraded due to the complex structure and large number of strokes of the target characters, and thus the editing results show distortion of the strokes. In the bottom row, it is a text conversion with cross languages and different character lengths. In dataset D t , we follow the dataset generation strategy of SRNet [2] , where source and target styled characters have the same geometric attribute (e.g., size, position) settings. However, for pairs of characters of different lengths, the strategy for setting the text geometry attributes is to make the overall style of the text with fewer characters converge to that of multiple characters. But inevitably, some geometric attributes of text with fewer characters are missing. The text conversion process of TENet excellently implements the conversion of geometric attributes from source text to target styled text, thus causing the generated results to have errors with ground-truth. These failures occur because the number and type of samples in the training data are insufficient, which leads to the unsatisfactory generalization performance of the model. So we believe that these problems could be alleviated by adding more complex characters and more font attributes to the training set.

2) Performance on the Student Card Forgeries: In Sec. V-A1, we perform an ablation study of the text editing module in a target text region of the document. However, it has not reflected the forgery performance on the entire image, including text, image and background as shown in Fig. 5 . In this part, we perform text editing on the captured student card images and stitch the edited text regions with the other regions to yield the forged document image. It should be noted that the print-and-scan distortion is not considered in this experiment since we are evaluating the performance of TENet.

In this experiment, SRNet [2] and the proposed TENet are compared in the text editing task with some student cards of different templates from dataset D c . The training data contains 50,000 images from each type of images introduced in Sec. IV-A-1). The height of all images is fixed to 128 pixels, and the original aspect ratio is maintained. The edited text fields are name, student No. and expiry date, including Chinese characters, English alphabets and Arabic numerals. It should be noted that the text lengths may be different before and after editing. As can be seen from Fig. 15 , the proposed TENet significantly improves the performance in character style conversion.

B. Performance Evaluation on ForgeNet 1) Ablation Study of PCNet and IHNet: This part shows the tampering results of ForgeNet under print-and-scan distortion. The ForgeNet consists of three modules, namely, TENet, PCNet, and IHNet. We perform ablation study to analyze the role of each module.

The role of the TENet is to alter the content of text region. However, as shown in Fig. 16(b) , the resulting text regions from TENet are not consistent with the surrounding pixels. This is because the edited region has not been through the print-and-scan channel. The main channel distortion includes color difference introduced by illumination conditions, different color gamuts and calibration in different devices, as well as halftoning patterns. One of the most significant difference is in color because printing and scanning process are with different color gamut, and the resulting color will thus be distorted. Another difference is on the micro-scale in the image which is introduced by the halftoning process and various source of noise in the print-and-scan process. Thus, the role of PCNet is to precompensate the output images with print-and-scan distortions. As shown in Fig. 16(c) , both the edited and background regions are more consistent after incorporating the PCNet. However, the halftoning artifacts (visible yellow dots) remains. The remaining halftoning artifacts interfere with the halftoning patterns which happens in the recapturing (print-and-capture) process. Thus, IHNet removes the visible halftoning artifacts (as shown in Fig. 16(a) and (d)) before performing recapturing attack. The resulting image processed with both PCNet and IHNet is closer to the original image, which shows that all three modules in ForgeNet play important roles.

2) Document Forgery with a Single Sample: In the previous section, we show the performance of the proposed ForgeNet on editing student card images. However, the background regions of these samples are relatively simple, usually with solid colors or simple geometric patterns. In this part, we choose Resident Identity Card for People's Republic of China with a complex background as a target document. Identity card tampering is a more practical and challenging task to evaluate the performance of the proposed ForgeNet. However, identity card contains private personal information. It is very difficult to obtain a large number of scanned identity cards as training data. Thus, we assume the attacker has access to only one scanned identity card image which is his/her own copy according to our threat model in Fig. 2(a) . This identity card image is regarded as both the source document image (to be edited) and the sample in target domain for fine-tuning TENet, PCNet and IHNet. The attacker then tries to forge the identity card image of a target person by editing the text. The identity card is scanned with a Canoscan 5600F scanner with a resolution of 1200 DPI. The whole image is cropped according to different network input sizes, and data augmentation is performed. In total, 5,000 image patches are generated to fine-tune the network. It is worth noting that the complex textures of the identity card background pose a significant challenge to the text editing task. To improve the background reconstruction performance, the attacker could include some additional texture images which are similar to the identity card background for fine-tuning. Some state-of-the-art texture synthesis networks can be employed to generate the texture automatically [30] . The image patches are fed to TENet, PC-Net, and IHNet for fine-tuning. In order to collect the sensitive information in identity cards, we need to collect personal information from our research group to finish the forgery test. Ten sets of personal information (e.g., name, identity number) are gathered for a small-scale ID card tampering test, and 10 forged identity card images are generated accordingly. As shown in Fig. 17 , some key information on the identity card is mosaicked to protect personal privacy. It is shown that ForgeNet achieves a good forgery performance by fine-tuning with only one image, while the text and background in the image reconstructed by SRNet are distorted.

3) Forge-and-Recapture Document Attack Authentication: In this part, the forged identity card images obtained in Sec. V-B2 are processed by the print-and-scan channel to demonstrate the threat posed by the forge-and-recapture attack. The printing and scanning devices used for the recapturing process are Canon G3800 and Canoscan 5600F, respectively. The highest printing quality of 4800 × 1200 DPI is employed. The print substraces is Kodak 230g/m 2 glossy photo paper. The scanned images are in TIFF or JPEG formats with scanning resolutions (ranging from 300 DPI to 1200 DPI) adjusted according to the required size of different authentication platforms.

The popular off-the-shelf document authentication platforms in China includes Baidu AI, Tencent AI, Alibaba AI, Netease AI, Jingdong AI, MEGVII Face++ AI, iFLYTEK AI, Huawei AI, etc. However, the document authentication platforms which detect identity card recapturing and tampering are Baidu AI [32] , Tencent AI [33] , and MEGVII Face++ AI [31] . We uploaded tampering results to these three state-of-the-art document authentication platforms for validation of the forgeand-recapture identity documents.

The authentication results on MEGVII Face++ AI are shown in Tab. III. It is shown that the 10 forge-and-recapture identity images in our test are successfully authenticated. All the tested images also pass the other two authentication platforms (include inspection against editing, recapturing, etc.). Given the fact that the state-of-the-art document authentication platforms have difficulties in distinguishing the forge-andrecapture document images, it fully demonstrates the success of our attack. This calls for immediate research effort in detecting such attacks.

As discussed in Section I, the main focus of this work is to build a deep learning-based document forgery network to study the risk of existing digital document authentication system. Thus, developing forensics algorithm against the forge-andrecapture attack is not in the scope of this work. Moreover, in order to study such attack, a large and well-recognized dataset of forge-and-recapture document images is needed. However, no such dataset is currently available in the public domain. Without such resource, some data-driven benchmarks in digital image forensics with hundreds or thousands feature dimensions [34] , [35] are not applicable. Meanwhile, this work enables an end-to-end framework for generating high quality forgery document, which facilitates the construction of a largescale and high-quality dataset. Last but not least, it has been shown in our parallel work [27] that the detection of document recapturing attack alone (without forgery) is not a trivial task when the devices in training and testing sets are different. The performance of generic data-driven approaches (e.g., ResNet [18] ) and traditional machine learning approach with handcrafted features (e.g., LBP+SVM [36] ) are studied. The detection performance degraded seriously in a cross dataset experimental protocol where different printing and imaging devices are used in collecting the training and testing dataset.

In this work, the feasibility of employing deep learningbased technology to edit document image with complicated characters and complex background is studied. To achieve good editing performance, we address the limitations of existing text editing algorithms towards complicated characters and complex background by avoiding unnecessary confusions in different components of the source images (by the image differentiation component introduced in Sec. III-A2), constructing texture continuity loss and providing auxiliary skeleton information (by the fine fusion and skeleton supervision components in Sec. III-A3). Comparisons with the existing text editing approach [2] confirms the importance of our contributions. Moreover, we propose to mitigate the visual artifacts of text editing operation by some post-processing (color pre-compensation and inverse halftoning) considering the print-and-scan channel. Experimental results show that the consistency among different regions in a document image are maintained by these post-processing. We also demonstrate the document forgery performance under a practical scenario where an attacker generates an identity document with only one sample in the target domain. Finally, the recapturing attack is employed to cover the forensic traces of the text editing and post-processing operations. The forge-and-recapture samples by the proposed attack have successfully fooled some stateof-the-art document authentication systems. From the study of this work, we conclude that the advancement of deep learningbased text editing techniques has already introduced significant security risk to our document images.

Security AI Challenger Program Phase 5: Counterattacks on Forged Images

Editing Text in the Wild

STEFANN: Scene text editor using font adaptive neural network

SwapText: Image based Texts Transfer in Scenes

Deepfakes: A New Threat to Face Recognition? Assessment and Detection

Document Forgery Detection using Distortion Mutation of Geometric Parameters in Characters

Detecting Copy-move Forgeries in Scanned Text Documents

An Image Recapture Detection Algorithm based on Learning Dictionaries of Edge Profiles

A Diverse Large-scale Dataset for Evaluating Rebroadcast Attacks

Learning Generalized Deep Feature Representation for Face Anti-spoofing

Attention-based Two-stream Convolutional Networks for Face Spoofing Detection

Image Recapture Detection with Convolutional and Recurrent Neural Networks

Face-spoofing 2D-detection based on Moiré-pattern Analysis

Which Encoding is the Best for Text Classification in Chinese

Accurate Modeling and Efficient Estimation of the Print-capture Channel with Application in Barcoding

Non-stationary Texture Synthesis by Adversarial Expansion

K3M: A Universal Algorithm for Image Skeletonization and a Review of Thinning Techniques

Deep Residual Learning for Image Recognition

Deep Context-aware Descreening and Rescreening of Halftone Images

Very Deep Convolutional Networks for Large-scale Image Recognition

Perceptual Losses for Realtime Style Transfer and Super-resolution

Image Style Transfer using Convolutional Neural Networks

Light Field Messaging with Deep Photographic Steganography

Noiseware for Adobe Photoshop

Technique for Generating Additional Colors in a Halftone Color Image through Use of Overlaid Primary Colored Halftone Dots of Varying Size

A Copy-Proof Scheme Based on the Spectral and Spatial Barcoding Channel Models

A Database for Digital Image Forensics of Recaptured Document

Image-to-image Translation with Conditional Adversarial Networks

Image Quality Assessment: from Error Visibility to Structural Similarity

Texturegan: Controlling Deep Image Synthesis with Texture Patches

Authenticity Recognition of Documents (Recapture, PS, etc.): Face++ Artificial Intelligence Open Platform

ID Identification and Risk Detection

Card OCR Recognition and Recapture, PS, Copy Alerting

Rich Models for Steganalysis of Digital Images

Identification of Various Image Operations using Residual-based Features

Face Spoof Detection with Image Distortion Analysis