key: cord-0057846-be2sm4n5 authors: Souibgui, Mohamed Ali; Kessentini, Yousri; Fornés, Alicia title: A Conditional GAN Based Approach for Distorted Camera Captured Documents Recovery date: 2021-02-22 journal: Pattern Recognition and Artificial Intelligence DOI: 10.1007/978-3-030-71804-6_16 sha: 1a79f4fae66a42a399ed044278965e5a3d94a239 doc_id: 57846 cord_uid: be2sm4n5 Many of the existing documents are digitized using smart phone’s cameras. These are highly vulnerable to capturing distortions (perspective angle, shadow, blur, warping, etc.), making them hard to be read by a human or by an OCR engine. In this paper, we tackle this problem by proposing a conditional generative adversarial network that maps the distorted images from its domain into a readable domain. Our model integrates a recognizer in the discriminator part for better distinguishing the generated images. Our proposed approach demonstrates to be able to enhance highly degraded images from its condition into a cleaner and more readable form. With the increasing daily use of smartphones and the advancement of its applications, they start replacing other tools and machines in many different tasks, such as scanning. Nowadays, smartphones could be used to digitize a document paper by simply taking a photo from its camera. Indeed, smartphones allow to scan anywhere compared to a classic scanning machine that is not mobile due to its size and weight. However, despite the mobility advantage, problems are occurring in most of the camera based scans: bad perspective angles, shadows, blur, light unbalance, warping, etc. [17] . Consequently, the extracted text from these document images by directly using a standard Optical Character Recognition (OCR) system becomes unreliable. Lately, thanks to the success of deep and machine learning models, some recent works show a higher robustness when reading distorted documents (at line level). Anyway, some of these methods apply a preprocessing step to segment the scanned text images into separated Fig. 1 . The proposed reading process: the role of our system is to preprocess the images to be read by an OCR system or by a human. lines. For example, [2] applied the Long Short-Term Memory (LSTM) networks directly on the gray-scale text line images (without removing the distortions), to avoid error-prone binarization of blurred documents, as the authors claimed. Similarly, [3] used Convolutional Neural Networks (CNN) to extract the features from the lines and pass it thought a LSTM to read the text. All these approaches lead indeed to a better performance comparing to using a standard OCR engine in this specific domain (i.e. distorted line image). Those neural networks could be seen as direct mapping functions from the distorted lines to the text. This means that they are not providing a clean version of the image lines to be read by a human, or by the widely used OCR systems that are much powerful when dealing with clean images because they are trained on a huge amount of data from different domains and languages. For this reason, we believe that restoring the lines images, i.e. mapping it from the distorted domain to a readable domain (by an OCR system or by a human) is a better solution. Figure 1 illustrates our approach: a preprocessing module to improve the posterior reading step (either manual or automatic). Knowing that the OCR accuracy has largely depended on the preprocessing step since it was generally the first step in any pattern recognition problem, a lot of research has addressed the preprocessing stage (i.e. document binarization and enhancement) during the last decades. The goal of this step is to transform the document image into a better or cleaner version. In our case, this means to remove (or minimize) the distortions is these lines (e.g. shadows, blur and warping). The most common step to clean a text image is binarization, which is usually done by finding either locally or globally thresholds to separate the text pixels from the distorted ones (including background noise) using the classic Image Processing (IP) techniques [18, 19, 21] . These approaches could be used to remove the shadows and fix the light distortion, but, they usually fail to restore the blurred images or to fix the baselines. Thus, machine learning techniques for image domain translation have been recently used for the this purpose. These methods mainly consist of CNN auto-encoders [4, 14, 16] and Generative Adversarial Networks (GANs) [7, 11, 22, 23] . The latter is leading to a better performance comparing to the classic IP techniques because they can handle more complex distortion scenarios like: dense watermarked [22] , shadowed [8, 13] , highly blurred [10, 22] and warped [15] document images. But, despite the success of the mentioned machine learning approaches for images domain translation, they are still addressing those distortion scenarios separately. Contrary, in this paper we are providing a single model to solve different types of degradation in camera captured documents [6, 17] . Moreover, in those image domain translation approaches, the goal is mapping an image to a desired domain depending only on the visual pixels information loss. In our case, when translating the text images, they should not only look clean, but also, legible. It must be noted that, sometimes, the model could consider the resultant images as text, but in fact they are just random pixels that emulate the visual shape characteristics of text, or random text characters that are constructing a wrong and random script. For this reason, current machine learning text generation models are using a recognition loss in addition to the visual loss to validate the readability of a generated text image [1, 12] . Similarly, we add a recognizer in our proposed conditional GAN model to guide the generator in producing readable images (by the human or the OCR) when translating them from the distortion domain to the clean domain. This simple idea shall lead to a better recovery of our distorted lines. The rest of the paper is organized as follows. Our proposed model is described in the next Section. Afterwards, we evaluate it comparing with related approaches in Sect. 3. Finally, a brief conclusion is added in Sect. 4. The proposed architecture is illustrated in Fig. 2 . It is mainly composed of three components: A regular generator G, a discriminator D (with the assigned trainable parameters θ G and θ D , respectively) and an OCR system R, which will not be trainable since it will only be used to validate the generations. It must be noted that we used the same generator and discriminator architectures as [22] , because of the superiority that they showed in document enhancement tasks. During training, the generator is taking as an input the distorted image, noted by I d and outputting a generated image I g , hence: I g = G θG (I d ). Then, the generated image is passed through the recognizer (OCR system) to get the recognition accuracy measured by the Character Error Rate (CER) CER g = R(I g ). After that, a matrix having the same shape of I g is created and filled with the resultant CER g . The matrix is concatenated with I g over the depth and passed to the discriminator with the label Fake to train it. The discriminator is looking, of course, to the Ground Truth (GT) images I gt which are concatenated with a CER that is close to zero and labeled as real. Clearly, concatenating a matrix with the same number of pixels as the generated image could be replaced by attaching a simple loss directly to the CER and force it to be reduced. However, the choice of a CER matrix was done to let the method be extendable on measuring the error rate from each word (even character or pixel) separately. Thus, we can provide a better feedback to the model, so that it can focus on enhancing the parts with high CER in the image (which could be known from the matrix), while keeping the parts of the image line that were correctly recovered (with low CER in the matrix). The discriminator is then used to predict the degree of 'reality' (i.e. how realistic) of the generated image, where P (Real) = D θD (I g , CER g ). We noted that it is better to assign a high CER for the GT images at the beginning of the training stage and then starting to decrease it after some epochs. Thus, we start with a weak discriminator that we progressively enhance it in parallel with the generator to get a better adversarial training. The whole adversarial training could be formalized, hence, with the following loss: To speed up the convergence of the generator parameters θ G , we use an additional loss which is the usual Binary Cross Entropy (BCE) between the generated images and the ground truth images. The whole Loss becomes: For a better understanding, we describe in what follows each architecture of the used components. Similar to [22] , the used generator is following the U-net encoder-decoder architecture detailed in [20] . It consists of 17 fully convolutional layers with the encoderdecoder fashion, 8 layers for the encoder (down-sampling with max-pooling every two layers) until getting to the 9th layer, followed by a 10th for the decoder (upsampling every two layers), with an employed skip connections (a concatenation between the layers). Table 1 presents the architecture. As it can be seen, the output is an image with 1 channel since we are providing a grey scale image. We used Tesseract 4.0 as a recognizer. This OCR engine version is based on deep learning techniques (LSTM), which show a good recognition performance. The recognizer takes an image as input and outputs its predicted text. Anyway, it must be noted that any other OCR system could be used for this purpose. The defined discriminator is composed of 6 convolutional layers described in Table 2 , which outputs a 2D matrix containing probabilities of the generated image denoting its realistic degree. The discriminator receives three inputs: the degraded image, its cleaned version (ground truth or cleaned by the generator) and the obtained CER. Those inputs are concatenated together in a H × W × 3 shape. Then, the obtained volume is propagated in the model to end up in a H 16 × W 16 ×1 matrix in the last layer. This matrix contains probabilities that should be, to the discriminator, 1 if the clean image represents the ground truth and 0 if it is coming from the generator. Therefore, the last layer takes a sigmoid as an activation function. Once the training is finished, this discriminator is no longer used. Given a distorted image, we only use the generative network to recover it. However, the discriminator shall force the generator during training to produce a realistic result, in addition to the BCE loss with the GT images. As mentioned above, the goal of this study is to provide a mapping from the distorted document into a clean and readable version. For evaluation, we compare our proposed approach with the relevant methods that can handle the same task in this Section. For a fair comparison, all the methods will be tested on the same dataset containing the distorted lines images and its clean version. This data was taken from SmartDoc-QA [17] , which is constituted from smartphone's camera captured document images, under varying capture conditions (light, shadow, different types of blur and perspective angles). SmartDoc-QA is categorized in three subsets of documents: contemporary documents, old administrative documents and shop's receipts. For computational reasons, we use only the contemporary documents category in our experiments. An example of those documents is presented in Fig. 3 . A preprocessing step was done to segment those documents at line level and construct our desired dataset. First, we extract the document paper from the background by applying a Canny edge detector [5] and finding the four corners of the document. Then, a geometric transformation is done for dewarping. Finally, the horizontal projection was applied to detect the lines. This results in 17000 lines images pairs (distorted and clean); from them, 10200 pairs were taken for training the different approaches and 6800 pairs for testing. The training was done for 80 epochs with a batch size of 32, and the Adam optimization algorithm was used with a learning rate of 1e−4. We study the performance of our developed method by comparing it with the following existing approaches, which were widely used for similar tasks: -DE-GAN [22] : This method uses our same architecture, but without a recognizer (only a generator and a discriminator). In this way, we can evaluate if adding the recognizer helps to provide cleaner documents. -Pix2Pix-HD [23] : This method extends [11] to provide a higher resolution and more realistic images. Anyway, both methods falls in the set of the widely used approaches to translate images between different domains. -CNN [10] : In this approach, a CNN is used to clean the image. Concretely, it was proposed for the goal of text images deblurring. The comparison is performed using two types of metrics: The first type is for measuring the visual similarity between the predicted images and the GT images. For this purpose, we use the Peak signal-to-noise ratio (PSNR) and Structural Similarity Index Measure (SSIM) [9] . The second metric type is for measuring the readability of the provided image. For this purpose, we simply use the CER metric after passing the cleaned images through Tesseract 4.0. The CER metric is defined as CER = S+D+I N , where S is the number of substitutions, D of deletions, I of insertions and N the ground-truth's length. So, the lower the CER value, the better. The obtained results are presented in Table 3 . As it can be seen, cleaning the distorted images using the different approaches leads to a higher SSIM and PSNR compared to the degraded lines (without any cleaning). This means that we are able to recover a visually enhanced version of the lines images using any of these approaches, with a slightly better performance using the CNN [10] approach. But, this does not means that all these approaches are leading to better versions of the line images. Because, the text is also an important factor to evaluate the cleaning. Anyway, the CER of the distorted images is much better than the cleaned ones when using the CNN, pix2pix-HD and DE-GAN approaches. As stated before, the reason is that the text in those methods is degrading during the mapping to the clean space. Since the model is only enhancing the visual form of the distorted line images. Contrary, when using our proposed approach, we observe that the CER is also boosted with 7% compared to the distorted images. This demonstrates the utility of using the recognition rate input in our proposed model, which cleans the image while taking the text preservation into account. Thus, from the found results, we can conclude that our model is the best way to perform the distorted to clean text image mapping among the different compared approaches. Moreover, To illustratively compare the performance of the different methods, we show in what follows some qualitative results. In Fig. 4 , we present the recovering of a slightly distorted line. This means that it could be correctly read by the OCR even without any preprocessing, since the distortion is only consists in the baseline due to the warped document and in the background color. It could be observed from the figure that applying the CNN, pix2pix-HD and DE-GAN methods is fixing the baseline and cleaning the background, but deteriorating the text quality and leads to some character errors when reading by the OCR. Contrary, our proposed approach is the one that mostly preserves the text readability while visually enhancing the text line. Another example of a slightly blurred and warped line is also presented in Fig. 5 . Despite the fact that the OCR result on the distorted image is still similar to applying it on our generated line (with a clear superiority compared to the CNN and pix2pix-HD methods), it is clear that our model is producing a much easier image to read by a human, since it is successfully deblurring and unwarping it. We also note in this example that the use of the regular DE-GAN (our same architecture except the recognizer) is resulting in a weak discriminator, which could be fooled by a wrong generation. This can be observed from comparing the visual similarity between our approach and DE-GAN. But, when reading the text, more DE-GAN's character errors are made compared to our generated text. Next, we show the recovery of some highly distorted lines in Fig. 6 . In this case, we tried to recover two distorted lines containing high blur, shadows and warping. Obviously, reading those lines directly with Tesseract is the worst option since it is clearly leading to a bad result by missing a lot of words which results in a high CER. However, by applying the different cleaning approaches, we are able to remove the distortion and produce a better text. Same as previous experiments, it can be seen that our proposed model is achieving the highest results by giving the best line image recovery. Our produced image is visually closed to the GT image, with a preserved text, that can be seen from the low CER compared to different methods. Finally, it is worth to mention that our proposed model was sometimes failing to produce readable lines. This was happening when dealing with the extremely distorted lines. Some examples of this particular case are presented in Fig. 7 . As can be seen, despite the fact that some words have been correctly enhanced and recognized by the OCR after applying our method, the line images are still visually degraded and unsatisfactory. Of course, this is happening due to the extreme complexity of fixing such lines, which are hard to be read even by the human naked eye. In this paper we have proposed an approach for recovering distorted camera captured documents. The goal is to provide a clean and readable version of the document images. Our method integrates an OCR to cGAN model to preserve the readability while translating the document domain. As a result, our method leads to a better CER compared to the widely used methods for this task. As future work, our proposed model could be extended to handle full pages instead of lines. Furthermore, the CER matrix provided for the discriminator could include the error rates at local level instead of passing the CER of the whole text line. This could help the discriminator to provide a better feedback to the generative model, and thus, improve the overall model performance. Finally, it will be interesting to test the model on historically handwritten degraded documents, using of course, a Handwritten Text Recognition system instead of the OCR system. Adversarial generation of handwritten text images conditioned on sequences High performance OCR for cameracaptured blurred documents with LSTM networks Text recognition in document images obtained by a smartphone based on deep convolutional and recurrent neural network A selectional auto-encoder approach for document image binarization. Pattern Recogn A computational approach to edge detection SmartATID: a mobile captured Arabic text images dataset for multi-purpose recognition tasks StarGAN: unified generative adversarial networks for multi-domain image-to-image translation Image shadow removal using end-to-end deep convolutional neural networks Image quality metrics: PSNR vs. SSIM Convolutional neural networks for direct text deblurring Image-to-image translation with conditional adversarial networks GANwriting: content-conditioned generation of styled handwritten word images Shadow removal via shadow image decomposition LLNet: a deep autoencoder approach to natural low-light image enhancement DocUNet: document image unwarping via a stacked U-Net Deep networks for degraded document image binarization through pyramid reconstruction SmartDoc-QA: a dataset for quality assessment of smartphone captured document images -single and multiple distortions An Introduction to Digital Image Processing A threshold selection method from gray-level histograms U-Net: convolutional networks for biomedical image segmentation Adaptive document image binarization DE-GAN: a conditional generative adversarial network for document enhancement Highresolution image synthesis and semantic manipulation with conditional GANs Acknowledgment. This work has been partially supported by the Swedish Research Council (grant 2018-06074, DECRYPT), the Spanish project RTI2018-095645-B-C21, the Ramon y Cajal Fellowship RYC-2014-16831 and the CERCA Program/Generalitat de Catalunya.