key: cord-0045750-3a6kdh4t
authors: Ferrando, Javier; Domínguez, Juan Luis; Torres, Jordi; García, Raúl; García, David; Garrido, Daniel; Cortada, Jordi; Valero, Mateo
title: Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems
date: 2020-06-15
journal: Computational Science - ICCS 2020
DOI: 10.1007/978-3-030-50417-5_29
sha: 964633232b55ebf6c79882c4211a383acd46a7e4
doc_id: 45750
cord_uid: 3a6kdh4t

This paper presents a study showing the benefits of the EfficientNet models compared with heavier Convolutional Neural Networks (CNNs) in the Document Classification task, essential problem in the digitalization process of institutions. We show in the RVL-CDIP dataset that we can improve previous results with a much lighter model and present its transfer learning capabilities on a smaller in-domain dataset such as Tobacco3482. Moreover, we present an ensemble pipeline which is able to boost solely image input by combining image model predictions with the ones generated by BERT model on extracted text by OCR. We also show that the batch size can be effectively increased without hindering its accuracy so that the training process can be sped up by parallelizing throughout multiple GPUs, decreasing the computational time needed. Lastly, we expose the training performance differences between PyTorch and Tensorflow Deep Learning frameworks.

Document digitization has become a common practice in a wide variety of industries that deal with vast amounts of archives. Document classification is a task to face when trying to automate their document processes, but high intra-class and low inter-class variability between documents have made this a challenging problem.

First attempts focused on structural similarity between documents [40] and on feature extraction [12, 24, 30] to differentiate characteristics of each class. The combination of both approaches has also been tested [14] .

Several classic machine learning techniques have been applied to these problem, i. e. K-Nearest Neighbor approach [7] , Hidden Markov Model [19] and Random Forest Classifier [24, 29] while using SURF local descriptors before the Convolutional Neural Networks (CNNs) came into scene.

With the rise of Deep Learning, researchers have tried deep neural networks to improve the accuracy of their classifiers. CNNs have been proposed in past works, initially in 2014 by Kang et al. [26] who started with a simple 4-layer CNN trained from scratch. Then, transfer learning was demonstrated to work effectively [1, 21] by using a network pre-trained on ImageNet [17] . And latest models have become increasingly heavier (greater number of parameters) [2, 16, 46] as shown in Table 1 , with the speed and computational resources drawback this entails.

Recently, textual information has been used by itself or as a combination together with visual features extracted by the previously mentioned models. Although Optical Character Recognition (OCR) is prone to errors, particularly when dealing with handwritten documents, the use of modern Natural Language Processing (NLP) techniques have demonstrated a boost in the classifiers performance [5, 6, 35] .

The contributions of this paper can be summarized in two main topics:

-Algorithmic performance: we propose a model and a training procedure to deal with images and text that outperforms the state-of-the-art in several settings and is lighter than any previous neural network used to classify the BigTobacco dataset, the most popular benchmark for Document Image Classification (Table 1 ). -Training process speed up: we demonstrate the ability of these models to maintain their performance while saving a large amount of time by parallelizing over several GPUs. We also show the performance differences between the two most popular Deep Learning frameworks (TensorFlow and Pytorch), when using their own libraries dedicated to this task.

Document Image Classification task tries to predict the class which a document belongs to by means of analyzing its image representation. This challenge can be tackled in two ways, as an image classification problem and as a text classification problem. The former tries to look for patterns in the pixels of the image to find elements such as shapes or textures that can be associated to a certain class. The latter tries to understand the language written in the document and relate this to the different classes.

As mentioned earlier, in this work we make use of two publicly available datasets containing samples of images from scanned documents from USA Tobacco companies, published by Legacy Tobacco Industry Documents and created by the University of California San Francisco (UCSF). We find these datasets a good representation of what enterprises and institutions may face with, based on the quality and type of classes. Furthermore, they have been go-to datasets in this research field since 2014 with which we can compare results. RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) is a 400.000 document sample (BigTobacco from now onwards) presented in [21] for document classification tasks. This dataset contains the first page of each of the documents, which are labeled in 16 different classes with equal number of elements per class. A smaller sample containing 3482 images was proposed in [24] as Tobacco3482 (SmallTobacco henceforth). This dataset is formed by documents belonging to 10 classes not uniformly distributed. 

The proposed methods in this work are based on supervised Deep Learning, where each document is associated to a class (label) so that the algorithms are trained by minimizing the error between the predictions and the truth. Deep Learning is a branch of machine learning that deals with deep neural networks, where each of the layers is trained to extract higher level representations of the previous ones. These models are trained by solving iteratively an unconstrained optimization problem. In each iteration, a random batch of the training data is fed into the model to compute the loss function value. Then, the gradient of the loss function with respect to the weights of the network is computed (backpropagation) and an update of the weights in the negative direction of the gradient is done. These networks are trained until they converge into a loss function minimum.

The field where machines try to get an understanding of visual data is known as Computer Vision (CV). One of the most well-known tasks in CV is image classification. In 2010 The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was introduced, a competition that dealt with a 1.2 million images dataset belonging to 1000 classes. In 2012 the first CNN-based model significantly reduced the error rate, setting the beginning of the explosion of deep neural networks. From then onwards, deeper networks have become the norm. The most used architecture in Computer Vision have been CNN-based networks. Their main operation is the convolution one, which consists on a succession of dot products between the vector representations of both the input space (L q × B q × d q ) and the filters (F q × F q × d q ). We slide each filter around the input volume getting an activation map of dimension L q+1 = (L q − F q + 1) and B q+1 = (B q − F q + 1). The output volume then has a dimension of L q+1 × B q+1 × d q+1 , where d q+1 refers to the number of filters used. We refer to [3] (we used the same notation for simplicity) to a more detailed explanation. Usually, each convolution layer is associated to an activation layer, where an activation function is applied to the whole output volume. To reduce the number of parameters of the network, a pooling layer is typically located between convolution operations. The pooling layer takes a region P q ×P q in each of the d q activation maps and performs an arithmetic operation. The most used pooling layer is the max-pool, which returns the maximum value of the aforementioned region.

The features learned from the OCR output are achieved by means of Natural Language Processing techniques. NLP is the field that deals with the understanding of human language by computers, which captures underlying meanings and relationships between words.

The way machines deal with words is by means of a real values vector representation. Word2Vec [34] showed that a vector could represent semantic and syntactic relationships between words. CoVe [32] introduced the concept of contextbased embeddings, where the same word can have a different vector representation depending on the surrounding text. ELMo [36] followed Cove but with a different training approach, by predicting the next word in a text sequence (Language Modelling), which made it possible to train on large available text corpus. Depending on the task (such as text classification, named entity recognition...) the output of the model can be treated in different ways. Moreover, custom layers can be added to the features extracted by these NLP models. For instance, ULM-Fit [23] introduced a language model and a fine-tuning strategy to effectively adapt the model to various downstream tasks, which pushed transfer learning in the NLP field. Lately, the Transformer architecture [47] has dominated the scene, being the bidirectional Transformer encoder (BERT) [18] the one who established recently state-of-the-art results over several downstream tasks.

Several ways of measuring models have been shown in the past years regarding document classification on the Legacy Tobacco Industry Documents [31] . Some authors have tested their models on a large-scale sample BigTobacco. Others tried on a smaller version named SmallTobacco, which could be seen as a more realistic scale of annotated data that users might be able to find. Lastly, transfer learning from in-domain datasets has been tested by using BigTobacco to pretrain the models to finally fine-tune on SmallTobacco. Table 2 summarizes the results of previous works in the different categories over time.

First results in the Deep Learning era have been mainly based on CNNs using transfer learning techniques. Multiple networks were trained on specific sections of the documents [21] to learn region-based high dimensional features later compressed via Principal Component Analysis (PCA). The use of multiple Deep Learning models was also exploited by Das et al. by using an ensemble as a meta-classifier [16] . A VGG-16 [41] stack of networks using 5 different classifiers has been proposed, one of them trained on the full document and the others specifically over the header, footer, left body and right body. The Multi Layer Perceptron (MLP) was the ensemble that performed the better. A committee of models but with a SVM as the ensemble was also proposed [37] . The addition of content-based information has been investigated on Small-Tobacco by extracting text through OCR and embedding the obtained features into the original document images as a previous phase to the training process [35] . Lately, a MobilenetV2 architecture [38] together with a CNN 2D [27, 49] taking as input FastText embeddings [9, 25] have achieved the best results in SmallTobacco [6] .

A study of several CNNs was carried out [2] , where VGG-16 architecture was found optimal. Afzal et al. also demonstrated that transfer learning from in-domain datataset like BigTobacco increases by a large margin the results in SmallTobacco. This was further investigated by adding content-based information with CNN 2D with ranking textual features (ACC2) to the OCR extracted.

As far as we are concerned, there is no study about the use of multiple GPUs in the training process for the task of Document Image Classification. However, parallelizing a computer vision task has been shown to work properly using ResNet-50, which is a widely used network that usually gives good results despite its low complexity architecture. Several training procedures are demonstrated to work effectively with this model [4, 20] . A learning rate value proportional to the batch size, warmup learning rate behaviour, batch normalization, SGD to RMSProp optimizer transition are some of the techniques exposed in these works. A study of the distributed training methods using ResNet-50 architecture on a HPC cluster is shown in [10, 11] . To know more about the algorithms used in this field we refer to [8] .

In this section we present the models used and a brief explanation of them. We also show the training procedure used in both BigTobacco and SmallTobacco and the pipeline of our approach to the problem.

EfficientNets [45] are a set of light CNNs designed to scale up in a structured manner. The network's width (w ), depth (d ) and resolution (r ) are defined as: w = α φ , d = β φ and r = γ φ , where φ is the scaling compound coefficient. The optimization problem is set by constraining α·β 2 ·γ 2 ≈ 2 and α ≥ 1, β ≥ 1, γ ≥ 1.

By means of a grid search of α, β, γ with AutoML MNAS framework [44] and fixing φ = 1, a baseline model (B0) is generated optimizing FLOPs and accuracy. Then, the baseline network is scaled up uniformly fixing α, β, γ and increasing φ. We find that scaling the resolution parameter as proposed in [45] does not improve the accuracy obtained. In our experiments in Sect. 5 we proceed with an input image size of 384 × 384, which corresponds to a resolution r = 1.71, as proposed by Tensmeyer et al. in [46] with AlexNet architecture [28] .

The main block of the EfficientNets is the mobile inverted bottleneck convolution [38, 44] . This block is formed by two linear bottlenecks connected through both a shortcut connection and an intermediate expansion layer with a depthwise separable convolution (3 × 3) [13] . Probabilities P (class|F C) are obtained by applying the softmax function on top of the fully connected layer FC of the EfficientNet model.

We train EfficientNets (pre-trained previously on ImageNet) on BigTobacco using Stochastic Gradient Descent for 20 epochs with Learning Rate Warmup strategy [22] , specifically we follow STLR (Slanted Triangular Learning Rate) [23] which linearly increases the learning rate at the beginning of the training process and linearly decreases it after a certain number of iterations. We chose the reference learning rate η following the formula proposed in [20] and used in [4] and [22] . Specifically, we set η = 0.2 · nk 256 , where k denotes the number of workers (GPUs) and n the number of samples per worker. Figure 1 shows the multi-GPU training procedure to get EfficientNet BigT obacco , which represents EfficientNet model pre-trained on BigTobacco. EfficientNet is loaded with ImageNet weights (EfficientNet ImageN et ) and then located in different GPUs within the same node.

We fine-tune on SmallTobacco the pretrained models by freezing the entire network but the last softmax layer. Just 5 epochs are enough to get the peak of accuracy. STLR is used this time with η = 0.8· nk 256 . Since only the last layer is trained, we reduce the risk of catastrophic forgetting [33] . Final fine-tuned model is represented as EfficientNet BigT obacco in Fig. 1 . 

Predictions from OCR Tesseract [42] are obtained by means of the BERT model [18] . BERT is a multi-layer bidirectional Transformer encoder model pre-trained on a large corpus. In this work we use a modification of the original pre-trained BERT BASE version. In our case, we reduce to 6 the number of BERT layers since we find less variance in the final results and faster training/inference times. The output vector size is kept to 768. The maximum length of the input sequence is set to 512 tokens. The first token of the sequence is defined as [CLS] , while [SEP ] is the token used at the end of each sequence.

A fully connected layer is added to the final hidden state of the [CLS] token h [CLS] of the BERT model, which is a representation of the whole sequence. Then, a softmax operation is performed giving P (class|h [CLS] ) the probabilities of the output vector h [CLS] , i.e the whole input sequence, pertaining to a certain class.

The training strategies used in this paper are similar to the ones proposed in [43, 48] . We use a learning rate η B = 3e −5 for the embedding, pooling and encoder layers while a custom learning rate η C = 1e −6 for the layers on top of the BERT model. A decay factor ξ = 1e −8 is used to reduce gradually the learning rate along the layers, η l = ξ · η l−1 . ADAM optimizer with β 1 = 0.9 and β 2 = 0.999 and L 2 -weight decay factor of 0.01 is used. The dropout probability is set at 0.2. Just 5 epochs are enough to find the peak of accuracy with a batch size of 6, the maximum we could use due to memory constraints.

In order to get the final enhanced prediction of the combination of both text and image model we use a simple ensemble as in [5] .

In this work w 1 , w 2 = 0.5 are found optimal. These parameters could be found by a grid search where N i=1 w i = 1, being N the number of models. This procedure shows to be an effective solution when both models have a similar accuracy and it allows us to avoid another training phase [6] . In Fig. 2 this whole process is depicted. 

In this section we compare the performance of the different EfficientNets in SmallTobacco and BigTobacco as showed in Table 2 

In order to compare with previous results in SmallTobacco dataset, we divide the dataset following the procedure in [24] . Documents are split in training, test and validation sets, containing 800, 2482 and 200 samples each one. 10 different splits of the dataset are created by randomly sampling from the 3482 documents, so that 100 samples per class are guaranteed between train and validation sets. In the Fig. 4 we give the accuracy on SmallTobacco as the median over the 10 dataset splits to compare with previous results. Accuracy on BigTobacco is shown as the one achieved on the test set. BigTobacco dataset used in Sect. 5.3 is slightly modified, where overlapping documents with SmallTobacco are extracted. Top performing model's accuracies are written down in Table 2 .

We show in Fig. 3 the time it takes to train the different networks while using 1, 2, 3 or 4 GPUs in a single node. In order to take advantage of the multiple GPUs we use data parallelism, which consists of placing a copy of the model in each of them. Since every GPU share parameters, it is equivalent to having a single GPU with a larger batch size.

The time reduction to complete the entire training process with B0 variant is ≈61.14% lower when compared with B4 (4 GPUs). Time reduction by using multiple GPUs is clearly showed in the left plot of Fig. 3 . For instance, EfficientNet-B0 benefits from a ≈75.4% time reduction after parallelizing over 4 GPUs. The total training time of the EfficientNets on the different number of GPUs is showed in the right side of Fig. 3 . The best performing model in BigTobacco dataset is EfficienNet-B4 with 92.31% accuracy in the test set.

Accuracies of the EfficientNets pre-trained on BigTobacco and finally fine-tuned on SmallTobacco are depicted in the left plot of Fig. 4 . Simpler models perform with less variability between the 10 random splits than the heavier ones. The best performing model is the EfficientNet-B1, achieving a new state-of-art accuracy of 94.04% median over 10 splits.

In this work, we also wanted to test the potential of light EfficientNet models on a small dataset such as SmallTobacco without the use of transfer learning from in-domain dataset, and compared it with the previous state-of-the-art. Results given by our proposed method described in Sect. 4.3 are shown in the right plot of Fig. 4 . Although we perform the tests over 10 different random splits to give a wider view of how these models work, in order to compare with Audebert et al. [6] we calculate the average over 3 random splits, which gives us a 89.47% accuracy.

Every ensemble model achieves better accuracy than previous results, and again, there is almost no difference between different EfficientNets results.

Single GPU training requires a huge amount of time, especially when dealing with heavy architectures like in the case of the EfficientNet-B4, which takes almost two days to complete the whole training phase. For this reason, experimenting with several workers is crucial to minimize the amount of time spent on this tasks. We test the same model and training procedure with two of the main used frameworks to train Deep Learning models, PyTorch and Tensorflow 4 . In both cases we use their own APIs for making a synchronous distributed training in several GPUs by means of data parallelism, where training on each GPU is done in its own process. We use PyTorch's DistributedDataParallel and Tensorflow's tf.distribute.Strategy (tf.distribute.MirroredStrategy). In both libraries data is loaded from the disk to page-locked memory in each host, and from there to each GPU in a parallel fashion by means of multiple workers. Each GPU is ensured to get a minibatch with non overlapping data. Every GPU has an identical copy of the model and each one does its own forward pass. Finally, NCCL is utilized as a backend to run the all-reduce algorithm to compute the gradients in parallel between GPUs, before updating the model parameters. Since we have not been able to apply the shear transformation efficiently in Tensorflow, we show the results of both frameworks without that preprocess. For this experiment we use the B0, B2 and B4 EfficientNets models. The time it takes to train each model is showed on the left side of Fig. 5 . PyTorch training is faster and the speedup more linear than in the case of TensorFlow. Some of this difference could be due to the data loading process, which we have not fully optimized in TensorFlow framework.

In this paper we have presented the use of EfficientNets for the Document Image Classification task and their scaling capabilities through several GPUs. By means of two versions of the Legacy Tobacco Industry Documents, a huge and a small dataset, we demonstrated the training process to obtain high accuracy in both of them. We have compared the different versions of the EfficientNets and raised the state-of-the-art classification accuracy to 92.31% in BigTobacco and 94.04% when fine-tuned in SmallTobacco. We can consider the B0 the best choice when considering limited computational resources. We have also presented an ensemble method by adding the content extracted by OCR. A reduced version of the BERT model is trained and both models predictions are combined to achieve a new state-of-the-art accuracy of 89.47%.

Finally, we have tested the same image models and training procedures in Tensorflow and PyTorch, where we have observed similar speedup values exploiting their libraries for distributed training. We have also tried distributed training in several GPU nodes by means Horovod framework [39] , however the stack of software in our IBM Power 9 cluster is still in its early stages and we have not been able to obtain desired results. Nevertheless, future work may focus in testing this approach.

Future work may also evaluate the use of different OCR engines, as we suspect this could have a great impact on the quality of the text model predictions.

With this work we also want to provide to researchers a benchmark in the Document Image Classification task, which can serve as a reference point to effortlessly test parallel systems in both PyTorch and TensorFlow.

DeepDocClassifier: document classification with deep convolutional neural network. In: ICDAR, p

Cutting the error by half: investigation of very deep CNN and advanced training strategies for document image classification

Neural Networks and Deep Learning: A Textbook

Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes

Two stream deep network for document image classification

Multimodal deep networks for text and image-based document classification

Using tree-grammars for training set expansion in page classification

Demystifying parallel and distributed deep learning: an in-depth concurrency analysis

Enriching word vectors with subword information

Scaling a convolutional neural network for classification of adjective noun pairs with TensorFlow on GPU clusters

Distributed training strategies for a computer vision deep learning training algorithm on a distributed GPU cluster

Structured document classification by matching local salient features

Xception: deep learning with depthwise separable convolutions

A clustering-based algorithm for automatic document separation

What is the right way to represent document images

Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks

ImageNet: a large-scale hierarchical image database

BERT: pre-training of deep bidirectional transformers for language understanding

Hidden tree Markov models for document image classification

Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR

Evaluation of deep convolutional nets for document image classification and retrieval

Bag of tricks for image classification with convolutional neural networks

Universal language model fine-tuning for text classification

Structural similarity for document image classification and retrieval

Bag of tricks for efficient text classification

Convolutional neural networks for document image classification

Convolutional neural networks for sentence classification

ImageNet classification with deep convolutional neural networks

Unsupervised classification of structurally similar document images

Learning document structure for retrieval and classification

Building a test collection for complex document information processing

Learned in translation: contextualized word vectors

Catastrophic interference in connectionist networks: the sequential learning problem

Efficient estimation of word representations in vector space

Embedded textual content for document image classification with convolutional neural networks

Deep contextualized word representations

Generalized stacking of layerwise-trained deep convolutional neural networks for document image classification

MobileNetV2: inverted residuals and linear bottlenecks

Horovod: fast and easy distributed deep learning in TensorFlow

Document image retrieval based on layout structural similarity

Very deep convolutional networks for large-scale image recognition

An overview of the tesseract OCR engine

How to fine-tune BERT for text classification?

MnasNet: platform-aware neural architecture search for mobile

EfficientNet: rethinking model scaling for convolutional neural networks

Analysis of convolutional neural networks for document image classification

Attention is all you need

To tune or not to tune?

A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification

Acknowledgements. This work was partially supported by the Spanish Ministry of Science and Innovation and the European Regional Development Fund under contract TIN2015-65316-P, by the BSC-CNS Severo Ochoa program SEV-2015-0493, and grant 2017-SGR-1414 by Generalitat de Catalunya and by the research agreement CaixaBank-BSC 2016-2021.