key: cord-0057804-tlvjvro1
authors: Nasir, Sidra; Siddiqi, Imran
title: Learning Features for Writer Identification from Handwriting on Papyri
date: 2021-02-22
journal: Pattern Recognition and Artificial Intelligence
DOI: 10.1007/978-3-030-71804-6_17
sha: 70f406987297e1a5e5ff4faa64fc5079536ba436
doc_id: 57804
cord_uid: tlvjvro1

Computerized analysis of historical documents has remained an interesting research area for the pattern classification community for many decades. From the perspective of computerized analysis, key challenges in the historical manuscripts include automatic transcription, dating, retrieval, classification of writing styles and identification of scribes etc. Among these, the focus of our current study lies on identification of writers from the digitized manuscripts. We exploit convolutional neural networks for extraction of features and characterization of writer. The ConvNets are first trained on contemporary handwriting samples and then fine-tuned to the limited set of historical manuscripts considered in our study. Dense sampling is carried out over a given manuscript producing a set of small writing patches for each document. Decisions on patches are combined using a majority vote to conclude the authorship of a query document. Preliminary experiments on a set of challenging and degraded manuscripts report promising performance.

Historical documents contain rich information and provide useful insight into the past. Drawings, embellishments, shapes, letters and signatures not only provide explicit details on the content but, diverse cultural and social attributes are also manifested in the style of writing and its evolution. Paleographers are particularly interested in tasks like identifying the scribe, determining the date and place of origin of a manuscript and so on. Such problems, naturally, require significant experience and domain knowledge.

Over the last few decades, there has been a significant increase in the trend to digitize ancient documents [1, 2] . The digitization not only aims at preserving the cultural heritage and to make it publicly accessible but also allows research on these rich collections without the need to physically access them. This, in turn, has exposed the pattern classification researchers in general and the document and handwriting recognition community in particular to a whole new set of challenging problems [3] . Few of the prominent digitization projects include the International Dunhuang Project (IDP) [4] , The Monk system [5] , Madonne [3] and NAVIDOMASS (NAVIgation in Document MASSes). Besides digitization, these projects are also supported by development of automated tools to assist the paleographers in tasks like spotting keywords in manuscripts or retrieving documents with a particular writing style or a dropcap etc. The SPI (System for Paleographic Inspection) [6] Software, for instance, has been employed by experts to compare and analyze paleographic content morphologically. Such systems help paleographers in inferring the origin of a manuscript as morphologically similar strokes are likely to originate from similar temporal and cultural environments. The notion of similarity can also be exploited to identify the scribe of a given manuscript.

In the past, paleographers and historians have been hesitant in accepting computerized solutions. The key contributing factor to this resistance has been the lack of 'trust' in machine based solutions. In the recent years, however, thanks to the advancements in different areas of image analysis and machine learning as well as the success of joint ventures between paleographers and computer scientists, the experts are more open to accepting automated solutions in their practices [7] . The main motivation of such solutions is to assist and not replace the human experts. These tools can be exploited to narrow down the search space so that the experts can focus on limited set of samples for detailed and in-depth analysis [8] .

Among various challenges in computerized analysis of historical manuscript, identification of scribes carries significant importance. Identifying the writer can also be exploited to estimate the date and region in which the manuscript was produced by correlating with the 'active' period of the scribe [9] . Writer of a document can be categorized by capturing the writing style which is known to be specific for each individual [10] . Writing style is typically exploited through a global (page or paragraph) scale of observation. Textural features, for example, have been extensively employed to capture the writing style [11] [12] [13] . Another series of methods employs low level statistical measures computed from relatively closer scale of observation (characters or graphemes for example) [10] . Studying the frequency of certain writing patterns in a given handwriting has also been exploited to characterize writers under the category of codebook based writer identification [14, 15] . In the recent years, feature learning using deep convolutional neural networks (CNNs) has also been investigated to characterize the writer [16] . A major proportion of work on writer identification targets contemporary documents which do not offer the challenges encountered when dealing with ancient manuscripts. Noise removal, segmentation of text from background, segmentation of handwriting into smaller units for feature extraction etc. are few of the challenges that hinder the direct application of many established writer identification methods to historical manuscripts. Another important factor is the medium on which writing is produced that has evolved over time (stone, clay, papyrus, parchment, paper etc.). Each medium has its own unique challenges that must be addressed to effectively identify the scribe. This paper addresses the problem of writer identification from handwriting on papyrus. The digitized images of handwriting are pre-processed and divided into patches using a dense sampling. Machine learned features are extracted from each patch using a number of pre-trained ConvNets. Since handwriting images are very different from the images on which most of the publicly available CNNs are trained, the networks are first fine tuned using a large dataset of contemporary writings. These networks are further tuned on the papyrus images to identify the scribe. Experimental study is carried out on the GRK-Papyri [17] dataset and results are reported at patch as well as document level (by applying a majority vote on patch level decisions).

We first present an overview of recent studies on similar problems in Sect. 2. Section 3 introduces the dataset and presents the details of the proposed technique. Details of experiments along with a discussion on the reported results are presented in Sect. 4. At the end, we summarize our findings in Sect. 5.

In the recent years, computerized analysis of ancient handwriting has gained significant attention from the document recognition community [18] [19] [20] [21] . The key challenge in the automatic writer identification (AWI) is the selection of distinguishable features which effectively extract the writing style of the scribe from the handwriting images. The scale of observation at which features are computed is also critical as features can be extracted from complete pages, small patches of handwriting, text lines, words, characters or even graphemes. These units represent different scale of observations at which the handwriting is analyzed.

As discussed in the introductory discussion, a recent trend in writer identification from contemporary documents is to learn features from data, typically using ConvNets. In our discussion, we will be focusing more on machine learning based methods for writer identification. Readers interested in comprehensive reviews on this problem can find details in the relevant survey papers [22, 23] .

From the perspective of feature learning, ConvNets are either trained from scratch or pre-trained models are adapted to writer identification problem using transfer learning. Rehman et al. [18] , for instance, employed the well-known AlexNet [24] architecture pre-trained on ImageNet [25] dataset as feature extractor. Handwriting images are fed to the trained model and extracted features are fed to an SVM for classification. In another deep learning based solution, Xing and Qiao [19] introduced a deep multi-stream CNN termed as DeepWriter. Small patches of handwriting are fed as input to the network that is trained with softmax classification. Experiments on English and Chinese writing samples report high identification rates. Authors also demonstrate that joint training on both scripts leads to better performances.

Among other significant contributions, Tang and Wu [26] employ a CNN for feature extraction and the joint Bayesian technique for identification. To enhance the size of training data, writing samples are split into words and their random combinations are used to produce text lines. The technique is evaluated through experimental study on the ICDAR2013 and the CVL dataset and Top-1 identification rates of more than 99% are reported in different experiments. In another similar work, writer identification is carried out from Japanese handwritten characters using a AlexNet as the pre-trained model [27] . Fiel et al. [28] mapped handwriting images to feature vectors using a CNN and carried out identification using a nearest neighbor classifier. Christlein et al. [20] investigate unsupervised feature learning using SIFT descriptors and a residual network. Likewise, authors in [29] employ a semi-supervised learning approach with ResNet. Weighted Label Smoothing Regularization (WLSR) was introduced to regulate the unlabeled data. Words in the CVL dataset were used as the original data while IAM words as the unlabeled set of data in the experimental study.

While CNNs are mostly employed in the classification framework for writer identification, Keglevic et al. [21] propose to learn similarity between handwriting patches using a triplet network. The network is trained by minimizing the intra class and maximizing the inter class distances and the writing patches are represented by the learned features. A relatively recent trend is to exploit hyperspectral imaging to capture handwriting images, mainly for forensic applications. Authors in [30] demonstrate the effectiveness of employing multiple spectral responses of a single pixel to characterize the writer. These responses are fed to a CNN to identify the writer. Experiments on the UWA Writing Inks Hyperspectral Images (WIHSI) dataset reveal the potential of this interesting area for forensic and retrieval applications.

From the perspective of writer identification in historical manuscripts, the literature is relatively limited as opposed to contemporary documents [31, 32] . In some cases, standard writer identification techniques have also been adapted for historical manuscripts [33] . A recent work is reported in [34] that targets writer identification in medieval manuscripts (Avila Bible). Transfer learning is employed to detect text lines (rows) from images and the writer against each line is identified. Majority voting is subsequently applied on the row-wise decisions to assign a writer to the corresponding page and, page-level accuracy of more than 96% is reported. Sutder et al. [35] present a comprehensive empirical study to investigate the performance of multiple pre-trained CNNs on analysis of historical manuscripts. The networks were investigated for problems like character recognition, dating and handwriting style classification.

In another similar work, Cilia et al. [36] propose a two-step transfer learning based system to identify writers from historical manuscripts. The text rows in images are first extracted using an object detection system based on MobileNet. The CNN pre-trained on ImageNet is subsequently employed for writer identification on digitized images from a Bible of the XII century. Mohammed et al. [37] adapt a known writer identification method (Local Naïve Bayes Nearest-Neighbour classifier [38] ) for degraded documents and demonstrate high identification rates on 100 pages from the Stiftsbibliothek library of St. Gall collection [39] . The same technique was applied to the GRK-Papyri dataset [17] with FAST keypoints and reported a low identification rate of 30% (using a leave-one-out evaluation protocol).

After having discussed the recent contributions to writer identification in general and historical documents in particular, we now present the proposed methods in the next section.

We now present the details of the proposed method for characterization of writers from the challenging papyrus handwriting. We first introduce the dataset employed in our study followed by the details of pre-processing, sampling and writer identification through ConvNets. The approach primarily relies on characterizing small patches of handwriting using machine-learned features in a twostep fine tuning process. An overview of the key steps is presented in Fig. 1 while each of these steps is discussed in detail in the subsequent sections. 

The experimental study of the system is carried out on the GRK-Papyri dataset presented in [17] . The dataset consists of 50 handwriting samples of 10 different scribes on papyri. All writings are in Greek and come from the 6th century A.D. The dataset has been made available for research along with the ground truth information of writers. Sample images from the dataset are shown in Fig. 2 .

All images are digitized as JPEGs and height of images varies from 796 to 6818 pixels while the width values are in the range 177 to 7938 pixels. The DPI also varies from a minimum of 96 to a maximum of 2000. Few of the images are digitized as gray scale with others are three channel RGB images. The samples suffer from sever degradation including low contrast, holes and glass reflection etc. (Fig. 2) . The background contains papyrus fibers with varying sizes and Fig. 2 . Sample images of GRK-Papyri dataset [17] frequencies adding further complexity from the perspective of automated processing. The samples are not uniformly distributed across the 10 scribes and the number varies from 4 to 7 samples per writer as presented in Table 1 . 

Prior to feeding the images to ConvNets for feature extraction, we need to process the images. Since the dataset comprises both colored and gray scale images with diverse backgrounds of papyrus fiber, directly feeding raw images may lead to learning features that could be linked with the background information rather than handwriting. We therefore first convert all images to gray scale and preprocess them in different ways to investigate which of the representations could yield better performance. These include:

-Binarization using adaptive (Sauvoloa [40] ) thresholding.

-Application of Canny edge detector to preserve edges of writing strokes only.

-Edge detection on adaptively binarized images.

-Binarization of images using a recent deep learning based technique -DeepOtsu [41] .

The output images resulting from these different types of processing are illustrated in Figure 3 . 

When employing pre-trained ConvNets in a transfer learning framework (fine tuning them on the target dataset), the resolution of images must match the input expected at the network. Naturally, resizing the complete page to a small square and feeding it to a network is not very meaningful as not only all writerspecific information is likely to be lost but the aspect ratio is also highly disturbed. We, therefore, carry out a dense sampling of the complete image using overlapping squared windows. The size of window determines the scale of observation and extracting square windows ensures that the aspect ratio is not disturbed once the extracted patches are resized to match the input layer of pretrained CNN. Figure 4 illustrates few patches of size 512 × 512 extracted from one of the images in the dataset. 

As discussed in the earlier sections, deep ConvNets have become the gold standard for feature extraction as well as classification. Designing a new architecture and training CNNs from scratch for every problem, however, is neither required not feasible. In most cases, architectures and weights of ConvNets can be borrowed from those trained on millions of images and made publicly available by the research community. This concept is commonly termed as transfer learning and has been successfully applied to a number of recognition tasks.

Pre-trained ConvNets can be employed only as feature extractors in conjunction with another classifier (SVM for example) or, they can be fine-tuned to the target dataset by changing the softmax layer (to match classes under study) and continuing back propagation. Fine-tuning can be employed to update weights of all or a subset of layers by freezing few of the initial layers. Most of the pretrained networks publicly available are trained on the ImageNet [25] dataset and have been fine-tuned to solve many other problems.

In our study, we employ the pre-trained ConvNets by fine-tuning them to our problem. More specifically, we employ three standard architectures namely VGG16 [42] , Inceptionv3 [43] and ResNet50 [44] trained on the ImageNet dataset. Since we deal with handwriting images which are different from the images in the ImageNet dataset, we employ a two-step fine-tuning. First we finetune the networks using IAM handwriting dataset [45] which contains writing samples of more than 650 writers. Although these are contemporary samples and do not offer the same challenges as those encountered in historical documents, nevertheless, since these images contain handwriting, we expect an enhanced feature learning. Once the networks are fine-tuned on IAM handwriting samples, we further tune them on the writing patches in our papyri dataset. The softmax layer of the final network is changed to match 10 scribes in our problem.

We now present the experimental protocol, the details of experiments and the reported results. The GRK-Papyri dataset is provided to carry out writer identification task in two experimental settings.

-Leave-one-out Approach -A training set of 20 and a test set of 30 images

Since we employ a machine learning based technique, experiments under a leave-one-out approach would mean training the system 50 times for each evaluation. We, therefore, chose to employ the training and test set distribution provided in the database i.e. 20 images in the train and 30 in the test set.

We first present the identification rates as a function of different preprocessing techniques. These classification rates are computed by fine-tuning Inceptionv3 first on IAM dataset and subsequently on the training images in the GRK-Papyri dataset. Results are reported at patch level as well as document level by applying a majority vote on the patch level decisions. It can be seen from Table 2 that among the different pre-processing techniques investigated, DeepOtsu reports the highest identification rates of 27% at patch level and 48% at document level. The subsequent experiments are therefore carried out using DeepOtsu as the pre-processing technique. Table 3 presents a comparison of the three pre-trained models VGG16, Incep-tionv3 and ResNet50 employed in our study. We present the identification rates by directly fine-tuning the models from ImageNet to our dataset (single step tuning) as well as by first tuning them on the IAM dataset and subsequently on the paypri dataset (two step tuning). It can be seen that in all cases two-step fine tuning serves to enhance the identification rates by 2 to 6%. The highest document level identification rate is reported by fine tuning ResNet50 and reads 54%. Considering the complexity of the problem and the small set of training samples, the reported identification rate is indeed very promising.

We also study the impact of patch size (scale of observation) on the identification rates. Document level identification rates with two step fine-tuning of Inceptionv3 and ResNet50 as a function of patch size are summarized in Fig. 5 . It is interesting to observe that both the models exhibit more or less similar trend and the highest identification rates are reported at a patch size of 512 × 512, i.e. 48% and 54% for Inception and ResNet respectively. Too small or too large patches naturally report relatively lower identification rates indicating that scale of observation is a critical parameter that must be carefully chosen.

From the view point of comparison, writer identification rates are reported on this dataset using Normalized Local Naïve Bayes Nearest-Neighbor with FAST key points in [17] . Authors report an identification rate of 30.0% with leave-oneout protocol and, 26.6% identification rate with distribution of data into training and test set. Using the same distribution of 20 images in the training and 30 in the test set, we report an identification rate of 54% which seems to be quite encouraging.

This study aimed at identification of scribes from historical manuscripts. More specifically, we investigated the problem on Greek handwriting on papyrus. Handwriting is extracted from the degraded images in a pre-processing steps and is divided into small patches using dense sampling. Features are extracted from handwriting patches by fine-tuning state-of-the-art ConvNets. A two-step fine-tuning is carried out by first tuning the models to contemporary hand writings and subsequently to the papyri dataset. Patch level identification decisions are combined to document level using a majority voting and, identification rates of up to 54% are reported. Considering the challenging set of writing samples, the realized identification rates are indeed very promising.

In our further study, we intend to extend the analysis to other relevant problems like classification of writing styles and dating. Furthermore, the current study revealed that pre-processing is a critical step in analyzing such documents and further investigating different pre-processing techniques could indeed be an interesting study. In addition to standard pre-trained models, relatively shallower networks can also be trained from scratch to study the performance evolution.

Document analysis systems for digital libraries: challenges and opportunities

Document images analysis solutions for digital libraries

Digitizing a million books: challenges for document analysis

International Dunhuang project: the silk road online

Handwritten-word spotting using biologically inspired features

A case study on the system for paleographic inspections (SPI): challenges and new developments

Historical manuscript dating using textural measures

Deep learning based approach for historical manuscript dating

Image-based historical manuscript dating using contour and stroke fragments

Individuality of handwriting

Personal identification based on handwriting

Writer identification using global wavelet-based features

Deep adaptive learning for writer identification based on single handwritten word images

Text-independent writer identification and verification using textural and allographic features

Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features

DeepWriter: a multi-stream deep CNN for text-independent writer identification

GRK-Papyri: a dataset of Greek handwriting on papyri for the task of writer identification

Automatic visual features for writer identification: a deep learning approach

DeepWriter: a multi-stream deep CNN for text-independent writer identification

Unsupervised feature learning for writer identification and writer retrieval

Learning features for writer retrieval and identification using triplet CNNs

State of the art in off-line writer identification of handwritten text and survey of writer identification of Arabic text

Writer identification: a comparative study across three world major languages

ImageNet classification with deep convolutional neural networks

ImageNet: a large-scale hierarchical image database

Text-independent writer identification via CNN features and joint Bayesian

Writer identification for offline Japanese handwritten character using convolutional neural network

Writer identification and retrieval using a convolutional neural network

Semi-supervised feature learning for improving writer identification

Hyperspectral image analysis for writer identification using deep learning

Binarization, character extraction, and writer identification of historical Hebrew calligraphy documents

Writer identification for historical Arabic documents

Using codebooks of fragmented connectedcomponent contours in forensic and historic writer identification

An end-to-end deep learning system for medieval writer identification

A comprehensive study of imagenet pre-training for historical document image analysis

A two-step system based on deep transfer learning for writer identification in medieval books

Writer identification for historical manuscripts: analysis and optimisation of a classifier as an easy-to-use tool for scholars from the humanities

Local Naive Bayes nearest neighbor for image classification

e-codices-virtual manuscript library of Switzerland

Adaptive document image binarization

DeepOtsu: document enhancement and binarization using iterative deep learning. Pattern Recogn

Very deep convolutional networks for large-scale image recognition

Rethinking the inception architecture for computer vision

Resnet in resnet: generalizing residual architectures

The IAM-database: an English sentence database for offline handwriting recognition

An improved canny edge detection algorithm

Authors would like to thank Dr. Isabelle Marthot-Santaniello from University of Basel, Switzerland for making the dataset available.