key: cord-0057825-vz6rug1s authors: Mekhaznia, Tahar; Djeddi, Chawki; Sarkar, Sobhan title: Personality Traits Identification Through Handwriting Analysis date: 2021-02-22 journal: Pattern Recognition and Artificial Intelligence DOI: 10.1007/978-3-030-71804-6_12 sha: 198637a508aa5eb55968d9392dcff68587224d05 doc_id: 57825 cord_uid: vz6rug1s Personality traits are of paramount importance in identifying the human’s behavior. They represent a useful information source for forensic control, recruitment profiling, medical symptoms, and other applications. Personality traits are identified through various physical aspects, including sense, honesty, and other emotions. These aspects can be revealed through handwritten features. Since handwriting is unique for everyone, its identification process is not as straight forward as it appears; rather it involves efficient tools for extraction and classification of features. The process has been the subject of various research works. However, results reported remain unsatisfactory due to mainly dissimilarities in handwriting. In this paper, we present an approach of recognition of personality traits based on textural features extracted from handwritten samples. Experiments are carried out using artificial neural networks and the TxPI-u database. Results deliver a significant recognition rate which endorses its effectiveness against similar works. Handwriting is viewed as a combined psycho-mechanic process achieved by the writer's hand according to his brain commands. It is, to some extent, his/her private seal and trademark, which cannot be reproduced by others. Such effect is endorsed by two natural factors that contribute to the script individuality; first, the genetic factor, which is responsible for the hand bio-mechanical structure, muscular strength and brain system proprieties, while the second, a mimetic factor, relative to the training concept acquired through the basic education and cultural environment influence. By another way, individual writing is subject to several improvements during the lifetime; it starts with a basic behavior in classroom copybook style, progressively affected by new attributes depending on personal life events, skills, etc. Finally, it becomes, specific to its owner. By another way, personal handwriting is a part of the global human aspects as walking behavior and voice speech. They allowed in some way, disclosing the psychological state of their owner. They accordingly represent a useful data source for forensic control, recruitment profiling, medical diagnostic symptoms and wide other applications. Related research area adopted the Graphology as a science of recognition of human traits and emotional outlay [1] . It allows revealing enough about the writer psychology and assessment based on handwriting features and in this way, the handwriting recognition emerged. Handwriting recognition, known also as character recognition is a study, consists of the process of conversion of manual texts to codes of symbols usable within a computer. The process is achieved by appropriate applications, including machine learning and OCR engines; it involves the script extraction, segmentation and classification of features. The useful information is stored on referential databases, known as classifiers, used afterward as an identification tool of the writer manuscript samples. The writer identification, called also the personality traits identification refers to the scientific methodology that understanding and evaluating personality emotion. It operates via the structure and patterns of handwriting and intends to build the writer personality through a piece of his handwriting. The identification process is unstable anywhere. It obeys to the concept that no two people write exactly similar and no one reproduces the same writing twice and depends essentially on the analyst's experience and his/her skills; however, the related results are quite often costly and prone to errors. Consequently, experts turn to the automated handwriting analysis which seems to be effective for trait personality prediction. It performs a one-to-many search with samples of authorship in a given classifier and gives back the most similar results which, may be processed manually afterward. To solve the handwriting recognition problem, we propose an approach of evaluation of personality traits through various handwriting characteristics. Its principle consists of the extraction of handwriting sample features and their classification using artificial neural network (ANN) algorithm. The results allow leading to recognize various personality traits in regard of the Five-Factor Model (FFM). The handwriting features proposed are based on edge directional, run length and ink distribution. The experimental process is accomplished using a new resource database called TxPI-u with 534 samples, performed by several writers; represent a common set of personality traits. A part on the database contents is used in the training process. It is noteworthy to mention that the full identification of personality traits is never reached. This may be due to the diversity of writer's manuscripts, the scan-quality of documents and foreground/background separation problems. In addition, the character recognition rate within a given document cannot be viewed as a valid result due to the lack of a standard evaluation context. The rest of the paper is organized into five sections: After an introduction and a literature review in Sects. 1 and 2, the proposed approach is described in Sect. 3. Section 4 illustrates some preliminary experiments and statistical analysis of results. Finally, the paper is concluded in Sect. 5; it summarizes the paper contents and provides valuable ideas about avenues for further investigation of the problem. Personality traits identification based handwriting remains a thriving research field. Its prominence emerges in pattern recognition, classification and in general, in the artificial intelligence (AI) field. Accordingly, researchers adopted diverse strategies for features extraction of symbols, conducting the retention of just effective information and then, decreasing the dimension of classifiers [2] . Before that, the staff lines problem must be fixed; it refers to the writing guidelines, encountered especially in old documents and often overlaps symbols parts. Staff lines removal must be processed before or after the segmentation stage with the risk of losing parts of useful data. It involves specific transformations and symbols reconstruction which conducts to useless results in various cases. Literature in this context are abundant; the earliest works date backs to 90s where Sheikholeslami et al. [3] proposed a computer-aided graphology system for extraction and analysis of handwriting with reduced data and limited results. In the 2000s, various similar works have emerged. They performed their experiments upon data-sets of numerous writers' samples [4] . Researchers and for their experiments, extract a variety of features: characters dimensions, slants and loops frequency [5] ; document and paragraph layout, pixels density [6] ; document layout, pen pressure, words spacing [7] ; baseline behavior and t-bar, y-loop characteristics [8] ; "f " and "i" letters characteristics [9] ; baseline layout, slants, margins [10] ; isolated characters behavior [11] . About the classification, researchers adopted various alternatives: neural network [9, 12] ; grey-level co-occurrence matrix, Gabor filters [13] , crisp and fuzzy approach [14] ; fuzzy inference [15] ; other specific tools [16] ; machine learning tool with KNN [17] ; combination of CNN and SVM classifiers [18] . In term of effectiveness, if a part of the cited approaches has yielded promising results, various other attempts remain unfruitful [19] . Their steering toward just the Latin-derived alphabets is undoubtedly the main cause, coupled with their ineptitude to deal with actual databases. Besides that, we noticed the significant amount of training data required by certain approaches which constitute a drawback of their efficiency. Overall, and in lack of a standard for predicting behavior based on handwriting, the most obtained results still dependent to their experimental environment. The proposed approach is based on three layers of Artificial neural network (ANN) architecture; it uses the contents of a handwriting database for evaluating the personality traits of the writer according to the FFM model. It focuses on analyzing off lines samples, a suitable alternative that replaces the questionnaire and psychological interview, used in classical processes. The handwriting samples are produced in a consistent format, recognized easily by the computer with a high degree of accuracy. Writing features are then considered and classified according to a predefined model. First, a part of the features is used as training mode that enhances the recognition accuracy. Figure 1 depicts the main approach steps for recognition. The dataset used for the evaluation of the present work consists of a new standardized multimodal corpus, baptised Text for Personality Identification of Undergraduates, baptised TxPI-u [20] , dedicated to experience the personality traits problem. Its contents consist of samples of manuscripts from a group of 418 of undergraduate Mexican students. It presented as a set of images samples of various academic programs (management, humanity, social studies, communication, etc.). An associated class 1 or 0 is affected to each sample; it corresponds to a present level of a specific personality trait (extroversion, agreeableness, conscientiousness, emotional stability and openness) according to the FFM. An image sample is illustrated in Fig. 2. Staff lines (called also ruled paper) consist of a series of horizontal continuous lines used as a guide, helping readers and especially writers to maintain their writing on a straight line and their drawing on a paging order (Figs. 2 and 3) . The ruling layout is determined according to the style defined by the manuscript's author or the entity for which it is intended. Staff lines are intended to maintain notes pitch for school notebooks and musical supports. Nowadays, they have gradually disappeared from documents due the use of mechanical writing, but still present within archives, and old manuscripts. Noise removal consists of removing staff lines and other unwanted data as extra symbols, ink blots that do not offer any useful information. It allows instant character recognition and in consequence, improves the quality of the image. In general, staff lines are performed with the same color of writing pen and sometimes, overlapped parts of writing symbols. Therefore, their automatic removal is susceptible to alter relevant data. Moreover, the issue is compounded by the support texture quality and the writer handwriting which partly modify the standard behavior of characters. In literature, staff lines identification and removal have envisaged under various aspects: thickness, distance between lines, straight behavior and exploration of the contrast between fonts and paper [21] [22] [23] [24] . We adopted in this paper the idea illustrated by Dos Cardoso et al. [25] ; it forward the notion of a stable path, defined as the shortest horizontal line that relays two pixels u and v enclosed in two distinct sub-graphs Ω 1 and Ω 2 situated respectively on both margins of the music score (Fig. 4) . A staff line is then viewed as an extensive object of black pixels with a homogenous width supported by a given shortest path. Staff lines thickness is modeled as an average of contiguous stable paths (Fig. 5) . Several runs are needed on the same series of pixels to fix the staff space height. Finally, the removal of staff lines consists just on swapping their pixels with white color and keeping intact other objects exceeding the considered width. The feature extraction is a very important step in pattern recognition systems; it partially emulates human thinking about the direction in handwritten traces. Handwriting features are various: line regularity, letters and words spacing, pen pressure, lifts, etc. They appear as dominant factors in visual appearance of handwritten shapes and are independent of the amount of the written material and the variations of the writer's life behavior. In the context of the present work, we retain for experiments the slants, writing direction, and the ink trace features as illustrated below. Edge-Hinge Distribution (f1). The Edge-hinge distribution, EHD in short, is a statistical feature that illustrates the direction orientation of a handwriting pattern. It reproduces the behavior of a pair of neighborhood edge fragments starting from a central pixel and evolving in two distinct directions, oriented respectively at angles ϕ 1 and ϕ 2 with the horizontal line as showed in Fig. 6 . The probability distribution p(ϕ 1 , ϕ 2 ) is extracted over a wide sample of pixels pairs that appear in the opposite corners of a square window moving over an edgedetected handwriting piece; it solely concerns just one scale direction instead of multiple ones. The EDD is the main feature of writing stroke that materializing more accurate about writer identification features as illustrated below. (f2/f5) . The run-length distribution, RLD in short, involves the behavior of the text direction, loop size and curvature [27] . The method principle consists on scanning pixel's columns on various directions Fig. 6 . The orientation of segments emerged from a central point within a binary image and computes the number of dark pixels (which correspond to ink width) at any direction after removing salt and pepper noise. The best values are used to construct a template distribution. Once dark pixels are processed, the histogram of run lengths is normalized and interpreted as a probability distribution. In experiments, we consider the RLD for both black pixels (f2) and white pixels (f5) which seem informative about symbols and words spacing. The auto-regressive model, AR in short, is a statistical tool for depicting the dynamic characteristics of discrete data within textures and images. AR describes the intensity of a given pixel depending on the intensity of its neighbors of a certain distance in all directions. It is then used to achieve the contents of a missing area within an image shape. The pixel intensity is represented as a linear combination of neighborhood pixels' intensities according to Eq. 1. x,ymin a ij I i−p,j−p + n xy (1) withİ xy , the complete sample at the location (x, y), the (i, j) denotes the known neighborhood values. The [x min , y min , x max , y max ] refer to the model order, generally represented by a square window Ω, a corresponds to the prediction coefficient and n, represents the white noise process. AR model has been successfully applied for modeling texture synthesis, segmentation and image classifications [28, 29] . The Edge-Direction Distribution (f4). The edge-direction distribution, EDD in short, is a texture descriptor which consists of an edge convolution with two orthogonal differential kernels followed by thresholding [26] . Such feature has been long used as the main component of handwriting trace [30, 31] . It is extracted by considering the line that relays two adjacent edge points k and k + 1 with (x k , y k ) and (x k+1 , y k+1 ) as their respective coordinates in a binary image in which only the edge pixels are visible. The considered line forms with the horizontal straight line an angle ϕ [32] , computed as Eq. 2. The EDD is considered within a square of neighborhood pixels in various directions (Fig. 7) , each with a probability distribution p(ϕ). In experiments, we consider just the EDD with the high probability since the direction of the writer's pen cannot be predicted in advance. Fig. 7 . Edge-direction distribution from 4 pixels-long edge fragments In the classification stage, the considered features are arranged into sets according to the personality traits. Various classifiers are available for this purpose They are devoted, each for a dedicated task-dependent of its specificities: fast recognition [33] , parameter setting rules [34] , automatic retrieval feature [18] , etc. Their performance depends on the ability of input processing [35] , training background [36] , and the way of the user adjusts and manages the considered database [24] . Most classifiers are built upon an architecture of neural networks [37] ; They have found application in a wide variety of problems, including pattern recognition. We have adopted the ANN due to its flexible architecture, connection weights evaluation, and activation functions. It also seems well adapted at handwritten styles recognition [38] . It consists of three main types of layers arranged in a feedforward structure and fully interconnected. The base layer acquires the handwriting sample features as data vectors. The intermediary hidden layers build a map of neurons connection based on handwriting learning features where the last layer generates distinct feature spaces according to the Big Five personality traits. We have five outputs that correspond to the FFM model, whereas, the input dimension is variable and depends on the feature vectors size as shown in Table 1 . The considered classifier achieves both training and test stages. In the training process, the it earmarked a part of the extract features to enrich a comprehensive database. In the test process, each feature vector is compared with the classifier patterns to locate closest similar features. The retained data may, in principle, match closely to the corresponding writer personality aspects. Evaluation is conducted on the database presented in Sect. 3 where its contents are split into three disjoint sets, assigned separately for training, validation, and test processes. The main experimental process consists of three phases, namely the noise removal, the construction of feature vectors, and the features classification. In the first phase, the idea presented in Sect. 3.3 is performed on each image of the database. As it showed in Fig. 8a , staff lines within the used database' images are relatively clear, concise and exhibit a regular vertical spacing. So it is not hard to locate the sub-graphs Ω 1 and Ω 2 related to each staff line. Also, they are straight and with equal length; it then easily leads to the construction of stable paths. Nevertheless, lines width is defined as an average of contiguous stable paths, so, it does not reflect the real thickness of each line. Hence, their removal may alter, in some cases parts of the overlapped data, especially for lines with inhomogeneous thickness. Despite that, the lost data concern just letters extended to lower zone as g, j, p, etc. which constitutes no more than 5% of the written contents; such fact cannot alter the whole image features. A staff removed lines model is observed in Fig. 8 . The second phase is dedicated to the features extraction. As described in Sect. 3.4, five feature methods have been explored in this study (f1 to f5). Each method builds its feature vector based on the characteristics extracts from characters of the database. The goal of the operation is to identify each character with a minimum of characteristics. Given the variable nature of the handwriting characters, feature vectors have distinct sizes which correspond to the number of characteristics of characters according to each feature. Table 1 summarizes the sizes of vectors (as presented in Sect. 3.4) . We propose four experimental scenarios with various amounts of data for each process as shown in Table 2 . The proposed classifier involves three layers with feed-forward architecture. The input layer is a set of five neurons; it accepts data of the five feature vectors f1 to f5. The output layer is doted also with five neurons; it delivers a result that corresponds to the five personality traits. We consider a hidden layer, fully connected on both sides to other layers with a given number of neurons that leads to the best results. Hence, experiments instances were performed using a range of 1000 to 5000 neurons. The best statistical accuracies of trait personality averaged over 10 runs for each scenario are shown in Tables 3, 4, 5 and 6. The overall results vary from 50% to 60%. The best accuracies (more than 65%) were noted for features f1 and f5. They were observed when the hidden layer size varies from 2500 to 3500 neurons. Furthermore, most experiments provide roughly alike results for all presented scenarios as shown in Fig. 9 . The experiments showed that the run-length of white segments (f5) exhibits a better performance than other features. It provides more informative results against the black segments feature (f2). By another way, the overall best results are obtained with a hidden layer of 3200 to 3600 neurons. Furthermore, a significant variation of outcome has been observed with distinct parts of data within the database, for different features or when varying the proportions of training and test sets. In regard to recent literature, similar works provide divergent results due to their specific test environments: 47 to 75% [39] , 48% [40] , 62 to 85% [41] , 60 to 90% [42] [43] [44] [45] , 76% [46, 47] and more 90% [48] [49] [50] . Consequently, a comparative study with these works seems without interest. The personality trait identification is a methodology of evaluating human emotions based on a piece of his/her handwriting. It allows in some aspect, disclosing the insight psychological state of the writer. Such fact offers a useful data source for forensic control, recruitment profiling, medical diagnostic symptoms and wide other applications. The personality trait identification consists of handwriting pattern analysis, characters feature extraction and classification. The ideas actually, is an active research subject but results still far from satisfaction. This is due to the writing patterns complexity and the variation of writers styles. In this paper, a personality trait identification approach has been proposed where a set of writing level features have been evaluated. Experiments have been carried out on a database of 543 handwritten samples. The result showed a prediction accuracy of more than 70% for both edge-hinge distribution and run-length distribution and more than 55% for other features, which seems to be effective regarding similar literature works. By another way, we think the approach performs well if the experimental space will be expanded to include more features and classifiers. It hence opens avenues for further studies that improve handwriting recognition accuracy. Les mystères de l'ecriture A trainable feature extractor for handwritten digit recognition. Pattern Recogn Computer aided graphology The MNIST database of handwritten digits Morphological waveform coding for writer identification. Pattern Recogn Writer identification: statistical analysis and dichotomizer Development of an automated handwriting analysis system Automated human behavior prediction through handwriting analysis Behavior prediction through handwriting analysis Feature extraction from handwritten documents for personality analysis Recognition of online isolated handwritten characters by backpropagation neural nets using sub-character primitive features Automatic detection of handwriting forgery Personal identification based on handwriting A combined crisp and fuzzy approach for handwriting analysis Towards emotional control recognition through handwriting using fuzzy inference HABIT: handwritten analysis based individualistic traits prediction Handwriting analysis for detection of personality traits using machine learning approach A novel hybrid CNN-SVM classifier for recognizing handwritten digits Survey on handwriting-based personality trait identification TxPI-u: a resource for Personality Identification of undergraduates A critical survey of music image analysis An efficient staff removal approach from printed musical documents Music score binarization based on domain knowledge Comparison between neural network and support vector machine in optical character recognition Staff detection with stable paths Writer identification using edge-based directional features Handwriting identification by means of run-length measurements On the usage of the 2D-AR-model in texture completion scenarios with causal boundary conditions: a tutorial. Signal Process Image Commun Texture recognition via auto regression. Pattern Recogn Writer style from oriented edge fragments A set of handwriting families: style recognition Advanced Topics in A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular Indic scripts. Pattern Recogn Random forests Finding the optimum classifier: classification of segmentable components in offline handwritten Devanagari words Feature extractor based deep method to enhance online Arabic handwritten recognition system Pattern classification Artificial Neural Networks: Concepts, Tools and Techniques Explained for Absolute Beginners Automatic prediction of age, gender, and nationality in offline handwriting Visual aesthetic analysis for handwritten document images Automatic personality identification using writing behaviours: an exploratory study Personality analysis based on letter 't ' using back propagation neural network Study on determining the Myers-Briggs personality type based on individual's handwriting Human behavior recognition based on hand written cursives by SVM classifier Personality analysis through handwriting detection using android based mobile device An overview of character recognition focused on off-line handwriting Detecting features of human personality based on handwriting using learning algorithms Adv Automated handwriting analysis system using principles of graphology and image processing An improved method for handwritten document analysis using segmentation, baseline recognition and writing pressure detection Handwriting analysis based on segmentation method for prediction of human personality using support vector machine