key: cord-0045738-8dz9n31i authors: Biel, Mikołaj; Kuta, Marcin; Kitowski, Jacek title: Personality Recognition from Source Code Based on Lexical, Syntactic and Semantic Features date: 2020-06-15 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50417-5_26 sha: dbeffe6225d5cd7bbe869b29d529ed7673950aaa doc_id: 45738 cord_uid: 8dz9n31i Automatic personality recognition from source code is a scarcely explored problem. We propose personality recognition with handcrafted features, based on lexical, syntactic and semantic properties of source code. Out of 35 proposed features, 22 features are completely novel. We also show that n-gram features are simple but surprisingly good predictors of personality and present results arising from joint usage of both handcrafted and baseline features. Additionally we compare our results with scores obtained within the Personality Recognition in SOurce COde track during Forum for Information Retrieval Evaluation 2016 and set up state-of-the-art results for conscientiousness and neuroticism traits. Personality influences many aspects of human behaviour, e.g. made decisions, propensity for communication with other people, way of writing or listened music [22] . In the context of computer science, personality may influence organization of created source code, or a choice of a software project a person takes part in. While automatic personality recognition from text attained remarkable attention [37] , personality recognition from source code is still a scarcely explored problem. Automatic personality recognition can be useful to customize learning process or to assess cultural fit in a company. Each company has a different culture [26] -there are places where programmers are supposed to often contact clients and conversely where they talk only to their supervisors. Firms may also differ in workplace organization -is it an open plan office, a small room, or remote work. Cultural fit is whether an employee is satisfied with these and some other matters in his workplace. If a person fits in his company, he is more involved in what he is doing, more satisfied with things he has accomplished during his work time and more productive, which is beneficial both for him and his employer. Cultural fit depends on one's personality, thus an automatic personality recognition system that detects whether a person fits into the company's environment based on programming assessment completed during a recruitment phase could save both employee's and employer's time and stress. Both in academia and industry, psychological and sociological predispositions of programmers could be examined to better recognize their soft skills and choose job. This paper proposes personality recognition from source code with random tree forest on the basis of 35 handcrafted features based on lexical, syntactic and semantic properties of source code. Out of these 35 features, 22 features are novel and have not been used earlier in personality recognition from source code. We compare above features with n-gram features serving as baseline features and present results arising from joint usage of both handcrafted and baseline features. Finally, we compare our results with scores obtained within Personality Recognition in SOurce COde (PR-SOCO) track during Forum for Information Retrieval Evaluation (PAN@FIRE 2016). As a model of personality, we adopt Big Five -a five-factor model of personality [27, 28] . The Big Five is a widely accepted model, being a result of long-time research, and there is a consensus that its five traits concisely describe independent personality dimensions [5] . The Big Five model assumes that personality can be described by the following five personality traits: -Conscientiousness (C) -consistency, persistence, good organizational skills. -Agreeableness (A) -attitude towards others (whether a person is suspicious or trustful, modest, willing to compromise). -Neuroticism (N) -impulsiveness, susceptibility to stress and anxiety. -Openness to experience (O) -intellectual curiosity, willingness to explore, rich imagination of examined person, searching original solutions rather than following in someone's footsteps. -Extroversion (E) -assertiveness, building relationship at ease. Each personality trait can be divided further into six facets, but facets are out of scope of our work. Deep learning personality predictors require no feature engineering, no preprocessing, scanning, nor parsing of source codes. An example of such an approach is an LSTM neural network which reads source code byte by byte [12] . Low amount of learning data is, however, especially problematic for this approach, as not only the correct predictor (a classifier or a regressor), but also the relevant features should be learned from data. Features designed for personality recognition from source code were based mainly on source code, but also on structure of the project, content of comments, or code complexity. In the PR-SOCO task, the following features were taken into account [33] : -number of files submitted by each programmer, -mean number of lines in programs, -mean length of variables, -mean number of classes, -mean length of classes (computed on the basis of the number of lines of code), -mean number of attributes, methods in a class, -number of programs implementing the same class, -number of errors, -Halstead complexity measures (e.g. difficulty and time needed for implementation and understanding), -duplicated fragments of source code, -cyclomatic complexity, -frequency of occurrence of comments and their length, -occurrence of comments written exclusively in capital letters, -number of comments in classes, -number of words inside comments, -usage of punctuation marks inside comments, -number of lines with missing white characters inside arithmetic expressions, -number of import declarations, which import the whole content of module (usage of * instead of concrete classes), -used white characters, -ways of indentation and formatting used by the programmer, -number of empty lines between methods, blocks of code and number of white characters between parentheses, -occurrence of digits, capital and small letters and symbol in names, as well as length of names. In [3] , frequency distribution of different types of nodes in an abstract syntax tree was examined, yielding however low results, little above baseline approaches. Another type of features are character n-grams -versatile, easy to implement features, which are language independent and have a wide range of applications in classification tasks, including authorship attribution [14, 35] , author profiling [2] , authorship verification [9, 20] and plagiarism detection [23] . They may also provide convenient features for a baseline solution of PR-SOCO. In the context of personality recognition from source code, character n-grams were used in [17, 32] . The choice of a predictor (a regressor or a classifier) is a more standard procedure and includes mainly: linear regression [16, 25, 32] , support vector regression [7, 11] , decision trees [11] , nearest neighbours [25, 36] and neural networks [12, 36] . The research concerning personality recognition from source code is scarce and extraction of novel features will likely extend possiblities of distinguishing traits. Table 1 shows proposed handcrafted lexical, syntactic and semantic features for automatic personality recognition from source code. Consistecy in using curly brackets around one-line branches of code is implemented in two variants so it gives rise to two features. Number of consecutive lines with aligned characters represents four features, as it is computed separately for four groups of characters. Thus, in total there are 35 proposed features. Proposed features are grounded in the extension of lexical hypothesis to programming languages. Lexical hypothesis [1] says that the most important differences in personality are reflected in used natural language, vocabulary. According to [21] , the more important the difference, the more likely it will be reflected in a single word. We suppose that in the domain of programming code a conscientious person will likely apply consistent indentation throught the code; a person high in openness might use richer vocabulary while an extrovert might use longer names for variables, methods and classes. Additionaly, corellation between personality traits and programming style has been found in [10] , according to which persons high in openness prefer breadth-first programming style, while persons low in openness prefer depth-first programming style. We describe in detail three features, most complicated due to their involved implementation: the number of code duplications, length of comments in characters and the level of indentation. Detection of code duplication is quite a complex task, which could be even cast as another machine learning problem, provided suitable learning data would be available or generated [24] . We adopted a simpler solution consuming less computing resources -syntax tree rewriting [30] . Two pieces of code, one being a duplicate of another, exhibit the same structure but differ in names of constants, variables or methods. The syntax tree generated with the javalang parser is transformed to a topologically equivalent syntax tree, where tree nodes are simplified, to only reflect the structure of the code and discard irrelevant data. For instance, a name of declared method has been discarded, but structure of its body, type of formal parameters and returned type have been retained. For blocks of instructions, information about entrance conditions has been discarded. Detection of code duplication in one block is performed on the basis of such a simplified tree. A list of all subtrees in the block is created and subtrees which serialize to the same expression are treated as duplications. Computing length of comment, otherwise simple, requires detection whether a comment contains parts of source code. Parsing a comment with a parser of Java would end up with a failure, as programmers usually comment a few lines of code or methods rather than entire programs. To solve this problem, besides the main parser of the whole program, parsers of smaller grammatical units of a program are used. As white characters are discarded during lexical analysis and even less information is passed to the parser, the level of indentation feature was implemented as a state machine (separate from the used parser), which reads tokens, one by one, and tracks the level of indentation. One difficulty in implementing this feature lies in distinguishing between a correct and wrong indentation after a sequence of empty lines of code. Although based only on finite automata formalism, the state machine has to roughly understand the syntax of Javait tracks the number of opening parentheses or curly brackets, closing an open block at the correct indentation level, reopening a block of code at a wrong indentation level. The state machine also knows which instructions require indentation. Additional difficulty arises from one-line bodies of if and for instructions, where curly brackets are not required. This seemingly simple task becomes a complex programming problem due to the great number of cases which should be considered. Due to above difficulties, the implementation of the discussed feature ignores checking the level of indentation in conditional instructions and loops whose bodies contain only one line of code; and in switch instructions. For the switch instruction, it is even impossible to determine which notation is correct, as the flat form was used by programmers mainly in the past, while switch instruction in the indented form is predominant currently. As many proposed features were based on the syntactic structure of source code, choice of a parser of Java was an important part of the feature engineering. Three parsers were considered due to their established popularity: ANTLR 1 generated parser, JavaParser 2 and javalang 3 . Table 2 presents measured time of parsing source code with above parsers for the Hello world program (Listing 1.1) and the PR-SOCO corpus. The parser generated by ANTLR was incorrect as it was not able to parse all source code from the training corpus. It was also very slow. Although JavaParser turned out faster than javalang on the PR-SOCO corpus, we chose the latter, as it was implemented in Python, which was the language of the whole project. As a learning and evaluation data we used the corpus of source codes, released for the PR-SOCO track [33] , which accompanied PAN@FIRE 2016. The track was aimed at automatic personality recognition of programmers on the basis of Java source codes they authored. In the PR-SOCO corpus, personality was modelled with Big Five, and each trait was given a value from [20, 80] . The corpus contains 2492 source code programs written in Java by 70 students of computer science along with values of their personality traits. Values of personality traits were found on the basis of 25-item BFI questionnaire called Big Five locator which was completed by students. The students made their code submission through a web-based online judge for grading. The judge system does not have tools for style correction. However, it is not known whether students used an IDE before the submission or not. The training and test set contain source codes of 49 and 21 programmers, respectively. During the PR-SOCO contest, personalities of 21 persons from the test set were concealed from participating teams. Each team was allowed to submit 6 trial solutions (shots). A single solution predicted five traits for each of 21 persons from the test set. Figure 1 presents distribution of values taken by each of five traits. Values from range [0, 20) and [80, 100] are never taken by any trait. We followed the PR-SOCO track and used two measures to assess our solution and compare with existing personality predictors: Root Mean Square Error (RMSE ) and Pearson Product-Moment Correlation coefficient (PCC ). RMSE measures the effectiveness of a regressor. For each personality trait t, root mean square error is defined as: where x i denotes true value of trait t for i-th instance (programmer), y i is a value of trait t predicted by a personality predictor, and N is the number of instances (programmers). The lower RMSE the better. Pearson Product-Moment Correlation is defined as: withx,ȳ denoting mean values of samples (x i ) N i=1 and (y i ) N i=1 , respectively. PCC indicates whether obtained RMSE is a random artifact or there is a correlation between actual and guessed values of traits. The larger the absolute value of PCC the better. In the experiments we took the profile-based approach, i.e., all source codes of a programmer were treated as one learning instance. Since personality traits take continuous values, personality recognition was cast as a regression problem with random forest regression [4] from scikit-learn package [31] as the prediction module. Random forest regressors were trained on 85% of the original training set, remaining 15% of the training set was reserved for the model selection procedure. We examined the random forest regression with the number of decision trees varying from 64 to 128 and their depth varying from 2 to 6. Optimal values of the above hyperparameters were selected separately for each personality trait with grid search [8] . Mean Square Error (MSE ) was used as the function measuring the quality of a split. Beside regression with 35 handcrafted features, we used N = 1500 n-gram features as our baseline: N 1 = 1000 most frequent character trigrams (n-grams with n = 3) and N 2 = 500 most frequent token trigrams. By tokens we mean lexical units returned by the Java scanner. Finally, we tried personality recognition with 1535 features, both handcrafted and n-gram features. Table 3 presents results of personality recognition we obtained with 3 sets of features: proposed handcrafted features, n-grams and handcrafted features together with n-grams. For comparison, best results, medians and mean results of FIRE competitors are given in Table 4 (summary of FIRE competition [33] shows also first, second and third quantiles, all extreme values, and detailed results of all participating teams). Additionally we computed confidence intervals with the pairs bootstrap method [13] . For conscientiousness personality trait, the model with handcrafted features obtained RMSE equal 8.17 (with 95% confidence interval [6.00, 9. 98]) which is lower than the minimum error achieved in the competition. Obtained value of PCC is 0. 33 For the state-of-the-art results, we inspected the random forest regressors and found the features with the highest importance. For the model predicting conscientiousness with handcrafted features, the following features were the most important (more important features come first): The effect of joint usage of handcrafted features and n-grams is the reduced error (in comparison to usage of only one type of features) for extroversion and agreeableness, although it does not set up new state-of-the-art results. Finally, we examined statistical significance of obtained trait predictions (RMSE s). Statistical tests, conducted on the STAC platform [34] , were computed for 14 algorithms (11 solutions from the PR-SOCO task and our three solutions: with handcrafted features, n-grams and all features) and five datasets (predictions for each of five traits were counted as a separate dataset). For solution from the PR-SOCO task we always chose the best shot. As the omnibus test we used Friedman F-test [15] for testing hypothesis H 0 that the means of the results of two or more algorithms are the same, followed by Nemenyi test [29] as the post-hoc test for pairwise comparison of predictors. At the significance level α = 0.05, hypothesis H 0 should be rejected but pairwise comparison revealed no pair of algorithms with a statistical difference in results. In this work we proposed new features for automatic personality recognition from source code. Handcrafted features turned out to be most useful for predicting openness and conscientiousness, traits (together with extroversion) connected with programming aptitude [19] . These features, despite their low number, achieved the state-of-the-art-results for conscientiousness. The lowest error in conscientiousness prediction is in line with the fact, that conscientiousness (and extroversion) are easily inferred from even slices of behaviour [6, 18] . N-gram features are surprisingly good predictors of personality, at the same time they are easy to implement and language independent. While the programmers' personalities may be connected with the code they write, we could not capture the relation between them. The results we achieved in neuroticism and conscientiousness recognition are state-of-the-art in personality recognition from source code, yet still insufficient to state that such a correlation exists. Large confidence intervals of RMSE s and PCC s, and conducted statistical tests prove that larger datasets are needed to increase statistical strength of our results as well as other methods proposed so far. New datasets should take into account more programming languages and programmers, including professional programmers. Trait names: a psycho-lexical study Ngram: new groningen author-profiling model Working Notes of FIRE 2016 -Forum for Information Retrieval Evaluation Random forests A large-scale, in-depth analysis of developers' personalities in the Apache ecosystem A thin slice perspective on the accuracy of first impressions Personality recognition applying machine learning techniques on source code metrics Hyperparameter search in machine learning A basic character N-gram approach to authorship verification notebook for PAN at CLEF Links between the personalities, styles and performance in computer programming A supervised approach for personality recognition in source code using code analysis tool at FIRE Shallow recurrent neural network for personality recognition in source code An Introduction to the Bootstrap. No. 57 in Monographs on Statistics and Applied Probability Local histograms of character N-grams for authorship attribution The use of ranks to avoid the assumption of normality implicit in the analysis of variance Indian Statistical Institute Kolkata at PR-SOCO 2016: a simple linear regression based approach PRHLT at PR-SOCO: a regression model for predicting personality traits from source code The elusive general factor of personality: the acquaintance effect What makes a computer wiz? Linking personality traits and programming aptitude N-gram feature selection for authorship identification The big five trait taxonomy: history, measurement, and theoretical perspectives The influence of listener personality on music choices Optimisation of character n-gram profiles method for intrinsic plagiarism detection CCLearner: a deep learningbased clone detection approach Pisco: a computational approach to predict personality types from Java source code Organizational Culture: Mapping the Terrain Validation of the five-factor model of personality across instruments and observers An introduction to the five-factor model and its applications Distribution-free multiple comparisons Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages Scikit-learn: machine learning in python Working Notes of FIRE 2016 -Forum for Information Retrieval Evaluation PAN@FIRE: overview of the PR-SOCO track on personality recognition in SOurce COde STAC: a web platform for the comparison of algorithms using statistical tests Not all character Ngrams are created equal: a study in authorship attribution Working Notes of FIRE 2016 -Forum for Information Retrieval Evaluation A survey of personality computing Acknowledgments. The research presented in this paper was supported by the funds assigned to AGH University of Science and Technology by the Polish Ministry of Science and Higher Education. Paolo Rosso, Francisco Rangel and Felipe Restrepo-Calle are acknowledged for making the PR-SOCO corpus available for our research and information about its construction.