key: cord-0681113-pqqihn0z authors: Beauchemin, David; Laumonier, Julien; Ster, Yvan Le; Yassine, Marouane title: "FIJO": a French Insurance Soft Skill Detection Dataset date: 2022-04-11 journal: nan DOI: nan sha: 77f071904ae350d28201d6bc3ceb4f315f52098c doc_id: 681113 cord_uid: pqqihn0z Understanding the evolution of job requirements is becoming more important for workers, companies and public organizations to follow the fast transformation of the employment market. Fortunately, recent natural language processing (NLP) approaches allow for the development of methods to automatically extract information from job ads and recognize skills more precisely. However, these efficient approaches need a large amount of annotated data from the studied domain which is difficult to access, mainly due to intellectual property. This article proposes a new public dataset, FIJO, containing insurance job offers, including many soft skill annotations. To understand the potential of this dataset, we detail some characteristics and some limitations. Then, we present the results of skill detection algorithms using a named entity recognition approach and show that transformers-based models have good token-wise performances on this dataset. Lastly, we analyze some errors made by our best model to emphasize the difficulties that may arise when applying NLP approaches. The digital transformation's impact on professional practices has led to a rapid evolution of in-demand job skills, making it difficult to track these changes for enterprises and workers. Manual evaluation is becoming exceedingly complex and time-consuming, justifying the need for an automatic evaluation of these changes [1] . One way to study these changes in job ads is automatic skills recognition [2] . However, job offer data is not easy to access even in job offers web platforms, mainly due to intellectual property issues. Moreover, the annotation needed to achieve good skill recognition performances through supervised machine learning is another complex and costly task. Indeed, many datasets offering job descriptions are accessible online such as the mycareersfuture public dataset [3] . However, as [4] reported in their article, very few public annotated datasets exist, and none are in French. As contributions, in this article, we propose "French Insurance Job Offer (FIJO)", a free and public non-annotated and annotated dataset, to facilitate research in this domain. This dataset focuses on soft skills, which describe the way employees work alone and with others, instead of hard skills, which represent a more formal knowledge used at work [5] . Also, we explore the training of a token-wise NER French skill detection algorithm in the field of insurance with state-of-the-art algorithms 1 . The rest of this paper is structured as follows. Firstly, we begin with a brief overview of the literature on skill detection in section 2. Secondly, we will present FIJO in section 3, including how we constructed our French complexity corpus, some statistics and analysis of the dataset. Third, we will present our skill detection algorithm, the training settings and our results in section 4. Finally, we will draw some concluding remarks in section 5. Authors contributed equally to this work. 1 All code used to obtain these results will be available on GitHub here. arXiv:2204.05208v1 [cs.CL] 11 Apr 2022 The first approach to recognizing skills inside a job offer is using statistical techniques. It is a commonly used approach in many pieces of work such as [6] who detect hard and soft skills in job offers by matching a list of keywords in the text with a skill database. Their skill database uses external databases knowledge, namely DBPedia and StackOverflow. Other works [2, 7, 8] do not rely on skills databases but use content analysis to detect the presence of certain words or concepts in offers. Other approaches using machine learning to detect skills in job offers have received lots of attention recently [4] . The skill recognition problem has been modeled either using topic modeling through Latent Dirichlet Analysis [9] , text classification with CNN and LSTM [10] , or NER with LSTM [11] and transformer-based models [12] . However, these pieces of work mainly focus on specialized skills (e. g. IT skill [9] ), focus on soft skill [10] or are applied on English job ads only. As per [4] , the conclusion in their survey on skill identification shows that very few datasets are available online. Most of the recent works do not release their dataset nor mention the reasons for the non-publication. For example, [13] uses content analysis, and 5 million non-annotated jobs adds to determine whether or not testing software is a standard in the IT industries. The possible reason for the lack of publication by the authors was possibly due to the intellectual property constraint from their industrial partner. However, a recent new public dataset released by [3] focuses on extracting identified hard skills in job ads. The dataset consists of 20,298 job ads, where each ad includes nearly 20 hard skills on average. A unique skill term corresponds to a unique class. Thus, the overall dataset includes 2,548 skill classes. For example, "Microsoft Word" is a skill, and "Microsoft Excel" is another skill. This new dataset lacks annotation for soft skills that have been more required than hard skills by enterprises in the past decade [5] . FIJO was created in partnership with four Canadian insurance companies. The dataset consists of non-annotated and annotated French job ads published by them, as well as their metadata (e. g. date of publication) between the years 2009 to 2020. Each job offer's text was manually extracted and semi-manually cleaned using the following procedure: removal of carriage return in an incomplete sentence (this is due to bullet point text) and multiple carriage returns, removal of bullet point character, normalization of the apostrophe punctuation characters, and removing of trailing whitespace at the beginning or end of a sentence. In order to protect the interests of the companies to whom the published data belongs, we chose to de-identify the job ads before making them publicly available. This process consists of three steps. Firstly, we used regular expressions to substitute the different variations of the companies' names and email addresses present in the offers. Next, a SpaCy French pre-trained NER model (fr_core_news_lg) was used to identify potential names and locations to help with the following step. Finally, a manual check was conducted on each offer to substitute the following elements: names, locations, postal addresses and miscellaneous elements that could help identify the companies (i. e. products, department names). Table 1 describes the substitution tags employed. The dataset is composed of 867 de-identified French job ads. As shown in Figure 1 , job ads lengths vary greatly, with an average length of 300.97 and a standard deviation of 119.78 tokens 2 . We can also observe that a few offers (16) are outliers with a length 2 Punctuations are computed as a token. We do so since our pre-processing procedure does not include the removal of punctuation nor lemmatization, or stemming. of more than 572. Moreover, Table 2 presents statistics of the dataset, where the lexical richness corresponds to the ratio of a job offer's number of unique words over the vocabulary cardinality without removing the stop words or normalizing them [14] . We can see that the lexical richness is relatively low, which means they are quite similar in terms of vocabulary. To learn to identify soft skills inside ads, 47 offers were annotated, and more precisely, each job offer sentence, for a total of 499 annotations. Our annotation process consists of creating a skills reference, which defines the skills used for the annotations, and randomly selecting 47 offers to be annotated by a domain expert. Annotation was conducted with non-overlapping sentence entities done individually. However, the overall job offer was given as a reference to the annotator for context. Each entity contains at least one word or, at most, the complete sentence. Based on the skill groups of the AQESSS public skills repositories and the one used by our insurance partners, which are based on the commercial Korn Ferry and SPB repositories, a set of four skills have been identified. Namely, "Thoughts", "Results", "Relational" and "Personal" 3 . The number of classes has been limited to four mainly because, in general, learning algorithms are not known for being inefficient on a large number of tags [15] but also because of the possible confusion between skills during the annotation process (see subsection 3.4). Figure 2 presents an example of a sentence annotation. First, as shown in Figure 3 our annotated portion of FIJO consists of 932 entities distributed unevenly between the four classes. We can see that the class with the most entities is "Thoughts" with 317 entities, followed by "Personal" with 297 and "Relational" with 216. The class with the lowest number of entities is "Results" with 102 entities. Second, as illustrated in Figure 4 , our entities are on average 9.6 long with a standard deviation of 7.14 tokens. Their length ranges from a single token to 50 tokens, but 50% are below 8. Moreover, Table 3 presents statistics of the dataset. We have a similar average number of words and sentences as per the non-annotated dataset. However, our annotated dataset uses fewer words and even fewer unique words. Finally, Figure 5 presents the number of occurrences of stop words in an entity text (blue) or outside of an entity (red). It shows that some stop words are overly represented in entities, such as "de" and "des" that are mostly in annotations since long skills tend to contain a high number of stop words. We have identified three limitations to our dataset: unbalanced entities classes, lexical overlapping and soft skill identification. Firstly, as illustrated in Figure 3 , our annotated dataset is composed of an unbalanced number of classes where two classes are more represented than the two others. When classes imbalance exists in training data, a classification algorithm will typically over-classify the majority group ("Thoughts") due to its increased prior probability. As a result, the instances belonging to the minority group ("Results") will likely be misclassified more often than those belonging to the majority group [16] . Secondly, Figure 6 illustrates the 2-dimension PCA of the TF-IDF score of each entity's text after stop words removal and lemmatization, separated by class. It shows that we can separate the terms (and centroids) present in skills from those that are not. For example, the "Personal" (purple) centroid (upper left) and "Thoughts" (red) centroid (down left) can be distinctly separated from each other and the other two centroids. Such separation may . An example of an entity "squeezed" between two other entity of a different class make it easier to discern the two cases since skills use specific terms that are less common in other skill texts. For example, the word "collaborer " (collaborate) appears only in the "Personal" entities. By contrast, well-distributed terms among the different groups may be more challenging. For example, the word "atteindre" (achieve) occurs in all four classes. However, we can also see that some terms are quite close to each other, possibly leading to a more difficult distinction between the four classes' word terminologies. Finally, soft skill identification is not an easy task, as mentioned by [2] , and our dataset reflects it. Firstly, some distinctions between skills can be quite confusing, as seen in the example in Figure 7 . This example can be read as "Welcoming visitors and responding to their various requests for information" and is tagged as Thoughts. However, one reader might find that such a skill could also represent a Relational one. Thus, creating a confusing distinction between some examples. Secondly, some examples contain two consecutive skills of the same class separated by coordinating conjunctions as seen in Figure 8 . We can see that the French coordinating conjunction "et" (and) split the two Personal entities. However, such coordinating conjunction is not always used to do so. It can also be used as an addition of information, such as in "vérifier et contrôler " (verify and control). Thus, it can be quite challenging to determine if a subset of a sentence is one or two skills in some cases. Finally, it is common to see job ads that list expected soft skills in the same sentence, but all skills do not belong to the same skill class. An example of such switching between two entity types is illustrated in Figure 9 . We can see that the first token is an entity, followed by another entity of a different type and the rest of the sentence is another entity of the same class as the first entity. This kind of "squeezing" of two entities sharing the same class around an entity of a different class can be challenging for a NER model. All these limitations justify the fact that, in this article, we apply token-wise approaches to the dataset to start with an easier learning task. Since skill detection is a sequence classification task similar to NER, we approach it using a recurrent neural network, namely a bidirectional long short-term memory (bi-LSTM) network [17] . First of all, we encode each word in a given sequence using FastText's pretrained French embedding model [18] which produces 300-dimensional word embeddings. Once a sequence is encoded, we feed each of the word embeddings to our bi-LSTM which has a hidden state dimension of 300 and obtain a new representation for each word. The final step in the prediction process is to classify each word using a fully connected network comprised of one linear layer followed by a softmax activation function. LSTM-based classifiers have proven to be quite efficient in terms of performance for sequence classification tasks. However, these models usually require a large amount of data to guarantee such performances. Therefore, since our annotated dataset contains a limited amount of data, we also use a pre-trained transformer model [19] in a transfer learning setting. Our model of choice is CamemBERT [20] a French transformer encoder based on the BERT architecture [21] . Consequently, we use it to encode our text sequence, and we employ a fully connected network identical to the one employed with the bi-LSTM model in order to accomplish the classification. We experiment with two configurations of this model, one in which CamemBERT's weights are frozen and one in which they are not. We dub these models CamemBERT frozen and CamemBERT unfrozen respectively. To further investigate the sensibility of our models to the amount of training data, as well as the transfer learning potential of the pre-trained transformer model, we experiment with different data subset sizes and report the results in subsection 4.2. We train each of the aforementioned models five times using different random initialization seeds ( [5, 10, 15, 20, 25] ). The models were trained for 300 epochs at most with an initial learning rate of 0.01 for bi-LSTM and CamemBERT frozen and of 0.0001 for CamemBERT unfrozen as suggested by [21] . Therefore, a learning rate schedule that decreased the learning rate by a factor of 0.1 after every five epochs without any decrease of the validation cross-entropy loss was applied. An early stopping with a patience of 15 epochs to prevent overfitting. Additionally, for the CamemBERT unfrozen model, we experimented with the training procedure and hyperparameters proposed in [22] in order to address the possible training instability associated with fine-tuning transformer-based language models. As such, an additional five experiments (using the same random seeds) were run with CamemBERT unfrozen by limiting the number of epochs to 20 and scheduling the learning rate as follows: we start the training with a linear learning rate warmup (i. e. the learning rate was linearly increased) up to 0.2e−5 for the first 10% of epochs, followed by a linear learning rate decay for the rest of the training epochs. We use CamemBERT unfrozen warmup to refer to this model. The training data was divided using a 80% − 10% − 10% train-validation-test split with simple random sampling, resulting in a total of 400 training samples. We also experiment with different training data subsets. Each subset is composed by sampling the first X data samples from the full training set placed in order, with X ∈ {50, 100, 150, 200, 350, 400}. The models and training procedures were implemented using Poutyne [23] , HuggingFace's Transformers [24] and spaCy [25] . Table 5 of both unfrozen models using only the best seed model per approach (i. e. 20 and 25, respectively) yielded a p-value of 0.5334. Thus, we can only assume no significant difference between the predictive models. Moreover, the CamemBERT unfrozen model seems to suffer from a certain degree of instability, as shown by the consistently high standard deviation. This issue mostly persists when using a learning rate warmup followed by a linear decay as proposed by [22] . CamemBERT frozen is the least sensitive to random initialization while bi-LSTM presents the highest sensitivity and the lowest performance. When it comes to training subsets, we can observe that all models perform best with a high amount of data. However, performance seems quite close across the 200 to 350 subset size range. We hypothesize that this is due to the data distribution of the training and test sets since companies use slightly different ways to express skills. For example, one uses a bullet point style to enumerate skills in a pragmatic approach, while another uses a more situational approach that puts skill within context. Thus, Figure 10 shows that both the train and test sets contain skills belonging, in the majority, to one company (sky blue). Two more companies (light green and pink) are present in the test set, while their skills are underrepresented in the training set. Furthermore, Figure 11 shows that the dominant company's (sky blue) skills are well represented in most training subsets, including smaller-sized ones, while augmenting the number of training samples mostly adds skills related to the company that is not present at all in the test set (red). It means that the training is quite sensitive to how the data is shuffled because of the limited number of data samples. Therefore, more data would need to be annotated to reach a more balanced dataset and use a stratified random sampling instead of the current simple random sampling to reflect imbalance better when splitting the data. Finally, Table 6 presents the mean token-wise accuracy and one standard deviation per skill on the test set for CamemBERT unfrozen warmup and CamemBERT unfrozen for a subset size of 400. We can see that for both approaches, the best performance occurs in the class with the lowest example (Results). We hypothesize that it might be due to the "simplicity" of those examples that are shorter than the average with an average length of Table 6 . Mean token-wise accuracy and one standard deviation per skill for CamemBERT unfrozen warmup and CamemBERT unfrozen on the test set for a subset size of 400 (the "O" tag means that a token does not belong to a skill) around seven tokens. Thus, these examples are possibly more straightforward than the others, leading to an easier classification. Also, we observe a higher variance on both Thoughts and Personal class for both our models. These two tags have the highest standard deviation, even if they are the classes with the most examples. It means that the training is quite sensitive to the initialization of the models. Because we have a limited number of data samples, minimizing such instability during training is more difficult. Therefore, more data would need to be annotated to reach a more stable training for all classes. Furthermore, we can see that there is still room for improvement for most tags. Using the approaches that yielded the higher accuracy (CamemBERT unfrozen), we conducted an error analysis on is 24 errors. We found that most of these were types that are similar to the cases illustrated in Figure 8 , namely two consecutive skill annotations with the same class but separated by a coordination conjunction. The NER identified the two skills as a single skill in all those error cases. Moreover, some cases (3) were a similar error type where a part of the sentence, before a coordination conjunction, is not an entity as illustrated in Figure 12 . The figure introduces each token's ground truth and prediction ("Prob" rows), along with the model probabilities, using the same color scheme as Figure 6 , namely red is Figure 12 . Example of a wrongly predicted sentence using the best seed CamemBERT unfrozen model where color represent the skill class (red is the "Thoughts" class, purple is "Personal", blue is "Relational", orange is "Results", and green is a word not in an entity) the "Thoughts" class, purple is "Personal", blue is "Relational", orange is "Results", and green is a word not in an entity. We can see that not only does the NER wrongly predict the class, Personal rather than Thoughts, but it also wrongly predicted that the first part of the sentence, "les dossiers sont de nature courante", is also a skill. We hypothesize that it is due to two things. First of all, the presence of the words "dossiers", "nature", "courante" that appear in other Personal examples. For instance, the word "dossiers" appears 49 times in a Personal entity, which could confuse our model as to whether such a sentence piece is a skill. Second of all, the presence of the coordinating conjunction "et" plus the determinant "les" which mostly appear within an entity and rarely appear just outside of it (a token before). We argue that our model annotated the overall sentence as a skill in that specific case due to the overwhelming examples of such cases. However, the entity class prediction is inconsistent with the sentence vocabulary distribution. The second part of the sentence is composed of words that only appear in Thoughts examples, such as "décision", "analyse" and "recherche". It leads to lower probability confidence of our NER, where those three words have the lowest probabilities. This article presents a new public dataset in French, including annotated and nonannotated job offers in the insurance domain. It aims to develop machine learning models to perform automatic skill recognition inside job ads, which is an increasingly useful task for understanding the evolution of the labour market. The dataset statistics and characteristics show limitations that could have made it challenging to perform the learning task well. It can be further improved by rebalancing companies' and skill classes' to make the annotated dataset more representative of the non-annotated distribution. Moreover, the impact of lexical overlapping and soft skill identification could be lowered by allowing more experts to annotate more job ads. In any case, this dataset will be improved by adding annotations. Despite these dataset limitations, we have obtained interesting results with pre-trained models despite the size of the dataset with a token-wise approach. Although the skillwise problem is closer to our main objective, our preliminary experiments on the skill-wise problem with common NLP algorithms seem to lead to poor accuracy. Since our results here are token-wise and not skill-wise, it is harder to extract the correct span of skill entities, and consequently, we cannot conclude the actual number of skills inside the non-annotated dataset. Thus, our work is a first step toward discovering some trends in the labour market and studying the evolution of skills. As our next step, our objective is to make efficient models identifying skills instead of tokens. Indeed, our models cannot distinguish two skills with the same tag if they are next to each other. Therefore, we need to detect the beginning of each skill inside the text. To achieve that, we plan to use the BIO tags scheme instead of IO [27] . However, adding a new beginning tag for each skill group would probably reduce the overall accuracy because of the small size of the dataset. At last, improving the cleaning phase by detecting more precisely the usual conjunctions between two skills could be another way to keep the token-wise results and identify the skills more efficiently. We have made some error analyses, but this kind of analysis is limited by the limited explainability possibilities of deep learning models. In the long term, we would like to make our model more explainable both to help us understand the strengths and weaknesses of our model and explain results to recruiters and human resources staff to allow them to adapt their needs in recruitment. To do so, we plan to explore counterfactual generation [28, 29] . Finally, we have only explored a few possibilities this dataset can offer. Some other tasks that can be performed could include the measurement of the impact of redaction style (e. g. long sentences vs. bullet points, different ways of addressing a potential applicant) on the performance, the impact of gendered wording [30] , the impact of COVID and teleworking on skill requirements [31] or even the extension of the dataset to include new tools used by human resources such as social media [32] . The future of skills: Employment in 2030 Demand for AI Skills in Jobs: Evidence From Online Job Postings Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework A Survey on Skill Identification From Online Job Ads Soft Skills, Hard Skills: What Matters Most? Evidence From Job Postings Bridge the Terminology Gap Between Recruiters and Candidates: A Multilingual Skills Base Built From Social Media and Linked Data Content Analysis of or Job Advertisements to Infer Required Skills Skill Requirements in Big Data: A Content Analysis of Job Advertisements Big Data Software Engineering: Analysis of Knowledge Domains and Skill Sets Using LDA-Based Topic Modeling Learning Representations for Soft Skill Matching Representation of Job-Skill in Artificial Intelligence with Knowledge Graph Analysis DataOps for Societal Intelligence: a Data Pipeline for Labor Market Skills Extraction and Matching What 5 Million Job Advertisements Tell Us About Testing: A Preliminary Empirical Investigation Comparing Measures of Lexical Richness Training Highly Multiclass Classifiers Survey on Deep Learning With Class Imbalance Long Short-Term Memory Enriching Word Vectors With Subword Information Attention is All You Need CamemBERT: A Tasty French Language Model BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines Poutyne: A Simplified Framework for Deep Learning Transformers: State-of-the-Art Natural Language Processing spaCy: Industrial-strength Natural Language Processing in Python Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages Segment representations in named entity recognition". In: International Conference on Text, Speech, and Dialogue Generate your counterfactuals: Towards controlled counterfactual generation for text Text Counterfactuals via Latent Optimization and Shapley-Guided Search Evidence that gendered wording in job advertisements exists and sustains gender inequality Trends and Disparities in Teleworking During the COVID-19 Pandemic in the USA The influence of online professional social media in human resource management: A systematic literature review This research was made possible thanks to the support of the Future Skills Centre and four Canadian insurance companies. We wish to thank the reviewers for their comments regarding our work and methodology.