key: cord-0057725-e42rmejf authors: Boudjedar, Sabrina; Bouhenniche, Sihem; Mokeddem, Hakim; Benachour, Hamid title: Automatic Human Resources Ontology Generation from the Data of an E-Recruitment Platform date: 2021-02-22 journal: Metadata and Semantic Research DOI: 10.1007/978-3-030-71903-6_10 sha: 9871e930871b8995ab5111d1f87e136d75603f32 doc_id: 57725 cord_uid: e42rmejf Over the last decade, several e-recruitment platforms have been developed, allowing users to publish their professional information (training, work history, career summary, etc.). However, representing this huge quantity of knowledge still limited. In this work, we present a method based on community detection and natural language processing techniques in order to generate a human resources “HR” ontology. The data used in the generation process is user’s profiles retrieved from the Algerian e-recruitment platform Emploitic.com(www.emploitic.com). Data includes occupations, skills and professional domains. Our main contribution appears in the identification of new relationships between these concepts using community detection in each area of work. The generated ontology has hierarchical relationships between skills, professions and professional domains. In order to evaluate the relevance of this ontology, we used both the manual method with experts in human resources domain and the automatic method through comparisons with existing HR-ontologies. The evaluation has shown promising results. The recent development of web technologies has revolutionized the world by offering permanent accessibility and high availability of data at the same time. Nevertheless, existing knowledge representation tools have shown their limits to fit with this huge revolution [7] . An ontology which is defined as "a formal and explicit specification of a shared conceptualization that is characterized by high semantic expressiveness" [2] , can provide better knowledge representation that improves data exploitation. In this context, using ontologies in human resources domain can be useful for both recruiters and candidates. In fact, an ontology can be used to create semantic search engines of job offers and candidate profiles. In addition, it can make easier for recruiters to select relevant candidates by matching between their profiles and job requirements. The human resources domain is characterized by a huge quantity of concepts. For example, ESCO ontology (European Skills, Competences, qualifications and Occupations) created by the European Commission, provides 2942 occupations and 13.485 skills linked to these occupations [10] . In fact, creating a human resources ontology manually is a time-consuming task because it requires managing huge quantities of data that need to be processed and exploited automatically. In addition to that, human resources domain is a field that grows quickly. For this reason, using an automatic process to generate an HR ontology can help to manage its evolution. Thus, many research works studied the generation of HR ontologies such as AGOHRA (Automatic Generation of an Ontology for Human Resource Applications) [6] and HOLA (HR Ontology Learned Automatically) [1] -ontologies. However, these ontologies contain many information gaps and do not include many concepts. In this paper, we propose a new method based on community detection and natural language processing techniques, for automatic generation of our HR-ontology using data provided by an Algerian e-recruitment platform named "Emploitic.com". It is structured as follows: in Sect. 2, we present some related works regarding HR ontologies and their generation process. Then, we describe our ontology generation process and its evaluation in Sect. 3. Several research works have been performed in order to create a valid HRontology. In this section, we go through the most interesting HR-ontologies. We present their structures and generation process. ESCO [10] , an ontology resulting from a European project, which brings together occupations, skills and qualifications. This ontology was created manually in order to provide a common European repository to be used in recruitment. The ontology includes 2942 occupations, 13485 skills/competences and 9455 qualifications available in 26 different languages. This knowledge is organized in a structure of three pillars: 1. Occupation pillar organizes occupation concepts. Each occupation concept is accompanied by a description, its related skills and essential knowledge for this occupation. 2. Skill pillar organizes skill concepts. Each skill concept is accompanied by a description and its occupations. 3. Qualification pillar organizes qualification concepts. Qualifications displayed in ESCO come from databases of national qualifications that are owned and managed by the European Member States. Authors of [6] proposed a semi-automatic generation approach of the HRontology AGOHRA based on job offers collected from the Internet. The generation process of this ontology follows these steps: firstly, a standardization on job offers is performed to separate offers according to the language used and also to remove duplicate offers. Secondly, occupations are extracted from job titles of each offer and only occupations having a significant frequency are maintained. Thirdly, skills linked to each occupation are extracted and represented according to the n-gram word format. In order to filter skills, a comparison is made with a prefix tree containing about 25,000 skills from social media profiles. Finally, the ontology is generated in RDF (Resource Description Framework) format where each universe contains a set of occupations and each occupation contains a set of skills. For its validation, a manual evaluation was performed by experts in HR domain. The detailed analysis showed results of a good quality (average precision of 0.79). More recently, authors of [1] generated automatically a human resource ontology named HOLA from Professional Social-Network Data (LinkedIn). Authors used data extracted from user's profiles and maintained only occupations and skills in the generation process of the HOLA ontology. This process follows these steps: first, data are extracted and represented in a graph where nodes represent either occupations or skills and edges represent the existing relationships between them. The weight of the edge between these two concepts represents the importance of the skill in a given occupation. Second, since data is entered by users, the noise level is probably very high. This step eliminates the noise and cleans data to keep only those that are useful in the creation. At the end, nodes are grouped into communities using Louvain algorithm in order to have more relationships in the graph. For the evaluation and validation, authors performed an automatic evaluation by doing a comparison between HOLA and ESCO. The evaluation results showed that HOLA contains more nodes and relations with 19,756 nodes and 154,259 edges. AHROGA 1 is an HR-ontology generated automatically from data of the Algerian e-recruitment platform emploitic.com. We can define our ontology through the following characteristics: Our ontology is a human resources ontology. This ontology aims to provide useful services that enable to evaluate whether a candidate has required skills for an occupation, suggest to a candidate how to highlight his skills and propose most relevant profiles for recruiters. For the ontology generation, we use candidate profiles of emploitic.com, that include occupations, skills and professional domains. The concepts of our ontology are: Professional domain, Occupation, Community and Skill. The generation process of our HR-Ontology is based on these three main aspects: -Cleaning and validating data. -Using professional domains in order to split our data in different clusters (representing sub-domains) and perform community detection on every cluster separately. -Merging the resulting clusters into one cluster. Our generation process is inspired from the process proposed by the authors of [2] . In addition to that, our generation process brings a new way to validate and evaluate the resulted ontology and its concepts using both automatic and manual methods. This process is described in the following pipeline (see Fig. 1 ). This first step aims to extract necessary data to build AHROGA ontology and to perform preprocessing and cleaning. Data Extraction. Our data corpus contains 300.000 profiles extracted in JSON format from Emploitic.com database. We identified for each profile three concepts that we consider essential to create AHROGA: Professional domain, occupation and skill. There are many reasons to justify this choice. The most important are: -The availability of these attributes in each profile. -The strong semantic link between occupations, skills and professional domains. Data Preprocessing. The data preprocessing phase is necessary in order to clean data and eliminate all types of noise. Preprocessing concerns only occupations and skills because they are filled manually by users, which increases the probability of providing erroneous data, unlike professional domains which are proposed by the system. In order to standardize and validate occupations, we performed a preprocessing stage. This stage enables to identify the following problems: In order to standardize and validate occupations, a preprocessing stage was done. This one goes through the following steps: -Unifying the occupation format in lower case. -Detecting the language of occupations to keep only ones written in French because 80% of job seekers use it. -Removing special characters, accents, punctuation, numbers, stop words, verbs and proper names such as first and last names and Algerian localizations. -Eliminating non-significant names that are very used by job seekers such as: CV, job search, profile, experience, curriculum vitae, internship. -Grouping occupations that have the same stem. After performing these tasks, many insignificant occupations which represent 3% of the data corpus were eliminated. In addition, we have identified through this step 290,458 occupations. Preprocessing skills enables to identify two types of problems: -Syntax problems such as wrong spelling. -The validity problem of the skill itself. For example, internet is not a valid skill. Among syntax problems we encountered we find: misspellings and multilingual skills, expressing skills in very long paragraphs (more than 6 words), using words that cannot be considered as skills like pc, internet, computer. In order to fix these problems, we performed the same tasks used in occupations preprocessing. The purpose of the analysis stage is to extract existing relationships between data concepts. It goes through the three following steps. Similar Jobs Detection. Similar jobs detection enables to create new relationships between occupations based on their syntax similarity in order to improve our final ontological structure. We link two occupations by using a distance calculated by Jaro-Winkler metric [4] , which is a string metric measuring an edit distance between two sequences. This distance is a variant of Jaro distance metric proposed by William E Winkler [5] . It is based on the number and order of common characters between the two strings. It uses the length P of the longest common prefix of two strings s and t. Letting P = max(P, 4) we define : For example, through this step we create syntax links between "Architect" and these occupations: "Urban planning architect", "Project manager architect", "Designer Architect", "Architect designer". Initial Graph Generation. Before performing the community detection algorithm, we firstly represent cleaned data in a weighted graph. This graph is structured as follows : -Nodes: to represent domains, occupations and skills. -Edges: pairs of relations between two different nodes. There are three types of relations: omain-Occupation. ccupation-Occupation. ccupation -Skill. -Edge Weight: a numerical value assigned to an edge representing the importance of an occupation in its domain or the importance of a skill in its occupation, it is calculated by the occurrence frequency. We had initially 27 large separated domain graphs that include a lot of occupations. In order to create small communities that have high semantic links, we splitted occupations into smaller communities by performing community detection on each of the 27 graphs separately. We have tested two different community detection algorithms: Louvain [3] and Label Propagation Algorithm (LPA) [11] . A performance comparison we performed have shown that Louvain algorithm gives more exact results. The performance metric is calculated by using a confusion matrix. Each element m(i,j) of this matrix represents the number of nodes in the community i. The performance formula is as follows [8] : If the estimated community is correct, the measure takes 1 otherwise it takes 0. In Table 1 , we can see an example of obtained performances by applying Louvain and LPA algorithms on 3 different domains. A second comparison based on a quantity and a quality of generated communities is shown in Table 2 . According to these results, we used Louvain algorithm for community detection because: 1. It generates partitions (set of communities) with better performance compared to LPA algorithm. 2. It generates fewer communities with better grouping of occupations. 3. It considers the importance of skills for each occupation as a parameter for grouping, which helps to group together occupations with significant numbers of common skills. To enable computer programs to exploit the ontology and provide useful features to users, we transformed the resulting graph after the detection of communities into Ontology Web Language (OWL) 2 through the creation of ontological components which are: axioms, classes, instances and relations. The validation is performed at the end of each previous step. It verifies that relationships and concepts are correct and checks if the generated ontology is consistent. To validate occupations, we used only those having an occurrence frequency greater than a certain threshold, which is determined according to the percentage of valid occupations in each domain. Using a threshold of 1% allowed us to validate 43,110 profiles, which represent 15% of the total number of profiles. Confirmed Skills. The purpose of this validation is to keep only correct skills. The approach that we followed is based on text classification. For that, we have firstly created the dataset by retrieving 5000 skills from profiles. Then, we used ESCO ontology API to get their descriptions. In addition, we have used Wikipedia for skills that did not have descriptions in ESCO. After that, we added a third attribute about the validity. The dataset that we have used in the skills classification is presented in the table below. As second step, we have used naïve Bayes classifier [9] , which is recommended in text classification in order to classify skills into two different classes (valid, not valid). By applying this approach on our dataset, we got a model accuracy of 95% and we validated 77.10% of all skills. The purpose of this evaluation is to validate with experts the ontology concepts of three professional domains: (1) IT, information systems, internet (2) construction site, construction jobs, architecture and (3) telecommunications, networks. This evaluation was conducted with five experts in the human resources domain. The evaluation is based on a questionnaire that has three sections where each section represents one professional domain. Questions of each domain concern the generated communities and the relationships between occupations and skills. After analyzing experts' responses, we calculated for each occupation the average number of valid skills and the percentage of valid domain skills. The results are presented in the table below. We then calculated for each domain the percentage of community validation. The results are presented in the figure below. In order to evaluate the quality of our results, a comparison of our ontology was made with ESCO ontology. It is based on common concepts between the two ontologies. We have found that 68.20% of total skills and 73.61% of total occupations generated using our method exist in ESCO, which affirm that these instances are valid. In order to ensure that our results are relevant, we applied the same approach used to generate HOLA ontology on our dataset and we compared it with our approach. (See Table 5 ) This comparison shows that our approach performed better compared to HOLA's approach. Using professional domains as a concept in AHROGA allows better grouping of occupations in communities. For example, two occupations which are not in the same domain but share the same skills (depending on users) are assigned to the same community with HOLA approach. In this paper, we have proposed an automatic ontology generation method using data of the Algerian e-recruitment platform Emploitic.com. From 300,000 profiles collected, we have retrieved common data to build a structured hierarchy of concepts representing each professional domain. Our dataset contained many noisy data, we have started by a preprocessing step in order to clean it and keep only valid data. Then, we have represented these cleaned data in a weighted graph, on which we have applied the community detection using Louvain algorithm in order to create new relationships between concepts based on shared skills. Finally, this hierarchical structure was represented in OWL format in order to be used in useful use cases. For its evaluation and validation, we have used both of manual methods like validating ontology concepts by human resources domain experts, and automatic methods like comparing our ontology concepts with ESCO. AHROGA contains 13606 instances and 113612 relationships, including 27 domains, 716 communities, 1437 occupations and 11,426 skills. The approach we proposed is generic and can be applied on every dataset that contains occupations, skills and professional domains. As future works, we have proposed to make several improvements on our generation pipeline. Firstly, in order to enrich the ontology, we have planned to use a combination of job offers and profiles data rather than only using profiles data. Secondly, we believe that we can enrich also our ontology by adding multilingual instances since that most of data are provided in three languages (French, Arabic and English) and by considering more concepts in the creation of our ontology, such as professional experience and education qualifications. Finally, useful services based on the ontology need to be implemented. For example, matching automatically job offers and candidate profiles. Automatically learning a humanresource ontology from professional social-network data Automatic ontology generation: state of the art Fast unfolding of communities in large networks A comparison of string distance metrics for name-matching tasks Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Agohra: generation of an ontology in the field of human resources A two-layered integration approach for product information in B2B E-commerce A comparison of community detection algorithms on artificial networks ESCO: Boosting job matching in Europe with semantic interoperability SLPA: uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process