key: cord-0470781-dwxuwebr authors: Tavakoli, Mohammadreza; Hakimov, Sherzod; Ewerth, Ralph; Kismih'ok, G'abor title: A Recommender System For Open Educational Videos Based On Skill Requirements date: 2020-05-21 journal: nan DOI: nan sha: 2e6000c1c817dd97859071c3de51beb4ef9d62f1 doc_id: 470781 cord_uid: dwxuwebr In this paper, we suggest a novel method to help learners find relevant open educational videos to master skills demanded on the labour market. We have built a prototype, which 1) applies text classification and text mining methods on job vacancy announcements to match jobs and their required skills; 2) predicts the quality of videos; and 3) creates an open educational video recommender system to suggest personalized learning content to learners. For the first evaluation of this prototype we focused on the area of data science related jobs. Our prototype was evaluated by in-depth, semi-structured interviews. 15 subject matter experts provided feedback to assess how our recommender prototype performs in terms of its objectives, logic, and contribution to learning. More than 250 videos were recommended, and 82.8% of these recommendations were treated as useful by the interviewees. Moreover, interviews revealed that our personalized video recommender system, has the potential to improve the learning experience. In the recent decades, we have been facing an indisputable change in the quantity and quality of skills demand and supply [1] . This dramatic change comes with a number of educational challenges for all labour market stakeholders (e.g. educational decision makers, employers, and employees). The gap, for instance, between skill demands and educational programs [2] , [3] is eminent. Moreover, the existence of transparent information on skill requirements should be a deep concern of our current societies and citizens alike. This information has significant importance to those, who want to secure long term employability, achieve promotions in their current workplace [1] , [4] , or, in the light of the current COVID-19 pandemic, to those, who need to re-skill themselves to adapt to post COVID-19 labour markets (and regain employment). In order to supply stakeholders with labour market information, a number of occupational taxonomies (e.g. ESCO, O*NET) exists to provide information to job-holders and job-seekers about occupations. However, most of these taxonomies are manually maintained, therefore they are time-consuming, expensive and also susceptible to being outdated [5] . At the same time, literature shows that open education is gaining traction in the area of personal skills development [6] . Open Educational Resources (OERs), especially videos, as content sources have become popular for learners [7] . They are supplied by large amount of experts around the world, with different levels of expertise in a wide range of professional contexts (e.g. discipline, location or language). However, the uptake of OERs is limited in most user groups (e.g. educators or learners) [7] - [11] , as OER repositories still struggle to provide personalised services for their users. As an illustration, if a learner wants to find appropriate learning content, (s)he must search manually through several OER repositories with different interfaces. Furthermore, the few OER recommendation algorithms available are limited to approaches such as building (or reusing existing) ontologies [12] , [13] , analysing user behavior in social networks [14] , or applying text mining techniques to identify similar OER documents [15] . Most of the research on learning content recommendations focus on textual resources, similarly to the field of search as learning [16] . However, insights from educational psychology and multimedia learning [17] suggest that images and videos might be preferable when addressing specific learning needs (e.g., procedural learning tasks [18] ). For many topics a large amount of educational and lecture videos are available on the Web, ranging from single video tutorials to entire Massive Open Online Courses (MOOCs). However, when exploring the Web and video platforms it is difficult to find (highquality) videos that are tailored to the user's (required) skills and knowledge levels. In this paper, therefore, we address the above mentioned challenges and build a software prototype to provide personalized OER (specifically videos) recommendations that help learners to master skills needed for their current or future jobs. Thus, the main objectives of this paper are: • Empower learners to construct their own learning trajectories on the basis of labour market information and OERs • Create an algorithm to decompose jobs into unique skills • Build a model to predict the quality of video based OERs • Develop and evaluate a personalized open educational video recommender system prototype, relying on labour market information and videos' properties This paper is organized as follows: Section II discusses the state-of-the-art of OER recommender systems. Subsequently, in Sections III and IV, we explain the processes of data collection and the construction of our recommender system. Section V shares the validation results of our prototype. Finally, we conclude the paper, and define our future steps in Section VI. Based on available literature [11] there is an enormous development potential in building OER-based recommender systems due to the vast amount of open educational content available globally and also due to their currently limited functionalities. Moreover, there is no signal that typical lifelong learning factors (skills, jobs) play any role in current work. We grouped the relevant available studies into the following three categories: 1) Semantic and Ontology Based Methods: Some studies make use of ontologies, linked data, and open source RDF data to leverage semantic content, and define recommendation algorithms [9] - [11] , [13] . For instance, [12] builds an ontology for learners, learning objects, and their environments to establish similarity measures between learning objects, update their properties, and provide diverse and adaptive recommendations. 2) Social Network Analysis: [14] builds graphs of OERs and learners based on social networks. They find tweets with valid educational URLs, and build an OER graph based on the co-occurrences of hashtags. They also use mentions and retweets to build a learner graph. Finally, they recognize important and influential nodes, and use density and centrality measures from the graph to provide recommendations. 3) Machine Learning Based Methods: [15] finds similar OERs using Document Clustering and LSA and makes recommendations based on the similarities. Moreover, [19] considers video content and sequential topic relationships to provide courses across multiple platforms while [20] uses a topic modeling algorithm to recommend videos based on user context and interests. In this section we describe the data we collected to build an open educational platform to recommend educational videos. Firstly, we describe the procedure of collecting skills, followed by an explanation on the retrieval of educational videos. The first step for building our recommender was the identification of skills that are correlated with particular jobs. We used online job vacancies and built a model to extract skills for jobs dynamically and avoid any dependency on existing taxonomies (which are susceptible to outdating). To build such a model, we used a crawled sample dataset from Monster.com containing 22,000 job vacancies 1 . After an exploratory analysis, we concluded that large number of vacancies do not contain a "Required Skills" section. Therefore, we selected vacancies explicitly containing a "Required Skills" section and used them to build a classification model to detect sentences, which define skill requirements. To build the model, at first, we run the following pre-processing procedure: • Removal of unimportant characters, punctuations, stop words 1 https://www.kaggle.com/PromptCloudHQ/us-jobs-on-monstercom • Sentence tokenization, lowercase conversion and lemmatisation Altogether we obtained more than 60,000 sentences including both sentences, which were mentioned in a "Required Skills" section (around 15,000 sentences and we set their label to 1), and also sentences mentioned in other sections in vacancies (around 45,000 and we set their label to 0). We trained a binary classifier model using FastText library in Python for our classification task [21] . The classifier uses word n-grams and learns embeddings as a training process. The obtained dataset was split as 80% for training and 20% for evaluation. Applying our model to the test dataset showed that our model can detect skill-related sentences with F1 score of 88.7% (harmonic mean of precision and recall). Consequently we used the trained classifier to detect sentences that contain skill related content. After identifying those sentences, we applied TF-IDF weighting (with Minimum Document Frequency of 3 as cut-off point) to detect skill terms in skill related sentences. All n-grams from the classified sentences were scored and the highest ranking n-grams are extracted as skill terms. We ran our skill extraction method on 300 randomly crawled data science job vacancies (the context of our first prototype), which have been published on Monster.com in December 2019 and obtained a list of skills that learners should focus on for building a career in data science. In total, we extracted 16 important and unique data science skills. We provide a sample skill with other metadata below. Skill: Python programming Keywords: python, python programming Description: Python is an interpreted, high-level, generalpurpose programming language. To find skill descriptions we used Wikipedia python API 2 and crawled the wikipedia content, which are related to skills 3 . We collected educational videos from two main sources: YouTube and TIB AV portal 4 . YouTube is the most popular platform for hosting any type of video content. The TIB AV-Portal 5 is a dedicated portal for scientific videos from the realms of architecture, chemistry, computer science, engineering and technology, mathematics, etc. and the videos include among others, computer visualisations, learning material, simulations, experiments, interviews, video abstracts, and recordings of lectures and conferences. We retrieved videos by performing a keyword search on each portal. As explained above, each skill contains a set of keywords. All keywords were used to search and retrieve relevant videos. Videos from both sources might contain transcriptions of audio. YouTube includes them as subtitles, TIB AV Portal shows the body of the transcribed text. Upon availability, we extracted these transcriptions for retrieved videos. Missing transcriptions were obtained by applying Google Cloud Speech 6 on audio files extracted from videos. Videos contain different types of information depending on the source. We collected/calculated the following metadata from YouTube and TIB AV portal videos. YouTube: title, target skill, URL, length, description, transcription, view count, rating, likes, dislikes, relevancy score (assigned according to the rank in the search results), Textual similarity (which is calculated based on the similarity between skill description and video transcription. We explain this calculation in Section IV) TIB AV Portal: title, target skill, URI, description, transcription For our first prototype, we retrieved 550 videos from YouTube and 57 videos from TIB AV portal, which mentioned the 16 skills we fetched previously. These videos were presented to six experts in data science (with more than six years of industrial and more than three years of teaching experience in data science related positions) to annotate whether they fit to their target skill, or not. Each video was reviewed by at least three annotators and annotators assigned at least 2 minutes to set the label of each video. The final label was assigned based on a majority vote. In total, the annotators provided labels for 550 videos, where 213 of them fit to a skill (positive label) and 337 do not fit (negative label). The complete list of videos and labels are available for the research community 7 . The following section provides details about how the recommender system was built, and how learners interact with our learning dashboard. We trained a machine learning model to predict whether a given video fits to a skill or not. As mentioned earlier, we selected 550 videos for 16 skills to annotate whether a video fits a skill or not and annotators provided labels for 213 videos as a fit and 337 were annotated as not fitting. A Random Forest model was trained on the annotated data to build a model that outputs a binary decision: match/nomatch. The algorithm uses the following video features to train our model. • Length: the length of a video in seconds • Rating: the user rating, what a video received on a platform • View count: the number of views on a video • Relevancy score: the score assigned during the search process based on the video-platforms' results ranking as 1 ranking position 6 https://pypi.org/project/google-cloud-speech/ 7 Annotated dataset including videos' properties and labels: https://github.com/ rezatavakoli/ICALT2020 recommender • Level: the pre-defined levels either beginner, intermediate or advanced. The levels are set during collection process by concatenating the search term with "beginner", "intermediate" or "advanced" to search for videos at different levels. • Text similarity: the similarity is computed between skill description and video transcription. First, each word in a text is encoded using pre-trained 300-dimensional Glove vectors [22] . Second, we average the vectors of words in a text to get a single vector that represents the whole text. We apply the described method to obtain a vector representation for both video transcription and skill description. Finally, the text similarity is a cosine similarity between the resulting two vectors. 70% of the data was used to train the model and the remaining 30% was used for the evaluation. The classifier achieves F1 score of 86.3% in predicting whether a video matches a skill. Additionally, we analysed the importance of each feature for the classification task. The trained model assigned different importance score to each feature on the basis of the provided training data. Each feature has a different weight on the decision based on these scores. The weights are calculated by pruning out trees below a particular node (as feature selection). The weights for the selected features are calculated as follows: length: 0.61, rating: 0.10, view count: 0.10, relevancy score: 0.08, level: 0.2, text similarity: 0.09. The model assigns the highest score for the Length feature and it is followed by the Rating feature. In the following section we describe how the trained binary classifier can be used within a recommender system. Our proposed recommender system suggests new content to learners based on different parameters. The goal is to optimize weights for these parameters by increasing learner satisfaction (based on their ratings). The recommender system uses the following parameters: • Popularity: We calculate the difference between the number of likes and the number of dislikes for each video, group videos by their target skill, and for each group we normalize the values using Minmax normalization • Fit probability: The probability of a video fitting a skill (explained in Subsection IV-A) • Length: the length of the videos; we group videos by their target skill, and for each group, we normalize the length using Minmax normalization • Text similarity: the textual similarity between a video transcription and a skill description We build a 4-dimensional vector of X where each item in the vector is value for a parameter mentioned above. We define a vector P as a preference matrix for each user that contains a weight for each parameter in X. The goal is to optimize weights in P for each learner based on previous ratings. In this way, we capture learners' preferences to provide personalised recommendations. The following loss function is where X i is the mentioned 4-dimensional vector of a recommended video i and Y i is the satisfaction rate of the user for that particular video i. We use Gradient Descend to find the best P for each user and the initial weights in P are set by taking weights from similar users (e.g. with the same job, location, etc.). Finally, the system generates a recommendation based on optimised weights in the preference matrix P given the parameters of each video for a target skill. The videos are ranked by computing a cosine similarity between their matrix X and the preference matrix P of a user. The video with the highest score is given as a recommendation to a user. We have built the prototype of our recommender system in the form of a dashboard 8 . The users interact with the dashboard for searching or adding skills they want to master, setting their levels of expertise for each skill, and adding contextual information about their occupation, geographical location and educational level. Subsequently, on the learning tab, the dashboard shows the list of their target skills, the learner's current expertise levels in each of them, and the links to the recommended open educational videos. Learners can watch the recommended content or ask for a new one in case they are not satisfied with the recommendations. After watching a recommended video, the learner rates her satisfaction with the recommendation. The system changes the learner's expertise level, updates her preference matrix P , and provides a new recommendation based on the new expertise level and preferences. The process continues until the learner reaches the highest mastery level for a particular target skill. Figure 1 depicts the building blocks of our proposed approach. To validate our proposed approach, we conducted semistructured interviews with subject matter experts in the job area of Data Science. We randomly selected 300 job vacancies related to data science from Monster.com which have been 8 Demo of our prototype is available: https://github.com/rezatavakoli/ ICALT2020 recommender published in December 2019. Afterwards, we collected the required skills for data scientists as described previously in Section III-A. To validate the proposed recommender system, we invited five university instructors with at least 10 years of teaching and 13 years of industrial experience, and 10 PhD students with a minimum teaching experience of one year and a minimum industrial experience of three years for a semi-structured interview 9 . Participants gave feedback on our prototype with regards to its general objectives, logic, and potential contribution to individual learning. Each participant had to complete the following protocol: 1) Learning about the research problems and the proposed approach -15 minutes 2) Work with our prototype dashboard -15 minutes 3) Going through a semi-structured interview with the help of a qualitative questionnaire 10 -30 minutes Participants received more than 250 video recommendations while working with our prototype dashboard (Each participant received 15-17 recommendations). 82.8% of these recommendations were signalled as useful and relevant to participants' skill levels and properties. 2.8% of the recommended videos were recognised as irrelevant, and in 14.4% of the cases participants decided to change the recommended video. The outputs of the interviews are summarised under the three following sections. Interviewees confirmed that there is a potential value in recommending open educational videos based on labour market information. Both instructors and PhD students expressed that although there are several open educational videos on the Internet, finding the most suitable content to learners' preferences is a complicated and time-consuming task. For instance, Instructor 2, Instructor 5, Student 10 told that personalisation of open educational content recommendation is one of the most important features of our proposed approach. Moreover, Instructor 3 suggested that we should recognize the level of expertise that the learner needs to achieve in order to prevent over qualifying and wasting time. Participants emphasized that our recommendation model can help learners in finding the most relevant videos covering particular skill areas. Student 6 suggested that the system needs to take into account the job area of skills, which may result in fine-grained recommendations that target a specific skill for a specific job. For instance, the skill data visualization might have different content depending on the job areas such as E-commerce or Bioinformatics. Regarding the recommendation logic, participants thought that suggesting videos based on learners' previous ratings is a novel idea and Instructor 2 told that the system should use more user properties such as language preferences, province of residency, etc. and also properties of similar users in order to generate better recommendations. Participants confirmed that engaging with learners based on their preferences may result in better retention rates for learners. Instructor 4 and Student 1 valued that setting specific goals and recommending videos to learners accordingly help them focus on their skill targets and could potentially improve their learning performance. Also, Student 4 and Student 7 recommended to build a list of topics, which should be associated to skills, and use these topics to improve skills assessments at the beginning (setting initial expertise levels) and also during the learning process (e.g. evaluating knowledge gains after watching videos). In this paper, we demonstrated a recommender system prototype, which capitalizes on open educational videos, and built a personalized learning environment, where users can select skills and master them based on labour market information. We showed that our skill extraction approach can detect skills in vacancies with F1 score of 88.7%. The recommender was validated with semi-structured interviews with subject matter experts. The initial results showed that participants were satisfied with 82.8% of the generated recommendations. The validation also revealed that our proposed recommender system has high potential to support learners in constructing individual learning scenarios and direct their learning towards their individual skill targets. As future work, we consider to progress towards the following improvements: 1) increase user satisfaction by adding more contextual features, like user traits or more fine grained topic classification into recommendations for better personalisation, 2) incorporate more video and OER repositories and collect/predict more properties for them such as level and completeness, 3) improve self assessment and learning pathway recommendation with generating target topics for skills and finally 4) use experimental designs to further validate the system in a number of use cases with large number of learners. Applying machine learning tools on web vacancies for labour market and skill analysis Skills and vacancy analysis with data mining techniques Defining the expectation gap: a comparison of industry needs and existing game development curriculum An ontology-based approach for the semantic representation of job knowledge Developing Skills in a Changing World of Work: Concepts, Measurement and Data Applied in Regional and Local Labour Market Monitoring Across Europe Global trends in oer: What is the future? A novel approach towards skill-based search and services of open educational resources A heuristic approach for new-item cold start problem in recommendation of micro open education resources A semantically enriched context-aware oer recommendation strategy and its application to a computer science oer repository A user profile definition in context of recommendation of open educational resources. an approach based on linked open vocabularies Recommendation of open educational resources. an approach based on linked open data," in Global Engineering Education Conference An e-learning recommendation approach based on the self-organization of learning resource Towards massive data and sparse data in adaptive micro open educational resource recommendation: a study on semantic knowledge base construction and cold start problem Recommendation of oers shared in social media based-on social networks analysis approach Oer recommender: A recommendation system for open educational resources and the national science digital library Current challenges for studying search as learning processes The Cambridge handbook of multimedia learning Examining learning from text and pictures for different task types: Does the multimedia effect differ for conceptual, causal, and procedural tasks? Moocex: Exploring educational video via recommendation Videotopic: Content-based video recommendation using a topic model Bag of tricks for efficient text classification Glove: Global vectors for word representation