key: cord-0923073-lcc3d50h
authors: Brahim, Ghassen Ben
title: Predicting Student Performance from Online Engagement Activities Using Novel Statistical Features
date: 2022-01-17
journal: Arab J Sci Eng
DOI: 10.1007/s13369-021-06548-w
sha: d47c3b7eb21e7de4d154d19da133b519d9b088eb
doc_id: 923073
cord_uid: lcc3d50h

Predicting students’ performance during their years of academic study has been investigated tremendously. It offers important insights that can help and guide institutions to make timely decisions and changes leading to better student outcome achievements. In the post-COVID-19 pandemic era, the adoption of e-learning has gained momentum and has increased the availability of online related learning data. This has encouraged researchers to develop machine learning (ML)-based models to predict students’ performance during online classes. The study presented in this paper, focuses on predicting student performance during a series of online interactive sessions by considering a dataset collected using digital electronics education and design suite. The dataset tracks the interaction of students during online lab work in terms of text editing, a number of keystrokes, time spent in each activity, etc., along with the exam score achieved per session. Our proposed prediction model consists of extracting a total of 86 novel statistical features, which were semantically categorized in three broad categories based on different criteria: (1) activity type, (2) timing statistics, and (3) peripheral activity count. This set of features were further reduced during the feature selection phase and only influential features were retained for training purposes. Our proposed ML model aims to predict whether a student’s performance will be low or high. Five popular classifiers were used in our study, namely: random forest (RF), support vector machine, Naïve Bayes, logistic regression, and multilayer perceptron. We evaluated our model under three different scenarios: (1) 80:20 random data split for training and testing, (2) fivefold cross-validation, and (3) train the model on all sessions but one which will be used for testing. Results showed that our model achieved the best classification accuracy performance of 97.4% with the RF classifier. We demonstrated that, under similar experimental setup, our model outperformed other existing studies.

The prediction of student performance has been the focus of many educational institutions due to the insight that it provides in terms of dropouts and learning outcome attainment. Such insights may support institutions in making learning adjustments or adopting new learning management strategies to create novel learning opportunities for the students. With the application of technology in the learning process through the wide deployment of learning management systems (LMS) and course learning platforms, the amount of B Ghassen Ben Brahim gbrahim@pmu.edu.sa 1 Department of Computer Science, College of Computer Engineering and Science, Prince Mohammad Bin Fahd University, Al-Khobar 31952, Saudi Arabia collected data is becoming significantly large to warrant sophisticated and intelligent techniques to manage and analyze such data. Student performance prediction is considered an important area of educational data mining (EDM) and Learning Analytics (LA) [1] . It is becoming more challenging in the presence of the ever-increasing data volumes. EDM and LA are considered to be closely related disciplines and they aim to support the process of analyzing related educational data. Together, they provide tools and techniques to collect, process, and analyze educational data. In most cases, the analysis is based on data mining, machine learning, and statistical analysis techniques to explore hitherto unknown patterns in different types of historical data.

Researchers have focused on different data features to predict student performance: these include previous exam scores, background, demographic data, and activity in and outside the classroom, to name but a few [2] . Though most of the performance prediction work tends to focus on previous exam scores, very little work seems to target the analysis of data wherein student interactions with online systems is being logged and analyzed [3] . Using past scores has two disadvantages: one is that it predicts performance in the long term such as predicting students' performance in his junior or sophomore year courses based on exam grades and course grades from the freshman or sophomore year, second is that even if it takes grades of major1 and major2 exams to predict the performance in the course, then it is too far in the semester for any corrective actions to be taken to assist the student to succeed. Therefore, in the work introduced herein, we propose focusing on performance prediction problems while exploring data describing students' online interactions during the online exam sessions. We experiment with a dataset that has been collected using digital electronics education and design suite (DEEDS)-a simulation software used for e-learning in a Computer Engineering course (Digital Electronics) [1] . The DEEDS platform logs student activities and actions during exams (such as viewing/studying exam content, working on a specific exercise, using a text editor or IDE, or viewing exam-related material). Our literature survey indicates that this DEEDS dataset was explored twice: in [1, 4] , wherein authors attempted to predict student performance based on exam complexity and predict the exam difficulty based on student activities from prior sessions, respectively.

The aim of this research work is to build a prediction model which is based on newly extracted statistical features aimed at predicting students' performance based on their online activities. To build and refine the model, we have proceeded as follows. Initially, we have proposed new features that were categorized into three broad categories, based on different criteria: (1) Activity-type count-based, (2) Timing statistics-based, and (3) Peripheral activity count-based, resulting in a total of 86 features. We have also proposed further improvement in the model by reducing the set of features and eventually keeping the most influential (significant) ones using the entropy-based feature selection method [35, 36] . The proposed model was then evaluated and compared with other existing similar research work. We compared the performance with some of the existing work addressing the same problem and using the same DEEDS dataset. We have shown that our proposed model outperforms existing ones in terms of classification accuracy results.

The key contributions of this research work can be summarized as follows.

• Categorization of student academic performance-related features in existing datasets. • Statistical analysis of the DEEDS dataset which supported the feature extraction process.

• The design of a student performance prediction model based on the extraction of a set of statistical features which were categorized into three broad categories: (1) activitytype based, (2) timing statistics-based, and (3) peripheral activity count-based. These features comprise an extraction phase followed by a feature selection phase using an entropy-based selection method. • Performance evaluation of the proposed model by considering the following classifiers: random forest (RF), support vector machine (SVM), Naïve Bayes (NB), logistic regression (LR), and multilayer perceptron (MLP). • Comparative performance analysis between our proposed model and some of the existing published research proposing students' performance prediction models using the DEEDS dataset [4, 30, 31] .

The rest of the paper is organized as follows. Section 2 presents the background about the student performance prediction domain: its importance, applicable prediction metrics, dataset categorization, and overview of prediction models. Section 3 describes our proposed approach considering student engagement data wherein the DEEDS dataset is presented along with the feature extraction process. Section 4, details the model performance in terms of prediction accuracy, and then a comparative study is presented in relation to the existing work. Finally, in Sect. 5, conclusions are drawn, and future research directions are suggested.

This section starts with a brief background about the importance of student performance prediction. Then it overviews the performance prediction targets in terms of prediction goals. Next, it surveys and categorizes the set of features commonly considered in most of the datasets used during the prediction process. Finally, it overviews the prominent approaches and models being used along with their achieved performance.

The problem of predicting student performance has been extensively studied by the research community as part of the learning analytics topic due to its importance for many academic disciplines [2, 3, [5] [6] [7] . Based on the goal of the performance prediction model, the benefits may include the following: (1) improved planning and accurate adjustments in education management strategies to yield enhanced attainment rates in program learning outcomes [8] , (2) identify, track and improve student learning outcomes and their impact on classroom activities. For instance, prediction models could be tuned to classify student performance as low, average, or high. Based on the classification results, concerted measures may be taken by the education managers to support the low-performing students [7] , (3) propose new, formative learning approaches for the students based on their predicted performance [8] . For instance, students are advised to adopt different learning strategies such as emphasizing more on practical aspects of course material, (4) allocating resources to the students based on their predicted performance. For instance, the identification and prediction of high-performing students will support institutions to estimate the number of awarded scholarships [9] , (5) minimize the student dropout rates which is considered a resources black hole that impacts graduation rates, quality, and even institutional ranking [10] .

Student performance prediction models have targeted several metrics which are both quantitative and qualitative in nature. The amount of research work to predict quantitative metrics outweighs those for qualitative metrics [8] . Qualitative metrics have mainly focused on Pass/Fail or Letter Grade classifications of students in particular courses [11] or overall student assessment prediction in terms of high/average/low. This type of assessment could be performed per course, major, topic, etc. [3] , or student knowledge accomplishment levels: First/Second/Third/Fail [4, 12] , or to classify students into low risk/high risk/medium risk [7] . By contrast, quantitative metrics have mainly attempted to predict scores or course/exam/assignment grades [5] , range of course/exam/assignment grades [6] , major dropout/retention rates [10] , prediction of the time needed for exam completion, prediction of on-duration/delay of graduation and student engagement as well [13] .

Most of the datasets that have been used to machine-learn the student performance have considered historic data that can be categorized into three broad categories based on the attribute types [2] : (1) student historic performance attributes, (2) student demographic attributes, and (3) student learning platform interactions (engagement) attributes. These categorizations were further extended into a more comprehensive classification of features to include two more categories [3] , namely: (4) personality-to better describe the subject capability and ability (such as efficacy, commitment, efficiency, etc.), and (5) institutional-to better describe the teaching methods, strategies, and qualities [14] . We have surveyed several datasets that have been considered for student performance prediction studies and have summarized them in Table 1 , where we enlisted the five categories along with most of the common and relevant features being used in each of these five categories. Table 1 also includes the stated aim of the prediction study per category.

Several methods and approaches were considered in predicting student performance; most of these approaches are statistical in nature and designed for machine learning (ML) models. The models attempt to estimate an inherent correlation between input variables and identify patterns within the input data. Following our review of most of the existing datasets, these attributes can be classified under any of five categories, namely: (1) student historic performance attributes, (2) student demographic attributes, and (3) student learning platform interactions attributes, (4) personality attributes, and (5) institutional attributes; as detailed in Table  1 .

Among the two existing ML models types, supervised learning techniques are a better fit for handling classification and regression problems and were more widely used to deal with the student prediction problem as compared to the unsupervised learning techniques. Classification approaches attempt to classify entities into some known classes, which are two in the case of binary classification (for example, classifying students into Passing or Failing classes) or more in the case of multinomial classification. On the other hand, in regression-based approaches, the model attempts to predict a continuous type of value (for instance, predict the final exam score, which could be a real number between "0 to 100"). This makes regression techniques more challenging problems to solve compared to the classification problems.

In the context of student performance prediction, several supervised learning models were used while considering datasets with each of the five feature categories as well as their combinations and targeting the prediction of specific performance features (as specified in Table 1 ). For instance, the authors in [2] have studied the impact of each of the three categories of features (student engagement, demographics, and performance data) on predicting student performance using binary classification-based models to predict at-risk students and regression techniques to predict the student scores. Also, they studied the prediction performance at different time instances before taking the final exam. The analysis was performed on a public open university learning analytics dataset (OULAD) while using Support vector machine (SVM), decision tree (DT), artificial neural networks (ANNs), Naïve Bayes (NB), K-nearest neighbor (K-NN), and logistic regression (LR) models for classification and SVM, ANN, DT, Bayesian Regression (BN), K-NN, and Linear Regression models for regression analysis. For the classification task, better performance results were obtained with ANN while considering engagement and exam score data (F1-score~96%). The same performance was also obtained for the regression analysis task where ANN outperformed other algorithms while inputting the model with more historic assessment scores (RMSE~14.59). The authors in [5] have focused on predicting student performance based on historic exam grades and in-progress course exam grades. The goal of such a study is to identify, per area and per subject, the "at-risk" students and hence provide real-time feedback about the current status of the students. Such feedback would drive the appropriate remedial strategies to support these students and eventually help improve retention rates. The authors have conducted their studies on 335 students and 6358 assessment grades while considering 68 subjects categorized into 7 different knowledge areas. The prediction model used the Decision Tree algorithm to classify students into passing and failing categories. In their effort to identify the most influential variable, the authors run their model while considering all possible combinations of Final grades and their weighted Partial (1 to 3) ones. The best model accuracy performance was reported (96.5%) when all Partial grades were included in the prediction.

In a different study, Huang and Fang [7] focused on predicting student performance in a specific field: Engineering dynamics. The authors conducted their research while considering four predictive models (multiple linear regression (MLR), multilayer perceptron network, radial basis function network, and the support vector machine-based model). Their study aimed to identify the best prediction model and the most influential variable among a set of six considered ones leading to a more accurate prediction model. The dataset being considered in this study included only the historic performance type of data. The dataset was a collection of 323 students over 4 semesters and included nine different grade dynamics and pre-requisite dynamic courses. Results showed that the type of mathematical model being used had little impact on the prediction accuracy. Regarding the most influential variables, results showed that it varies based on the goal of the instructor in terms of what to predict (average performance of the entire dynamic class or individual student performance). Best performance results in terms of accuracy were achieved with MLR (reaching 89.7%). Similar to the proposed model in [5] , this model only considered one category of student data (historic performance-related data) and did not include different categories such as engagement or demographic data.

As online teaching is gaining more and more popularity, it is becoming necessary for all schools to provide e-learning options to their students, especially after the COVID-19 pandemic. Many research works have studied and evaluated the performance of such learning in a virtual environment. For instance, Hussain et al. [15] have focused on studying the impact of student engagement in a virtual learning environment on their performance in terms of attained exam scores. This study has considered various variables including demographics, assessment scores, and student-system engagements such as the number of clicks that student executes to access certain URLs, resources, homepages, etc. The authors have considered several machine learning classification algorithms to predict low engagement instances, such as Decision Trees (DT), JRIP, J48, gradient-boosted tree (GBT), CART, and Naïve Bayes. The best performance in terms of predicting students with low engagement was obtained with the first four algorithms (topping 88.5%). Also, these results have identified the best predictive variables in terms of the number of clicks students executed per activity during the identification of low-engagement students.

In the same context, Vahdat et al. [1] aimed to study the impact of student behavior during the online assessment and the scores obtained. The authors have used complexity matrix and process mining techniques to identify any correlation between attained scores and student engagement. The dataset was collected using digital electronics education and design suite (DEEDS), a simulation software used for e-learning in a Computer Engineering course, namely Digital Electronics. Analysis has shown that: (1) the complexity matrix and student scores are positively correlated, (2) and the complex matrix and session difficulties are negatively correlated. Additionally, the authors demonstrated that the proposed process discovery could provide useful information about student learning practices.

In a different study Elbadrawy et al. [16] , the authors have built a model to accurately predict student grades based on several types of features (past grade performance, course characteristics, and student interaction with the online learning management system (aka. Student Engagement). The built model relies on a weighted sum of collaborative multiregression-based models, which were able to improve the prediction accuracy performance by over 20%.

Along the same directions as in [16] , Liu and d'aquin [17] have attempted to predict student performance based on two categories of features: Demographics and Student Engagement with the online learning system. They have applied supervised learning-based algorithms in their model on the Open University Learning Analytics dataset [18] and investigated the relationship between demographic features and the achieved performance. Analysis has shown that the bestperforming students were those who had acquired a higher education level and were residing in the most privileged areas.

Hussain et al. in [4] have investigated predicting difficulties that students may face during Digital Electronics lab sessions based on previous student activities. They identified the best predicting model, as well as the most influential features. The authors have only considered the engagement type of data using the digital electronics education and design suite (DEEDS) simulator [1] . They have conducted their study considering the following five features: average time, average idle time, and the total number of activities, total related activ-ity, and the average number of keystrokes. Five classification algorithms were explored: support vector machines (SVMs), artificial neural networks (ANNs), Naïve Bayes classifiers, Logistic Regression, and Decision Trees. While considering fivefold cross-validation and random data division, the best accuracy performance results were obtained with the ANN and SVM-based models (75%). This performance was later improved and reached 80% when considering the Alpha Investing technique on the SVM-based model. DEEDS dataset has also been used in other research work where researchers have attempted to predict the performance of students and the difficulties they face by analyzing students' behavior during interactive online learning sessions. For instance, in [30] , the authors attempted to predict the student performance using the DEEDS dataset by considering regression models. The input variables of the model were the current study of the students for all sessions, however, the model output consists of the student's grade for a specific session. Among the three models used, (Linear regression, Artificial Neural Networks, and Support Vector Machine (SVM)), SVM performed the best and achieved an accuracy of 95%.

In a different research work, the authors in [31] have also considered the DEEDS dataset to perform a comparative analysis using various machine learning models such as Artificial Neural Network, Logistic regression, Decision Tree, Support Vector Machines, and Naïve Bayes. In their study, the authors have extracted and considered five different types of features, including average time, total activities, average mouse clicks, related activities in an exercise, average keystrokes, and average idle time per exercise. During this study, SVM performed the best as compared to the rest of the models and has achieved an accuracy of 94%.

In [32] , the authors have considered different datasets in their attempt to predict the performance of students in two different courses (namely Mathematics and Portuguese language). Both datasets have a total of 33 attributes each and a total of 396 and 649 records, respectively. The authors have applied two different types of models: following a feature Support Vector Machines and Random Forest to classify passing students from the failing ones. Both datasets comprise 16 features, including historic performance, demographic data, and personality type of features. Experimental results showed an accuracy of more than 91% and that the historic features were most influential during classification. It was also worth noting that Random Forest performed better in the case of larger dataset (Portuguese language).

The authors in [33] have proposed a model to predict students' performance in higher educational institutions using video learning analytics and data mining techniques. A sample of 722 students was considered in the collection of the dataset. Among the five categories highlighted in Table 1 , only the historic performance, engagement, personality, and institutional categories were used. Out of the eight algorithms used, the RF performed best and has achieved an accuracy of 88% following the feature reduction step.

In another study [34] , the authors have used Artificial Neural Networks (ANN) to predict the performance of students in an online environment. The data set was collected using the participation of 3518 students. Out of the five categories of features described in Table 1 , only historic performance and learning platform interaction categories were considered. Results showed that the ANN-based model was able to predict the students' performance with an accuracy of 80%. Summary of the prediction of student performance using machine learning-based models is presented in Table 2 . It is worth noting that the last entry in Table 2 captures the performance achieved with the proposed model in predicting students' performance while considering student engagement and historic type of features.

Our proposed research explores the area of predicting student performance in an e-learning environment, which is gaining more popularity, especially post the COVID-19 pandemic. We propose exploring the DEEDS dataset, which is, to the best of our knowledge, studied twice by Vahdat et al. [1] and Hussain et al. [4] . DEEDS is a Technology Learning Environment platform that logs the real-time interactions of students during classwork as well as exam performance in terms of grades; these logs were collected in six different interactive sessions. We intend to conduct a statistical study on the DEEDS dataset followed by the design of a new prediction model based on a new set of statistical features to predict student performance using their interaction logs registered by DEEDS. We propose assessing our model's performance using five different types of classifiers in a different experimental setup. We will also compare the achieved performance in terms of accuracy and F1-score to that proposed by Hussain et al. [4] .

The proposed method aiming to classify student performance follows a typical machine learning classification approach. Initially, we start with a statistical analysis and feature engineering of the DEEDS dataset which resulted in the reduction of DEEDS activity features from 15 to 9 activities. This reduction (as will be described in Sect. 3.2) consists of aggregating semantically similar activities into a single category. The next step of this process consists of the data pre-processing phase, where entries showing some discrepancy were discarded, and DEEDS logs with no corresponding lab or final exam grades were also excluded. This is further explained in Sect. 3.3. The next step is a features extraction phase. This phase resulted in extracting three broad categories of features: (1) Activity-type count-based, (2) Our proposed model attempts to predict the following: weak versus good performing students based on interaction with DEEDS through the use of a binary classification models. The proposed model's performance was also compared to that proposed in [4] , where the authors considered a set of 30 extracted features while considering the same DEEDS Dataset. Figure 1 shows a workflow diagram of the entire process that we followed to build our prediction model. The process involves the typical five machine learning steps, including: (1) DEEDS raw dataset collection, (2) DEEDS data logs pre-processing, (3) Features extraction, (4) Feature selection, and (5) classification and analysis.

In this research work, DEEDS dataset was used. DEEDS is considered among the very few existing ones that capture real-time in-class student interactions and behavior while using a proprietary Enhanced Technology Learning (ETL) platform called DEEDS (Digital Electronics Education and Design Suite). This dataset was considered by this study due to each richness in terms of the variety of students interaction information being logged by the system such as: average time being spent by students in various types of problems as well as learning environments, types of activities, the total number of activities, average idle time, the average number of keystrokes and total related activities for various exercises during individual sessions, the grade performance achieved by students in each session, etc. Also, we claim that there is enough room for opportunity to develop a classification model which will achieve better performance compared to the existing ones, which is the main objective of this research work. General information about DEEDS is presented in Table 3 . DEEDS logs real-time interaction of students during classwork as well as exam performance in terms of grades. The DEEDS dataset was collected following six working lab sessions. Students will initially be learning a specific topic during each of these sessions, then taking a set of exercises. The number of exercises in each of the 6 sessions ranges from 4 to 6 exercises. For instance, in sessions 1, 3, and 5, four exercises were planned. In sessions 2 and 6, six exercises were planned. In session 3, five exercises were planned. Student performance (in terms of attained grades) is subsequently recorded; with the maximum session grade varying from 4 and 6. During a study session phase, DEEDS records all student activities in each session, such as using text edi- Fig. 1 Workflow diagram of the proposed model for student performance prediction tor, using a simulation timing diagram tool, reviewing study material, using the Aulaweb, etc. After attending all lab sessions, students will be taking an exam where questions are chosen to cover each of the six topics studied in each of the six lab sessions, and student performance per topic was also noted. DEEDS creates a separate file for each student attending a session and adds a new entry every one-second interval. Each new entry corresponds to a new row in these log files describing student activities during each session. Each row is a collection of 13 comma-separated features arranged in the order indicated in Table 4 . Figure 2 shows a snapshot of the comma-separated log file corresponding to student ID 21 collected during the 4th session.

As a result of the preliminary data pre-processing and metadata analysis, we are able to better describe the dataset, as shown in Table 5 . Though 115 students were expected to participate in this experiment and eventually take exams, 7 students ended up not showing up in all sessions, and an average of 86 students were registered per lab session. Our dataset included more than 230,000 entries, which were uniformly distributed across all sessions, as shown in Fig. 3 . Sessions 1 through 5 included 13-17% of the entire dataset, except for the last session, which included 23% of all data. Statistical analysis has also shown that some dataset features logged fewer activities, and most of the registered values were "0". For the "number of mouse wheel" feature, 90.7% of the logged values were zeros', in the "number of mouse wheel click" feature, 99.9% of the logged values were zeros', in the "number of mouse wheel click right" feature, 95% of the logged values were zero, and 86% of logged "keystrokes" were zeros'. On the other hand, the number of "mouse wheel click left" and "mouse movement" features were well distributed across a range of 0 to 1096 and 0 to 85,945, respectively. Our next analysis focused on the representation of each of the activities in the entire dataset. Some activities did not have sufficient representation throughout the entire dataset and were represented with less than 1%, occurrence such as "Text Editor no exercise-0.02%", "Deeds no exercise-0.2%", "Deeds other activity-0.4%", "FSM related-0.1%". This is in comparison with the rest of the activities, which had representation rates between 7.5 and 16.3% of the entire dataset. Proportional activity frequency results are depicted in Fig. 4 . To avoid dropping any of the least frequent activity data entries, we propose aggregating semantically equivalent activities under new activity categories. These are being represented in Fig. 5 , along with their representation frequency. In this figure, activity "Editing" includes the following similar editing activities "Text Editor exercise"," "Text Editor no exercise", and "Text Editor other activity" with an 18% frequency. The same applies for the new category "Deeds activity", which includes "Deeds exercise", "Deeds no exercise", and "Deeds other activity" with a 17% overall frequency occurrence. The same applies to the new category "study", which includes "study exercise" and "study material" with a 10% frequency. Also, new category "FSM" now includes activities "FSM exercise" and "FSM material" with a 9% frequency.

Pre-processing is a typical first step in the data classification and pattern recognition process. During this phase, data transformation and irrelevant information are being discarded from the original raw dataset, and only relevant data is kept for further processing. In the current work, the following steps have been executed on the original raw data.

(a) Data discrepancy removal We have noticed that the original dataset had some inconsistencies between the pairs (Session, Exercise) and (Exercise, Activity), as indicated in Fig. 5 . DEEDS platform seems to have wrongly logged the session ID with its corresponding exercise ID (Fig. 6a) and Exercise ID with its corresponding Activity ID (Fig. 6b) . The frequency of occurrences of such inconsistencies was minimal, and counts for less than 0.5% of the entire dataset and were proportionally equivalent across all sessions and exercises; therefore, these entries were discarded from the dataset. During this phase, we also excluded the entries where students have logged on but did not start working on the exercises. These are the logs with the exercise ID field matching "ES".

(b) Session 1 data removal During this step, the session 1 data was filtered out. This was necessary since session 1 did not have any corresponding intermediate nor final exam grades related to the topics covered in this session. It is worth mentioning that student grades (intermediate or final exam) will be used to label our data for classification purposes.

(c) Time format conversion The DEEDS platform logs the time using the format "hours: minutes: seconds". These times were converted to seconds to facilitate operations on these times.

(d) Exercises 5 and 6 exclusion This phase has to do with restricting our analysis to the first 4 exercises of each of the 5 sessions (2 through 6), the rest have been discarded. This step is necessary since not all sessions have the same number of exercises.

This step is deemed pivotal during classification problems. In this work, we propose using an augmented set of numeric features compared to those by Hussain et al. (2019) , where the authors have considered 30 features (5 features per exercise) for each student-session combination. We extracted a set of 86 features for each student per session and categorized these features into 3 broad categories: (1) Activity-type countbased features, (2) Timing statistics-based features, and (3) peripheral activity count-based features, which are explained below.

As stated in our dataset description, DEEDS defines 15 types of activities. Based on our statistical analysis in terms of log count distribution of each of these activities (Fig. 3) along with their significance, we propose reducing these lists Each of these 9 activities will be referenced with its order of appearance in the Activity_type set. For instance, activity 3 represents "Study" and activity 9 represents "Properties". Before we can mathematically express our activity-based features, we introduce the following activity matrix, which captures statistics about student activity occurrence count in all 4 exercises per session: 1 a 1, 2 a 1, 3 . . . a 1, 9 a 2, 1 a 2, 2 a 2, 3 . . . a 2, 9 a 3, 1 a 3, 2 a 3, 3 . . . a 3, 9 a 4, 1 a 4, 2 a 4, 3 . . . a 4, 9 ⎤ ⎥ ⎥ ⎦ In matrix A, which is a 4 by 9 matrix, element "a i, j " represents the occurrence count of activity "j" in exercise "i". For instance, "a 1, 3 " represents the occurrence count of activity 3 ("Study") in exercise 1.

Following the definition of matrix A, we define the first 36 features as depicted in Eq. 1 as follows:

(1)

The 36 features map/track the occurrence count of the 9 activities in each of the 4 exercises (indexed by e). F 1 for instance a 1, 1 and represents the occurrence count of "Editing" activity in exercise_1.

For the next 9 features (F 37 through F 45 ), we track the occurrence count of the 9 activities across all exercises. These are captured in Equations set 2. F 37 , for instance, represents the occurrence count of "Editing" activity in all 4 exercises. 

The next 4 features track the occurrence count of all 9 activities together in each of the 4 exercises. These are captured in Equations set 3. F 46 , for instance, represents the aggregated occurrence count of all 9 activities in exercise 1.

The final feature in this activity count-based category tracks the aggregated sum of occurrence of all activities across all exercises. This is being captured in Eq. 4.

This category of features captures the timing performance of a student while working on exercises. We used 2 sets of timing features of size 4 features each. The first set of 4 features {F 51 , F 52 , F 53 , F 54 } describes the time spent in each exercise. This is calculated by taking the difference of max-end time and min-start time for each of the 4 exercises as indicated in Eq. 5. F 51, for instance, corresponds to the period of time spent by a student in exercise 1.

The next set of 4 features (F 55 , F 56 , F 57 , F 58 ) describes the total idle time registered in each of the 4 exercises. F 55, for instance, describes the total idle time registered in exercise_1. This is being captured in Eq. 6

1→n I dle_T ime e ; where e 1 → 4 (6)

The final set of features deals with tracking the count of student interactions with the computer input peripherals (mouse and keyboard). Since DEEDS logs 5 different types of mouse activities {MouseW heelCount, MouseW heelClickCount, MouseClick Le f tcount, MouseClick Rightcount, MouseMovementCount, K eystr okeCount} along with one single keyboard activity K eystr okeCount . We used 24 new features to track the total occurrence count of each of the aforementioned 6 peripheral related activities in each of the 4 exercises.

Each of these 6 peripheral activities will be mapped with its order of appearance in the Peripheral_Activity_type set. For instance, peripheral activity 3 represents "Mouse Click Left count" and activity 6 represents "Keystroke Count".

Similar to the case of Activity-based features, we introduce the following peripheral activity matrix, which captures statistics about student interactions using the computer peripherals in all 4 exercises per session: 1 p 1, 2 p 1 In matrix P, which is a 4 by 6 matrix, element " p i, j " represents the occurrence count of peripheral activity "j" in exercise "i". For instance, " p 1, 3 " represents the occurrence count of peripheral activity 3 ("Mouse Click Left Count") in exercise 1.

Each element of the peripheral activity-based matrix maps to a different feature, which will be considered during the classification phase. These constitute a set of 24 new features (F 59 through F 82 ). F59, for instance, describes the total occurrence count of Mouse Wheel events in exercise_1.

The next set of features within this category defines 4 more features to describe the level of utilization of the computer peripherals in each of the 4 exercises. These are described in equation set 7. F 83, for instance, specifies the total occurrence count of the 5 mouse activities and the keystrokes activities in exercise 1.

Our feature selection phase consists of ranking all features by applying the entropy-based techniques based on the Shannon theory [35, 36] . This approach attempts to assign ranks based on the level of uncertainty or disorder within the data. Under this scheme, features with low entropy levels indicate a high level of uncertainty and, therefore, will be assigned low-rank values. The entropy-based technique is a powerful technique in determining the level of correlation between variables by capturing the linear and non-linear associations between variables. We propose assessing the performance of our prediction model with the full set of extracted features and a reduced set after eliminating the none influential features (low ranked ones.)

Among the many existing classification models, we have evaluated our model using five different classifiers. There are many factors influencing the suitability of the machine learning-based regression models, which may differ from one problem to another. These factors include the type of the data, the dataset size, the number of features, data distribution, etc. Also, models may work best for problems but not others. For instance, SVM is known to perform well in the case of relatively small dataset size, which is the case of DEEDS. In this work, models were chosen based on two criteria: (1) current existing work dealing with the same problems and using the same dataset DEEDS. For instance SVM, LR, NB.

(2) cover the broad spectrum of the different categories of classifiers. For instance, RF was considered as a representative of the ensemble type of classifiers, NB as a probabilistic type of models, MLP as a Neural Network-based model. The model selection process also included a step where models which have experienced high error rates were eliminated. Next, a brief description of each of the Classification models being used in this work is presented.

RF is an ensemble of Decision Tree bundled together [19] . The training of these bundles of trees consists of executing the bagging process on a dataset of N entities. This process consists of sampling a set of N training samples with replacement. Then using these samples to train a decision tree. This process needs to be repeated T times. The prediction of the unseen entity is eventually made through a majority vote of each of the T trees in the case of classification trees or the average value in the case of regression as given by Eq. 8.

where y is the predicted value, x' is the unseen sample, fi is the trained decision tree on data sample i, and T is a constant number of iterations to repeat the process.

where f i (x ) is the prediction class of unseen entity x' using the decision tree trained on data sample i. RF technique has shown its ability in handling large datasets with a large number of attributes while it weighs the importance of each of the problem features. It is prone to noise, outliers, and overfitting [20] . Contrarily to other classification techniques, the RF relies on a combination of a set of classification techniques contributing to a single vote during the class classification process.

MLP is a supervised learning-based approach [21] . It is based on the concept of perceptron in Neural Networks, which is capable of generating a single output based on a multidimensional data input through their linear (non-linear in some occasions) combination along with their corresponding weight as follows:

where w i , x i , β, and α. are the weights, input variable, bias, non-linear activation function, respectively. The MLP is composed of three or more node layers, including the input/output layer and one or many hidden layers. The training phase in the case of MLP consists of adjusting the model parameters (biases and weights) through a back and forth mechanism (Feed-forward pass followed by Back-forward pass) with respect to the prediction error.

SVM is a supervised ML model to solve classification and regression problems; it has demonstrated efficiency in solving a variety of linear and non-linear problems. The idea of SVM lies in creating a hyperplane that distinctly categorizes the data into classes [22] . SVM works well for multi-domain applications with a large dataset; however, the model has a high computational cost.

NB is a probabilistic algorithm based on Bayes' theorem. It is naïve in the sense that each feature makes an equal and independent contribution in determining the probability of the target class. NP has the advantage of noise immunity [23] . It is proven to perform well in the case of large highdimensional datasets. It is fast (computational complexitywise) and relatively easy to implement.

LR is an ML algorithm based on probability concepts used for classification-finding the success and failure events. LR can be considered as a Linear Regression model with a more complex cost function, which can be defined as the sigmoid function (compared to a linear function in the case of linear regression [24] . LR has the advantage of being computationally efficient, relatively simple to implement with good performance with various types of problems. However, it has the main disadvantage of assuming linearity between independent and dependent variables [25] .

Three sets of experiments aiming at three different goals were conducted during this study: (1) Goal_1-evaluate the performance of the proposed model while considering the full set of extracted features, (2) Goal_2-study the importance of each of the extracted features in the classification process using the entropy-based ranking approach, (3) Goal_3-compare the performance of our model to that proposed in [4, 30, 31] . The general experiment setup consists of using DEEDS log data from five different sessions (sessions 2 through 6) along with their corresponding intermediate grades that were attained by all 115 students. Only the first 4 exercises from each session were considered. This has resulted in a total dataset size of 575 entries. We have proceeded in the same direction as in [4] and [31] in terms of data labeling where a student achieving a grade higher than 2 is labeled as class "A" student-(student with "no difficulty"); otherwise, the student is labeled as class "B" student-(student with "difficulty"). By adopting this labeling strategy, 74% of our dataset included students with category A and 26% of students falling under category B.

For the training and evaluation phase, we have considered three sets of experiments. The first experiment consists of a random distribution-based model where we randomly chose 80% of the data (resulting in 460 records~4 data sessions) for training and 20% for testing (115 entries~1 data session). The second experiment is a more generic approach and consists of the classic fivefold cross-validation (resulting in 80% of the data for training and 20% for testing). The third experiment consists of independently assessing the performance of our model per session. In this setup, 4 session data were used for training (equivalent to 80% of the data) and the remaining one (equivalent to 20%) for testing. In the classification phase, we have used five of the well-known classifiers then selected the most accurate one. We have considered MLP, RF, SVM LR, and NB models to classify student performance. Our classifiers' configuration parameters tuning phase has led to running all classifiers with a Batch size equal to 100, Learning rate equal to 0.3, and Loss equal to 0.1.

We evaluated the effectiveness of the proposed classification model through the analysis of the following four metrics: (1) accuracy, (2) precision, (3) recall (aka. sensitivity), and (4) F1-score. These are briefly described in Table 6 , where T p , T N , F P , and F N represent the True Positive, True Negative, False Positive, and False Negative testing cases, respectively. Along with these four metrics, we also considered the Receiver Operator Characteristic (ROC) metric to analyze the proposed model's ability to distinguish between classes by looking at the True Positive rate versus the False Positive rates under different settings.

The proposed models were evaluated based on the metrics mentioned in Sect. 4.1. Initially, we have tested our model with the full set of extracted features (86), next we have studied the level of influence of each of the 86 features in the overall model accuracy performance through the application of entropy-based ranking approach. In line with Table 7 , Fig. 7 shows a breakdown of the average performance of all five classifiers in terms of their effectiveness in predicting the correct classes (A-for students with no difficulty, versus B-students with difficulty). Results demonstrated a pattern of improved prediction rates for class A entries versus class B entries across all classifiers. For example, RF achieved 98.8% F1-score for class A entries compared to 95% for class B entries. By contrast, NB consistently showed low performance, especially for class A entries where 73.7% F1-score was attained.

In this set of experiment, we have studied the relevance and level of influence of the extracted features through the application of the entropy-based ranking approach. Features ranking results are captured in Table 8 . Results show that about 20% of the total features received a low ranking (less than 0.21), indicating that these features may not influence the performance classification model. In fact, a re-run of our prediction model where we have excluded the 18 features from our original dataset has resulted in a very similar accuracy performance for all classifiers. For instance, for our best performing classifier (RF algorithm), and in the case of random distribution of test data and training data (randomly choosing 80% of the data for training and the remaining 20% for testing), we have achieved an accuracy of 96.7% compared to 97.4% when the full list of features where considered. This insignificant variation in the accuracy performance could be attributed to the size of our dataset. It is worth highlighting that though the achieved accuracy performance was the same, running the model with a reduced number of features contributed to reducing the overall complexity of the model.

Our next results are depicted in Table 9 where we show the corresponding confusion matrix following the execution of MLP, RF, SVM, LR, and NB classifiers. Results are in line with that presented in Fig. 7 . For class A entries and when considering the SVM classifier, Table 9 shows that 85 out of 86 were classified properly, which is slightly better than the performance of RF, where 84 out of 86 were classified correctly. However, RF showed a better performance than SVM when classifying class B type of entries (28 out of 29 versus 24 out of 29). It is noteworthy that Table 9 shows that we are in possession of unbalanced data where class A entries outnumber class B entries (86 versus 29, respectively).

Contrary to the results reported in Table 7 , where 80% of the data was chosen randomly for training and the remaining 20% for testing, Table 10 shows the performance achieved by all five classifiers while considering fivefold cross-validation, a more conservative approach. Results reported in Table 10 are generally in line with those reported in Table 7 in terms of classifier performances. RF classifier, for instance, performed the best in terms of Accuracy and F1-score (93.37% and 95.4%, respectively), and NB continues to show the lowest performance. It is also noticed that there was a slight decrease of about 4% in the overall performance reported in Table 10 compared to those reported in Table 7 , which could be attributed to the random nature of the fivefold crossvalidation technique.

Our next set of results aims at studying the performance of our model in a setup where four sessions' data will be used for training, and the remaining unseen session data will be used for testing. These are captured in Table 11 . Table 11 shows the results of five different experiments. For instance, the set of results for "session ID for testing" equals 2 represents the performance of the model being trained using sessions 1, 3, 4, and 5, then tested on session 2. NB had the poorest performance (between 85 and 88% F1-score), followed by LR, then MLP, then SVM, then RF (between 94 and 97% F1-score).

However, the difference in specific metrics performance varied from one session to another. This was manifested with the Recall and Accuracymetrics where the variations were 20% in session 1 and 14% in sessions 1 and 5, respectively.

The final set of results compares the performance of our proposed model with that of [4, 30, 31] , where DEEDS dataset has been used to predict students' academic performance under the same experimental setup. Figure 8 illustrates the performance comparison of all four research results (ours with three others from recent literatures which are named as Hussain, Sriram and Maksud in Fig. 8 ) in terms of Accuracy, Precision, Recall, and F1-score that were achieved while considering the best performing classifiers (RF in case of our model and ANN in case of the model of Hussain et al. in [30, 31] , respectively). Results show that our proposed models outperformed all the three existing models in terms of accuracy with an improvement ranging between 2 to 22% compared to that being achieved in [4, 30, 31] . The F1-score was 12% higher than that achieved in [4] using the ANN classifier and 2 percent in the case of the SVM classifier being used in [30, 31] . We believe that such improvement is attributed to the extended set of features that were introduced by our model compared to a reduced and abstract list of five features per exercise proposed in [4] . While in [4] , the authors did not differentiate between the types of activities within a single exercise, our model provisioned for the different types of activities resulting in 9 distinct features along with the total activity occurrence count per exercise. This has resulted in a total of 10 features related to activities per exer-cise. Also, contrary to the model in [4] , where the interaction of students with DEEDS was only captured by a single feature counting the number of keystrokes, our model took into account the different types of interactions of students with DEEDS using input peripherals (mouse and keyboard) leading to a total of 6 features per exercise. The extra information that was provided to the prediction model explains the significant classification performance improvement captured in Fig. 8 .

In this article, we demonstrated the ability to predict student performance by analyzing the interaction logs of students in the DEEDS dataset. We have extracted a total of 86 statistical features, which are categorized into three main categories based on different criteria: (1) activity-type based, (2) timing statistics-based, and (3) peripheral activity count-based features. This set of features was further reduced during the feature selection phase where we have applied the entropybased selection technique and only influential features were retained for training purposes. We trained our model considering three different scenarios: (1) 80:20 random data split for training and testing, fivefold cross-validation, and (3) train the model on all sessions but one which will be used for testing. Then we collected performance results in terms of Accuracy, Precision, Recall, F1-score, and ROC, using the five prominent classifiers (RF, SVM, MLP, LR, and NB). Results showed that the best performance was obtained using an RF classifier with a classification accuracy of 97% and an F1-score of 97%. However, the poorest results were achieved with NB due to the inherent dependency of the model on the proposed features. When comparing our model with the benchmark models proposed by Hussain et al. in [4] , Sriram et al. in [30] and Maksud et al. in [31] , we were able to demonstrate that, under a similar experimental setup, our model outperformed existing models in terms of classification accuracy and F1-score.

Future work For future work, we propose exploring various research directions as follows:

1. Modify and compare the proposed model with a model which considers more sophisticated Machine Learning algorithms for feature extraction and classification such as Decision Trees, fuzzy entropy-based analysis, and transfer learning, etc. 2. Enhance the prediction models to a multi-label problem aimed at classifying students into four broad categories:

(1) very weak, (2) weak, (3) average, and (4) good. 3. Consider proposing a regression model to predict exam grades along with classifying students' performance using just the binary classification approach.

Funding Not applicable.

The dataset used in this study is publicly published https://sites.google.com/site/learninganalyticsforall/ data-sets/epm-dataset.

A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator

An overview and comparison of supervised data mining techniques for student exam performance prediction

Predicting academic performance: a systematic literature review

Using machine learning to predict student difficulties from learning session data

Application of machine learning in predicting performance for computer engineering students: a case study

Using machine learning algorithms to predict students performance and improve learning outcome: a literature based review

Predicting student academic performance in an engineering dynamics course: a comparison of four types of predictive mathematical models

Analyzing and predicting students' performance by means of machine learning: a review

A comparative study for predicting students academic performance using Bayesian network classifiers

Data mining for modeling students' performance: a tutoring action plan to prevent academic dropout

Student pass rates prediction using optimized support vector machine and decision tree

Student and school performance across countries: a machine learning approach

Combining university student selfregulated learning indicators and engagement with online learning events to predict academic performance

An application of classification models to predict learner progression in tertiary education

Student engagement predictions in an e-learning system and their impact on student course assessment scores

Collaborative multiregression models for predicting students' performance in course activities

Unsupervised learning for understanding student achievement in a distance learning setting

OU Analyse: analysing at-risk students at The Open University

Random decision forests

An empirical comparison of voting classification algorithms: Bagging, boosting, and variants

Enhanced MR image classification using hybrid statistical and wavelets features

Machine learning models and algorithms for big data classification

Machine Learning for subsurface Characterization

Statistics review 14: logistic regression

Logistic regression diagnostics: understanding how well a model predicts outcomes

Automatic visual features for writer identification: a deep learning approach

Determining the impact of demographic features in predicting student success in Croatia

Feature selection with the Boruta package

ANOVA for unbalanced data: an overview

A comparative analysis of student performance prediction using machine learning techniques with DEEDS lab

Machine learning approaches to digital learning performance analysis

Predicting Student Academic Performance using Support Vector Machine and Random Forest

Predicting student performance in higher educational institutions using video learning analytics and data mining techniques

Predicting student final performance using artificial neural networks in online learning environments

Feature ranking methods based on information entropy with parzen windows

Development of an entropy-based feature selection method and analysis of online reviews on real estate

The author has no conflicts of interest.