key: cord-0627672-dzvl358a
authors: Copur, Onur; Nakip, Mert; Scardapane, Simone; Slowack, Jurgen
title: Engagement Detection with Multi-Task Training in E-Learning Environments
date: 2022-04-08
journal: nan
DOI: nan
sha: dc52d5df8179a2dfaca0f5d298a1264e9c8ce911
doc_id: 627672
cord_uid: dzvl358a

Recognition of user interaction, in particular engagement detection, became highly crucial for online working and learning environments, especially during the COVID-19 outbreak. Such recognition and detection systems significantly improve the user experience and efficiency by providing valuable feedback. In this paper, we propose a novel Engagement Detection with Multi-Task Training (ED-MTT) system which minimizes mean squared error and triplet loss together to determine the engagement level of students in an e-learning environment. The performance of this system is evaluated and compared against the state-of-the-art on a publicly available dataset as well as videos collected from real-life scenarios. The results show that ED-MTT achieves 6% lower MSE than the best state-of-the-art performance with highly acceptable training time and lightweight feature extraction.

During the COVID-19 outbreak, nearly all of the learning activities, as other meeting activities, transferred to online environments [32] . Online learners participate in various educational activities including reading, writing, watching video tutorials, online exams, and online meetings. During the participation in these educational activities, participants show various engagement levels, e.g. boredom, confusion, and frustration [11] . To provide feedback to both instructors and students, online educators need to detect their online learners' engagement status precisely and efficiently. For example, the teacher can adapt and make lessons more interesting by increasing interaction, such as asking questions to involve non-interacting students. Since, in e-learning environments, students are not speaking most of the time, the engagement detection systems should extract valuable information from only visual input [29] . This makes the problem non-trivial and subjective because annotators can perceive different engagement levels from the same input video. The reliability of the dataset labels is a big concern in this setting but often is ignored by the current methods [29, 30, 32] . Because of this, deep learning models overfit to the uncertain samples and perform poorly on validation and test sets.

In this paper, we propose a system called Engagement Detection with Multi-Task Training (ED-MTT) 4 to detect the engagement level of the participants in an e-learning environment. The proposed system first extracts features with OpenFace [2] , then aggregates frames in a window for calculating feature statistics as additional features. Finally, it uses Bidirectional Long Short-Term memory (Bi-LSTM) [13] unit for generating vector embeddings from input sequences.

In this system, we introduce a triplet loss as an auxiliary task and design the system as a multi-task training framework by taking inspiration from [22] , where self-supervised contrastive learning of multi-view facial expressions was introduced. The reason for the triplet loss usage is based on the ability to utilize more elements for training via the combination of original samples. In this way, it avoids overfitting and makes the feature representation more discriminative [9] . To the best of our knowledge, this is a novel approach in the context of engagement detection. The key novelty of this work is the multi-task training framework using triplet loss together with Mean Squared Error (MSE). The main advantages of this approach are as follows:

-Multi-task training with triplet and MSE losses introduces an additional regularization and reduces possibly over-fitting due to very small sample size. -Using triplet loss mitigates the label reliability problem since it measures relative similarity between samples. -A system with lightweight feature extraction is efficient and highly suitable for real-life applications.

Furthermore, we evaluate the performance of ED-MTT on a publicly available "Engagement in The Wild" dataset [7] , which is comprised of separated training and validation sets. In our experimental work, we first analyze the importance of feature sets to select the best set of features for the resulting trained ED-MTT system. Then, we compare the performance of ED-MTT with 9 different works [1, 5, 15, 20, 24, 25, 27, 31, 32] from the state-of-the-art which will be reviewed in the next section. Our results show that ED-MTT outperforms these state-of-the-art methods with at least 6% improvement on MSE.

The rest of this paper is organized as follows: Section 2 reviews the related works in the literature. Section 3 explains the architectural design of ED-MTT. Section 4 presents experimental results for the performance evaluation of ED-MTT and comparison with the state-of-the-art methods. Section 5 conclude our work and experimental results.

One of the first attempts to investigate the relationships between facial features, conversational cues, and emotional expressions with engagement detection is presented by D'Mello et al. in [8] . The authors in [10, 28] used the Facial Action Coding System (FACS) which is a measure of discrete emotions with facial muscle movements, and point out the relation between specific engagement labels and facial actions. In Reference [28] , Whitehill [19, 21, 23] , authors used models which are based on Convolutional Neural Networks (CNN) and Residual Networks (ResNet) [12] . All the works above considered the engagement detection problem as a multi-class classification problem. In contrast, in this paper, we follow a more recent line of research that considers engagement detection as a regression problem, where MSE loss is used to measure a continuous distance between predicted and ground truth engagement levels.

Yang et al. [31] also used MSE loss and developed a method that ensembles four separate LSTMs using facial features extracted from four different sources. In [20] , Niu et al. combined the outputs of three Gated Recurrent Units (GRU) based on a 117-dimensional feature vector composed of eye gaze action units and head pose features. In [24] , Thomas et al. used Temporal Convolutional Network (TCN) on the same set of features as in [20] . In previous works [29, 32] , the most common ways to overcome over-fitting is data augmentation and crossvalidation training. Some other works [1, 27] consider imbalanced sampling [17] and using weighted/ranked loss functions. Moreover, some works also consider spatial dropout and batch normalization as a regularization technique [5, 24] . All the previous studies focus on small sample sizes and imbalanced labels but none of them consider the reliability of the labels. On the other hand, in this paper, ED-MTT aims to handle both overfitting and label reliability at the same time via multi-task training with triplet loss. ED-MTT that consists of four main parts: Feature Extraction, Frame Aggregation, Sequence Modeling, and Multi-Tasking. The inputs of this architecture are three batches of samples as Anchor, Positive and Negative. In each batch, each sample is the sequence of images which is obtained by segmenting a video into m frames each of size h × w × c, where h denotes the height in pixels, w denotes the width in pixels and c denotes the number of color channels of each frame, where RGB color space is used. During the training with this approach, each sample s in the anchor batch is assumed to have a labeled engagement level E s between 0 and 1. For each s, E s is assigned into either low engagement or high engagement classes. To this end, if E s < 0.5, s is assigned into the low engagement class; otherwise, i.e. E s ≥ 0.5, s is assigned into the high engagement class. Then, for each sample s in the anchor batch, the positive batch contains a random sample from the same engagement class of s while the negative batch contains a random sample from the opposite engagement class of s.

Furthermore, the outputs of the architecture in Fig. 1 are the MSE and Triplet Loss which are combined to train the Bi-LSTM model. Note that during inference, the engagement level prediction is the output of the fully connected neural network. While creating a multi-task learning problem through triplet loss, which aims to prevent overfitting due to the very few samples available for engagement detection during e-learning, we are able to perform regression for continuous engagement levels using MSE. In the rest of this section, we explain each part of the training architecture.

In order to narrow down the feature space by extracting the important features from the sequence of video frames, we first determine the features that are related to the engagement level of a subject. Accordingly, as done in [14, 20, 24, 29, 31] , we consider 29 features which are related to eye gaze, head pose, head rotation, and facial action units. We extract these features with OpenFace which provides many different facial features [2] and can be described as

where X s mhwc is the tensor of frame sequences at sample s, and Y s m×n is the matrix of sequence of features at sample s, where the (i, j)-th element of Y s m×n the feature i for frame j.

In the result of feature extraction, the eye gaze-related features are, gaze 0 x, gaze 0 y, gaze 0 z which are eye gaze direction vectors in world coordinates for the left eye and gaze 1 x, gaze 1 y, gaze 1 z for the right eye in the image. The head pose-related features are pose Tx, pose Ty, pose Tz representing the location of the head with respect to the camera in millimeters (positive Z is away from the camera). pose Rx, pose Ry, pose Rz indicates the rotation of the head in radians around x,y,z axes. This can be seen as pitch (Rx), yaw (Ry), and roll (Rz). The rotation is in world coordinates with the camera being the origin. Finally, the following 17 facial action unit intensities varying in the range 0 − 5 are used: AU01 r, AU02 r, AU04 r, AU05 r, AU06 r, AU07 r, AU09 r, AU10 r, AU12 r, AU14 r, AU15 r, AU17 r, AU20 r, AU23 r, AU25 r, AU26 r, AU45 r.

We now explain the aggregation of feature statistics over time windows with multiple video frames. In this way, the number of features (which was equal to n at the end of the Feature Extraction phase) is increased to b in order to provide more information to the Sequence Model.

Let the operation of the "Feature Aggregation over Time Windows" be shown as

where Z s a×b is the matrix of the b feature statistics for a aggregated frames. Let z be the number of frames in each time window that are considered for feature aggregation, where m = a × z. Then, in each of a windows, we compute the mean, variance, standard deviation, minimum, and maximum of each feature over the consecutive z frames resulting in b feature statistics, where b = 5 × n.

Multi-task learning aims to learn multiple different tasks simultaneously while maximizing performance on one or all of the tasks [4] . The suggested architecture contains two tasks: The first task is predicting the multi-level engagement label by optimizing the MSE loss between actual and predicted labels. The second task is learning hidden vector embeddings by optimizing the triplet loss.

As shown in Fig. 1 , during sequence modeling, we use three parallel (siamese) Bi-LSTM models with weight sharing to compute the hidden vectors for Triplet Loss and for MSE loss as cascaded to the Fully Connected Neural Network. However, note that training is performed for only one Bi-LSTM model since the Bi-LSTM models in Fig. 1 are used with weight sharing for triplet loss. We call the Bi-LSTM model for the aggregated feature matrix Z s a×b as

where T s v is the hidden vector, which is the hidden state of the last layer of Bi-LSTM model. Thus, the length of this vector, denoted by v, is equal to twice the number of hidden units of the last layer of the Bi-LSTM.

Triplet loss is a loss function where a baseline (anchor) sample is compared with a positive and negative sample. The distance between the anchor and the positive sample is minimized and the distance between the anchor and the negative is maximized. We use the triplet loss function which is presented in [26] and defined as 4) where S is the number of samples in a batch, d is the euclidean distance, and margin is a non-negative margin representing the minimum difference between the positive and negative distances that are required for the loss to be 0. Moreover, Anchor s , Positive s and Negative s denote the Anchor, Positive and Negative batches for sample s, respectively.

In addition to the triplet loss, we also minimize the MSE loss which measures the error for the engagement regression. To this end, we cascade the Bi-LSTM model to the Fully Connected Neural Network whose output is the engagement level. Recall that the engagement regression is the main task during the realtime application. Accordingly, during training, the minimization of MSE can be considered as the main task while the minimization of Triplet loss is the auxiliary task.

For the performance evaluation of the proposed technique, we use both training and validation datasets published at "Emotion Recognition in the Wild" (EmotiW 2020) challenge [7] where the engagement regression is a sub-task. The dataset is comprised of 78 subjects (25 females and 53 males) whose ages range from 19 to 27. Each subject is recorded while watching an approximately 5 minutes long stimulus video of a Korean Language lecture. This procedure results in a collection of 195 videos, where the environment varies over videos and the subjects are not disturbed during recording. The engagement level of each video recording is labeled by a team of five between 0 and 3 resulting in the distribution shown in Fig. 2 . 

We implemented ED-MTT by using PyTorch on Python 3.7.12. The experiments are executed on the Google Colab platform where the operating system is Linux-5.4.144, and the GPU device is Tesla P100-PCIE-16GB. The model is trained via the adam optimizer [16] for 500 epochs with 5 × 10 −5 initial learning rate and batch size of 16. Furthermore, during our experiments, we first fixed the number of aggregated frames a = 100. At the input of Bi-LSTM, we used a batch normalization with an imbalanced sampler from the "imbalanced-learn" library of Python [17] . Then in order to determine the architectural hyperparameters of the sequential model, we performed a random search for the number of Bi-LSTM layers, the size of the hidden state as well as the number of neurons at each of two fully connected neural network layers. The random search sets are as follows: {1, 2, 3} for the number of Bi-LSTM layers, {128, 256, 512, 1024} for the size of hidden state of each Bi-LSTM layer 5 , {256, 128, 64} for the first layer of the fully connected neural network, and {32, 16, 8} for the second layer of fully connected neural network. At the end of this search, the resulting architecture is comprised of 2 Bi-LSTM layers each of whose hidden state size is 1024, and two sequential fully connected layers with 64 and 32 neurons respectively.

We now evaluate the performance of ED-MTT for engagement detection on a publicly available "Engagement in The Wild" dataset. During performance evaluation, we first aim to select the subset of facial and head position features with respect to their effects on the performance of our system. To this end, Table 1 displays the performance of the model under different combinations of feature sets, where the combinations are selected empirically to achieve high -The best performance is achieved by using all features except Head Rotation features. Accordingly, in the rest of our results, we use the combination of Eye Gaze, Head Pose, and Action Unit features. -The most effective individual feature set is Action Units.

-The MSE loss significantly decreases for the majority of the cases when Action Unit features are included in the selected features.

Furthermore, in Fig. 3 , we present the comparison of ED-MTT against the state-of-the-art engagement regression methods that are evaluated on the Engagement in The Wild dataset. In this figure, the MSE scores of the state of the art methods are taken from their original papers. The results show that ED-MTT achieves the best performance with 0.0427 MSE loss on the validation set. Although the performances of all methods are highly competitive with each other, the ED-MTT improved the best performance (Chang et. al. [5] ) in the literature by 6%. In addition, the training time of ED-MTT is around 38 minutes for 149 samples for 500 epochs. Fig. 4 displays the box plot of the predicted engagement levels on the validation sets which are classified with respect to the ground truth engagement labels in the dataset. In this figure, from median and percentiles of predicted engagement levels, one may see that the continuous predictions of ED-MTT distinctly reflects the four level of engagement classes in the ground truth labels. Moreover, ED-MTT can easily distinguish between classes 0, 0.33, and 0.66 while the difference between 0.66 and 1.0 is more subtle. 

Finally, ED-MTT is also tested on a preliminary real-life engagement detection tasks for which the prediction results are presented in Fig. 5 . These results show that the proposed model, ED-MTT, trained on Engagement in The Wild dataset is able to provide highly successful predictions in real-life use-cases, which are totally different than the cases in the training set. According to our observations on the prediction results for a total (approximately) 12 minutes long videos including 8 people, the model can successfully distinguish different levels of engagement (very low, low, high, and very high engagement levels). However, the predicted engagement levels lie between 0.2 and 0.92, which forces to determine smaller quantization intervals to classify engagement levels in real-life use-cases.

Online working and learning environments are currently more essential in our lives, especially after the COVID-19 era. In order to improve the user experience and efficiency, advanced tools, such as recognition of user interaction, became highly crucial in these digital environments. For e-learning, one of the most important tools might be the engagement detection system since it provides valuable feedback to the instructors and/or students. In this paper, we developed a novel engagement detection system called "ED-MTT" based on multi-task training with triplet and MSE losses. For engagement regression task, ED-MTT uses the combination of Eye Gaze, Head Pose, and Action Units feature sets and is trained to minimize MSE and triplet loss together. This training approach is able to improve the regression performance due to the following reasons; 1) multi-task training with two losses introduces an additional regularization and reduces over-fitting due to very small sample size, 2) triplet loss measures relative similarity between samples to mitigate the label reliability problem. 3) minimization of MSE ensures that the main loss considered for the regression problem is minimized alongside the triplet loss.

The performance of ED-MTT is evaluated and compared against the performances of the state-of-the-art methods on the publicly available Engagement in The Wild dataset which is comprised of separated training and validation sets. Our results showed that the novel ED-MTT method achieves 6% lower MSE than the lowest MSE achieved by the state-of-the-art while the training of ED-MTT takes around 38 minutes for 149 samples for 500 epochs. We tested the performance of ED-MTT for real-life use cases with 8 different participants, and the prediction results for majority of these cases were shown to be highly successful.

Affect-driven engagement measurement from videos

Openface 2.0: Facial behavior analysis toolkit

Toward active and unobtrusive engagement assessment of distance learners

Multitask learning

An ensemble model using face and body tracking for engagement detection

A deep learning approach to detecting engagement of online learners

Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges

Multimethod assessment of affective experience and expression during deep learning

Triplet loss in siamese network for object tracking

Automatically recognizing facial expression: Predicting engagement and frustration

Daisee: Towards user engagement recognition in the wild

Deep residual learning for image recognition

Long short-term memory

Fine-grained engagement recognition in online learning environment

Engagement analysis of students in online learning environments

Adam: A method for stochastic optimization

Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning

Deep facial spatiotemporal network for engagement prediction in online learning

Engagement detection in e-learning environments using convolutional neural networks

Automatic engagement prediction with gap feature

Recognition of learners' cognitive states using facial expressions in e-learning environments

Self-supervised contrastive learning of multi-view facial expressions

Engagement detection through facial emotional recognition using a shallow residual convolutional neural networks

Predicting engagement intensity in the wild using temporal convolutional network

Engagement intensity prediction withfacial behavior features

Learning local feature descriptors with triplets and shallow convolutional neural networks

Bootstrap model ensemble and rank loss for engagement intensity regression

The faces of engagement: Automatic recognition of student engagementfrom facial expressions

Advanced multi-instance learning method with multi-features engineering and conservative optimization for engagement intensity prediction

Multi-feature and multiinstance learning with anti-overfitting strategy for engagement intensity prediction

Deep recurrent multi-instance learning with spatio-temporal features for engagement intensity prediction

Multi-rate attention based gru model for engagement prediction