key: cord-0506133-iij93qja authors: Ilyas, Chaudhary Muhammad Aqdus; Song, Siyang; Gunes, Hatice title: Inferring User Facial Affect in Work-like Settings date: 2021-11-22 journal: nan DOI: nan sha: abc802cf8ab7f1221e405e8ab76fdedf36712cb4 doc_id: 506133 cord_uid: iij93qja Unlike the six basic emotions of happiness, sadness, fear, anger, disgust and surprise, modelling and predicting dimensional affect in terms of valence (positivity - negativity) and arousal (intensity) has proven to be more flexible, applicable and useful for naturalistic and real-world settings. In this paper, we aim to infer user facial affect when the user is engaged in multiple work-like tasks under varying difficulty levels (baseline, easy, hard and stressful conditions), including (i) an office-like setting where they undertake a task that is less physically demanding but requires greater mental strain; (ii) an assembly-line-like setting that requires the usage of fine motor skills; and (iii) an office-like setting representing teleworking and teleconferencing. In line with this aim, we first design a study with different conditions and gather multimodal data from 12 subjects. We then perform several experiments with various machine learning models and find that: (i) the display and prediction of facial affect vary from non-working to working settings; (ii) prediction capability can be boosted by using datasets captured in a work-like context; and (iii) segment-level (spectral representation) information is crucial in improving the facial affect prediction. There are different ways of modelling and analysing human affect. The theory of six basic emotions (happiness, sadness, surprise, fear, anger and disgust) has been a good simplification but in the real-world, e.g., where people live and work, people do not display emotions in an exaggerated manner that can be categorized into these six categories. The dimensional perspective has been proposed as a viable alternative for nonacted emotions. It suggests that emotions are responses to environmental stimuli that vary along three key dimensions -valence/pleasure, arousal, and dominance/control. This approach has now been widely adopted by the affective computing community -see [1] , [2] for extensive reviews. Previous research studies explored facial affect recognition through hand-crafted features [3] , [4] and deep neural networks [5] - [8] . However, all these studies are conducted on facial datasets recorded in the lab (controlled) conditions or in-the-wild (data scrapped from the internet) settings (please see [9] for details). Systems created for ambient assistive living environments [1] , [10] , [11] aim to be able to perform both automatic affect analysis and responding. Ambient assistive living relies on the usage of information and communication technology (ICT) to aid in person's every day living and working environment to keep them healthier and active longer, and enable them to live independently as they age. Thus, ambient assistive living aims to facilitate health workers, nurses, doctors, factory workers, drivers, pilots, teachers as well as various industries via sensing, assessment and intervention. The system is intended to determine the physical, emotional and mental strain and respond and adapt as and when needed, for instance, a car equipped with a drowsiness detection system can inform the driver to be attentive and can suggest them to take a little break to avoid accidents [12] , [13] . Research shows that employees' moods, emotions, and dispositional affect influence critical organizational outcomes such as job performance, decision making, creativity, turnover, prosocial behaviour, teamwork, negotiation, and leadership [14] . Therefore, analysing and understanding the affect of the employees in an organisational setting is crucial in shaping organizational behaviours and decisions. To understand and evaluate the emotional and affective states in working environment it is necessary to gather data from the actual working environment or work-like settings as emotional and affective states in these environments vary from other settings due to the specific physical and mental workload, and physiological activity of the worker [15] . More importantly, workers' affect in working environments relate to their performance, wellbeing, risk perception and assessment, and can be even used for quality control [16] , [17] . Therefore, in this paper, we aim to train and evaluate machine learning models to infer user facial affect when the user is engaged in multiple work-like tasks under varying difficulty levels. More specifically, we explore (i) an office-like setting where they undertake a task that is less physically demanding but requires greater mental strain; (ii) an assembly-line setting that requires the usage of fine motor skills; and (iii) an officelike setting representing teleworking and teleconferencing. The rest of this paper is organised as follows: Section II describes the study protocol developed to acquire data, and tasks performed to simulate work-like settings. Section III presents the methodology and section IV analyses the results of the experiments conducted and concludes the paper. To investigate the people's affective and emotional states in work-like settings, we designed a study and conducted different experiments that simulate the working environment conditions and challenges such as varying mental workload, varying physical load and varying stress levels. With this, we aim to investigate the following research questions: • RQ1: Does facial affect (valence and arousal) predicted by the machine learning models vary significantly in work-like settings as compared to non-working conditions? • RQ2: How does segment-level information as compared to frame-level information influence the accuracy of the facial affect predictions? • RQ3: How well the predictions generated by the models in terms of valence and arousal match the self-reported affect reported using the Self-Assessment Manikin (SAM)i.e., to what extent are the self-reported labels and system predicted labels agree? ? The recording setup is illustrated in Fig. 1 . A participant is asked to sit at a table, where a laptop displaying slides with instructions is provided to guide the participant through the required tasks. Meanwhile, two cameras (a Logit web camera and a Dahua IP camera) are placed in front of the participant and a GoPro camera is placed on the table. In addition, the participant is also asked to wear three sensors, i.e., a Jabra microphone around their neck that records the participant's voice, an Empatica wristband and a Muse sensor that record psychological signals. As a result, the dataset contains multimodal recordings for each participant, including audio, video, and a set of psychological signals. In this paper, only the Dahua IPC-HFW1320S-W Camera recordings are considered for facial affect analysis . This study was approved by the University of Cambridge's Department of Computer Science and Technology Ethics Committee. Following an explanation of the study and informed consent from each participant, the experiment was carried out in accordance with the principles outlined in ethical approval. Additional COVID related measures were also put in place prior to undertaking the study. Twelve participants were recruited from the University of Cambridge (5 male and 7 female from 9 countries) with an average age of 28.25 (max=41 and min=22). All participants were proficient in English. To simulate various working conditions,a standard protocol was designed to assess affect in work-like contexts. The experiments to stimulate working conditions included three tasks: the N-back task, the Operations Game task (they were looking down during this task), and the Webcall task. In Nback and Operations Game tasks, participants were asked to conduct different sub-tasks performed in varying challenging conditions namely baseline, easy (conducted twice), hard (conducted twice) and hard-under-stress conditions. For the Webcall task, there are three sub-tasks including baseline, conversation of a happy memory and conversation of a negative memory. The example facial displays triggered by different tasks are visualized in Fig. 2 . The N-back task: The activities in this task represent the office-like settings (less physical but with greater mental strain) and test the worker's memory load for a reasonable approximation of work load. In this task, a series of letters are presented on a computer screen and the participant is requested to press the button when the letter on the screen match the letter that appeared in previous n stages. The task's complexity can be modified by increasing the value of n, challenging participants to memorize more letters in order. In this study, the tasks are categorized into Baseline and three conditions, Easy, Hard and Hard-under-stress. Under all conditions, 21 Uppercase letters (33% target letters) were exhibited for 500 milliseconds with random inter-stimulus interval of 500 to 3000 milliseconds. the current stimulus matches the stimulus that appeared two stages before (equivalent to NBH) but in the presence of additional noise (85dB) and white coat effect (presence of the experimenter in the room monitoring the participant performance). The Operations Game task: The activities in this task represent an assembly-line setting and test the fine motor skills of the participant. In this task, the participant is presented with a board in the form of a patient, and is asked to use tweezers to extract several objects from several slots, without touching negative / sad memory from their recent past. These tasks were executed in random order, and balanced across all participants. We refer to the data collected through this study and these tasks as Working-Environment-Context-Aware Dataset (WECARE-DB). To compare the objective sensor data with subjective self-reported evaluations, a questionnaire called the Self-Assessment Manikin (SAM) was used. The Self-Assessment Manikin (SAM) is a picture-oriented questionnaire [18] developed to measure the valence/pleasure of the response (from positive to negative), perceived arousal (from high to low levels), and perceptions of dominance/control (from low to high levels) associated with a person's affective reaction to a wide variety of stimuli. The person is asked to provide only three simple judgments along each affective dimension (on a scale of 1 to 9) that best describes how they felt regarding the provided stimuli. In our study, SAM was filled in after each baseline task and after each condition within a task. The questionnaire was introduced at the beginning of the experiment with example exercises. Firstly, to be able to use the self-reported SAM labels as ground truth, we first map the collected values to valence and arousal dimensions. Specifically, we map the unhappyhappy dimension to valence and the calm-excited dimension to arousal. For both dimensions, we use 5 as the threshold to map the corresponding values to negative (< 5), neutral (= 5) and positive (< 5). Secondly, to process the camera input, for each grabbed frame we apply face detection followed by facial landmark detection [19] . The facial landmarks are utilized for face alignment [20] . The aligned facial image is then fed to a ResN et − 18 network [21] with two additional convolution layers to provide deeper feature representation for valence/arousal prediction. This network is trained with a Mean Squared Error (MSE) loss to predict the valence and arousal values for each incoming facial image. The employed network is pre-trained using the AffectNet dataset [9] . This dataset contains large number of images from the Internet that were obtained by querying different search engines using emotionrelated tags. 450,000 images in this dataset are manually annotated with valthe valence and arousal dimensions). Thirdly, we fine-tune this network using our collected WECARE dataset and we refer to this model as F − Res. Finally, as the goal is to predict the affect of a user for a certain period in time, the spectral representation [22] , [23] is utilised to summarize the frame-level predictions for a certain period, which is then fed to another two-layer fully connected layer to generate the segment-level valence/arousal predictions. We refer to this model as S − Res. This pipeline is also illustrated in Fig. 5 . The performance of all three models is evaluated using three metrics: Concordance Coefficient Correlation (CCC), Pearson Coefficient Correlation (PCC) and Root Mean Square Error (RMSE) as these are well known metrics utilised for automatic prediction of affect [24] , [25] . Importantly, we found when using spectral representation [22] , [23] to summarise segment-level information provides a large improvement suggesting that representing facial displays along time is crucial for predicting facial affect in work-like settings. Table II presents the average and standard deviation of the sign agreement between the model predictions and selfreported SAM questionnaire. The values show that higher the agreement, the better the model performance. We observe that S-Res model that incorporates information from multiple frames performs best as compared to the other two models, thus supporting RQ2, that temporal information improves model performance. As future work, we will extend this study to a larger multisite dataset that has been acquired at different European sites where the acquisition and study protocol used in this paper has been utilised to record participants performing work-like tasks. We will also investigate the information contained in other modalities such as psychological signals for predicting user affect when performing work-like tasks. The ultimate goal is to implement and use the trained models in real time and in real work settings to provide input to decision support systems to promote health and well-being of people during their working age in the context of the EU WorkingAge Project [26] . Automatic, dimensional and continuous emotion recognition Categorical and dimensional affect analysis in continuous input: Current trends and future directions Pca-based dictionary building for accurate facial expression recognition via sparse representation Facial expression recognition utilizing local directionbased robust features and deep belief network Deep transfer learning in human-robot interaction for cognitive and physical rehabilitation purposes Rehabilitation of traumatic brain injured patients: Patient mood analysis from multimodal video Deep emotion recognition through upper body movements and facial expressions Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction Affectnet: A database for facial expression, valence, and arousal computing in the wild Exploring the ambient assisted living domain: a systematic review A review of internet of things technologies for ambient assisted living environments Real-time driver drowsiness detection for embedded system using model compression of deep neural networks Real-time driver-drowsiness detection system using facial features Why does affect matter in organizations Wearable technologies for mental workload, stress, and emotional state assessment during working-like tasks: a comparison with laboratory technologies Stress assessment by combining neurophysiological signals and radio communications of air traffic controllers Measuring neurophysiological signals in aircraft pilots and car drivers for the assessment of mental workload, fatigue and drowsiness Measuring emotion: The self-assessment manikin and the semantic differential Joint face detection and alignment using multitask cascaded convolutional networks Openface 2.0: Facial behavior analysis toolkit Deep residual learning for image recognition Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features Spectral representation of behaviour primitives for depression analysis Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space Your fellows matter: Affect analysis across subjects in group videos Decision support systems to promote health and well-being of people during their working age: The case of the workingage EU project