key: cord-0209454-osmlqyf7 authors: Gera, Darshan; Balasubramanian, S title: Landmark Guidance Independent Spatio-channel Attention and Complementary Context Information based Facial Expression Recognition date: 2020-07-20 journal: nan DOI: nan sha: 42cfbb3266df012c2e0556b91a887772aca53825 doc_id: 209454 cord_uid: osmlqyf7 A recent trend to recognize facial expressions in the real-world scenario is to deploy attention based convolutional neural networks (CNNs) locally to signify the importance of facial regions and, combine it with global facial features and/or other complementary context information for performance gain. However, in the presence of occlusions and pose variations, different channels respond differently, and further that the response intensity of a channel differ across spatial locations. Also, modern facial expression recognition(FER) architectures rely on external sources like landmark detectors for defining attention. Failure of landmark detector will have a cascading effect on FER. Additionally, there is no emphasis laid on the relevance of features that are input to compute complementary context information. Leveraging on the aforementioned observations, an end-to-end architecture for FER is proposed in this work that obtains both local and global attention per channel per spatial location through a novel spatio-channel attention net (SCAN), without seeking any information from the landmark detectors. SCAN is complemented by a complementary context information (CCI) branch. Further, using efficient channel attention (ECA), the relevance of features input to CCI is also attended to. The representation learnt by the proposed architecture is robust to occlusions and pose variations. Robustness and superior performance of the proposed model is demonstrated on both in-lab and in-the-wild datasets (AffectNet, FERPlus, RAF-DB, FED-RO, SFEW, CK+, Oulu-CASIA and JAFFE) along with a couple of constructed face mask datasets resembling masked faces in COVID-19 scenario. Codes will be made publicly available. • Validation of the proposed architecture on variety of datasets, both in-lab and in-the-wild cases with superior performance over the recent methods. Additionally, comparison against a baseline on a couple of constructed masked-facial datasets resembling masked faces in COVID-19 scenario. Earlier works relied on hand-crafted features like deformable features, motion features and statistical features along with traditional machine learning (ML) paradigms like SVM for FER [1] , [24] , [50] . These works validated their models on in-lab datasets [31] , [57] , [32] . However, models trained for controlled environment do not generalize well outside that environment. Current trend is being driven by DL paradigm with availability of huge volume of data collected in an uncontrolled 'in-the-wild' setting. It is to be noted that in-the-wild datasets, though available for FER, are not upscaled yet except for AffectNet [36] . However, as mentioned earlier, datasets for FR are very huge. So, a general strategy for FER is to adopt transfer learning [38] from FR. CNNs have stamped their supremacy with regard to performance on recognition problems. In fact, almost two decades ago, CNNs have been found to be robust to some degree of affine transformations with regard to FER [34] . In [38] , a twostage architecture is proposed for FER. In the first stage, parameters of FR model guides the learning of convolutional layers of the FER model. In the subsequent stage, expression representation is learnt by addition of fully connected (fc) layers, and further refinement. [27] addresses FER through an ensemble network with a novel pre-processing transformation, that transforms image intensities into a 3d-space, facilitating robustness to illumination variation. While [27] used an ensemble network, [21] uses an ensemble of supervisions through supervised scoring ensemble (SSE) learning mechanism wherein, apart from the output layer, intermediate and shallow layers are also supervised. Spatial information from facial expression images and differential information from changes between neutral and non-neutral expressions have been fused together using multiple fusion strategies in [44] for FER. Multi-scale features obtained using hybrid inception-residual units with concatenated rectified linear activations enabled the architecture in [52] to capture variations in expression. A 2-channel CNN with one channel trained in an unsupervised fashion, and subsequent information merge from both the channels aid FER in [20] . Apart from stand-alone CNNs, generative models where CNNs define the generator and discriminator have also been used for FER. In [51] , a facial image is viewed as a sum of neutral and expressive component, and subsequently the residual expressive component is learnt using a de-expression residual learning module under a generative model setting. Another generative model that suppreses identity while preserving expression, to account for within subject variance, is proposed in [4] . While all these methods outperform traditional hand-crafted features ML based methods, most of them still validate their analysis only on in-lab datasets. They are not robust to challenges like presence of occlusions and pose variations that are common in 'in-the-wild' datasets. Incorporation of attention mechanism plays a vital role to face these challenges in FER. Attention mechanism, an attempt to imitate how humans focus on salient regions in an image, has been successful in many computer vision tasks including abnormal behaviour recognition [46] and visual explanation [16] . With regard to FER robust to occlusions and pose variations, the recent works [30] , [48] , [13] have pushed up the performance. In [30] , unobstructedness or importance scores of local patches of feature maps corresponding to certain landmark points are computed using self attention and the respective local feature maps are weighted by these scores. The expectation is that, over training, patches corresponding to occluded areas in the image will receive low importance and hence become less relevant. Parallely, global context is captured through self attention on the entire feature map. Concatenation of local-global context is passed to a classifier for expression recognition. Region attention network (RAN) [48] is conceptually similar to [30] but selects patches directly from the image input. RAN combined with a region biased loss quantifies the importance of patches. Subsequently, a relation-attention module that relates local and global context provides the expression representation for classification. It is to be noted that selection of patches directly from the image input will increase time of inference. In [13] , attention weights are generated as samples from a Gaussian distribution centered at spatial locations of the feature map, corresponding to certain confident landmark points, where the confidence score is provided by an external landmark detector. Selection of local patches follow [30] . Concurrently, complementary information is gathered from non-overlapping partitions of the feature map. Together, the patch based information and the complementary information guide the classifier to report state-of-the-art results. This work is also an attention based architecture for robust FER. It differs from [30] , [48] , [13] in the following aspects: • [30] assigns a single attention weight to the entire feature map per patch. [48] assigns attention weight to every feature per input crop but it works with a flattened vector thereby losing spatial information. [13] assigns spatially varying attention weights that are constant across channel dimensions, per patch. In this work, per channel per spatial location attention weight per patch is defined through the proposed SCAN. • [48] , [13] uses landmark detector to define attention. The proposed model does not need external sources like landmark detectors to define attention. • [13] does not emphasize on the relevance of input feature maps to collect complementary information. The proposed work uses ECA to attend to the relevance of input feature maps to CCI branch. • [48] fixes patches at the level of image input, thereby potentially increasing the inference time, especially when the number of patches are scaled up. The proposed work does not crop the input into multiple patches. 3 Proposed Model The proposed model is motivated by the attention based FER detailed in [30] , [48] , [13] and the key observation in [43] . A generic depiction of types of attention blocks in these attention based models is shown in Fig. 1 . The three types of attention blocks, in order, are adopted in [30] , [48] , [13] respectively. It can be noticed from [30] , [48] , [13] that either the attention weight is constant across spatial and channel dimensions (Type 1, [30] ) or it is constant across spatial dimensions (Type 2, [48] ) or it is constant across channel dimensions (Type 3, [13] ). However, it has been illustrated in [43] , that in presence of occlusions, different channels respond differently and further that, each channel exhibits variation in intensity across spatial locations, for the given input stimuli. For example, Fig.2 shows 8 chosen channels of the median relative difference in response between a clean image and its occluded counterpart from the output of Conv_3x block in ResNet-50. Median is computed from a sample of 100 pairs of clean and occluded images with occlusion fixed at the same location across the entire sample. Clearly, some channels display minimal response to occlusion while others show differing responses around the region of occlusion. Based on the aforementioned observation, this work proposes an attention block that provides attention per channel per spatial location for the given input feature map. This attention block is called as Spatio-Channel Attention Net. The input-output pipeline of SCAN is shown in Fig. 3 . Mathematically, let P be the C × H × W dimensional input feature map to SCAN. Let f be the non-linear function that models SCAN. In this work, f is defined to be a 'same' convolution operation with number of out channels same as that of input channels, followed by parametric ReLU where Conv is the 'same' convolution operation that takes a C × H × W input feature map and outputs C × H × W attention weights. Following the computation of attention weights, the input feature map is weighted. Let W P denote the weighted input feature map. Then, where is the element-wise multiplication operator. For sake of convenience, it is assumed that this element-wise multiplication is merged with SCAN. Hence the output of SCAN is W P for the input P . It is to be noted that the input P could be a local patch from a feature map, or the entire feature map providing global context. For this reason, SCAN is called as local-global attention branch. When multiple patches from a feature map are input to SCAN, the learnable parameters in SCAN can either be shared across all the input patches, or kept independant across patches. In this work, both shared as well as independant set of SCAN parameters reported similar performance. This will be illustrated in Section 6. For further presentation, it is assumed that parameters are independant across patches. While SCAN endeavours to discriminate between expressions by emphasizing on relevant features locally and globally, extra complementary information to this would surely enhance the discriminating ability of the model. Since transfer learning from FR is adopted for FER, the wealth of information available in FR could be easily leveraged upon to extract complementary context information. Particularly, information through feature maps from middle layers of the FR model could be used for this purpose. Middle layers are likely to contain features surrounding parts of the face like eyes, nose, mouth etc. that are useful for FER. Final layers generally contain identity specific features which are not suitable for FER. Features from the middle layer may be redundant, and all the features may not be important for FER. To eliminate redundancy and emphasize the important features, ECA [49] is applied to the feature map from middle layer. ECA is a local cross-channel interaction mechanism without dimensionality reduction that computes channel-wise attention, constant across spatial dimensions. ECA has shown that dimensionality reduction decreases the performance while cross-channel interaction improves performance. It is to be noted that this philosophy is followed in SCAN as well. A natural question here would be that why ECA here and why not SCAN. It is to be noted that the goal here is to obtain complementary context information to the one provided by SCAN. Hence SCAN is not used here. Also, prior to using the information from the middle layer of FR model to perform FER, the important features needs to be emphasized and the redundancy needs to be eliminated. Hence channel-wise attention suffices here. In this regard, since ECA has demonstrated its superiority over other attention methods, it is adopted here. The CCI branch proposed in this paper is similar to the one in [13] but with prior feature emphasis by ECA. The influence of ECA is illustrated in Section 6.7. The CCI branch is described as follows: Let F be the feature map from a chosen middle layer of the base model trained for FR. Let OF be the output of CCI branch. Then, where PARTITION partitions the ECA weighted feature map in to k non-overlapping blocks and GAP is global average pooling. OF i is the output of GAP on i th block and it is a vector of features. Following [13] , k is chosen to be 4 in this work. The complete proposed architecture is shown in Fig. 4 . The input I is a 224 × 224 RGB image. It is processed by a pre-trained ResNet-50 model. For the SCAN branch, the output of Conv_3x block from ResNet-50 goes as input. This is a 28 × 28 × 512 feature map, denoted by F u . Next, F u is partitioned into m non-overlapping blocks, each block acting as a local patch. In this work, m is chosen to be 25 based on the ablation study reported in Section 6.9. This results in 25 local patches. The output of SCAN from each of the 25 local patches is global average-pooled across spatial dimensions and subsequently max-pooled across channel dimensions to provide the summary of the local context in the form of a 512 dimensional vector, denoted by V l . Also, the whole F u is fed to SCAN and the output is global average-pooled to capture global context in the form of another 512 dimensional vector, denoted by V g . V l and V g are concatenated and sent through an expression classifier based on cross-entropy loss denoted by L u . It is to be noted that the entire local and global context processing through SCAN is devoid of any information from external landmark detectors. For the CCI branch, the output of Conv_4x block from ResNet-50, denoted by F l , goes as input. It has dimension 14 × 14 × 1024. As expressed in Eq. 3, the output of CCI is a set of k vectors {OF 1 , OF 2 , ..., OF k }. Subsequently, each of OF i is densely connected to a 256 node layer and then goes through an expression classifier based on cross-entropy loss, denoted by L i . The total loss from the CCI branch is The overall loss from both the branches is given by where λ belongs to [0, 1]. In this work, we set λ to 0.2 based on the ablation study reported in Section 6.9. The Extended Cohn-Kanade dataset (CK+) [25, 31] contains 593 video sequences recorded from 123 subjects. The first and last three frames of each video sequence are considered as the neutral and target expressions, resulting in 1254 and 1308 images for the cases of 7 and 8 expressions, respectively. Oulu-CASIA [57] dataset contains six prototypic expressions from 80 people between 23 to 58 years old. The first and the last three frames are considered as the neutral and target expressions, resulting in 1920 images. JAFFE [32] contains 213 images of 7 facial expressions posed by 10 Japanese female models. Each image has been rated on 6 emotion adjectives by 60 Japanese subjects. In the background of covid-19 pandemic where worldwide restrictions have enforced people to mask their nose and mouth regions, this work constructs two datasets from FERPlus and RAF-DB using the publicly available code 1 , simulating masked faces in covid-19 scenario. Performance on these datasets are also reported. The proposed work is implemented in PyTorch 2 using a single GeForce RTX 2080 Ti GPU with 11GB memory. The backbone net of the proposed architecture is a pre-trained ResNet-50, trained on VGGFace2 [5] . Resnet-50 trained on ImageNet [10] is also considered. All images are aligned using MTCNN 3 [56] and then resized to 224x224. Clearly, the proposed model outperforms all the baselines with a minimum of around 9% (in AffectNet-7) to a maximum of around 39% (in SFEW) improvement over the best baseline. This indicates that pre-trained models from FR or ImageNet cannot be directly used for FER. Different branches (like SCAN and CCI) for processing feature maps from middle layers of the pre-trained model are necessary for enhancing performance on FER. The proposed model is compared with GACNN [30] , RAN [48] , OADN [13] , FMPN [7] and SCN [47] . These are the most recent state-of-the-art methods published in last one year. Table 2 reports the comparison of the proposed model against these methods on in-the-wild datasets. As can be seen, except the proposed method, none of the other 5 methods have reported results on all the datasets. From Table 2 , it is vivid that the proposed model has outperformed all the other models in all the datasets except RAF-DB. This clearly emphasizes that SCAN and CCI are vital for enhancing performance. It is to be noted that all the state-of-the-art methods have quoted only their best performance. For fair comparison, Table 2 highlights the best performance of our model across all the datasets. However, influence of randomness in training is clearly quantified for our model in the last row of Table 1 . Even while accounting for randomness, the one standard deviation interval of accuracy of the proposed model is sharper, emphasizing its superior generalizing capability. To validate the robustness of the proposed method to presence of occlusions and pose variations, the performance of the proposed method is tested on challenging test subsets of AffectNet, RAF-DB and FERPlus. It is also compared against RAN, and OADN. Table 3 depicts the results. The proposed method exhibits superior robustness to occlusions and pose variations, with around 2 to 5% gains in case of occlusions and around 1 to 7% gains in case of pose variations, over RAN and OADN. To further validate the robustness to occlusions, the proposed method is tested on another real-world occlusions dataset, namely FED-RO. Table 4 enumerates the results. Again, the proposed method stands tall among all the state-of-the-art methods with a solid 2.3% advantage over the next best performing OADN. [30] 66.5 RAN [48] 67.98 OADN [13] 71.17 Ours 73.5 In A workaround to these issues and thereby to further enhance the performance would be to rely on a hybrid approach that has an architecture like the proposed one in this paper and a meta-learner [47] that can suppress uncertainties in annotations automatically. This is a future work to be carried out. Table 5 highlights the performance of the proposed method on CK+ and Oulu-CASIA datasets. 10-fold cross-validation is carried out. The proposed model trained on combination of AffectNet and RAF-DB is fine-tuned during the crossvalidation. It is to be noted that there is no standard way of selection of expression images from the video sequences of CK+ and Oulu-CASIA. This work follows [9] in this regard. In Table 5 , CK+8 corresponds to consideration of all 8 expressions while CK+7 corresponds to only first 7 expressions (no contempt). Again, CK+7 for IACNN [7] , DEFR [51] and FMPN [35] corresponds to 7 expressions without contempt expression. The proposed method performs almost on par with the state-of-the-art methods for in-lab datasets. It is to be noted that most of the compared methods are specifically designed for in-lab datsets while the proposed method focuses on robustness to wild conditions like occlusions and pose variations in the real-world. Despite this difference, the proposed method performs well on in-lab datasets, which strengthens the generalizing capability of the proposed method. Apart from cross-validation through fine-tuning, cross-evaluation is also performed directly on in-lab datasets. For this, the best model trained on AffectNet-7 is evaluated on CK+, Oulu-CASIA and JAFFE datasets. Results are displayed in Table 6 . To study how good SCAN is, the proposed model is trained on AffectNet by replacing SCAN using: (i) a single attention (SA), (ii) a spatial attention constant across channels (SPA), (iii) only local patch attention (LPA), (iv) only global feature map attention (GA), (v) ECA. Table 7 elicits the comparison results. Definitely, SCAN has advantage over other attention modules. Though the gain may not look pronounced, it is to be noted that the local patches input to SCAN are of size around 5x5x512. The spatial size of 5x5 is small. However, a gain of around 0.3 to 1.2% achieved by SCAN over other attention modules emphasize that even when spatial size is smaller, per channel per location attention pushes up the performance and makes the model more robust. To study how good is CCI branch, the proposed model is trained with and without CCI. The results are tabulated in Table 8 . It is very clear that without CCI, the performance drops down significantly, by 11 to 21%. This emphasizes that wealth of information available from FR model must be put to use. However, weighting the features prior to use stresses important features and eliminates redundancies. To validate this point, Table 9 shows performance of the model in the presence and absence of ECA before CCI branch. Incorporating ECA before CCI branch does attain a significant gain of 2% over ECA-less CCI branch. Another study carried out keeping in mind the current COVID-19 pandemic is to evaluate how well the proposed model perform when the regions of mouth and nose are masked. Of course, FED-RO is an already available real-world occlusions dataset on which the proposed method had already been evaluated (see Table 4 ). However, it has variety of occlusions and does not predominantly contain occlusions covering mouse and nose regions. The interest here is to look at the performance of the proposed model exclusively on masked mouth and nose regions. Towards this end, using the publicly available code 4 , two datasets have been constructed from the whole of RAF-DB and FERPlus. A sample of constructed images are shown in Fig. 6 . Table 10 displays the results on these synthetic masked datasets. Row 3 in Table 10 corresponds to the performance of the proposed model initialized by weights from the associated best performing proposed model on non-masked datasets. Similarly, row 4 corresponds to the performance of the proposed model initialized by weights from the associated best performing proposed model on non-masked datasets, and subsequently fine-tuned. Row 5 corresponds to the performance of the proposed model that is trained from scratch on masked datasets. Performance is compared with Baseline3 and RAN. Baseline1 and Baseline2 are not considered because they gave lower results than Baseline3. Compared to Baseline3, the proposed model has done exceedingly well in being robust to mask. Further, the proposed model has clearly outperformed RAN by around 3% and 10% on masked RAF-DB and masked FERPlus, respectively. However, a dip of around 13 to 14% in performance is seen when compared to performance on non-masked datasets (see Table 1 ). This is because some of the relevant regions like nose and mouth regions are unavailable for discriminating between expressions. Further, fine-tuning parameters of non-masked model does not perform as good as the one trained from scratch. Some kind of unlearning and fresh learning is required when certain regions are completely blocked. This is not possible in fine-tuning. This experiment clearly suggests that new ways need to be explored to tackle challenges as depicted here. In SCAN, each local patch goes through a separate convolutional unit (see Eq.1). In this regard, an experiment is attempted where all the local patches share the convolutional parameters. The results are available in Table 11 . Clearly, shared parameters almost match the performance of SCAN with non-shared parameters. In fact, in RAF-DB, shared parameters outperform the non-shared SCAN. With shared parameters, downgrade by a factor of 25 in the number of parameters for SCAN can be achieved, making SCAN lighter. In order to investigate the salient regions focused upon by the proposed model in the presence of occlusions and pose variations, the attention weighted activation maps from conv_3x are visualized using Grad-CAM [41] for Baseline3, the proposed model without ECA and the proposed model with ECA. Fig. 9 shows the visualizations. Dark color indicates high attention while lighter color indicates negligible attention. For Baseline3, either the attention is spread over occluded parts (1st column, 2nd and 4th rows) or the attention is negligible all over. The proposed model without ECA also sometimes focuses on irrelevant pars (ear region in 2nd column, 1st row) or misses out on relevant parts (like missing out mouth region in 2nd to 4th rows). In comparison to Baseline3 and ECA-less case, the proposed model with ECA attends on non-occluded and relevant parts for expression recognition. Though the predictions made by Baseline3 and the proposed model without ECA are correct on some of the cases in Fig. 9 , it is to be noted that presence of ECA provides attention that seems more natural. Also, the proposed model without ECA still has SCAN to propel it to provide better attention weights than Baseline3. This again emphasizes the importance of SCAN in the proposed model. More visualizations are provided in the supplementary material. Fig. 10 shows some of failure cases of the proposed method. It is to be noted that though the predictions are wrong, attention has been given to relevant regions (except may be in image 4), avoiding occluded parts. Wrong predictions are among confused pair of expressions. Usually, surprise is easily confused with happy and fear expressions and sad with disgust. Such confused pairs sometimes arise due to inconsistent labeling by human taggers, and also some expressions exhibiting compound emotions like happily surprised or fearfully sad. These failures open up more scope for research in FER domain in future. In-the-wild datasets are highly imbalanced as it is difficult to annotate and collect images in categories like disgust and contempt. So, oversampling is used to overcome it. Results on AffectNet-7 with and without oversampling are shown in the Table 1 below: Additional visualizations, continuing from main text, are provided in Fig. 1 . Further, visualizations, as described in main text, are also provided for masked FERPlus dataset in Fig. 2 . It is clear that the proposed model did not focus on masked area while baseline spreads around. More confusion plots for different datasets are available in Figures [2] [3] [4] [5] [6] [7] . Happy expression is easily recognizable across most of the datasets. Contempt expression could not be recognized well in FERPlus as number of samples for this category is much less as compared to others. Even though oversampling has been used, it requires further enquiry how performance can be improved. In SFEW dataset, disgust and anger are poorly recognized, again probably because of imbalance. Fear and surprise are confused pair of expressions. Similarly, anger and sad are confused at times. Automatic interpretation and coding of face images using flexible models Human-computer interaction using emotion recognition from facial expression Training deep networks for facial expression recognition with crowdsourced label distribution Identity-free facial expression recognition using conditional generative adversarial network Vggface2: A dataset for recognising face across pose and age Facial expression recognition based on local binary patterns: A comprehensive study Facial motion prior networks for facial expression recognition Facial expression recognition approach for performance animation Microexpnet: An extremely small and fast model for expression recognition from face images ImageNet: A Large-Scale Hierarchical Image Database Acted facial expressions in the wild database Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark Occlusion-adaptive deep network for robust facial expression recognition Style aggregated network for facial landmark detection Deep convolution network based emotion analysis towards mental health care Attention branch network: Learning of attention mechanism for visual explanation Performance comparisons of facial expression recognition in jaffe database Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. ECCV Facial expression measurement for detecting driver drowsiness Face expression recognition with a 2-channel convolutional neural network Learning supervised scoring ensemble for emotion recognition in the wild Facial expression recognition for intelligent tutoring systems in remote laboratories platform Abstracts of scientific cooperations international workshops on electrical and computer engineering subfields Recognition of facial expressions for optical flow Comprehensive database for facial expression analysis Imagenet classification with deep convolutional neural networks Emotion recognition in the wild via convolutional neural networks and mapped binary patterns Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild Occlusion aware facial expression recognition using cnn with attention mechanism The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression Coding facial expressions with gabor wavelets Facial expression recognition using hybrid texture features based ensemble classifier Subject independent facial expression recognition with robust face detection using a convolutional neural network Identity-aware convolutional neural network for facial expression recognition. FG Affectnet: A database for facial expression, valence, and arousal computing in the wild Facial expression recognition system for autistic children in virtual reality environment Handbook Of Research On Machine Learning Applications and Trends: Algorithms, Methods and Techniques Deep face recognition. British Machine Vision Conference Imagenet large scale visual recognition challenge Grad-cam: Visual explanations from deep networks via gradient-based localization Very deep convolutional networks for large-scale image recognition Occlusion robust face recognition based on mask learning with pairwise differential siamese network Deep spatial-temporal feature fusion for facial expression recognition in static images Going deeper with convolutions Thian Song Ong, and Pin Shen Teh. Abnormal behavior recognition using cnn-lstm with attention mechanism Suppressing uncertainties for large-scale facial expression recognition Region attention networks for pose and occlusion robust facial expression recognition Eca-net: Efficient channel attention for deep convolutional neural networks Facial expression recognition using fisher weight maps Facial expression recognition by de-expression residue learning Holonet: towards robust emotion recognition in the wild Learning face representation from scratch Generative image inpainting with contextual attention. CVPR Occlusion-free face alignment: Deep regression networks coupled with de-corrupt autoencoders Joint face detection and alignment using multitask cascaded convolutional networks Facial expression recognition from near-infrared videos We dedicate this work to Bhagawan Sri Sathya Sai Baba, Divine Founder Chancellor of Sri Sathya Sai Institute of Higher Learning, PrasanthiNilyam, A.P., India.