key: cord-0138191-1yl0c5je authors: Alonso-Fernandez, Fernando; Diaz, Kevin Hernandez; Ramis, Silvia; Perales, Francisco J.; Bigun, Josef title: Facial Masks and Soft-Biometrics: Leveraging Face Recognition CNNs for Age and Gender Prediction on Mobile Ocular Images date: 2021-03-31 journal: nan DOI: nan sha: 2563af17fe83bccaee802e694213d5294265e32b doc_id: 138191 cord_uid: 1yl0c5je We address the use of selfie ocular images captured with smartphones to estimate age and gender. Partial face occlusion has become an issue due to the mandatory use of face masks. Also, the use of mobile devices has exploded, with the pandemic further accelerating the migration to digital services. However, state-of-the-art solutions in related tasks such as identity or expression recognition employ large Convolutional Neural Networks, whose use in mobile devices is infeasible due to hardware limitations and size restrictions of downloadable applications. To counteract this, we adapt two existing lightweight CNNs proposed in the context of the ImageNet Challenge, and two additional architectures proposed for mobile face recognition. Since datasets for soft-biometrics prediction using selfie images are limited, we counteract over-fitting by using networks pre-trained on ImageNet. Furthermore, some networks are further pre-trained for face recognition, for which very large training databases are available. Since both tasks employ similar input data, we hypothesize that such strategy can be beneficial for soft-biometrics estimation. A comprehensive study of the effects of different pre-training over the employed architectures is carried out, showing that, in most cases, a better accuracy is obtained after the networks have been fine-tuned for face recognition. Recent research has explored the automatic extraction of information such as gender, age, ethnicity, etc. of an individual, known as soft-biometrics [1] . It can be deduced from biometric data like face photos, voice, gait, hand or body images, etc. One of the most natural ways is face analysis [2] , but given the use of masks due to the COVID-19 pandemic, the face appears occluded even in cooperative settings, leaving the ocular region as the only visible part. In recent years, the ocular region has gained attention as a stand-alone modality for a variety of tasks, including person recognition [3] , softbiometrics estimation [4] , or liveness detection [5] . Accordingly, this work is concerned with the challenge of estimating soft-biometrics when only the ocular region is available. Additionally, we are interested in mobile environments [6] . The pandemic has accelerated the migration to the digital domain, converting mobiles in data hubs used for all type of transactions [7] . In such context, selfie images are increasingly used in a variety of applications, so they enjoy huge popularity and acceptability [8] . Social networks or photo retouching are typical examples, but selfies are becoming common for authentication in online banking or payment services too. Soft-biometrics information may not distinctive to allow accurate person recognition, but in unconstrained scenarios where hard biometric traits (like face or iris) may suffer from degradation, it has been shown to improve the performance of the primary system [9] . If a sufficient number of characteristics are available, it might be even possible to carry out recognition with just soft-biometrics [10] . Such information has other diverse practical applications as well [1] . One example is targeted advertising, where customized products or services can be offered if age, gender or other characteristics of the customer are automatically inferred. In a similar vein, Human-Computer Interaction (HCI) can be greatly improved by knowing the particularities of the person who is interacting with the system. In biometrics identification, search across large databases can be facilitated by filtering subjects with the same characteristics. On the one hand, it reduces the amount of comparisons, since only a portion of the database would be searched. On the other hand, it also allows to attain a better accuracy, since the errors of identification systems increases in proportion to the amount of comparisons [11] . Similarly, searches can be facilitated while looking for specific individuals in images or videos [12] . The complexity can be reduced enormously by searching or tracking persons only fulfilling certain semantic attributes (e.g. a young male with beard), filtering out those that are sufficiently distinct [13] . Another important fields of application are access control to products or services based on age (such as gambling, casinos, or games) and child pornography detection. The rapid growth of image and video collections due to high-bandwidth internet and cheap storage is being accompanied by the necessity of efficient identification of child pornography, often within very large repositories of hundreds of thousands or millions of images [14] . Soft-biometrics using RGB ocular images captured by front cameras of smartphones (selfies) is a relative new problem [15] , with very few works [16] [17] [18] [19] [20] . Selfie images usually contain degradations like blur, uneven light and background, variable pose, poor resolution, etc. due to unconstrained environments and mobile operation. In addition, front cameras usually have lower quality in comparison to back-facing ones. In such conditions, soft-biometric attributes like gender or age may be extracted more reliably than features from primary biometric traits such as face or iris [21] . It may not even be necessary to look actively to the camera, so after initial authentication with a primary modality, the user may be continuously authenticated via soft-biometrics without active cooperation [22] . Transparent authentication is possible with other smartphone sensors as well, such as keystroke dynamics [23] or readings from the accelerometer or gyroscope [24] . Solutions to counteract the lack of resolution in primary modalities have been proposed too, such as super-resolution [25] , so they are usable even at low resolution. However, the techniques in use are sensitive to acquisition conditions, degrading quickly with non-frontal view, illumination or expression changes. They also rely on a precise image alignment, which is an issue in low resolution, where blurring creates ambiguities for proper localization of facial landmarks or iris boundaries. Another issue has to do with the limited resources of mobile devices. Recent developments in computer vision involve deep learning solutions which, given enough data, produce impressive performance in a variety of tasks, including those using biometric data [26] [27] [28] [29] . But state-of-the-art solutions are usually based on deep Convolutional Neural Networks (CNN) with dozens of millions of parameters, and whose models typically have hundreds of megabytes, e.g. [30] . This makes unfeasible their applicability to mobile devices, both because of computational constraints, and of size limitations imposed by marketplaces to downloadable applications. If we look at state-of-the-art results with the database that we employ in the present paper [31] [32] [33] (Table 9 ), they all use very deep networks which would not be transferable to mobiles. Thus, models capable of operating under the restrictions of such environments are necessary. Another limitation is the lack of large databases of ocular images for soft-biometrics [15] . To overcome this, it is common to start with networks pre-trained on another tasks for which large databases exist. Examples include the generic ImageNet Challenge [34] , as done e.g. in [19, 35] , or face recognition datasets [32, 36] . Both are approaches that we follow in the present paper as well. This article focuses on the use of smartphone ocular images to estimate age and gender. Partial faces can be expected in unconstrained environments, but also in controlled ones due to the use of masks, thus our focus on the ocular region as the only visible part of the face. To be clear, we have not employed images of people wearing masks, or occluded images, but we have cropped the ocular area from selfie face images. This also allows to compare the use of the entire face or only the ocular region with the same input data. A preliminary version appeared in a conference [4] . Here, we employ another database, Adience [21] , consisting of Flickr images uploaded with smartphones that are jointly annotated with age and gender. It also has a more balanced distribution between classes. Given its in-the-wild nature, it provides a more demanding setup. In some other works (see Tables 1 and 2) , images are taken in controlled environments, for example from face databases (such as MORPH or FERET), or using close-up capture typical of iris acquisitions (such as Cross-Eyed, GFI, UTIRIS, ND-Iris-0405, etc.) Images from the Adience database (from [21] ). Datasets for age and gender prediction from social media are still relatively limited [1] . To counteract over-fitting, some works use small CNNs of 2 or 3 convolutional layers trained from scratch [16, 17, 35, 37] . To be able to use more complex networks, one possibility is to pre-train them on a generic task for which large databases exist, like ImageNet [34] . This is done for example in [19] [35] , and in the present paper. In the previous study, we employed CNNs pre-trained on ImageNet as well, and classification was done with Support Vector Machines (SVMs). In contrast, end-to-end training of the networks on the target domain is evaluated here too. Also, the present study evaluates networks pre-trained in a related task, face recognition [6, 38] , where large databases are available. Since both tasks use the same type of input data, we aim at analyzing if such face recognition pre-training can be beneficial for soft-biometrics. Other works have followed this strategy as well [32, 36] , but they employ the entire face. Thus, to the best of our knowledge, taking advantage of networks pre-trained for face recognition for the task of ocular soft-biometrics can be considered novel. Finally, this paper is oriented towards the use of smartphone images. This demands architectures capable of working in mobile devices, a constraint not considered in our previous study. The lighter CNNs that we employ [39, 40] have been proposed for common visual tasks in the context of the ImageNet challenge, and they have been bench-marked for face recognition as well [6, 41, 42] . To achieve less parameters and faster processing while keeping accuracy, they use techniques such as point-wise convolution, depth-wise separable convolution, bottleneck layers, or residual connections [40] . The models obtained have a few megabytes (Table 3 ), in contrast to other popular models such as ResNet [30] , which occupy dozens or hundreds of megabytes. The contributions of this paper to the state of the art are thus: • We summarize related works in age and gender classification using ocular images. • We apply two generic lightweight CNN architectures to the tasks of age and gender estimation. The networks, SqueezeNet [39] and MobileNetv2 [40] , were proposed in the context of the ImageNet Challenge [34] , where the networks are pre-trained with millions of images to classify thousands of generic object categories. The networks proposed within ImageNet have been used in the literature as base models in many other recognition tasks [57] , specially when available data is insufficient to train them from scratch. There is the assumption that architectures that perform well on a generic task like ImageNet will perform well on other vision tasks [58] . Thus, it is common to use ImageNet pre-trained networks just as fixed feature extractors, taking the output of the last layers as descriptor, and use it to train a classifier (like SVM) for the new task. In some cases, the network is re-trained taking ImageNet weights as initialization. even if there is sufficient training data for the new task, since it can produce faster convergence than scratch initialization [58] . The networks that we have selected for the present paper are two of the smallest generic architectures proposed within ImageNet, specifically tailored for mobile environments. To be precise, the architectures employed were presented by their respective authors [39, 40] in the context of the ImageNet challenge, and here we apply them to the task of ocular soft-biometrics classification. We have also implemented two existing lightweight architectures proposed specifically in previous studies for face recognition using mobile devices, MobileFaceNets [41] and MobiFace [42] . They are based on MobileNetv2, but with a smaller size and number of parameters. • To assess if more complex networks can be beneficial, we also evaluate two CNNs based on the large ResNet50 model [30] and on Squeeze-and-Excitation (SE) blocks [59] . ResNet was also proposed within ImageNet, presenting the concept of residual connections to ease the training of CNNs. They have been also applied successfully to face recognition [38] . In this paper, we apply these existing architectures to soft-biometric classification. Proposed without mobile restrictions in mind, they have significantly more parameters and size than the networks of the previous point (Table 3) . However, as we have observed, it does not translate in superior performance, at least with the amount of training data available in this paper. • The available networks are comprehensively evaluated for age and gender prediction with smartphone ocular images. For comparative purposes, we also use the entire face. To this aim, we use a challenging dataset, Adience [21] , which consists of selfie images captured in real-world conditions with smartphones. To the best of our knowledge, this is the first work that compares the use of face and ocular images for age and gender prediction with this database. We also conduct experiments using two different ocular ROIs consisting of single eye images and combined descriptors from both eyes. • Classification experiments with the networks are done in two ways: by using feature vectors from the layer prior to the classification layer, and then training a separate SVM classifier; and by training the networks end-to-end. Prior to this, the networks are initialized in different ways. First, we use the large-scale ImageNet pre-training [34] , an approach followed in many other classification tasks [57] . It allows to use the network as feature extractor and simply train a classifier, or to facilitate end-to-end training if there is few data in the target domain [58] . Due to previous research [6, 38] , the CNNs are also available after being fine-tuned for face recognition with two large databases [38, 60] . Even if face recognition is a different task, we hypothesize that such fine-tuning can be beneficial for soft-biometrics classification. Indeed, facial soft-biometrics indicators also allow to separate identities [9] , so features learn for one task can aid the other. In addition, since the ocular region appears in face images, we speculate that networks trained for face recognition can benefit soft-biometric estimation using ocular images as well. • Results of our experiments are reported in several ways. First, the accuracy of the networks is reported for the various initializations and classification options evaluated. Convergence of the end-to-end training is also analyzing by showing the training curves, including training and inference times. Finally, t-SNE scatter plots of the vectors given by the last layer of the networks are also provided, showing progressive separation of the classes as the network progresses from a generic training (ImageNet) to an end-to-end training which also includes face recognition fine-tuning in the process. The rest of the paper is organized as follows. A summary of related works in age and gender classification using ocular images is given in Section 2. Section 3 then describes the networks employed. The experimental framework, including database and protocol, is given in Section 4. Extensive experimental results are provided in Section 5, followed by conclusions in Section 6. Pioneering studies of age or gender estimation from RGB ocular smartphone images were carried out by Rattani et al. [16, 18, 19] . Previously, near-infrared (NIR) iris images for age estimation were employed, taking advantage of available iris databases [61, 62] . These studies used geometric or textural information, attaining an accuracy of ∼64%. Gender estimation from iris texture had been also proposed [45, [63] [64] [65] [66] [67] [68] [69] , reaching an accuracy over 91% [66] . These early works followed the pipeline of iris recognition systems, so soft-biometrics classification was done extracting features from the segmented iris region (even if the surrounding ocular region is visible). Later works, mentioned below, have incorporated the ocular region to the analysis, even if the images are captured using traditional NIR iris sensors. Before the availability of specific ocular databases, it was also common to crop the ocular region from face databases like FRGC [45] , FERET [46] , web-retrieved data [44] , or pictures of groups of people [47] . There are also works using the entire face [70, 71] but due to space, we concentrate only on ocular images. Tables 1 and 2 summarize previous work on age and gender prediction. Only two databases (Adience and VISOB) are captured with frontal smartphone cameras (selfie-like). Databases like MORPH, LFW, FRGC, FERET, etc. contain face images, of which the ocular region is cropped. Other databases are of ocular images captured with digital cameras (Cross-Eyed), or iris images with NIR sensors (e.g. BioCOP, GFI, UTIRIS, UNAB, ND-Iris-0405). Age classification from smartphone ocular images is carried out in [16] using their own proposed CNN. To avoid over-fitting, they use a small sequential network with 3 convolutional and 2 fully-connected layers (41416 learnable parameters), which takes as input a crop of 32×92 pixels of the two eyes. Experiments are done with 12460 images of the Adience benchmark [21] , which is also employed in the present paper. The database contains face images, so the ocular ROI is extracted by landmark localisation with the DLib detector [72] . To simulate selfie-like case, only frontal images are retained. The reported accuracy is 46.97±2.9 (exact) and 80.96±1.09 (1-off). A set of works apply a patch-based approach for age estimation, in which crops of face regions are used [17, 37] . In [37] , the authors use 23 patches around facial landmarks to fed 23 small CNNs (of 3 convolutional layers), each CNN specialized in one patch. Landmarks are detected using Active Shape Models. The patches operate at different scales, with the larger scale covering the entire face, and their outputs are connected together in a fully-connected layer. Therefore, the algorithm rely on combining regions of the entire face. Experiments are done with 55244 images of the MORPH database, which includes age labels from 16 to 77 years. The Mean Absolute Error (MAE) is of 3.63 years. The authors also found that patches capturing smaller areas of the face give better results than patches that capture big areas, although the best accuracy is obtained when all scales are combined. Inspired by [37] , the authors of [17] use a CNN architecture of 4 branches, having 4.8M learnable parameters. Each branch, of just 3 convolutional layers, is specialized on one patch around the eyebrows, eyes, nose or mouth. These regions are detected using the OpenFace and DLib detectors [72] . The branches are then connected to a fully-connected layer. During training, the loss of each branch and the loss of entire network are summed up. However, each branch estimator is not used at inference time, but only the concatenated soft-max, so the system relies on the availability of all regions. The approach is evaluated with 19370 in-plane aligned images of Adience. The accuracy is 51.03±4.63 (exact) and 83.41±3.17 (1-off). The authors also removed different branches to evaluate its contribution, noticing that the absence of eyes and mouth contributed most to reducing the accuracy (specially the eyes). This supports studies like the present one, which concentrates on the ocular region as the most prominent facial part for soft-biometrics. In a recent work [43] , the authors use SURF (Speeded Up Robust Features) to detect key-points and extract features from the ocular region. Then, a hybrid SVM-kNN classifier is applied. With a small database of 500 images, they achieve an age accuracy of 96.57%. More recently, we applied CNNs pre-trained on Imagenet to the tasks of age, gender and ethnicity [4] with 12007 images of the Labelled Faces in the Wild (LFW) database. One of the CNNs is also pre-trained for face recognition, as in the present work, although in [4] it did not prove to be an advantage. We extract features of different regions (face, eyes and mouth) using intermediate layers of the networks identified in previous works as providing good performance in ocular recognition [73, 74] . Then, we train SVMs for classification. In overall terms, the accuracy using ocular images only drops ∼2-4% in comparison to the entire face. The reported accuracy is 95.8/64.5% in gender/age estimation (entire face), 92.6/60.2% (ocular images), and 90.5/59.6% (mouth images). The approach is also evaluated against two commercial off-the-shelf systems (COTS) that employ the whole face, which are outperformed in several tasks. Regarding gender estimation, the work [48] pioneered the use of different regions around the iris for prediction. It uses BSIF (Binarized Statistical Image Feature) texture operator, and SVM as classifier. Data consists of 3314 NIR images of the BioCOP database. The work found that the entire ocular region provides the best accuracy (∼85%) and excluding the iris has a small impact (∼84%). On the other hand, using only the iris texture pushes down accuracy to less than 75%, highlighting the importance of the periocular region. The first study making use of selfie ocular images was presented in [18] . It evaluates several textural descriptors in combination with SVMs and Multi-layer Perceptrons (MLPs). They use 1200 selfie images of the VISOB database captured with 3 smartphones. The left and right eyes are cropped to 240×160 pixels with the Viola-Jones eye detector. The work reports results for each smartphone, with the best accuracy being 90.2%. Later, the same authors evaluated pre-trained and custom CNNs on the same database [19] . The very deep VGG and ResNet networks (pre-trained on ImageNet), along with a custom CNN of 3 convolutional layers, are employed. Experiments are conducted on single eye images (of 120×123) and on strips of both eyes (120×290). The pre-trained networks are used to extract feature vectors (from the last layer before soft-max) that feed a external classifier. The authors evaluated SVMs, MLPs, K-nearest neighbours (KNN), and Adaboost. The best accuracy (90.0±1.35) was obtained with pre-trained networks and both eyes. The custom CNN is just behind (89.60±2.91). Using only one eye, the best accuracy is 89.01±1.30 (pre-trained CNNs) and 87.41±3.07 (custom CNN). In [49] , Tapia and Viedma address gender classification with RGB and NIR ocular images. They employ pixel intensity, texture features (Uniform Local Binary Patterns, ULBP), and shape features (Histograms of Oriented Gradients, HOG) at different scales. Classification is done with Random Forest using 3840 images of the Cross-Eyed database. Among the different findings, we can highlight that: it is better to extract features at different scales than in a single scale only, and the fusion of features from RGB and NIR images improves accuracy. They also compare the extraction of features from the iris texture or the surrounding ocular area, finding that the ocular area is best, attaining an accuracy of 90%. In subsequent works, Viedma et al. [35, 50] study gender classification with NIR ocular images. In [35] , they train two small CNNs of 2 and 3 convolutional layers from scratch. They also use the very deep VGG-16, VGG-19 and Resnet-50 architectures (pretrained on ImageNet). As in [19] , the pre-trained networks are used as fixed feature extractors to feed a classifier (a dense neural network in this case). The authors also fine-tune these pre-trained networks by freezing the initial convolutional layers. Experiments are done with 4976 images of 120×160 from the GFI database, which are augmented using several spatial transformations. The custom CNNs were found to perform better (best accuracy 85.48%). They also observed (via activation maps of the networks) that the ocular area that surrounds the iris is the most relevant to classify gender, more than the iris area itself. In [50] , the authors employ the same features as in [49] , together with SVMs and 9 ensemble classifiers. They use 4 databases with gender information: GFI (4976 images), UTIRIS (389), Cross-Eyed (3840) and UNAB-gender (2768). The best accuracy is 89.22%, achieved by selecting features from the most relevant regions using XgBoost. As in [35] , the relevant features are spread throughout the whole ocular area with the exception of the iris. Later on, authors from the same group [51] applied superresolution convolutional networks (SRCNNs) to counteract scale variability in the acquisition of selfie ocular images in real conditions. They use 4 databases of VIS images: CSIP (2004 images), MOBBIO (800), MICHE (3196) and a self-captured one (450). Classification is done with Random Forest. The work shows that increasing resolution (2× and 3×) improves accuracy, achieving 90.15% (right eye) and 87.15% (left eye). In another paper [52] , they applied a small CNN of 4 convolutional layers, both trained separately for each eye, and for the fused left-right eye images. They use 3000 NIR images of the GFI database, showing that training the network separately for each eye is best (87.26% accuracy). The work [53] apply a variant of an auto-encoder (Deep Class-Encoder) to predict gender and race using NIR iris images of 48×64 pixels. The databases employed for gender experiments are GFI (2999 images) and ND-Iris-0405 (64980 images). The best gender accuracy is 83.17% (GFI) and 82.53% (ND-Iris-0405). In [54] , they use GIST perceptual descriptors with weighted kernel representation to carry out gender classification from images captured in 8 different spectral bands simultaneously. To this aim, the authors use a spectral imaging camera. With a self-captured database of 104 ocular instances (10 different captures per instance, totalling 104×10×8 images), they achieve an average accuracy of 81%. In [55] , the authors use NIR ocular images to estimate gender and race. They apply typical iris texture descriptors used for recognition (Binarized Statistical Image Feature, BSIF, Local Binary Patterns, LBP, and Local Phase Quantization, LPQ) with SVM classifiers. Three datasets are used: BioCOP2009 (41830 images), Cosmetic Contact (4200), and GFI (4976). The gender accuracy from a single eye image is of 86%. The study also confirms previous research that showed that excluding the iris region provides greater accuracy. The authors of [20] apply a patch-based approach for gender estimation with 10 crops around landmarks (left eye, right eye, complete eye region, lower nose, lip, left face, right face, forehead and upper nose). Then, compass LBP features are extracted from each region, and classified with one SVM per region. Finally, the classification scores of all regions are combined with a genetic algorithm. Experiments are done with Adience (1757 images), color FERET (987), LFW (5749) and two sketch datasets, CUFS (606) and CUFSF (987 sketches from color FERET). The best accuracy is 95.75% (color FERET). The performance on Adience using the whole face is 87.71%. The authors also study each facial region individually on the Adience database, with an accuracy of 84.06% (one eye) and 83.27% (both eyes). Other regions of the face provide lower accuracy (73.95-82.71%), with the lip region providing 78.25%. This supports the findings of our previous study, which revealed the eye region as having superior accuracy than other regions of the face [4] . Lastly, in [56] , it is proposed a multimodal system that fuses features from the face and ocular regions. They use 300 NIR images of CASIA-Iris-Distance, and 405 VIS images of the MBGC database (one third are faces, one third are left eye, and one third are right eye images). As features, they employ ULBP (with overlapping blocks), We extract features from the face, left and right ocular regions (Figure 1 , top) using different CNN architectures (Table 3) . Two light-weight pre-trained generic architectures, SqueezeNet and MobileNetv2, are used for feature extraction and classification. • SqueezeNet [39] is one of the early networks designed to reduce the number of parameters and model size. The authors proposed the use of squeeze and expand modules that follow the bottleneck concept. First, dimensionality is reduced with 1×1 point-wise convolutions (squeeze or bottleneck layer), followed by a layer with a (a) MS1M images of three users (by row) and three viewpoints (by column: frontal (1-2), threequarter (3) (4) , and profile (5)). (b) VGGFace2 images from three viewpoints (frontal, threequarter, and profile, arranged by row). Image from [38] . larger amount of filters (expansion layer), which includes 3×3 filters too. The network uses late downsampling, since keeping large activation maps should lead to a higher accuracy. With only 1.24M parameters, 4.6MB and 18 convolutional layers, it matched AlexNet accuracy on ImageNet with 50x fewer parameters. • MobileNetv2 [40] employs depth-wise separable convolutions and inverted residual structures to achieve a light architecture. Depthwise separable convolution works in two stages, first performing filtering with a single filter per input channel, followed by a 1×1 point-wise convolution that linearly combine the channels. In the case of 3×3 filters, this reduces computations by a factor of 8 or 9 compared to a standard full convolution, with a small cost in accuracy [75] . Inverted residual structures, also called bottleneck residual blocks with expansion, consists of first expanding the number of channels with 1×1 point-wise filters. Then, they are processed with a large amount of 3×3 depth-wise separable filters. Finally, the number of channels is reduced again with 1×1 point-wise filters. A shortcut (residual) connection is added between the input and the output of such structure to improve the ability of a gradient to propagate across layers. This network has 3.5M parameters, a size of 13Mb and 53 convolutional layers. The original SqueezeNet and MobileNetv2 are modified to employ an input size of 113×113×3. The stride of the first convolutional layer is changed from 2 to 1, so the networks can remain unchanged (more importantly, we can use ImageNet weights). We have also implemented two lightweight architectures proposed specifically for face recognition using mobile devices. They are MobileFaceNets [41] and MobiFace [42] . Both are based on MobileNetV2, but with smaller expansion factors on bottleneck layers to make the network smaller. They also employ a reduced input image size of 113×113×3. MobileFaceNets has 0.99M parameters, 50 convolutional layers, and 4MB. It uses Global Depth-wise Convolution (GDC) to substitute the standard Global Average Pooling (GAP) at the end of the network. The motivation is that GAP treats all pixels of the last channels equally, but in face recognition, the center and corner pixels should be weighted differently It also uses PReLU as non-linearity, and fast down-sampling at the beginning. MobiFace [42] also employs fast down-sampling and PReLU, but the authors changed GAP by a fully-connected layer to allow learning of different weights for each spatial region of the last channels. This network has a size of 11.3MB and 45 convolutional layers. Finally, we evaluate the large models of [38] for face recognition. They use ResNet50 [30] and SE-ResNet50 (abbreviated as SENet50) [59] as backbone architectures, with an input size of 224×224×3. ResNet networks presented the idea of residual connections to ease CNN training. Followed later by many (including MobileNetV2), residual connections allow much deeper networks. The network employed here, ResNet50, has 50 convolutional layers, but there are deeper ResNets of even 1001 layers [76] . The Squeezeand-Excitation (SE) blocks [59] , on the other hand, explicitly model channel relationships to adaptively recalibrate channel-wise feature We use the Adience benchmark [21] , designed for age and gender classification. The dataset consists of Flickr images uploaded automatically with smartphones. Some examples are shown in Figure 2 . Given the uncontrolled nature of such images, they have high variability in pose, lightning, etc. The downloaded dataset includes 26580 images from 2284 subjects. To simulate selfie captures, we removed images without frontal pose, resulting in 11299 images. They are then rotated w.r.t. the axis crossing the eyes, and resized to an inter-eye distance of 105 pixels (average of the database). Facial landmarks are extracted using the MTCNN detector [77] . Then, a face image of 224×224 is extracted around the mass center of the landmarks, together with the ocular regions (of 113×113 each). The breakdown of images into the different classes is given in Table 4 . The Adience benchmark specifies a 5-fold cross-validation protocol, with splits pre-selected to avoid images from the same Flickr album appearing in both training and testing sets in the same fold. Given a test fold, classification models are trained with the remaining four folds. Classification results, therefore, consist of mean accuracy and standard error over the five folds. Following [21] , we also provide the 1-off age classification rate, in which errors of one age group are considered correct classifications. The training folds are augmented by mirroring the images horizontally. In addition, the illumination of each image is varied via gamma correction with γ=0.5, 1, 1.5 (γ=1 logically leaving the image unchanged). This way, from a single face or ocular image, we generate 6 training images, with which we expect to counteract over-fitting and accommodate variations in illumination. Finally, when feeding the CNNs, images are resized to the corresponding input size indicated in Table 3 . Classification is done in two ways (Figure 1, bottom) : i) by training a linear SVM [78] using feature vectors extracted from the CNNs, and ii) by training the CNNs end-to-end. Prior to training, the CNNs are initialized in different ways, as will be explained in Section 5. To train the SVMs, we use vectors from the layer prior to the classification layer, with the size of the feature vectors given in Table 3 . When there are more than two classes (age classification), a one-vs-one multi-class approach is used. For every feature and N classes, N (N − 1)/2 binary SVMs are used. Classification is based on which class has most number of binary classifications towards it (voting scheme). Regarding end-to-end training, we change the last fully connected layer of each network to match the number of classes (2 for gender, 8 for age). Batch-normalization and dropout at 50% is added before the fully-connected layer to counteract overfitting. The networks are trained using soft-max as loss function and Adam as optimizer, with mini-batches of 128 (we also tried SGDM initially, but Adam provided better accuracy overall, therefore we skipped SGDM). The learning rate is 0.001. During training, 20% of images of each class are set aside for validation in order to detect over-fitting and stop training accordingly. When the networks are initialized from scratch, training is stopped after 5 epochs. In all other cases, training is stopped after 2 epochs. Experiments have been done in stationary computers running Ubuntu, with an i9-9900 processor, 64 Gb RAM, and two NVIDIA RTX 2080 Ti GPUs. We carry out training using Matlab r2020b, while the implementations of ResNet50 and SENet50 are run using MatConvNet. The SqueezeNet, MobileNetv2 and ResNet50 CNNs are available pre-trained on the large-scale ImageNet dataset [34] . They are also available after they have been fine-tuned for face recognition using two large face databases [6, 38] . To do so, the networks are trained for biometric identification on the MS-Celeb-1M database [60] (MS1M for short), and then fine-tuned on the VGGFace2 database [38] . The images of these databases, downloaded from the Internet, show large variations in pose, age, ethnicity, lightning and background (see Figure 3 ). MS1M has 10M images from 100k celebrities (with an average of 81 images per subject), while VGGFace2 has 3.31M images of 9131 subjects (362.6 images per subject). Fine-tuned ResNet50 and SENet50 models are made available by the authors of [38] , initialized from scratch. SqueezeNet and MobileNetv2 are trained by us as described in [6] , initialized using ImageNet weights, and producing the models trained with MS1M, and then on VGGFace2. MobileFaceNets and MobiFace are also trained by us with the same protocol, but initialized from scratch. Table 5 shows the classification performance obtained with these pre-trained networks, and using SVM as classifier, according to the Table 6 Accuracy of gender and age estimation using CNN models trained end-to-end. The best results with each network are marked in bold. For each column, the best accuracy is highlighted with a grey background. Underlined elements indicate that its accuracy is worse than the corresponding combination in Table 5 . Table 7 Training and inference times of the networks evaluated in this paper. Training times correspond to the plots shown in Figure 4 . The computers used equip a Intel i9-9990 CPU @ 3. protocol of Section 4.2. For each CNN, different possibilities based on the available pre-training are reported. We provide age and gender classification results using as input either the whole face or the ocular region. The columns named 'ocular' refer to the left and right eyes separately (each image is classified independently), while 'ocular L+R' refer to the combination of both eyes (by averaging the CNN descriptors before calling the SVM classifier). In the majority of networks, a better accuracy is obtained after the CNNs are fine-tuned for face recognition on MS1M or VGGFace2. Also, it is better in general if the networks have undergone the double fine-tuning, first on MS1M, and then on VGGFace2. This goes in line with the experimentation of [6, 38] , which showed that face recognition performance could be improved after this double fine-tuning. These results also show that a CNN trained for face recognition can be beneficial for soft-biometrics classification too, even if just the ocular region is employed. Given that facial softbiometric cues can be used for identity recognition as well [9] , features learnt for recognition are expected to carry soft-biometrics information, and vice-versa. The only exception is ResNet50, where a better accuracy in general is obtained only with the ImageNet training. This shows as well that even ImageNet training can be beneficial for soft-biometrics (as shown in our previous paper too [4] ), since the accuracy of ResNet50 on ImageNet is similar or better in some cases than the accuracy obtained with other CNNs after they are fine-tuned for face recognition. With SENet50 we cannot draw any special conclusion since there is only one pre-training available. What it can be said is that it performs worse than ResNet50, even if in face recognition tasks, SENet50 is better (as reported in [6, 38] Fig. 4 : Training progress of the CNNs for soft-biometrics (with networks pre-trained on MS1M for face recognition, the case that provides the best accuracy in Table 6 ). All plots correspond to 2 training epochs over the training set of the first fold of Adience. the ocular area shows comparable accuracy, and even better in some cases, for example: 38.7% vs. 40.4% (ResNet50, exact accuracy), or 36.6% vs. 37.8% (MobileNetv2, exact accuracy). The indicated ocular accuracy refers to both eyes ('ocular L+R'), which is observed to Table 3 . All plots are generated with the test set of the first fold of the Adience database. Best in colour and with zoom. improve by 3-4% in comparison to using one eye only. This comparable accuracy between face and ocular regions is a very interesting result. Since the networks are trained for a generic recognition task like ImageNet, and not particularly optimized to the use of facial or ocular images, we can safely assume that the ocular region is a powerful region for soft-biometrics estimation, and comparable to the entire face. This is in line with our previous findings as well [4] . When the networks are fine-tuned for face recognition with MS1M or VGGFace2, accuracy with the entire face becomes substantially better (sometimes by ∼15%). Still, accuracy with the ocular area is improved as well. This may be because it appears in the training data, although in a small portion of the image. This leads us to think that accuracy with the ocular area could be made comparable if the CNNs are fine-tuned for ocular recognition instead. Lastly, from the results of Table 5 , we cannot conclude that one CNN is better than other. A good CNN for gender is MobileNetv2, which is the best with the ocular region, and its face accuracy is good as well. For age classification, MobiFace stands out. It should be highlighted though that the difference between CNNs is 2-3% or less in most columns. This is interesting, considering that the networks differ in size, sometimes substantially (Table 3) . It is specially relevant to observe that ResNet50 and SENet50 do not outperform the others, even if the input image size and the network complexity is higher. A final observation is that gender classification is more accurate in general than (exact) age classification. Being a binary classification, gender may be regarded as less difficult than age recognition, which has eight classes. In addition, we have employed the same database for both tasks, so age classes contain less images for training. If we consider the 1-off age rate, on the other hand, age accuracy becomes better than gender accuracy. Four networks are further fine-tuned to do the classification end-toend, according to the protocol of Section 4.2. We keep only the small CNNs, since they will be less prone to over-fitting, given the reduced number of images in comparison to, for example, face recognition [6, 38] . Table 6 shows the classification results considering different pre-training, including from scratch. As in the previous section, a better accuracy is obtained with the CNNs that are fine-tuned first for face recognition on MS1M or VGGFace2, rather than only on ImageNet. However, in this case, it is sufficient if the networks are just fine-tuned on MS1M. Training from scratch produces the worst results, suggesting that the amount of training data is not yet sufficient in comparison to other domains. A way to overcome such problem is to train the networks first in other tasks for which large-scale databases are available, as we do in this paper. A generic task like ImageNet can be useful [57] , producing better results than in the network is just trained from scratch. But according to our experiments, a better solution is to use a task for which similar training data is employed, such as face recognition. In Table 7 and Figure 4 , we provide the training curves over two epochs, and training/inference times of the different models (pre-trained on MS1M, which is the model that provides the best accuracy overall in Table 6 ). Due to space constraints, we show only the results over the first fold of the database. Figure 4 shows that most models converge over the first epoch (first half of the horizontal axes), with the validation accuracy showing little improvement over the second epoch. The horizontal axes of the periocular plots reach a higher value because for each face image, there are two separate ocular images, so the number of iterations is doubled. It can be also seen that the validation accuracy after the second epoch (red and blue for gender and age, respectively) is similar in most cases to the accuracy reported in Table 6 , i.e. 70-80% for gender estimation, and 40-50% for age estimation. Regarding training times, ocular obviously takes double due to the duplication of images. Also, gender and age training takes comparatively the same time for each CNN, given that the same images are used, but divided into different classes. The depth of each network (convolutional layers, see Table 3 ) correlates with the training time. The lightest network (SqueezeNet) takes the least time, while the deepest ones (MobileNetv2 and MobileFaceNets) take the longest. Inference times have been computed with the CPU to simulate lack of graphical power. Still, times are in the order of milliseconds, showing correlation with the depth of the CNN as well. Regarding face vs. ocular classification, the same conclusions than in the previous section apply. When the networks have not seen such type of images before (scratch or ImageNet pre-training), face and ocular images produce comparable performance. The difference is just 2-3% with most networks, the only exception being SqueezeNet, for which face is better than ocular by up to 10%. On the other hand, when the CNNs are fine-tuned for face recognition, then accuracy with the entire face becomes substantially better, although accuracy with the ocular area is improved as well. In contrast to the previous section, Table 6 shows MobileNetv2 as the clear winner, producing the best accuracy in all tasks. This network is the more complex of all four (see Table 3 ), which may explain its superiority. The other networks should not be dismissed though, since their accuracy is just 2-3% below in most cases, so a lighter network provides just a slightly worse accuracy. Comparing the results of Tables 5 and 6, we observe that the best accuracy per network (bold elements) is in general equal or better in the experiment of this section (Table 6) , except the 1-off age accuracy. Even if all networks improve the exact age estimation to a certain extent, accuracy in this task is still below 50%, which may be a sign that more training data would be desirable. The degradation in 1-off age accuracy may be another sign of over-fitting. The network that benefits the most from the training of this section is MobiletNetv2, with improvements of 3-5%. MobileFaceNets and MobiFace show improvements of 2-3% in the majority of tasks. Squeezenets shows marginal improvements in gender estimation, with some improvements of 2-3% in age estimation only. To further evaluate the improvements given by the end-to-end training of this section, we have removed the fully-connected layers of each network, and trained SVMs instead for classification, as in Section 5.1. Results are shown in Table 8 . Interestingly, gender accuracy is degraded, but age shows some improvement in exact estimation (1-3%), and a substantial improvement in 1-off estimation (18-20%) . For example, the best 1-off age face/ocular accuracy is 91.3/86.2%, surpassing the best gender results obtained in this paper. The results of Table 8 suggest that SVM is a better classifier in the difficult age estimation task. Table 8 also shows the superiority of MobileNetv2, having the best accuracy in nearly all tasks. Finally, we evaluate the benefits of the progressive fine-tuning proposed by showing in Figure 5 the scatter plots created by t-SNE [79] of the vectors provided by each network just before the classification layer. The t-SNE settings are exaggeration=4, perplexity=30, learning rate=500. For MobileNetv2, we show results after different network training (left column, top to bottom): i) ImageNet (generic pre-training), ii) ImageNet+MS1M (fine-tuning to face recognition), and iii) ImageNet+MS1M+Adience (fine-tuning to soft-biometrics classification). It can be seen that as the plots progresses from top to bottom, the cluster of each class tend to separate more from the others. For gender classification, the blue and red dots form two distinct clouds in the third row. For age classification, clusters of age 0-2 (red), 4-6 (orange) and 8-13 (green) appear nearby, and in progressive (circular) trajectory towards the light blue clusters (15-20 and 25-32 age groups). Then, the dark blue clusters (38-43 and 48-53) appear, and finally, the 60-99 group (magenta). This progressive ordering of age groups in the feature space reflects the expected correlation in human age [1] , with adjacent age groups being closer to each other, and non-adjacent age groups appearing more distant. Similar class separations after ImageNet+MS1M+Adience training is also observed with the other three networks (right column). On the contrary, in training i and ii with MobileNetv2, the clusters are spread across a common region, without a clear separation among them, specially with only ImageNet training. Male/female clusters are intertwined, forming a circle-like cloud, and the same happens with age groups. Light blue (young adults), dark blue (middle age) and magenta (old age) dots are spread across the same region. Even red and orange dots (children) sometimes appear in opposite extremes of the circle-like shape. Table 9 shows a summary of the best reported accuracy of the two previous sub-sections (cells highlighted with a grey background in Tables 5-8 ). For reference, the performance of other works using the same database for ocular age or gender estimation is also shown [16, 20] . Most of the literature making use of the Adience database employ full-face images, with the best published accuracy shown also at the bottom of Table 9 . To identify these works [31] [32] [33] , we have reviewed all citations to the papers describing the database [21, 80] reported by IEEEXplore (circa 305 citations), and selected the ones with the best published accuracy for each column. It must be noted that although the Adience database is divided in pre-defined folds, the works of Table 9 may not necessarily employ the same amount of images per fold, so results are not completely comparable. In gender estimation, we do not outperform the related work that uses the ocular region [20] . It should be highlighted that the latter uses 1757 images (see Table 2 ), while we employ 11299. We outperform previous age accuracy using the ocular region [16] , which uses a set of comparable size (12460 images). To prevent over-fitting, the paper [16] uses a small custom CNN trained from scratch with images of 32×92 (crop of the two eyes). In contrast, our networks are pre-trained on several tasks, including generic object classification [34] , and face recognition [6, 38] , which seems a better option. Our input image size is also bigger (113×113). Compared to works using the full face, we do not outperform them either [31] [32] [33] . In gender estimation, we obtain an accuracy ∼8% behind the best method [31] . The latter uses the very deep VGG19 CNN (535 MB, 144M parameters), which is much more complex than the networks employed here ( Table 3 ). The size of the input image is 448×448, which is much bigger than ours. Also, a saliency detection network is trained first on the PASCAL VOC 2012 dataset to detect the regions of interest ('person' or face' pixels) and indicate the classification CNNs the pixels to look at. In age estimation, our accuracy is more competitive, ∼4% behind the best method in 1-off classification [33] , although the exact accuracy is still way behind the best result [32] . The work [33] combines residual networks (ResNet) or residual network of residual networks (RoR) with Long Short-Term Memory (LSTM) units. First, a ResNet or a RoR model pretrained on ImageNet is fine-tuned on the large IMDB-WIKI-101 dataset (500k images with exact age label within 101 categories) for age estimation. Then, the model is fine-tuned on the target age dataset to extract global features of face images. Next, to extract local features of age-sensitive regions, a LSTM unit is presented to find such age-sensitive region. Finally, age group classification is conducted by combining the global and local features. The size of the input image is 224×224, and the best reported accuracy is obtained with a ResNet152 network as base model (214 MB), an even deeper network that the ResNet50 evaluated in the present paper. The work [32] follows an approach similar to ours to prevent over-fitting. They use the very deep VGG-Face CNN (516 MB) [81] , which is trained to recognize faces using ∼1 million images from the Labeled Faces in the Wild and YouTube Faces datasets. To finetune the model for age classification, the CNN is frozen, and only the fully connected layers are optimized. The network uses images of 224×224 for training. For testing, they use images of 256×256, Table 9 Summary of the best reported accuracy of the experiments of this paper. The table also include results of recent works using the same database. Different papers may not employ exactly the same amount of images per fold, so results are not completely comparable. The best results of our experiments are marked in bold. For each column, the best accuracy is highlighted with a grey background. of which 5 images of 224×224 are extracted (four corners and center). Then, the five images are fed into the CNN, and the softmax output vectors are averaged. This combination method is also followed by the authors of Adience [80] , showing some improvement in comparison to the center crop only. We lastly report the detail of gender and age estimation of each class for our approach (Tables 10, 11). We also include (when available) the details of other approaches of Table 9 . It can be observed that gender recognition is relatively equal between classes (1-2% of variation around the overall accuracy), which can be a result of the classes being well balanced in the database (Table 4 ). Regarding age, the accuracy between classes is more variable. It may may be a product of the classes being less balanced, although there are not always correlation between class representation and accuracy. It can also be seen that all methods show the same relative performance among classes. This includes other works [32, 80] , even if they are based on different networks or training strategies, suggesting that some classes may be more difficult. The classes with the worst accuracy are 38-43 and 48-53 in the majority of columns, but the class 48-53 is much less represented in the database. The class 15-20 also has comparatively low performance. On the other hand, other classes with low representation (0-2 and 60-99) have better performance, and in some cases, 0-2 even shows the best accuracy. The most represented class (25) (26) (27) (28) (29) (30) (31) (32) does correlate with the best accuracy in some cases, and its performance is among the best in most columns. We are interested in lightweight network architectures capable of providing age and gender recognition using selfie ocular images. The literature review suggests that many of the proposed methods use data captured in controlled ways, either cropped from RGB face databases or from iris databases that employ close-up nearinfrared sensors. Also, to be able to operate in mobile devices, the models have to be sufficiently small, making infeasible the use of very large Convolutional Neural Networks (CNNs) that provide state-of-the-art results in related tasks such as identity or expression recognition [27, 29] . Their typical size (hundreds of megabytes) prevent their incorporation in downloadable mobile applications, where the entire file typically cannot exceed 100 Mb. Accordingly, we have adapted very light models of a few megabytes [39] [40] [41] [42] to operate with small ocular images. The networks employed can also provide inference in <30 ms on a CPU, so a mobile device with sufficient power should be able to run them in real-time too. To counteract over-fitting due to the lack of very large selfie datasets for age and gender prediction, we use architectures pre-trained on the Ima-geNet Challenge [34] , where the networks have learnt to classify thousands of generic object categories by using millions of training images. We also exploit the availability of very large face recognition databases [38, 60] . Due to previous research [6, 38] , the networks are fine-tuned first for face recognition. We hyphothesize that such largescale fine-tuning can be beneficial for soft-biometrics classification too, since both tasks uses the same type of input data. Experiments are done with 11299 images of the Adience benchmark, which contains in-the-wild smartphone images uploaded to Flickr. The networks are evaluated for age and gender prediction using images of the ocular region. For comparison, they are also evaluated with the entire face. Classification is done in two ways: by extracting feature vectors from the layer prior to the classification layer of the network, and then training a SVM classifier; and by training the network end-to-end. We also compare different network initialization, including from scratch, with ImageNet weights, and fine-tuned for face recognition (as mentioned above). In our experiments, training from scratch provides the worst results, suggesting that training data is not yet sufficient compared to other domains. Initializing the networks with a generic task for which large databases exists (like ImageNet) is more efficient [57] , as done by in another soft-biometrics works too [19, 35] . But in most cases, the best accuracy is obtained when the CNNs are fine-tuned first for face recognition. This is also observed in the t-SNE plots of the vectors given by the networks, where the classes appear more separated after such face recognition pre-training. Such phenomenon is observed even if only the ocular region is used for soft-biometrics estimation, which we attribute to the ocular region appearing in face images, so it is 'seen' by the networks previously. Identity and softbiometrics are inter-related tasks, since they use the same input data. Indeed, soft-biometrics can aid identity recognition as well [9] , so it is expected that one task benefits the other. Regarding face vs. ocular classification, there is no clear winner when the networks are initialized with ImageNet, as observed in previous research too [4] . In such case, the networks are trained for a generic task, without a particular optimization to facial or ocular images. Thus, we can consider the ocular region as a powerful stand-alone region for soft-biometrics, comparable to the entire face. On the contrary, when the networks are initialized with face recognition weights, soft-biometrics classification with the entire face becomes substantially better (although accuracy with the ocular area is improved as well). Our interpretation is that since the ocular region appears in portions of the face image, such initialization also benefits the ocular soft-biometric task, although to a lesser extent. We believe that if the networks are fine-tuned for ocular recognition instead, ocular soft-biometric classification would become comparable to the entire face, as observed with the agnostic ImageNet initialization. Regarding absolute numbers, our best accuracy is 85.3/93.4% in gender/1-off age estimation with the entire face, and 78.9/86.2% with the combination of the two eyes. In gender ocular recognition, we do not outperform the best accuracy of the literature with the Adience database [20] , although the mentioned work uses 10% of the images that we employ in this paper. In age ocular recognition, we outperform previous research [16] . The majority of research with this database is done with full-face images, but existing papers producing state-of-the-art results [31] [32] [33] (Table 9 ) all use very deep networks, which would not be transferable to mobile devices. As future work, we are looking into fine-tuning the networks for ocular recognition, given that such area can be cropped from face databases. This way, we expect to increase ocular soft-biometrics accuracy by transfer-learning, as observed after the networks are trained for face recognition. Also, this work has simultaneously addressed age and gender recognition with a single database, but larger repositories of unconstrained data containing only one of these indicators are becoming available, e.g. [28, 82] . This would allow to separately address each task with bigger datasets, although it would hinder another direction that we want to pursue, which is joint-estimation of both indicators. We foresee that improvements can be obtained by sharing weights between the networks, since a single facial feature can carry information not only about identity, but about different soft-biometrics at the same time. One plausible direction to overcome this would be to train the networks on larger databases for each task, as done by works that focus on gender [31] or age estimation [33] separately, and then combine them together onto a database labelled with several soft-biometric indicators simultaneously. Freezing initial layers after the networks have been pre-trained in a related task (such as face recognition) can be another approach to counteract the lack of sufficient data in the target database, as done by other studies as well [32] . Age estimation using ocular data also deserves extra attention. The exact accuracy is still low in comparison to gender estimation. With the employed database, state-of-the-art accuracy is 93.52% (gender) vs 70.96% (age), see Table 9 . In ocular works with another databases (Tables 1 and 2 ), a gender accuracy of 90-95% is common, while exact age estimation barely reaches 60%. We expect to achieve improvements in this direction with larger facial repositories [83] . Demographic analysis from biometric data: Achievements, challenges, and new frontiers What else does your biometric data reveal? a survey on soft biometrics A survey on periocular biometrics research Softbiometrics estimation in the era of facial masks Presentation attack detection methods for face recognition systems: A comprehensive survey Squeezefaceposenet: Lightweight face verification across different poses for mobile platforms Biometrics: In search of identity and security (q a) Introduction to selfie biometrics Facial soft biometrics for recognition in the wild: Recent works, annotation and cots evaluation Bag of soft biometrics for person identification An introduction to biometric recognition Soft biometrics and their application in person recognition at a distance Search pruning in video surveillance systems: Efficiency-reliability tradeoff A benchmark methodology for child pornography detection Softbiometric attributes from selfie images Convolutional neural network for age classification from smart-phone based ocular images Age estimation from facial parts using compact multi-stream convolutional neural networks Gender prediction from mobile ocular images: A feasibility study Convolutional neural networks for gender prediction from smartphone-based ocular images Recognizing gender from human facial regions using genetic algorithm Age and gender estimation of unfiltered faces Attribute-based continuous user authentication on mobile devices Gender recognition from mobile biometric data Investigating gender recognition in smartphones using accelerometer and gyroscope sensor readings Super-resolution for selfie biometrics: Introduction and application to face and iris Deep learning for biometrics: A survey A survey on deep learning based face recognition', Computer Vision and Image Understanding Age from faces in the deep learning revolution Deep facial expression recognition: A survey Deep residual learning for image recognition' Muti-stage learning for gender and age prediction Age range estimation using mtcnn and vgg-face model Fine-grained age estimation in the wild with attention lstm networks Imagenet large scale visual recognition challenge Deep gender classification and visualization of near-infra-red periocular-iris images How transferable are cnn-based features for age and gender classification? Age estimation by multi-scale convolutional network Vggface2: A dataset for recognising faces across pose and age Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size Mobilenetv2: Inverted residuals and linear bottlenecks Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices Mobiface: A lightweight deep learning face recognition on mobile devices Extract features from periocular region to identify the age using machine learning algorithms An exploration of gender identification using only the periocular region Eyebrow shape-based features for biometric recognition and gender classification: A feasibility study Periocular gender classification using global ICA features for poor quality images On using periocular biometric for gender classification in the wild Iris or periocular? exploring sex prediction from near infrared ocular images Gender classification from multispectral periocular images Relevant features for gender classification in nir periocular images Sex-classification from cellphones periocular iris images Gender classification from periocular nir images using fusion of cnns models Gender and ethnicity classification of iris images using deep class-encoder Fused spectral features in kernel weighted collaborative representation for gender classification using ocular images Analyzing covariate influence on gender and race prediction from near-infrared ocular images Effect of face and ocular multimodal biometric systems on gender classification Cnn features off-the-shelf: An astounding baseline for recognition Do better imagenet models transfer better Squeeze-and-excitation networks Ms-celeb-1m: A dataset and benchmark for large-scale face recognition Age prediction from iris biometrics The prediction of old and young subjects from iris texture Learning to predict gender from iris images Predicting ethnicity and gender from iris texture Exploring gender prediction from iris biometrics Gender classification from iris images using fusion of uniform local binary patterns Gender classification from the same iris code used for recognition Gender-from-iris or gender-frommascara? Gender classification from near infrared iris images Age estimation via face images: A survey Computational intelligence in automatic face age estimation: A survey Dlib-ml: A Machine Learning Toolkit Periocular recognition using CNN features off-the-shelf' Cross-sensor periocular biometrics: A comparative benchmark including smartphone authentication Mobilenets: Efficient convolutional neural networks for mobile vision applications Identity mappings in deep residual networks Joint face detection and alignment using multitask cascaded convolutional networks The Nature of Statistical Learning Theory' Visualizing data using t-sne Age and gender classification using convolutional neural networks' Deep face recognition Sensitivenets: Learning agnostic representations with application to face images Deep expectation of real and apparent age from a single image without facial landmarks