key: cord-0870630-euisyaxg
authors: Kudisthalert, Wasu; Pasupa, Kitsuchart; Morales, Aythami; Fierrez, Julian
title: SELM: Siamese extreme learning machine with application to face biometrics
date: 2022-03-15
journal: Neural Comput Appl
DOI: 10.1007/s00521-022-07100-z
sha: b7076dbf30d820c4e5038c377c4b265df265f1ef
doc_id: 870630
cord_uid: euisyaxg

Extreme learning machine (ELM) is a powerful classification method and is very competitive among existing classification methods. It is speedy at training. Nevertheless, it cannot perform face verification tasks properly because face verification tasks require the comparison of facial images of two individuals simultaneously and decide whether the two faces identify the same person. The ELM structure was not designed to feed two input data streams simultaneously. Thus, in 2-input scenarios, ELM methods are typically applied using concatenated inputs. However, this setup consumes two times more computational resources, and it is not optimized for recognition tasks where learning a separable distance metric is critical. For these reasons, we propose and develop a Siamese extreme learning machine (SELM). SELM was designed to be fed with two data streams in parallel simultaneously. It utilizes a dual-stream Siamese condition in the extra Siamese layer to transform the data before passing it to the hidden layer. Moreover, we propose a Gender-Ethnicity-dependent triplet feature exclusively trained on various specific demographic groups. This feature enables learning and extracting useful facial features of each group. Experiments were conducted to evaluate and compare the performances of SELM, ELM, and deep convolutional neural network (DCNN). The experimental results showed that the proposed feature could perform correct classification at [Formula: see text] accuracy and [Formula: see text] area under the curve (AUC). They also showed that using SELM in conjunction with the proposed feature provided [Formula: see text] accuracy and [Formula: see text] AUC. SELM outperformed the robust performances over the well-known DCNN and ELM methods.

In the period of the COVID-19 pandemic, a New Normal was introduced. People all around the world had to change their daily habits. They had to be constantly aware of their surroundings and had to keep everything around them clean of the virus all the time. The traveling history of every suspected COVID vector in an area had to be retraced when an infected person was detected in the area during that time, e.g., everyone arriving or leaving a building or community at the same time. Accurate personal identification is of utmost importance to retrace traveling history. At this time of writing, some communities required visitors to identify themselves correctly before they were permitted access into the communities. There are several ways to identify an individual, such as from their ID card, passport, fingerprint, iris, or DNA [14, 28] , but one of the most convenient ways in many setups (like the discussed moving travelers due to is facial identification. At this time, numerous monitoring cameras have already been installed almost everywhere, such as in department stores, airports, border crossing facilities, cities, and transportation stations, as a security and surveillance measure. An accurate and reliable face identification algorithm is required to identify individuals by their facial features [23, 45] . Identification from facial features is a type of one-to-many mapping process, i.e., an unknown face is identified between multiple faces already registered in a database. The identification is assisted by taking into account demographic information-identity, age, gender, and ethnicity [19, 21, 54, 56] . On the other hand, a face verification task is a one-to-one mapping process. The task verifies whether the individual with the recognized face is the same person registered in a system [52] . This task is often used for authorizing a system, for example, for authorizing access to a mobile device or a laptop [43] . The advantage of this method over others like fingerprint recognition [2] is that it does not require anyone touching anything [17] .

Face recognition techniques have been developed for decades [18] , e.g., geometric-based approaches [49] , local feature analysis [4] , dictionary-based learning [8, 42] , hand-crafted features [3, 29] and, recently, deep convolutional neural network (DCNN) [64] . Recently, many largescale face datasets containing millions of images have been available [7, 30, 36] for training deep learning models. Nevertheless, the class distributions of some variates in those datasets were rather imbalanced, causing statistical bias [54] . This issue was associated with an imbalanced representation of classes in a dataset. An effect of the bias was reported in [44] . They reported that algorithms invented by Asian researchers could distinguish Asian subjects better than Caucasian subjects. Conversely, algorithms from the West performed better on Caucasian subjects. Along the same line, a study by [6] reported that a commercial face recognition system yielded better outcomes on male individuals and lighter individuals but worse outcomes on darker females. Therefore, bias in class proportion and demographic variates would strongly affect a biometric system performance [48] . This concern could be alleviated by utilizing datasets evenly distributed across demographics [31, 47] . Training a model on a specific group could reduce data diversity and allows the model to learn better characteristics of each class. Interestingly, the performance of a model that was intensely trained on a very specific group, like male and female or every different ethnic in an area, might be superior to the performance of a conventionally trained model [1] .

Face representation is an essential part of the face verification task. Historically, different representation techniques have been used to extract facial information from face images. In the past, hand-crafted techniques were employed to transform face images into useful features. For example, geometry-based features utilized face shape and its landmarks to represent the appearance of the face and its components. At the time of writing, the most competitive face representations are obtained using DCNN optimized according to different loss functions [12, 34, 61] . Among the different loss functions, triplet loss (a triplet network) is a distance-metric approach designed as a type of Siamese network [25] . This triplet network has a hierarchy that starts learning from low-level features to high-level features, i.e., from pixels to classes. It could be fed with two inputs in parallel. A pair of faces can be fed into a triplet network to output a similarity/distance coefficient between the two input face images. The value of this coefficient is then usually compared against a threshold. An identity match is positive when it exceeds the threshold. Else, it was a mismatch. Fortunately, several machine learning algorithms could be employed to enhance the performance of the face verification task. They could learn the data pattern and distinguish them into classes instead of measuring the similarity/distance coefficient between two faces. Nevertheless, most of them could not deal with this task without some modification because their architecture was designed to be fed with one input at a time. Fortunately, this can be solved by linking two inputs into a concatenated input. However, certain unavoidable biases would be introduced, e.g., the exact order of concatenation of the two inputs might introduce a bias-a different order yields a different output. In this work, we restructured a well-known classification algorithm, extreme learning machine (ELM) [27] , to accept twin inputs simultaneously and eliminate this kind of bias. The restructured algorithm was based on a single hidden layer feedforward neural network (SLFN).

The following are the main contributions of the present paper:

• We propose a novel classification method for verification tasks called Siamese extreme learning machine (SELM). The proposed method adapts conventional ELM architecture to process parallel inputs in an efficient way. • We develop a demographic-dependent triplet model that improves the performance in face verification. • The proposed framework is demonstrated to distinguish gender, ethnicity, and face accurately. • We perform a performance comparison in face biometrics between biased and unbiased triplet models under different setups: subject-independent, gender-dependent, and gender-ethnicity-dependent. • We carry out a performance comparison between Siamese and non-Siamese algorithms.

Some of the key challenges in face recognition are the following: (1) inadequate quality of facial images deteriorates the performance of face detection and verification [23] ; and (2) biases between cohorts of people, especially with respect to privileged ones, deteriorates the performance of face recognition in general and introduces undesired discrimination between population groups [46, 51] . There are many powerful and well-known techniques for face recognition [45] . In this section, we will first discuss the strengths and weaknesses of key techniques for face recognition with emphasis on the two challenges indicated above. Then, we will position our proposed machine learning methods in context.

Gender and race are two important demographic variates representing subject-specific characteristics of the human face. Other variates have also been proven useful for face recognition. For example, skin tone can help improve face recognition performance. Back to demographic variates, Cook et al. [9] examined the effects of demographic variates on face recognition through leading commercial face biometric systems. They investigated the effects with a dataset of 363 subjects in a controlled environment and found that many demographic covariates significantly affected the face recognition performance, including gender, age, eyewear, height, and especially skin reflectance. Lower skin reflectance (darker skin tone) was associated with lower efficiency (longer transaction time) and accuracy, in terms of mated similarity score. The study also revealed that skin reflectance was a significantly better predictor than self-identified race variates. Buolamwini and Gebru [6] reported a significant bias in well-known commercial gender classification systems, i.e., Microsoft [11] , IBM [24] , and Face??. They found that darker-skinned females were the most misclassified group with an error rate of 34:7%, while the misclassified rate of lighter-skinned males was only 0:8%. The largest difference in error rate between the best and the worst classified groups was 34:4%. They concluded that these three classification systems yielded the best accuracy for lighter-skinned individuals and males but the worst accuracy for darker-skin females due to the mentioned bias. Several studies have reported that Caucasian and male individuals are easier to distinguish by face recognition algorithm [6, 9, 31] .

Recently, Lu et al. [36] have investigated the effects of demographic groups on face recognition and found that the difficulty of unconstrained face verification varies significantly with different demographic variates. Males are easier to verify than females, and old subjects are recognized better than young individuals. On the other hand, light-pink skin tone is recognized with the best performance. Moreover, gender and skin tone variates are not significantly correlated.

On the other hand, some works have exploited the inherent differences between population groups for stronger and more fair recognition. Phillips et al. [44] and O'Toole et al. [40] showed the importance of demographic composition and modeling. They reported that recognition of face identities from a homogeneous population (samerace distribution) was easier than recognition from a heterogeneous population. Liu et al. [37] showed that the recognition performance using a training set that contained facial images of Caucasians and East Asians at a ratio of 3:1 was better at identifying East Asians in every case. Klare et al. [31] and Vera-Rodriguez et al. [57] improved face-matching accuracy by training exclusively on specific demographic cohorts of which demographic variates were evenly distributed. This solution could reduce face bias and increase accuracy across all demographic cohorts. Vera-Rodriguez et al. [57] proposed a gender-dependent training approach to improve face verification performance that reduced the effect of gender as a recognition covariate. The approach improved AUC performance from 94.0 to 95.2. Vera-Rodriguez et al. [57] and Serna et al. [46, 47] applied deep learning methods to train face recognition models and benchmarked the models over multiple privileged classes. Conventional methods (not exploiting data diversity) resulted in poor performance when demographic diversity was large. Their experimental results showed a big performance gap between the best class (Male-White) and the worst class (Female-Black) that reached up to 200%. The above studies also demonstrated that training the models on specific demographic cohorts can be a possible solution to those large performance differences between cohorts. For example, useful features for distinguishing black individuals may differ from those for white individuals. Thus, training a model with specific groups of individuals may direct the model to learn the special characteristics of the groups better.

Many well-known large-scale face recognition datasets have been published, such as MS-Celeb-1M [36] , Megaface [30] , and VGGFace2 [7] . These datasets contain more than a million face images, but most of them are highlybiased datasets, composed mainly of Caucasian people (70%þ), while 40%þ come from a Male-Caucasian cohort. Recently, Wang et al. [58, 59] have introduced diverse and discrimination-aware face databases with evendistributed populations: Asian, Black, Caucasian, and Indian. However, they did not balance the gender distribution. Along the same line, Morales et al. [39] introduced the DiveFace database with equal distribution for six demographic groups: Female-Asian, Male-Asian, Female-Black, Male-Black, Female-Caucasian, and Male-Caucasian. The dataset was designed to be unbiased in terms of Gender and Ethnicity, which is useful both for training fair recognizers and evaluating them in terms of fairness across population groups.

Machine learning classification techniques have been popular for face recognition tasks. Successful algorithms are, for example, random forests [35] , support vector machines (SVM) [10] , ELM [22] , and DCNN [62] , the last one now dominating the field. Goswami et al. [20] summarized the performances of features extracted by deep and shallow feature extractor approaches. The experimental results clearly showed the superiority of deep features.

Other works such as Liu et al. [35] , Bianco [5] and Wong et al. [63] have also shown the robustness and improved recognition of face biometrics based on features extracted from DCNNs. However, the typical classification architecture in those works was designed to be fed with one input image at a time. To compare two input faces (e.g., for authentication), there is a need to extend the basic DCNN architecture to process two inputs. One popular approach to exploit a DCNN backbone for comparing two inputs is the Siamese architecture. The concept is to train a feature representation by comparing pairs of facial images. The conceptual diagram is shown in Fig. 1 . In this work, we will adopt this architecture in combination with an ELM (cf. Sect. 3.1 for an introduction to this type of network.) ELMs have been shown to be quite successful in various tasks related to face biometrics, but so far, Siamese architectures have not been explored yet for enhancing basic ELM methods.

As examples of ELMs for face biometrics, Laiadi et al. [33] predicted kinship relationships by comparing facial appearances. They used three different types of features: deep features using VGG-Face model, BSIF-Tensor, and LPQ-Tensor features using MSIDA. These three features of the two considered face images were measured by cosine similarity. Then the measured data were concatenated as a vector for computing a kinship score by ELM. The proposed approach was up to 3% more accurate than a baseline ResNet-based method. Wong et al. [63] adopted ELM to tackle face verification. They added a top layer of DeepID [53] with ELM as the classification layer instead of a soft-max layer. This approach improved the accuracy by 1:32% and 26:33%, respectively, for a conventional Dee-pID and ELM.

In this paper, we develop and explore a novel Siamese classification algorithm for face verification with an ELM backbone. This concept was motivated to improve the architecture of machine learning methods to deal with verification tasks. In particular, the proposed method aims to reduce the gap between the performance of pre-trained deep feature extractors for different demographic groups. The proposed algorithm utilized trained feature representations from DCNN together with an improved version of ELM, which was redesigned as Siamese architecture, as a classifier. It compares pairs of facial images based on demographic attributes. The traits are used as factors for selecting feature extraction models and guide the learning process. The main aim of this work is to boost the performance of the algorithm by decreasing the verification errors on all the demographic groups. A secondary aim of this work is to investigate the dependency of the performance on demographic variates.

ELMs were first introduced by Huang et al. [27] . They are based on an SLFN architecture of which the weights are obtained by the closed-form solution of an inverse problem, instead of the typical iterative back-propagation optimization. It has been demonstrated that this closedform solution in ELMs yields a small classification error and extremely fast learning. Assuming that x is an input sample, x 2 R m . The ELM architecture consists of m input neurons (m ¼ input dimensions). The input neurons are fully connected with l hidden neurons each one with weighted inputs according to w i , with i ¼ 1; . . .; l, w 2 R m . The weights between the hidden layer and the output layer are defined as hidden layer output weights b that are used to determine the prediction outputsŷ. The model is Fig. 1 The Siamese network concept was designed to deal with particular classification problems, such as validation tasks. The architecture consists of three components, i.e., input image, feature extractor, and classifier method Neural Computing and Applications expressed mathematically as (scalars in italics, column vectors in bold lowercase, matrices in bold uppercase, | denotes transpose):

where b is a bias and n is the number of input samples. The hidden layer output matrix H is processed by an activation function gðÁÞ with a linear combination of input X and synaptic weights W as well as bias b, where H 2 R nÂl , input matrix X ¼ ½x 1 ; x 2 ; :::; x n | , X 2 R nÂm , a set of weights W ¼ ½w 1 ; w 2 ; :::; w l | , W 2 R lÂm . It should be noted that the set of w and b are randomly generated once to speed up the training process. Therefore, the activity of the hidden node can be written as:

The prediction score is then expressed by:

ELM minimizes the mean square error between true target labels y and predicted targetsŷ by using the following objective function:

The optimal solution of the hidden layer output weights b is finally calculated by the Moore-Penrose pseudo-inverse:

The WELM architecture is shown in Fig. 2 , where the conventional activation function gðÁÞ, e.g., sigmoid or radial basis function, is replaced with a similarity-based activation function sðÁÞ, e.g., cosine similarity or Euclidean distance. WELM can reduce training time because it does not need any tuning of the kernel parameters. It yields better performance, especially when dealing with similarity-based tasks [32, 41] . In WELM, the H matrix in conventional activation is replaced by:

. . . sðx n ; w 1 Þ . . . sðx n ; w l Þ 

The set of weights W are randomly selected from a training set X, thus, W & X.

This paper proposes a novel SELM architecture to handle verification tasks that require simultaneous comparison of two identities. SELM is developed on a WELM network backbone. Input vectors x A and x B from identity A and B, respectively, are fed into WELM after a Siamese input layer, turning the conventional WELM architecture into a SELM architecture capable of feeding two inputs simultaneously and in parallel into the network, as shown in Fig. 3 . A Siamese condition function scðÁÞ in the Siamese layer is the core of SELM. The function combines two input vectors using one of the following equations:

• Summation condition function:

• Distance condition function:

• Multiply (Hadamard product) condition function:

• Mean condition function: Note that this Siamese layer can be also interpreted as an initial feature-level information fusion stage [15] . The pseudocodes of the training and prediction processes of SELM are shown in Algorithm 1. The train process consists of input matrix for training X Train and class labels y. Training samples X Train then are paired and calculated Siamese condition to be X Train;EL in Siamese Layer. The weight samples W are a subset of X Train;EL and are randomly selected with a normal distribution function from X Train;EL . The hidden layerĤ measures the similarity between X Train;EL and W and is utilized to calculate the hidden layer output weights b in the next step. It should be noted that the SELM algorithm can converge with a small number of training data and less time consumption. Due to the use of the Moore-Penrose pseudo-inverse function in the hidden layer output b, the solution is guaranteed to be the global minimum in a single step. However, SELM requires a large memory to train a model as it cannot feed data as a small batch, which differs from random forest or DCNN techniques.

The 

The triplet network model was proposed for learning useful representations by distance comparisons [25] between three samples: anchor sample x, positive sample x þ , and negative sample x À . The triplet network structure is shown in Fig. 4 . As can be seen, the network employs DCNNs as the backbone to optimize the model's weights with backpropagation. These core networks are identical and share the same weights. The triplet network aims to minimize the d p distance between the anchor and the positive sample and to maximize the d n distance between the anchor and the negative sample. The positive sample and the anchor sample come from the same identity, while the negative sample comes from different identities. The Euclidean distance of d p and d n is expressed as,

The triplet loss is then calculated as a loss function of the network as follows:

where the a parameter is a soft margin. The objective of the learning function is to satisfy d n ! d p þ a. In this study, we trained a number of triplet networks with several demographic groups so that they could learn population-specific facial information.

The proposed framework is shown in Fig. 5 . It consists of five stages. The framework was designed to verify the identity of two input facial images. The input images are first classified into gender and ethnicity to select genderand ethnicity-dependent triplet models for each input. The details of each stage are explained below.

1. First stage (Input) input color facial images were first cropped and aligned properly [13] before being fed into the next stage. It should be noted that the two images passed in parallel through every process in the framework simultaneously. 3 The SELM architecture was designed to deal with validation tasks. The extra Siamese Layer is added between input and hidden layer in order to calculate new input x by Siamese condition function scðÁÞ between input from identity A and B Fig. 4 Triplet network structure is comprised of input vector from anchor x, positive x þ , and negative x À samples, feature extractor Net (it can be a DCNN), and comparator that is used to classify the identity

Second, machine learning models are applied to verify if both images come from the same identity. In this work, we compare the proposed approach SELM to the performance of standard ELM and ResNet (which is now one of the most common DCNNs used for face recognition [33] ). Incidentally, ResNet is also a core component of our proposed approach for training the triplet models.

In this study, we used two datasets: DiveFace and Labeled Faces in the Wild. DiveFace is a diversity-aware face recognition dataset for training models such as Gender classification, Ethnicity classification, and Gender-Ethnicity-dependent triplet models. Labeled Faces in the Wild dataset is a well-known large-scale face dataset in the face recognition domain for performance evaluation.

DiveFace was constructed to be an unbiased face recognition dataset. Each image was carefully selected from Megaface MF2 training dataset [30] that contained 4.7 million faces from 672K identities from Flickr Yahoo's dataset [55] . There are 24,000 identities from six demographic groups, 4000 identities for each group, and three poses for each identity. Thus, each demographic group contained 12,000 faces for a total of 72,000 faces in the whole dataset. (see Table 1 ). The identities in the DiveFace database are equally distributed among six classes (16:67% for each class) related to gender (Female-Male) and ethnicity. Three ethnicity categories are available, related to the physical characteristics of each ethnic group:

• Group 1 people with ancestral origin in Japan, China, Korea, and other countries in that region. • Group 2 people with ancestral origins in Sub-Saharan Africa, India, Bangladesh, Bhutan, and others. • Group 3 people with ancestral origins in Europe, North-America, and Latin-America with European origin.

In this study, we denoted Groups 1, 2, and 3 as Asian, Black, and Caucasian, respectively. A t-distributed stochastic neighbor embedding (t-SNE) [38] of dimension 2 from ResNet-50 descriptors of the full DiveFace dataset is shown in Fig. 6 . As can be seen, the six clusters separated from each other clearly. However, a few data points in the Male-Black category are also in the clusters of Male-Asian and Male-Caucasian.

Labeled Faces in the Wild (LFW) database was introduced to evaluate the performance of face verification algorithms with unconstrained parameters, such as position, pose, lighting, background, camera quality, and gender [26] . The database contains 13,233 faces collected from the web from 5,749 unique individuals. LFW was published in 2007. It has been a very popular database in the face recognition field. LFW has already been appropriately split into standard training and test sets. In this work, we used the test set to evaluate our framework's performance. It contains a balanced set of 1000 sample pairs (500 pairs of genuine facial images and 500 pairs of imposter images).

This study divided the DiveFace dataset into training, validation, and test sets. The training set size was 60% of the whole dataset; the size of the validation set was 10%, Fig. 5 The workflow of the proposed framework Neural Computing and Applications and the size of the test set was 30%. The training set was used to train gender and ethnicity classifier models and the triplet models; the validation set was used to select optimal models; and the test set was used to evaluate the prediction performances of all tested models. On the other hand, the performance of our entire framework was evaluated with the LFW database. The average and standard deviation of the metrics of ten experimental runs, each with a different random split, are reported. For the image pairing, the set of positive samples was constructed by pairing all pose images in all possible ways within each identity. On the other hand, the set of negative samples was constructed by randomly pairing different identities.

The performance of the proposed SELM is evaluated in comparison with ResNet and ELM. ResNet is one of the most well-known DCNNs methods. We used a ResNet-50 architecture pre-trained for face recognition with VGGFace2 (millions of images) as a comparison baseline. The pre-trained ResNet-50 was then used to train our triplet models. These triplet models classify input image pairs into two classes (genuine or impostor match) based on Euclidean distance. The ELM and SELM methods have a similar architecture. The architecture is based on an SLFN that can be trained much faster than common artificial neural networks. On the other hand, SELM has one additional layer (the Siamese layer). Both ELM and SELM use a kernel trick together with a pseudoinverse technique to generate the weights of the model that provide the lowest error rate. Moreover, we evaluate the performance when using four different Siamese conditions to improve the classification outcome.

As for parameter settings, the parameters of the three methods are tuned to obtain the best result. False acceptance rate (FAR) and false rejection rate (FRR) were used to find an optimal threshold, which is considered to be at the Error Equal Rate (EER). For ELM, three parameters were tuned: regularizing C which was set to be ½10 À6 ; 10 À5 ; . . .; 10 5 ; 10 6 , percentage of hidden nodes which was in the range of ½10; 20; . . .; 90; 100%, and gamma c in RBF kernel which was ½10 À6 ; 10 À5 ; . . .; 10 5 ; 10 6 . For SELM, two parameters were tuned: regularizing C and percentage of hidden nodes. As for ResNet, we used the same Euclidean coefficients for calculating the loss function as the Euclidean coefficients that we used in the kernel trick in SELM. Hence, for the kernel trick, no parameters needed to be tuned.

In this section, we report the experimental results on the following types of evaluation: evaluation of feature performance, evaluation of classifiers, evaluation of Siamese and non-Siamese architectures, and evaluation of the performance of the whole framework. Two evaluation metrics are employed: verification accuracy and area under the curve (AUC). The average and standard deviation of ten runs are reported for each experiment.

The performances of all features used in the experiment are presented in this section. ResNet-50 was used to train three different feature-extraction models, which were trained differently as follows. cohort; thus, the model was trained independently on each of the six considered cohorts. The number of training samples from every cohort was assigned to be the same.

The experimental results on the DiveFace dataset are shown in Table 2a and b. The best features among all types of features in every cohort are marked in bold.

As can be seen in Table 2 , the values of accuracy and AUC reflect each other, the higher the accuracy, the higher the AUC, and vice versa. The feature performance of SI, the baseline, was the worst, but it still reached up to 93.65% and 97.87% in overall accuracy and AUC, respectively. Therefore, it was a challenge to improve on those metrics. Nevertheless, GED and GD yielded a better AUC performance: 99.45% and 99.04% AUC value, respectively. GED results to be the best among the tested methods, followed by GD. Furthermore, GED exhibits better metrics for every cohort compared to SI and GD. This result confirms our hypothesis that training samples with specific, distinctive groups could induce the model to learn more useful facial features. The reason that the performance of GD was higher than SI and that the performance of GED was higher than GD is that GD learned intensively and independently on gender group, and GED learned in the same way as GD but on both gender and ethnicity groups.

Nevertheless, GED performance was only 0:41% better than that of GD. To check if that difference was significant or not, we used one-way ANOVA to test the null hypothesis (SI, GD, and GED have the same population mean, l SI ¼ l GD ¼ l GED ) [50] . The statistical result, f ¼ 144:06, indicates that the difference is statistically significant at a level of p:001, hence the null hypothesis H 0 was rejected. GED is the best feature type among the three models tested in this work.

The performances of ResNet, ELM, and SELM embedded with four different types of Siamese conditions-summation, distance, multiply, and mean denoted as Sum, Dist, Mult, and Mean, respectively-are shown in Table 3a and b. We used the best feature, GED, obtained from the previous experiment, Sect. 6.1. Table 3 lists the performance metrics-accuracy and AUC-achieved by the proposed SELM in comparison with standard ELM and the ResNet baseline. The best metric achieved by the best classifier candidate for each identity cohort is marked in bold.

The experimental results in Table 3a show that SELM Mean is the best classification method in terms of overall accuracy score, followed by SELM Sum , SELM Mult , SELM Dist , ResNet, and ELM. SELM Mean yields the highest accuracy for four out of the six demographic groups; SELM Sum yields the highest accuracy for two out of the six demographic groups; and SELM Mult yields the highest accuracy for one group. Nevertheless, the accuracy score achieved by the first and second best methods, SELM Mean and SELM Sum , differs only by 0:01%. Furthermore, SELM Sum achieves the highest AUC metric (99.72) for only one out of six groups, but SELM Mean , (99.72) achieves the highest AUC for four out of the six groups. SELM Dist , ELM, SELM Mult and ResNet follow those two in this order. Figure 7 shows a comparison between the number of wins of SELM Sum and SELM Mean , in terms of both accuracy and AUC evaluation metrics. Since the graphs were data from ten experimental runs of six demographic cohorts, the ideal score should be 10 Â 6 ¼ 60. Figure 7 shows clearly that SELM Sum is definitely better than SELM Mean for 51 out of 60 cases in terms of accuracy and 45 out of 60 cases in terms of AUC. It can be seen that the performance of SELM Sum and SELM Mean were very competitive and almost identical results. This is because the Siamese condition of Sum and Mean are basically the same function. However, the Mean In addition, we show the accumulated AUC score ranks across the ten experimental runs as a way to rank the methods in Fig. 8 . The ideal summation would be first rank in all 60 experimental runs, i.e., 60 is the lowest summation possible (best method). At the other extreme, sixth, 360 would be the highest accumulated rank possible (worst method). We then used Kendall's coefficient of concordance W, a statistical technique, to calculate the degree of reliability of the ranked order:

where R is the average ranked order assigned to the i-th candidate; N is the number of candidate methods (six); and the number of runs times the number of cohort groups k is 60. The value of W was found to be 0.7526. The critical value in v 2 distribution was converted from W by the following equation:

We acquired v 2 ¼ 225:78 which indicates that the ranked order shown in Fig. 8 is reliable at a confidence level of 99.9%. The rank order is as follows:

In this section, we compare the performance of the most robust Siamese architecture (SELM Sum ) to that of WELM, an ELM with non-Siamese architecture. Their backbone architecture was identical except for the additional Siamese layer in SELM. Simultaneous dual inputs into WELM were concatenated for training the network, but these inputs were not concatenated by SELM; instead, they were passed through the Siamese layer. Any subsequent procedural steps of the two architectures are the same. Figure 9 shows the accuracy and AUC performances of WELM and SELM while using an increasing number of hidden nodes to train a model ( Fig. 9a and b, respectively) . The performance values are obtained averaging across the six available demographic cohorts. It can be seen that WELM has to use a large number of hidden nodes up to 80% of the training samples in order to compete with SELM, while SELM needs only less than 10% in order to achieve excellent results. The optimal model of WELM achieves 94% accuracy when the number of its hidden nodes is 99:0%, while SELM achieves 97:00% accuracy with a number of hidden nodes of only 81%. It should also be noted that SELM achieved 96:80% accuracy and 99:50% AUC with a number of hidden nodes of only 20%. We used two-sample t-test analysis to check the statistical significance between the mean scores from both methods at p:001 and found that the t-values for accuracy and AUC are t ¼ 9:08 and t ¼ 6:78, respectively. Hence, we conclude that the proposed Siamese-ELM performs significantly better than the standard non-Siamese-ELM.

We evaluated the proposed system, described in Sect. 4, in conjunction with the most robust feature, GED, described in Sect. 6.1, and the most robust classification method, SELM Sum , described in Sect. 6.2. The whole system is termed SELM GED Sum . It should be noted that the proposed system first classifies individuals according to their respective Gender-Ethnicity class so that a proper featureextraction model could be selected for that purpose, and the input image pairs that are not in the same Gender-Ethnicity class are classified as impostor comparisons. SELM Sum SI is similar to SELM GED Sum but without the initial Gender-Ethnicity classification. In Fig. 10 , we show the performances of ResNet (baseline), SELM GED Sum , and SELM Sum SI tested on the standard test set of the LFW database.

The ranked order of each demographic is shown on the top of the bar representing that group in Fig. 10 . It can be seen that SELM Sum SI is the best method producing the smallest sum of ranked order (9.5), followed by ResNet (10.5), and SELM GED Sum (16) . The performances of both SELM Sum SI and SELM GED Sum for the Black demographic class are lower than the performance obtained for Asian and Caucasian classes. This is because the systems were trained on DiveFace, which contains data of individuals in the Black group whose origin is in the Sub-Saharan region, Africa, India, Bangladesh, and Bhutan, while the tested LFW dataset contains data of individuals of the Black group not well represented by those regions. Regarding the performance of SELM GED Sum , it works like a two-stage prediction system, and the accuracy of the final prediction in the second stage depends highly on the performance of the first stage, the Gender-Ethnicity prediction model. In this study, SELM GED Sum yielded very accurate outcomes when the first stage provided an ideal classification of Gender-Ethnicity group. Figure 11 shows bar graphs of two evaluation metrics-false acceptance rate (FAR) and false rejection rate (FRR)-produced by ResNet, SELM GED Sum , and SELM Sum SI . FAR was considered the most important metric for this kind of task. It represented the rate of which wrong persons were given access to the system. The performance results show that both SELM GED Sum and SELM Sum SI provided a very low FAR (0:2%), 12 times lower than that provided by ResNet (2:4%), indicating that they would execute with far less error in face recognition tasks.

A framework for face verification is proposed. The framework employs a new classification method called Siamese extreme learning machine (SELM), an improved version of a powerful classification method called extreme learning machine that can accept two image inputs in parallel and process them concurrently. It utilized trained feature representation techniques together with Siamese architecture to accomplish the framework. In our performance evaluation, SELM was studied in conjunction with several features that were trained on unbiased demographic-dependent groups. With this training, the featureextraction model in our proposed SELM was able to better recognize distinct features of individuals in demographic groups than a conventional feature-extraction model was able to. In an evaluation experiment, four different types of Siamese conditions embedded in the Siamese layer were compared. The SELM with summation and mean conditions provided the highest overall performance score. Furthermore, in another experiment, SELM with the 'sum' Siamese condition was demonstrated to be more robust than baseline methods ResNet and ELM. In particular, the proposed method was able to perform the verification task better, with 98:31% accuracy and 99:72% AUC, than the other methods. More importantly, SELM Sum SI provided a very low 0:2% false acceptance rate, which was 12 times lower than that provided by ResNet (2:4%), a considerable improvement.

For future work, we aim to do the following: (i) train our own face recognition model from scratch to eliminate any bias from the beginning [54] , (ii) explore other architectures for processing multiple inputs on top of ELM backbones beyond Siamese settings using recent advances from the information fusion field [16] , and (iii) applying SELM to other types of image comparison tasks in addition to human face verification. 

Conflict of interest The authors declare that they have no competing interests. 

Measuring the gender and ethnicity bias in deep models for face recognition

Learned vs. hand-crafted features for pedestrian gender recognition

A face recognition system based on local feature analysis

Large age-gap face verification by feature injection in deep networks

Gender shades: Intersectional accuracy disparities in commercial gender classification

VGGFace2: a dataset for recognising faces across pose and age

Dictionarybased face recognition from video

Demographic effects in facial recognition and their dependence on image acquisition: an evaluation of eleven commercial systems

Improved face recognition rate using hog features and svm classifier

Introducing microsoft cognitive services

Arcface: additive angular margin loss for deep face recognition

Retinaface: single-shot multi-level face localisation in the wild

Adapted fusion schemes for multimodal biometric authentication

Multiple classifiers in biometrics. part 1: fundamentals and review

Multiple classifiers in biometrics. part 2: trends and challenges

Benchmarking touchscreen biometrics for mobile authentication

Study on face identification technology for its implementation in the Schengen information system. Publications Office of the European Union

Facial soft biometrics for recognition in the wild: recent works, annotation and COTS evaluation

Face verification via learned representation on feature-rich video frames

Learning meta face recognition in unseen domains

Kernel ELM and CNN based facial age estimation

Biometric quality: review and application to face recognition with FaceQnet

The era of cognitive systems: an inside look at IBM watson and how it works. IBM Corporation, Redbooks

Deep metric learning using triplet network

Labeled faces in the wild: a database for studying face recognition in unconstrained environments

Extreme learning machine: a new learning scheme of feedforward neural networks

50 years of biometric research: accomplishments, challenges, and opportunities

Hand-crafted features or machine learnt features? together they improve RGB-D object recognition

The megaface benchmark: 1 million faces for recognition at scale

Face recognition performance: role of demographic information

Counting and classification of malarial parasite from giemsa-stained thin film images

Kinship verification based deep and tensor features through extreme learning machine

Sphereface: Deep hypersphere embedding for face recognition

Conditional convolution neural network enhanced random forest for facial expression recognition

An experimental evaluation of covariates effects on unconstrained face verification

Neural Computing and Applications

A meta-analysis of face recognition covariates

Visualizing data using t-SNE

SensitiveNets: learning agnostic representations with application to face images

Demographic effects on estimates of automatic face recognition performance

Virtual screening by a new clustering-based weighted similarity extreme learning machine approach

Dictionary-based face recognition under variable lighting and pose

Quickest intruder detection for multiple user active authentication

An other-race effect for face recognition algorithms

Deep learning for understanding faces: machines may be just as good, or better, than humans

Algorithmic discrimination: formulation and exploration in deep learning-based face biometrics

Sensi-tiveLoss: improving accuracy and fairness of face representations with discrimination-aware deep learning

InsideBias: measuring bias in deep networks and application to face gender biometrics

How effective are landmarks and their geometry for face recognition

Nonparametric statistics for the behavioral sciences

Fairface challenge at ECCV 2020: analyzing bias in face recognition

Hybrid deep learning for face verification

Deep learning face representation from predicting 10,000 classes

A comprehensive study on face recognition biases beyond demographics

The new data and new challenges in multimedia research

Facial soft biometric features for forensic face recognition

FaceGenderID: exploiting gender information in DCNNs face recognition systems

Mitigating bias in face recognition using skewness-aware reinforcement learning

Racial faces in the wild: reducing racial bias by information maximization adaptation network

Benchmarking deep learning techniques for face recognition

Distance metric learning for large margin nearest neighbor classification

A discriminative feature learning approach for deep face recognition

Realization of a hybrid locally connected extreme learning machine with deepid for face verification

A convolutional neural network based on tensorflow for face recognition