key: cord-0203956-vzjgwoun
authors: Alonso-Fernandez, Fernando; Raja, Kiran B.; Raghavendra, R.; Busch, Cristoph; Bigun, Josef; Vera-Rodriguez, Ruben; Fierrez, Julian
title: Cross-Sensor Periocular Biometrics in a Global Pandemic: Comparative Benchmark and Novel Multialgorithmic Approach
date: 2019-02-21
journal: nan
DOI: nan
sha: 4965bb8be625411e605015d209d8c46486b14fc1
doc_id: 203956
cord_uid: vzjgwoun

The massive availability of cameras results in a wide variability of imaging conditions, producing large intra-class variations and a significant performance drop if heterogeneous images are compared for person recognition. However, as biometrics is deployed, it is common to replace damaged or obsolete hardware, or to exchange information between heterogeneous applications. Variations in spectral bands can also occur. For example, surveillance face images (typically acquired in the visible spectrum, VIS) may need to be compared against a legacy iris database (typically acquired in near-infrared, NIR). Here, we propose a multialgorithmic approach to cope with periocular images from different sensors. With face masks in the front line against COVID-19, periocular recognition is regaining popularity since it is the only face region that remains visible. We integrate different comparators with a fusion scheme based on linear logistic regression, in which scores are represented by log-likelihood ratios. This allows easy interpretation of scores and the use of Bayes thresholds for optimal decision-making since scores from different comparators are in the same probabilistic range. We evaluate our approach in the context of the Cross-Eyed Competition, whose aim was to compare recognition approaches when NIR and VIS periocular images are matched. Our approach achieves EER=0.2% and FRR of just 0.47% at FAR=0.01%, representing the best overall approach of the competition. Experiments are also reported with a database of VIS images from different smartphones. We also discuss the impact of template size and computation times, with the most computationally heavy comparator playing an important role in the results. Lastly, the proposed method is shown to outperform other popular fusion approaches, such as the average of scores, SVMs or Random Forest.

it will be common to replace acquisition hardware as it is damaged or newer designs appear or to exchange information between agencies or applications operating in different environments. Furthermore, variations in imaging spectral bands can also occur. For example, face images are typically acquired in the visible (VIS) spectrum, while iris images are usually captured in the near-infrared (NIR) spectrum. However, cross-spectrum comparison may be needed if, for example, a face image obtained from a surveillance camera needs to be compared against a legacy database of iris imagery.

Here, we propose a multialgorithmic approach to cope with periocular images captured with different sensors. With face masks in the front line to fight against the COVID-19 pandemic, periocular recognition is regaining popularity since it is the only region of the face that remains visible. As a solution to the mentioned cross-sensor issues, we integrate different biometric comparators using a score fusion scheme based on linear

Periocular biometrics has gained attention during the last years as an independent modality for person recognition [1, 2] after concerns of the performance of face or iris modality under non-ideal or uncooperative conditions [3, 4] . The mandatory use of face masks due to the COVID-19 pandemic has produced that, even in cooperative settings, face recognition systems are presented with occluded faces where the periocular region is often the only visible area. This face occlusion comes with a reduction in facial information that may be significant for recognition [5, 6] . To what extent this information reduction is detrimental for face recognition is yet something largely unexplored.

In practice, recent studies have shown that commercial face recognition engines, even in cooperative settings, struggle with persons wearing face masks [7] , driving vendors to include capabilities for recognition of masked faces in their products [8] . In parallel, hygiene concerns are triggering fears against the use of contact-based biometric solutions such as fingerprints [9] .

According to the Merriam-Webster dictionary, the medical definition of "periocular" is "surrounding the eyeball but within the orbit". From a forensic/biometric application perspective, our goal is to improve the recognition performance by using information extracted from the face region in the immediate vicinity of the eye, including the sclera, eyelids, eyelashes, eyebrows and the surrounding skin ( Figure 1 ). This information may include textural descriptors, but also the shape of the eyebrows or eyelids, or colour information [1] . With a surprising high discrimination ability, the resulting modality is the ocular one requiring the least constrained acquisition. It is sufficiently visible over a wide range of distances, even under partial face occlusion (close distance) or low-resolution iris (long distance), facilitating increased performance in unconstrained or uncooperative scenarios. It also avoids the need for iris segmentation, an issue in difficult images [10] . The COVID-19 outbreak has imposed the necessity of dealing with partially occluded faces even in cooperative applications in security, healthcare, border control or education. Another advantage in the context of the current global pandemic is that the periocular region appears in iris and face images, so it can be easily obtained with existing setups for face and iris. The ocular region consists of several organs such as the cornea, pupil, iris, sclera, lens, retina, optical nerve, and periocular region. Some of them are shown in Figure 1. Among these, iris, sclera, retina and periocular have been studied as biometric modalities [2] . The significant progress of ocular biometrics in the last decade has been primarily due to efforts in iris recognition since the late 80s, resulting in large-scale deployments [11] . Iris provides very high accuracy with near-infrared (NIR) illumination and controlled, close-up acquisition. However, deployment to non-controlled environments is not yet mature due to the impact of low resolution, variable illumination, or off-angle views, which makes very difficult to locate and segment the iris [10] . Even if the latter can be achieved, the quality of the resulting iris image might not be sufficient for accurate recognition either [12] . The feasibility of vasculature of the sclera as a biometric modality (sometimes simply referred to as sclera) has also been established by several studies [13] , although its acquisition in non-controlled environments poses the same problems as the iris modality. The vasculature of the retina is also very discriminative, and the retina is regarded as the most secure biometric modality due to being extremely difficult to spoof. However, its acquisition is very invasive, requiring high user cooperation and specialized optical devices.

In this context, periocular has rapidly evolved as a very popular modality for unconstrained biometrics [2, 1, 13] , and recently due to the use of face masks even in constrained settings [7] . The term periocular is used loosely in the literature to refer to the externally visible region of the face that surrounds the eye socket. Therefore, images of the whole eye, such as the one in Figure 1 , are employed as input [13] . While the iris, sclera and other elements are present in such images, they are not explicitly used in isolation. It may be that the iris texture or the vasculature of the sclera cannot be reliably obtained either to be used as stand-alone modalities [12] . Some works even suggest that with visible light data, recognition performance is improved if components inside the ocular globe (iris and sclera) are discarded [14] . The fast-growing uptake of face technologies in social networks and smartphones, as well as the widespread use of surveillance cameras or face masks, has arguably increased the interest in periocular biometrics, especially in the visible (VIS) range. In such scenarios, samples captured with different sensors are to be compared if, for example, users are allowed to use their own acquisition devices, leading to a cross-sensor comparison in the same spectrum (VIS-VIS in this case). Unfortunately, this massive availability of cameras results in heterogeneous quality between images [15] , which is known to decrease recognition performance significantly [11] . These sensor interoperability issues also arise when a biometric sensor is replaced with a newer one without reacquiring the corresponding template, thus forcing biometric samples from different sensors to co-exist. Sensors may also operate in a range other than VIS, such as NIR, leading to cross-sensor NIR-NIR comparisons, e.g. [16] . In addition, iris images are largely acquired beyond the visible spectrum [17] , mainly using NIR illumination, but there are several scenarios in which it may be necessary to compare them with periocular images in the VIS range, leading in this case to a cross-sensor comparison in different spectra (NIR-VIS in this case), also known as cross-spectral comparison. This happens, for example, in law enforcement scenarios where the only available image of a suspect is obtained with a surveillance camera in the VIS range, but the reference database contains images in the NIR range [18, 19] . These interoperability problems, if not properly addressed, can affect the recognition performance dramatically. Unfortunately, widespread deployment of biometric technologies will inevitably cause the replacement of hardware parts as they are damaged, or newer designs appear. Another application case is the exchange of information among agencies or applications which employ different technological solutions or whose data is captured in heterogeneous environments. The different types of image comparisons mentioned, based on the spectrum in which they have been acquired, are summarized in Figure 2 . Accordingly, to counteract the reduction in recognition performance that is usually observed when comparing data from different sensors, we propose to combine the output of different periocular comparators at the score level, referred to as multialgorithm fusion (in contrast to multimodal fusion, which combines information from different modalities) [20, 21] . The consolidation of identity evidence from heterogeneous comparators (also called experts, feature extraction techniques, or systems in the present paper) is known to increase recognition performance, because the different sources can compensate for the limitations of the others [20, 22] . Integration at the score level is the most common approach because it only needs the output scores of the different comparators, greatly facilitating the integration. With this motivation, we employ a multialgorithm fusion approach to cope with periocular images from different sensors which integrates scores from different comparators. It follows a probabilistic fusion approach based on linear logistic regression [23] , in which the output scores of multiple systems are combined to produce a log-likelihood ratio according to a probabilistic Bayesian framework. This allows easy interpretation of output scores and the use of Bayes thresholds for optimal decision-making. This fusion scheme is compared with a set of simple and trained fusion rules widely employed in multibiometrics based on the arithmetic average of normalized scores [24] , Support Vector Machines [25] , and Random Forest [26] .

The fusion approach based on linear logistic regression served as an inspiration to our submission to the 1 st Cross-Spectral Iris/Periocular Competition (Cross-Eyed 2016) [27] , with an outstanding recognition accuracy: Equal Error Rate (EER) of 0.29%, and False Rejection Rate (FRR) of 0% at a False Acceptance Rate (FAR) of 0.01%, resulting in the best overall competing submission. This competition was aimed at evaluating the capability of periocular recognition algorithms to compare visible and near-infrared images (NIR-VIS). In the present paper, we also carry out cross-sensor experiments with periocular images in the visible range only (VIS-VIS), but with two different sensors. For this purpose, we employ a database captured with two smartphones [28] , demonstrating the benefits of the proposed approach to smartphone-based biometrics as well.

The rest of the paper is organized as follows. This introduction is completed with a description of the paper contributions. A summary of related works in periocular biometrics is given in Section 2. Section 3 then describes the periocular comparators employed. The score fusion methods evaluated are described in Section 4. Recognition experiments using images in different spectra (cross-spectral NIR-VIS) and in the visible spectrum (cross-sensor VIS-VIS) are described in Sections 5 and 6, respectively, including the databases, protocol used, results of the individual comparators, and fusion experiments. Finally, conclusions are given in Section 7. 

The contribution of this paper to the state of the art is thus as follows. First, we summarize related works in periocular biometrics using images from different sensors.

Second, we evaluate nine periocular recognition comparators under the frameworks of different spectra (NIR-VIS) and same spectrum (VIS-VIS) recognition. The Reading Cross-Spectral Iris/Periocular Dataset (Cross-Eyed) [27] and the Visible Spectrum Smartphone Iris (VSSIRIS) [28] databases are respectively used for this purpose. We employ the three most widely used comparators in periocular research, which are used as a baseline in many studies [1] : Histogram of Oriented Gradients (HOG) [29] , Local Binary Patterns (LBP) [30] , and Scale-Invariant Feature Transform (SIFT) key-points [31] . Three other periocular comparators, proposed and published previously by the authors, are based on Symmetry Descriptors [32] , Gabor features [33] , and Steerable Pyramidal Phase Features [34] . The last three comparators use feature vectors extracted by three Convolutional Neural Networks: VGG-Face [35] , which has been trained for classifying faces, so the periocular region appears in the training data, and the very-deep Resnet101 [36] and Densenet201 [37] networks. Two example images from the two databases employed are shown in Figure 3 (first column). The second column shows the two images after applying Contrast Limited Adaptive Histogram Equalization (CLAHE) [38] , whereas the last two columns show the regions of interest (ROI) used by the different comparators. The comparators are evaluated both in terms of performance, template size and computation times. In a previous study [39] , we presented preliminary results with the VSSIRIS database using a subset of the mentioned comparators [12, 32, 33] , which are extended in the present paper with additional experiments using new comparators [34, 35, 36, 37] and the mentioned Cross-Eyed database.

Third, we describe our multialgorithm fusion architecture for periocular recognition using images from different sensors ( Figure 4) . The input to a biometric comparator is usually a pair of biometric samples, and the output is, in general, a similarity score s. A larger score favours the hypothesis that the two samples come from the same subject (target or client hypothesis), whereas a smaller score supports the opposite (non-target or impostor hypothesis). However, if we consider a single isolated score from a biometric comparator (say a similarity score of s=1), it is in general not possible to determine which is the hypothesis the score supports the most, unless we know the distributions of target or non-target scores. Moreover, since the scores output by the various comparators are heterogeneous, score normalization is needed to transform these scores into a common domain prior to the fusion process [20] . We solve these problems by linear logistic regression fusion [40, 41] , a trained classification approach in which scores of the individual comparators are combined to obtain a log-likelihood ratio. This is the logarithm of the ratio between the likelihood that input signals were originated by the same subject and the likelihood that input signals were not originated by the same subject. This form of output is comparator-independent in the sense that this log-likelihood-ratio output can theoretically be used to make optimal (Bayes) decisions. To convert scores from different comparators into a log-likelihood ratio, we evaluate two possibilities ( Figure 5 ). In the first one (top part), the mapping function uses as input the scores of all comparators, producing a single log-likelihood ratio as output. In the second one (bottom), several mapping functions are trained (one per comparator), so one log-likelihood ratio per comparator is obtained. Under indepen- dence assumptions (as in the case of comparators based on different feature extraction methods), the sum of log-likelihood ratios results in another log-likelihood ratio [42] .

Therefore, in the second case, the outputs of the different mapping functions are just summed. The latter provides a simple fusion framework that allows obtaining a single log-likelihood ratio by simply summing the (mapped) score given by each available comparator. This would allow coping with missing modalities [43] since the output still would be a log-likelihood ratio regardless of the number of systems combined.

This fusion approach has been previously applied successfully to cross-sensor comparison in the face and fingerprint modalities [23] , achieving excellent results in other competition benchmarks as well [43] . Fourth, we compare this fusion approach with a set of simple and trained score fusion rules based on the arithmetic average of normalized scores [24] , Support Vector Machines [25] , and Random Forest [26] . These fusion approaches are very popular in the literature, having demonstrated to give good results in biometric authentication [44, 20] . Fifth, in our experiments, conducted according to the 1st Cross-Spectral Iris/Periocular Competition (Cross-Eyed 2016) protocol [27] , reductions of up to 29/47% in EER/FRR error rates (with respect to the best individual system) are obtained by fusion under NIR-VIS comparison, resulting in a cross-spectral EER of 0.2%, and a FRR @ FAR=0.01% of just 0.47%. Regarding cross-sensor VIS-VIS smartphone recognition, the reductions in error rates achieved are 85/93% in EER/FRR, respectively, with corresponding cross-sensor error values of 0.3% (EER) and 0.3% (FRR). 

Interoperability between different sensors is an area of high research interest due to new scenarios arising from the widespread use of biometric technologies, coupled with the availability of multiple sensors and vendor solutions. A summary of existing works in the literature is given in Table 1 . Most of them employ the Genuine Acceptance Rate (GAR) as metric, which is computed as 100-FRR(%). For this reason, in this subsection, we report GAR values. However, in the rest of the paper, we will follow the Cross-Eyed protocol and will report FRR values.

Cross-sensor comparison of images in the visible range (VIS-VIS) from smartphone sensors is carried out, for example, in [45, 46, 47] , while the challenge of comparing images from different sensors in the near-infrared spectrum (NIR-NIR) has been addressed in [16] . In the work [46] , the authors apply Laplacian decomposition (LD) of the image coupled with dynamic scale selection, followed by frequency decomposition via Short-Term Fourier Transform (STFT). In the experiments, they employ a subset of 50 periocular instances from the MICHE I dataset (Mobile Iris Challenge Evaluation I dataset) [54] , captured with the front and rear cameras of two smartphones in indoor and outdoor illuminations. The cross-sensor EER obtained ranges from 6.38 to 8.33%

for the different combinations of reference and probe cameras. The authors in [45] use a sensor-specific colour correction technique, which is estimated by using a colour chart in a dark acquisition scene that is further illuminated by a standard illuminant.

The authors also carry out a score-level fusion of six iris and five periocular compara- In another line of work, surveillance at night or in harsh environments has prompted interest in new imaging modalities. For example, the authors in [48] presented the II-ITD Multispectral Periocular database (IIITD-IMP), with a total of 1240 VIS, NIR and Night Vision images from 62 subjects (the latter captured with a video camera in Night Vision mode). To cope with cross-spectral periocular comparisons, they employ Neural Networks to learn the variabilities caused by each pair of spectra. The employed comparator is based on a Pyramid of Histograms of Oriented Gradients (PHOG) [57] .

They report results for each eye separately and for the combination of both eyes, ob-taining a cross-spectral GAR of 38-64% at FAR=1% (best of the two eyes), and a GAR of 47-72% combining the two eyes. The use of pre-trained Convolutional Neural Networks (CNN) as a feature extraction method for NIR-VIS comparison was recently proposed in [53] . Here, the authors identify the layer of the ResNet101 network that provides the best performance on each spectrum. Then, they train a Neural Network that uses as input the feature vector of the best respective layers. Using the IIITD-IMP database, they report results considering the left and right eyes of a person as different users (effectively duplicating the number of classes (Four-Patch LBP) and TPLBP (Three-Patch LBP). They report a cross-spectral periocular GAR at FAR=0.1% of 16-18% (IIITD-IMP) and 45-73% (PolyU). These two databases, together with the Cross-Eyed database (with 3840 images in NIR and VIS spectra from 120 subjects) [27] , are used in the work [51] . To normalize the differences in illumination between NIR and VIS images, they apply Difference of Gaussian (DoG)

filtering. The comparators employed were based on Local Binary Patterns (LBP) and

Histogram of Oriented Gradients (HOG) features. They report results for each eye separately and for the combination of both eyes. The IIITD-IMP database gives the worst results, with a cross-spectral EER of 45% and a GAR at FAR=0.1% of only 25%

(two eyes combined). The reported accuracy with the other databases is better, ranging between 10-14% (EER) and 83-89% (GAR).

Latest advancements have resulted in devices with the ability to see through fog, rain, at night, and to operate at long ranges. In the work [49] , the authors carry out ex- 

This section describes the biometric comparators used for periocular recognition.

We employ nine different comparators, whose choice is motivated as follows. Three comparators are based on the most widely used features in periocular research, which are employed as a baseline in many studies [1] : Histogram of Oriented Gradients (HOG) [29] , Local Binary Patterns (LBP) [30] , and Scale-Invariant Feature Transform (SIFT) key-points [31] . Other three comparators, available in-house, have been self-developed by the authors and published previously with competitive results. These are based on Symmetry Descriptors (SAFE) [32] , Gabor features (GABOR) [33] , and Steerable Pyramidal Phase Features (NTNU) [34] . We also employ three comparators based on deep Convolutional Neural Networks: the VGG-Face network [35] , which has been trained for classifying faces (so the periocular region appears in the training data), and the two very-deep Resnet101 [36] and Densenet201 [37] architectures. Example of LBP and HOG features of the input image shown in Figure 3 (top row).

This comparator employs the Symmetry Assessment by Feature Expansion (SAFE) descriptor [32] , which encodes the presence of various symmetric curve families around

image key-points ( Figure 6 , top). We use the eye centre as the anchor point for feature extraction. The algorithm starts by extracting the complex orientation map of the image via symmetry derivatives of Gaussians [59] . We employ S=6 different scales in computing the orientation map, therefore capturing features at different scales, with standard deviation of each scale given by σ s = K s−1 σ 0 (with s = 1, 2..., S; K = 2 1/3 ; σ 0 = 1.6). These parameters have been chosen according to [31] . For each scale, we then project N f = 3 ring-shaped areas of different radii around the eye centre onto a space of N h = 9 harmonic functions. We use the result of scalar products of complex harmonic filters (shown in Figure 6 ) with the orientation image to quantify the amount of presence of different symmetric pattern families within each annular band.

The resulting complex feature vector is given by an array of S × N h × N f elements.

The comparison score M ∈ C between a query q and a test SAFE array t is computed using the triangle inequality as M = q,t |q|,|t| . The argument ∠M represents the angle between the two arrays (expected to be zero when the symmetry patterns detected coincide for reference and test feature vectors, and 180 • when they are orthogonal), and the confidence is given by |M | ∈ [0, 1]. To include confidence into the angle difference,

The annular band of the first ring is set in proportion to the distance between eye corners (Cross-Eyed database) or to the radius of the sclera circle (VSSIRIS database), while the band of the last ring ends at the boundary of the image. This difference in setting the smallest ring is due to the ground-truth information available for each database, as explained later. However, in setting the origin of the smallest band, we have tried to ensure that the different annular rings capture approximately the same relative spatial region in both databases. The ROI of the SAFE comparator for each database is shown in Figure 3 (third column). Using the eye corners or the sclera boundary as reference for the first annular band alleviates the effect of dilation that affects the pupil, which is more pronounced with visible illumination. Since the eye corners or the sclera are not affected by such dilation or by partial occlusion due to eyelids, they provide a more stable reference [60] .

This comparator is described in [33] , which is based on the face recognition comparator presented in [61] . The periocular image is decomposed into non-overlapped square regions ( Figure 3 , fourth column), and the local power spectrum is then sampled at the centre of each block by a set of Gabor filters organized in 5 frequency and 6 orientation channels. An example of Gabor filters is shown in Figure 6 . This sparseness of the sampling grid allows direct Gabor filtering in the image domain without needing the Fourier transform, with significant computational savings and feasibility in real-time. Gabor responses from all grid points are grouped into a single complex vector, and the comparison between two images is made using the magnitude of com-plex values via the χ 2 distance. Prior to the comparison with magnitude vectors, they are normalized to a probability distribution (PDF). The χ 2 distance between a query q and a test vector t is computed as

, where p are entries in the PDF, n is the bin index, and N is the number of bins in the PDF (dimensionality).

The χ 2 distance, due to the denominator, gives more weight to low probability regions of the PDF. For this reason, it has been observed to produce better results than other distances when using normalized histograms [62] .

Image features from multi-scale pyramids have proven to extract discriminative features in many earlier works concerned with texture synthesis, texture retrieval, image fusion, and texture classification, among others [63, 64, 65, 66, 67, 68, 69, 70] .

Inspired by this applicability, we employ steerable pyramidal features for periocular image classification using images from different sensors. Further, observing the nature of textures that are different across spectra (NIR versus VIS), we propose to employ the quantized phase information from the multi-scale pyramid of the image, as explained next.

A steerable pyramid is a translation and rotation invariant transform in a multiscale, multi-orientation and self-inverting image decomposition into a number of subbands [71, 72, 73] . The pyramidal decomposition is performed using directional derivative operators of a specific order. The key motivation in using steerable pyramids is to obtain both linear and shift-invariant features in a single operation. Further, they not only provide multi-scale decomposition but also provide the advantages of orthonormal wavelet transforms that are both localized in space and spatial-frequency with aliasing effects [71] . The basis functions of a steerable pyramid are K-order directional derivative operators. The steerable pyramids come in different scales and K + 1 orientations.

For a given input image, the features of steerable pyramid coefficients can be represented using S (m,θ) , where m represents the scale and θ represents the orientation.

In this work, we generate a steerable pyramid with 3 scales (m ∈ {1, 2, 3}) and angular coefficients in the range θ 1 = 0 to θ K+1 = 360, resulting in a pyramid that covers all directions. The set of sub-band images corresponding to one scale can be therefore represented as S m = {S (m,θ1) , S (m,θ2) , . . . S (m,θ K+1 ) }. We further note that the textural information represented is different in the NIR and VIS domains. In order to obtain domain invariant features, we propose to extract the local phase features [74] from each sub-band image S (m,θ) in a local region ω in the neighbourhood of n pixels given by

represent the pixel location. The local phase response obtained through Fourier coefficients are computed for the frequency points u 1 , u 2 , u 3 and u 4 , which relate to four

The phase information presented in the form of Fourier coefficients is then separated into real and imaginary parts of each component, as given by [Re{F }, Im{F }], to form a vector R with eight elements. Next, the elements R i of R are binarized to Q i by assigning a value of 1 to components with a response greater than 1, and 0 otherwise. The phase information is finally encoded to a compact pixel representation P in the 0 − 255 range by using a simple binary to decimal conversion strategy given by

This procedure is followed with the different scales and orientations of the selected space. All the phase responses P (m,θ) of the input image are concatenated into a single vector. Comparison between feature representations of two images is made using the

This comparator is based on the SIFT operator [31] . SIFT key-points (with dimension 128 per key-point) are extracted in the annular ROI shown in Figure 3 , third column. The use of an annular ROI like SAFE is inherited from our previous contribution [39] , but to compare with other systems that employ the entire input image ( Figure 3 , fourth column), we report experiments with the latter as well. The recognition metric between two images is the number of paired key-points, normalized by the minimum number of detected key-points in the two images being compared. We use a free C++ implementation of the SIFT algorithm 1 , with the adaptations described in 1 http://vision.ucla.edu/∼vedaldi/code/sift/assets/sift/index.html [75] . Particularly, it includes a post-processing step to remove spurious pairings using geometric constraints, so pairs whose orientation and length differ substantially from the predominant orientation and length are removed.

Together with SIFT key-points, LBP [30] and HOG [29] have been the most widely used descriptors in periocular research [1] . An example of LBP and HOG features is shown in Figure 6 distance is usually used for this purpose [12] , but here we employ the χ 2 distance for the same reasons as with the Gabor comparator.

Inspired by the works [76, 77, 53] in iris and periocular biometrics, we leverage the power of existing architectures pre-trained with millions of images to classify hundreds of thousands of object categories 2 . They have proven to be successful in very large recognition tasks apart from the detection and classification tasks for which they were designed [78] .

Here, we employ the VGG-Face [35] and the very deep Resnet101 [36] and Densenet201 [37] architectures. VGG-Face is based on the VGG-Very-Deep-16 CNN sequential architecture, implemented using ∼1 million images from the Labeled Faces in the Wild [79] and YouTube Faces [80] datasets. Since VGG-Face is trained for classifying faces, we believe that it can provide effective recognition with the periocular region as well,

given that this region appears in the training images. Introduced later, the ResNet networks [36] presented the concept of residual connections to ease the training of CNNs.

By reducing the number of training parameters, they can be substantially deeper. The key idea of residual connections is to make available the input of a lower layer to a higher layer, bypassing intermediate ones. There are different variants of ResNet networks, depending on its depth. In this work, we employ ResNet101, having a depth of 347 layers (including 101 convolutional layers). In DenseNet networks [37] , the residual concept is taken even further since the feature maps of all preceding layers of a Dense block are used as inputs of a given layer, and its own feature maps are used as inputs into all subsequent layers. This encourages feature reuse throughout the network. Similarly to ResNet, there are different variants of DenseNet (defined by its depth). In this paper, we employ Densenet201, having a depth of 709 layers (including 201 convolutional layers).

In using these networks, periocular images are fed into the feature extraction pipeline of each pre-trained CNN [76, 77] . However, instead of using the vector from the last layer, we employ as feature descriptor the vector from the intermediate layer identified as the one providing the best performance. These will be found in the respective experimental sections. This approach allows the use of powerful architectures pre-trained with a large number of images in a related domain, eliminating the need of designing or re-training a new network for a specific task, which may be infeasible in case of lack of large-scale databases in the target domain (as in the case of periocular recognition with images from different sensors). The extracted CNN vectors can be simply compared with distance measures. In our case, we employ the χ 2 distance, which has proven to provide better results than other measures such as the cosine or Euclidean distances [77] .

A biometric verification comparator can be defined as a pattern recognition machine that, by comparing two (or more) samples of input signals, is designed to recognize two different classes. The two hypotheses or classes defined for each comparison are target hypothesis (θ t : the compared biometric data comes from the same individual) and non-target hypothesis (θ nt : the compared data comes from different individuals).

As a result of the comparison, the biometric system outputs a real number s known as score. The higher the score, the more it supports the target hypothesis, and viceversa. The acceptance or rejection of an individual is based on a decision threshold τ , and this threshold depends on the priors and decision costs involved in the decisionmaking process. However, if we do not know the distributions of target or non-target scores from such comparator or any threshold, we will not be able to classify the associated biometric samples in general.

Integration at the score level is the most common approach used in multibiometric systems due to the ease in accessing and combining the scores s = (s 1 , . . . , s i , . . . , s N ) generated by N different comparators [20] . Unfortunately, each biometric comparator outputs scores that are in a range that is specific to the comparator, so score normalization is needed to transform these scores into a common domain prior to the fusion can be said about the fusion of such scores. From this viewpoint, outputs are dependent on the comparator, and thus, the acceptance/rejection decision also depends on the comparator.

These problems can be addressed with the concept of calibrated scores. During calibration, the scores s = (s 1 , . . . , s i , . . . , s N ) are mapped to a log-likelihood-ratio (LLR) as s cal ≈ log p(s|θt) p(s|θnt) , where s cal represents the calibrated score. Then, a decision can be taken using the Bayes decision rule [42] :

The parameter τ B is known as the Bayes threshold, and its value depends on the prior probabilities of the hypotheses p (θ t ) and p (θ nt ) and on the decision costs. This form of output is comparator-independent since this log-likelihood-ratio output can theoretically be used to make optimal (Bayes) decisions for any given target prior and any costs associated with making erroneous decisions [42] . Therefore, the calibration process gives meaning to s cal . In a Bayesian context, a calibrated score s cal can be interpreted as a degree of support to any of the hypotheses. If s cal > 0, then the support to θ t is also higher, and vice-versa. Also, the meaning of a log-likelihood ratio is the same across different biometric comparators, allowing to compare them in the same probabilistic range. This calibration transformation then solves the two previously commented problems. First, it maps scores from biometric comparators to a common domain. Second, it allows the interpretation of biometric scores as a degree of support.

A number of strategies can be used to train a calibration transformation [81] . Among them, logistic regression has been successfully used for biometric applications [40, 41, 82, 83, 23] . With this method, the scores of multiple comparators are fused together, primarily to improve the discriminating ability, in such a way as to encourage good calibration of the output scores. Given N biometric comparators which output the scores s j = (s 1j , s 2j , ...s N j ) for an input trial j, a linear fusion of these scores is:

When the weights {a 0 , ..., a N } are trained via logistic regression, the fused score f j is a well-calibrated log-likelihood-ratio [81, 41] . Let [s ij ] be an N × N T matrix of training scores built from N biometric comparators and N T target trials, and let [r ij ] be an N × N N T matrix of training scores built from the same N biometric comparators with N N T non-target trials. We use a logistic regression objective [40, 41] that is normalized with respect to the proportion of target and non-target trials (N T and N N T , respectively), and weighted with respect to a given prior probability P = P (target).

The objective is stated in terms of a cost C, which must be minimized:

where the fused target and non-target scores are respectively

and where logitP = log P 1−P . It can be demonstrated that minimizing the objective C with respect to {a 0 , ..., a N } results in a good calibration of the fused scores [81, 41] . In practice, changing the value of P has a small effect. The default of 0.5 is a good choice for a general application and it will be used in this work. The optimization objective C is convex and therefore has a unique global minimum.

Another advantage of this method is that when fusing scores from different comparators, the most reliable comparator will implicitly be given a dominant role in the fusion (via the trained weights {a 0 , ..., a N }). In other standard fusion methods, such as the average of scores [24] , all comparators are given the same weight in the fusion, regardless of its individual accuracy. It is also straightforward to show that if M calibrated scores {s cal 1 , s cal 2 , . . . , s cal M } come from statistically independent sources (such as multiple biometric comparators), its sum s cal 1 + s cal 2 + . . . + s cal M also yields a log-likelihood ratio [42] . The latter allows to calibrate the scores s i of each available biometric comparator separately (by using N =1 in Equation 2), and simply sum the calibrated scores s cal i of each comparator in order to obtain a new calibrated score, as shown in Figure 5 . In this paper, the two possibilities are evaluated, i.e. calibrating the scores of all comparators together vs calibrating them separately and then summing them up. In order to perform logistic regression calibration, the freely available Bosaris toolkit for Matlab has been used 3 . For further details of this fusion method, the reader is referred to [23] and the references therein. The probabilistic fusion method described above is compared in the present work with three strategies. Since each biometric comparator usually outputs scores that are in a range that is specific to the system, the scores of each comparator are normalized prior to the fusion using z-score normalization [24] . The three strategies are:

• Average. With this simple rule, the scores of the different comparators are simply averaged. Motivated by their simplicity, simple fusion rules have been used in biometric authentication with very good results [84, 85] . They have the advantage of not needing training, sometimes surpassing other complex fusion approaches [86] .

• SVM. Here, a Support Vector Machine (SVM) is trained to provide a binary classification given a set of scores from different biometric comparators [87] .

The SVM algorithm searches for an optimal hyperplane that separates the data into two classes. SVM is a popular approach employed in multibiometrics [25] , which has shown to outperform other trained approaches [20] . In this work, we evaluate Linear, RBF, and Polynomial (order 3) kernels. Instead of using the binary predicted class label, we use the signed distance to the decision boundary as the output score of the fusion. This allows the presentation of DET curves and associated EER and FRR measures.

• Random Forest. Another method employed for the fusion of scores from multiple biometric comparators is the Random Forest (RF) algorithm [26] . An extension of the standard classification tree algorithm, the RF algorithm is an ensemble method where the results of many decision trees are combined [88] . This helps to reduce overfitting and to improve generalization capabilities. The trees in the ensemble are grown by using bootstrap samples of the data. In this work, we evaluate ensembles with 25, 150, and 600 decision trees. Instead of using the binary predicted class label, we use the weighted average of the class posterior probabilities over the trees that support the predicted class, so we can present DET curves and associated measures. 

In the cross-spectral recognition experiments of this section, we employ the Read- To avoid usage of iris information by periocular methods during the Cross-Eyed competition, periocular images were distributed with a mask on the eye region, as discussed in [12] . A new edition of the competition was held in 2017. The 120 subjects of the Cross-Eyed 2016 database were provided as the training set, and an additional set of 55 subjects were sequestered as the test set in the 2017 edition, but the latter was never released [89] .

Prior to the competition, a training set of images from 30 subjects was distributed.

The test set consisted of images from 80 subjects, sequestered by the organizers and distributed after the competition. Images from 10 additional subjects were also released after the competition that were not present in the test set. Here, we will employ the same 30 subjects of the training set to tune our algorithms and the remaining 90 subjects is the preprocessing choice with ocular images [90] , and then sent to feature extraction.

We carry out verification experiments, with each eye considered a different user.

We compare images both from the same device (same-sensor) and from different de- Since the training set contains 30 subjects, this results in 29 × 4 × 2 (same-sensor) and

29 × 4 × 2 × 2 (cross-spectral) training impostor scores per subject. By multiplying these amounts by 30, we obtain the total amount of impostor scores with the training set. The experimental protocol is summarized in Table 2 . and Densenet201 models are from the pre-trained models available in Matlab r2019a.

In line with the Cross-Eyed competition, we also provide the extraction and comparison time of each method ( Table 4 , second and third columns). Here, it can be also appreciated the variation of the SIFT versions depending on the image or ROI size. 

Normalized periocular images are fed into the feature extraction of each pre-trained CNN. We investigate the representation capability of each layer by reporting the corresponding cross-spectral accuracy using features from each layer. The recognition accuracy of each network (EER and FRR @ FAR=0.01%) is given in Figure 8 . It is worth noting that the best performance is obtained in some intermediate layer for all CNNs, in line with previous studies using ocular modalities [76, 77] . In selecting the best layer, we prioritize the FRR @ FAR=0.01%, since this was the metric employed to rank submissions to the Cross-Eyed competition, although we seek a balance with the EER as well. We have also searched for layers that give optimum performance both with the Cross-Eyed and the VSSIRIS databases simultaneously if possible (results with the latter are given in Figure 12) . 

We now report the performance of all periocular comparators in Table 5 . Besides the EER, we also report the FRR at FAR=0.01%. The latter was the metric used to rank submissions to the Cross-Eyed competition. We report two types of results: i) samesensor comparisons; and ii) cross-spectral comparisons. In Figure 9 we also give the DET curves of the cross-spectral experiments.

From [48] . However, in the mentioned studies, the images employed are of smaller size, ranging from 100×160 to 640×480, while the images employed in this paper are of 613×701 pixels. Also, they evaluate three different periocular comparators at most. In the present paper, the use of bigger images may be the reason for a comparable performance between NIR and VIS images.

Regarding cross-spectral experiments, we observe a significant worsening in performance w.r.t. same-sensor comparisons, although not all comparators are affected in the same way. HOG, NTNU and especially LBP are substantially affected in high security mode (i.e. low FAR), as can be appreciated in the right part of Table 5 . The relative FRR increase @ FAR=0.01% for these comparators is in the range of 200% to nearly 500%. But the comparator that is most affected is SIFT. Even if its cross-spectral performance is the best among all comparators, it is about one or two orders of magnitude worse than its same-sensor performance (meaning a thousand per cent worse or more). This is despite the use of a descriptor with a bigger size (see Table 3 ). SIFT extracts features from a discrete set of local key-points, but it might be that the position of detected key-points is not the same in each spectrum. With the other comparators, on the other hand, the image is divided into annular or square regions (Figure 3 ), and features are extracted from each region, ensuring a consistent extraction between both spectra.

Concerning the individual performance of each comparator, SIFT exhibits very low error rates at the original image size, but this comparator is computationally heavy both in processing times and template size. In this paper, we use the SIFT detector with the same parametrization employed in [75] for iris images of size 640×480. In the work [75] , the iris region represented ∼1/8 of the image only, leading to some hundreds of key-points per image. However, images of the Cross-Eyed database are of 613×701 pixels, and the periocular ROI occupies a considerably bigger area than the iris region, leading to an average of ∼1900 key-points per image (annular ROI) or ∼2543 (Table 4 ). Therefore, we carry forward the AR configuration of SIFT to the fusion experiments of the next section. On the other hand, if the size of the input image is reduced to match the CNNs (224×224), the lower amount of detected key-points (only 92 on average) produces that both same-sensor and cross-spectral performance degrades one of two orders of magnitude. When this happens, SIFT becomes worse than e.g. DenseNet201 or ResNet 101, and comparable to VGG-Face in some DET regions ( Figure 9 ). This can also serve as indication of the strength of the CNNs, which match SIFT's performance if the image size of the latter is reduced to be equal, and they also rank ahead of the other comparators while using a smaller input image size.

In general, there is an inverse proportion between the error rates and the template size. The comparators with the best performance (SIFT, NTNU and the three CNNs)

are also the ones with the biggest feature vector (see Table 3 ). It is remarkable the performance of NTNU as well, surpassing the CNNs in some cases, but with a smaller feature vector. When it comes to cross-spectral comparisons, however, the CNNs provide better performance. This is observed especially with the deeper networks (ResNet101 and DenseNet201), highlighting the capability of these powerful descriptors pre-trained with millions of images. In the DET curves of Figure 9 , it can be better appreciated the superiority of the three CNNs for cross-spectral comparisons w.r.t. the other comparators (apart from SIFT). It is also remarkable the behaviour of DenseNet201, which provides the second-best result of all comparators, but with a feature vector much smaller than the other CNNs. Among the three CNNs used in this paper, DenseNet201 is the one providing the best performance on the original task for which they were trained (ImageNet), so it could be expected that such superiority is transferred to other tasks as well. It is also worth noting the relatively good cross-spectral EER values of some light comparators such as LBP or HOG. With a feature vector of only 384 real numbers and an EER of 5-6%, they would enable low-security applications where computational resources are limited. Best seen in colour.

We then carry out fusion experiments using all the available comparators, according to the fusion schemes presented in Section 4. We have tested all the possible fusion combinations. Whenever training is needed (i.e. to compute calibration weights, znormalization, SVM, or Random Forest models), the training set of the Cross-Eyed database is used. In Figure 10 and 'LLR (sum)', respectively.

As it can be observed, a substantial performance improvement can be obtained when combining several comparators. The best cross-spectral performance is obtained In Table 6 , we show the comparators involved in the best fusion cases. For the sake of space, we only provide results with a selection of fusion approaches, according to the observations made above when discussing Figure 10 : the LLR method (best case), SVM linear (a good runner-up which is also faster to train than its polynomial counterpart), and AVG or AVERAGE (a simple approach that does not need training).

To allow a more comprehensive analysis, we also provide not only the best cases but also the second and third best combinations for a given number of comparators. It can be seen that the best combinations for any given number of comparators always involve the SIFT method. The excellent accuracy of the SIFT comparator is not jeopardized by the fusion with other comparators that have a performance one or two orders of magnitude worse, but it is complemented to obtain even better cross-spectral error rates, especially with trained approaches. A careful look at the combinations of Table 6 shows that the CNN comparators are also chosen first for the fusion. Together with SIFT, they are the comparators with the best individual performance, and they appear to be very complementary too. However, it should not be taken as a general statement that the best fusion combination always involves the best individual comparators. Different fusion algorithms may lead to different results [94, 86] . For example, the best FRR with the simple average rule involves the SAFE comparator. It is also worth noting that other comparators with worse individual performance and not based on deep networks (such as SAFE, LBP, or NTNU) are also selected in combinations that have a performance nearly as good as the best cases. At the same time, this shows the power of the fusion approaches employed, and especially of the calibration method, which are capable of reducing error rates substantially by fusion of comparators with very heterogeneous performance and different feature representations.

To further illustrate the benefit of using calibrated scores, we plot in Figure 11 the False Acceptance/False Rejection (FA/FR) curves of the individual systems. This is done using raw scores of each system (left), normalized scores using z-score normalization (center), and calibrated scores (right). One selected fusion case of Table 6 (best combination of three systems: SIFT+LBP+ResNet101) is also plotted using average of normalized scores (center) and score calibration (right). It can be seen that the raw scores of each system lies in a different range, even if all comparators are expected to produce a score between 0 and 1 ([−1, 1] with SAFE). After z-score normalization, the impostor score distributions become aligned to a certain degree, since such normalization converts them to zero mean and unit variance. Also, the extent to which the genuine distributions spread are indicative of the performance of each system (in order:

SIFT (gray), DenseNet201 (black), ResNet101 (green), etc.). However, this cannot always be expected, since the fusion (blue thick curve) is situated between the curves of the individual systems involved due to scores being averaged. The EER of each system occurs as a different score value too. Similar effects can be expected with other popular normalization techniques like max-min, tanh, etc. [24] . When scores are normalized by calibration, two phenomena occur: i) the FA and FR curves cross at ∼0 score (the EER is always situated at this point), since a positive log-likelihood-ratio output supports the genuine (mated) decision, and a negative value the opposite; and ii) the spread and order of the curves are indicative of the performance of each system. For example, the SIFT curves (gray) have a smaller slope and reach higher log-likelihood-ratios (both positive and negative), due to this system being significantly better than the others (Ta- [95] . The ranking in the evaluation of the submitted approaches is also given. For more information, refer to [27] . Table 7 shows the results of the submission of Halmstad University to the Cross-Eyed 2016 competition. We provide both the results reported by the organizers [27] , and our own computations on the training and test sets of the database using the executables submitted and the protocol described in Section 5.1. For the evaluation, only the SAFE, GABOR, SIFT, LBP, and HOG comparators were available. We contributed with three different fusion combinations, named HH1, HH2, and HH3, with the HH1 combination obtaining the first position in the competition. Two key differences in the results reported in Table 7 in comparison with the present paper are that in our executables: i) the score of each comparator was calibrated separately, and the resulting calibrated scores were summed up; and ii) the LBP and HOG comparators employed the Euclidean distance (which is the popular choice in the literature with these methods, instead of χ 2 ). At the time of submission, the test set had not been released, so our decisions could only be based on the results on the training set. We observed that the SIFT comparator already provided cross-spectral error rates of nearly 0% on the training set (not shown in Table 7 ). However, it was reasonable to expect a higher error with a bigger dataset, as demonstrated later when the test set was released. Therefore, we contributed to the competition with a fusion of the five comparators available (called HH1) to be able to better cope with the generalization issue that is expected when performance is measured in a bigger set of images. Indeed, in Table 7 it can be seen that performance on the test set is systematically worse than on the training set. Since the combination of the five available comparators is computationally heavy in template size (due to the SIFT comparator), we also contributed by removing SIFT (combination HH2), and by further removing SAFE (combination HH3), which has a feature extraction time considerably higher than the rest of the comparators in our implementation (see Table 4 ). Thus, our motivation behind HH2 and HH3 was to reduce template size and feature extraction time. Some differences are observable between our results with the test set and the results reported by the competition [27] . We attribute this to two factors: i) the additional 10 subjects included in the test set released, which were not used during the competition, and ii) the employment of a different test protocol since it is not specified by the organizers the exact images used for impostor trials during the competition. Therefore, the experimental framework used in this paper is not exactly the same employed in the Cross-Eyed competition.

6. Cross-Sensor (VIS-VIS) Smartphone Periocular Recognition

In the cross-sensor experiments of this section, we use the Visible Spectrum Smartphone Iris (VSSIRIS) database [28] , which has images from 28 subjects (56 eyes) captured using the rear camera of two smartphones (Apple iPhone 5S, of 3264×2448 pixels, and Nokia Lumia 1020, of 3072×1728 pixels). They have been obtained in unconstrained conditions under mixed illumination (natural sunlight and artificial room light). Each eye has 5 samples per smartphone, thus 5×56=280 images per device (560 in total). The acquisition is made without flash, in a single session and with semicooperative subjects. Figure 7 (bottom) shows some examples.

All images of VSSIRIS are annotated manually, so the radius and centre of the pupil and sclera circles are available. Images are resized via bicubic interpolation to have the same sclera radius (set to R s =145, the average radius of the whole database).

We use the sclera for normalization since it is not affected by dilation. Then, images are aligned by extracting a square region of 6R s ×6R s (871×871) around the sclera centre. This size is set empirically to ensure that all available images have sufficient margin to the four sides of the sclera centre. Here, there is sufficient availability to the four sides of the eye, so the normalized images have the eye centred in the image, as can be seen in Figure 3 (bottom). Images are further processed by Contrast-Limited

Adaptive Histogram Equalization (CLAHE) [38] to compensate for variability in local illumination.

We carry out verification experiments, with each eye considered a different user.

We compare images both from the same device (same-sensor) and from different de- SIRIS in comparison with the Cross-Eyed database results in the availability of fewer scores. Therefore, whenever a parameter needs training, 2-fold cross-validation [96] is used, dividing the available number of users in two partitions. Otherwise, we report results employing the entire VSSIRIS database.

The parameters of the periocular comparators are as follows. As with the Cross- Table 4 (right). 

We first identify the optimum layer of each CNN. The cross-sensor accuracy of each network is given in Figure 12 for each cross-validation fold. When selecting the best layer, we have tried to find the one that gives optimum performance both with the Cross-Eyed and the VSSIRIS databases simultaneously. However, it has not always been possible. According to the discussion in Section 5. 

The performance of individual comparators is then reported in Table 9 . Similarly as Section 5, we adopt as measures of accuracy the EER and the FRR at FAR=0.01%.

In Figure 13 , we give the DET curves of the cross-sensor experiments.

By comparing Table 5 and Table 9 , it can be observed that same-sensor experiments with the VSSIRIS database usually exhibit lower error rates for any given comparator.

Possible explanations might be that the ROI of VSSIRIS images is bigger (871×871 vs 613×701), or that the VSSIRIS database has fewer users (28 vs 90 subjects). On the opposite side, cross-sensor error rates with VSSIRIS are significantly worse for some comparators (e.g. SIFT, HOG, NTNU, or VGG-Face). Lighter comparators such as LBP or HOG are not capable of providing good cross-sensor performance in lowsecurity applications either (EER of 11% or higher). The difference is especially relevant with the SIFT comparator, where cross-sensor error rates on Cross-Eyed (Table 5) were 0.28% (EER) and 0.88% (FRR), but here they increase one order of magnitude, up to 1.6% (EER) and 12.7% (FRR) (annular ROI). This is despite the higher number of SIFT key-points per image with VSSIRIS due to higher image size (∼3000 vs ∼1900 on average). It is thus interesting that the comparators employed in this paper are more robust to the variability between images in different spectra (NIR and VIS) than the variability between images in the same (VIS) spectrum captured with Figure 7 ), so images are expected to be very well aligned after cropping. This synchronicity and absence of time span could be one of the reasons of the better cross-spectral performance obtained with the Cross-Eyed database, or the less sensitivity of the SIFT method to changes in the ROI.

Another observation is that same-sensor performance with VSSIRIS is sometimes very different depending on the smartphone employed, even if they involve the same subjects and images are resized to the same size. Contrarily, same-sensor performance with Cross-Eyed tends to be similar regardless of the spectrum employed (Table 5) , which might be explained as well by the synchronicity in the acquisition mentioned above. Previous works have suggested that discrepancy in colours between VIS sensors can lead to variability in performance, which is further amplified when images from such sensors are compared among them. The sensitivity of SIFT to changes in the ROI can also be an indicative of this. Although we apply local adaptive contrast equalization, our results suggest that other device-dependent colour correction might be of help [45] . Another difference observed here is that the best individual comparator (in terms of FRR) is not SIFT. With Cross-Eyed, SIFT was the best by a large margin, but here, other comparators have similar or better performance (e.g. DenseNet201, ResNet101). This is despite the higher number of SIFT key-points per image with VS-SIRIS mentioned above. Nevertheless, the correlation between bigger template size and lower error rates remains since the comparators with the best performance (SIFT, NTNU and the three CNNs) are also the ones with the biggest feature vector. The superiority of these comparators can also be observed in the DET curves of Figure 13 . Best seen in colour.

We now carry out fusion experiments using all the available comparators. Whenever a fusion method needs training, 2-fold cross-validation [96] was used, dividing the available number of users in two partitions. We have also tested here all the possible fusion combinations, with the best combinations chosen based on the lowest crosssensor FRR @ FAR=0.01%. The best results obtained for an increasing number M of combined comparators is given in Figure 14 (average values of the two folds). The comparators involved in the best fusion cases are also given in Similarly as Cross-Eyed, cross-sensor performance is also improved significantly here by fusion. The relative EER and FRR improvement of the best fusion case is even bigger, being 87.5% and 95.2%, respectively. This is high in comparison with the reductions observed with Cross-Eyed, which were in the order of 30-40%. It is also remarkable that similar or even better absolute performance values are obtained with VSSIRIS. This is despite the worse performance observed in the individual comparators, as discussed in the previous section. However, it comes at the price of needing more comparators to achieve maximum performance. Even if the biggest performance improvement also occurs after the fusion of two or three comparators, the smallest error is obtained with the fusion of four comparators. In contraposition, Cross-Eyed needed only two or three (see Figure 10 ).

The fusion methods evaluated also rank in the same order here (see Figure 14 ).

The probabilistic fusion method based on calibration (LLR) outperforms all the others, followed by SVM linear and polynomial. The simple average rule also matches the performance of other trained approaches in some points, but it deteriorates quickly as more comparators are combined. Lastly, the Random Forest approach performs the worst in general. In addition, the SIFT comparator is also decisive to achieve lower error rates, as it is always selected in any combination ( Table 10 ). The same observations than Section 5.4 can be made, in the sense that calibration provides alignment of genuine and impostor distribution around zero, and that the arrangement and spread of the distributions to both sides of the horizontal axis are indicative of the relative performance among systems.

Periocular biometrics has rapidly evolved to competing with face or iris recognition [1, 2] . The periocular region has shown to be as discriminative as the full face, with the advantage that it is more tolerant to variability in expression, blur, downsampling [97] , or occlusions [98, 12] . Under difficult conditions, such as people walking by acquisition portals, [99, 100, 101] , distant acquisition, [102, 103] , smartphones, [45] , webcams, or digital cameras, [33, 91] , the periocular modality is also shown to be clearly superior to the iris modality, mostly due to the small size of the iris or the use of visible illumination. The COVID-19 pandemic has also imposed the necessity of developing technologies capable of dealing with faces occluded by protective face masks, often with just the periocular area visible [9, 7, 8] .

As biometric technologies are extensively deployed, it will be common to compare data captured with different sensors or from uncontrolled non-homogeneous environments. Unfortunately, the comparison of heterogeneous biometric data for recognition purposes is known to decrease performance significantly [11] . Hence, as new practical applications evolve, new challenges arise, as well as the need for developing new algorithms to address them. In this context, we address in this paper the problem of biometric sensor interoperability, with recognition by periocular images as test-bed.

Inspired by our submission to the 1 st Cross-Spectral Iris/Periocular Competition (Cross-Eyed) [27] , we propose to mitigate such problem via a multialgorithm fusion strategy at the score level that combines up to nine different periocular comparators.

The aim of this competition was to evaluate periocular recognition algorithms when images from visible and near-infrared spectra are compared. We follow a probabilistic score fusion approach based on linear logistic regression [81, 41] . With this method, scores from multiple comparators are fused together not only to improve the discriminating ability but also to produce log-likelihood ratios as output scores. This way, output scores are always in a comparable probabilistic domain since log-likelihood ratios can be interpreted as a degree of support to the target or non-target hypotheses.

This allows the use of Bayes thresholds for optimal decision-making, avoiding the need to compute comparator-specific thresholds. This is essential in operational conditions since the threshold is critical to determine the accuracy of the authentication process in many applications. In the experiments of this paper, this method is shown to surpass other fusion approaches such as the simple arithmetic average of normalized scores [24] or trained algorithms such as Support Vector Machines [25] or Random Forest [26] . This employed fusion approach has been applied previously to cross-sensor comparison of face or fingerprint modalities [23] as well, also providing excellent results in other competition benchmarks involving these modalities [43] . We employ in this paper three different comparators based on the most widely used features in periocular research [12] , as well as three in-house comparators that we proposed recently [32, 33, 34] , and three comparators based on deep Convolutional Neural Networks [35, 36, 37] . The proposed fusion method, with a subset of the periocular comparators employed here, was used in our submission to the mentioned Cross-Eyed evaluation, obtained the first position in the ranking of participants. This paper is complemented with cross-sensor periocular experiments using images from the same spectrum as well.

For this purpose, we use the Visible Spectrum Smartphone Iris database (VSSIRIS) [28] , which contains images in the visible range from two different smartphones.

We first analyze the individual comparators employed not only from the point of view of its cross-sensor performance (Figures 9 and 13 ), but also taking into account its template size and computation times (Tables 3 and 4 ). We observe that the comparator having the biggest template size and computation time is usually the most accurate in terms of individual performance, also contributing decisively to the fusion. In the experiments reported in this paper, significant improvements in performance are obtained with the proposed fusion approach, leading to an EER of 0.2% in visible-to-nearinfrared comparisons ( Figure 10 ) and 0.3% in visible-to-visible comparison of smartphone images ( Figure 14) . The FRR in high-security environments (at FAR=0.01%) is also very good, being 0.47% and 0.3%, respectively.

Interestingly, the best performance is not obtained necessarily by the combination of all available comparators. Instead, the best results are obtained by fusion of just two to four comparators. A fundamental problem in classifier combination is to determine which systems to retain in order to attain the best results [104] . The systems retained are not necessarily the best individual ones, especially if they are not sufficiently complementary (for example, if they employ similar features) [86] . When the comparators are properly chosen (in our case, found by exhaustive search), the performance increases quickly with the addition of a small number of them. Then, it tends to stabilize until the addition of new ones actually decreases the performance.

The need to retain the best features only, and the mentioned performance 'peaking' effect, is well documented [104] , and it can be attributed to the correlation between classifiers or to the effect of a limited sample size. Such phenomenon have been also observed in other related studies in biometrics [105, 86, 106, 107, 91] . It is also worth noting that the comparators producing the best fusion performance (Tables 6 and 10) have an individual performance that differs in one or two orders of magnitude in some cases. In the probabilistic approach employed, each comparator is implicitly weighted by its individual accuracy, so the most reliable ones will have a dominant role [108] . It is, therefore, a very efficient method to cope with comparators having heterogeneous performance. On the contrary, in conventional score-level fusion approaches (like the average of scores), each comparator is given the same weight regardless of its accuracy, a common drawback that makes the worst comparators to produce misleading results more frequently [24] . Another relevant observation is that cross-sensor error rates of the individual comparators are higher with the database captured in the same spectrum (VSSIRIS) than the database which contains images in different spectra (Cross-Eyed).

As a result, there is a need to fuse more comparators with VSSIRIS to achieve maximum performance. This is an interesting phenomenon since one would expect that the comparison of images captured with visible cameras would produce better results than the comparison of near-infrared and visible images. Some authors point out that the discrepancy in colours between sensors in the visible range can be very important, leading to a significant decrease in performance when images from these sensors are compared without applying appropriate device-dependent colour corrections [45] .

Since NIR images do not contain colour information, this effect may not appear in NIR-VIS comparisons.

In the present work, we use the eye corners or the sclera boundary as references to extract the periocular region of interest (ROI). While we have employed ground-truth information, an operational system would demand to locate these parts, so inaccuracies in their location would affect subsequent processing steps. In order to mitigate the effects of incorrect detection on the periocular matching performance of the different comparators and obtain a measure of their capabilities in ideal conditions [12] , we have not implemented any detector of the necessary references. Even if errors in the detection will influence the overall performance of the recognition chain, feature extraction methods are not necessarily affected in the same way. This is seen for example in [109] with the iris modality, which will serve as inspiration for a similar systematic study with periocular images. The amount of periocular area around the eye necessary to provide good accuracy is another subject of study, with studies showing differences depending on the spectrum [110] . In VSSIRIS, the available images (captured with smartphones) contain a bigger periocular portion than images from the Cross-Eyed database (Figure 7) . However, it is not sufficient to provide better cross-sensor accuracy. Therefore, an interesting source of future research work will be to test the resilience against a variable amount of periocular area, including occlusions [12] .

Another observation is that the proposed fusion method needs to be trained separately for each domain (NIR-VIS or VIS-VIS). This is not exclusive of this method but an issue that is common to score-level fusion methods in general. Since the scores given by different systems do not necessarily lie in the same range, they are made com-parable by mapping them to a common interval using score normalization techniques [20] . Even the score distributions of a given algorithm do not necessarily lie in the same range if the operational conditions are different, such as operating in NIR-VIS or VIS-VIS domains. Just changing a sensor by a more recent one from the same manufacturer may have the same effect [39] , and the shape of the distributions are not necessarily equal either. One obvious effect of the difference between score distributions in different domains is that the accuracy of the comparators is different, not only in absolute numbers but also in the relative differences among them ( Table 5 vs Table 9 ). For example, the best comparator in Table 5 is SIFT, and it is one order of magnitude better than the others. On the other hand, in Table 9 , the EER of SIFT is only a little ahead of Resnet101 or Densenet201, and the FRR is even worse. Another observable effect of this phenomenon is that the slope of the DET curves is not the same either (Figure 9 vs Figure 13 ). For these reasons, the normalization and the fusion algorithms will usually need different training for each context. The calibration method employed implicitly finds the weight to be given to each system, so if their absolute or relative performance changes, the weights need to change accordingly. The same can be said about the other fusion algorithms evaluated. The number of systems that are needed to achieve maximum performance will not necessarily be the same either ( Figure 10 vs Figure 14) , nor the individual systems involved in the fusion ( Table 6 vs Table 10 ).

These observations are also backed up by a number of previous studies with different biometrics modalities [91, 111, 112, 86] . As a future work in this direction, we are looking at the robustness of the different comparators to cross-domain training, i.e.

training the calibration in one domain and testing in the other. We speculate that some comparators may be more robust than others, so using only those for calibration would allow transferring the training for one domain to the other without needing to re-train in the target domain. The use of several databases in one domain is also another way to test the generalization of the suggested approach by cross-database training [113] .

As future work, we are also exploring to exploit deep learning frameworks to learn the variability between images in different spectra or captured with different sensors. One plausible approach is the use of Generative Adversarial Networks [114] to map images as if they were captured by the same sensor. This has the advantage that images can be compared using standard feature extraction methods such as the ones employed in this paper, which have been shown to work better if images are captured using the same sensor.

In the context of smartphone recognition, where high-resolution images may be available, fusion with the iris modality is another possibility to increase recognition accuracy [91] . However, it demands segmentation, which might be an issue if the image quality is not sufficiently high [15] . This motivates pursuing the periocular modality, as in the current study. We will also validate our methodology using databases not only limited to two devices or spectra, e.g. [45, 52] , and also including more extreme variations in camera specifications and imaging conditions, such as low resolution, illumination or pose variability. For such low-quality imaging conditions, super-resolution techniques may also be helpful [115] and will be investigated as well.

Finally, recent interest in learning biases around face recognition [116, 117] motivates future research to study learning biases in the periocular region and developing new methods to reduce undesired biases [118] in that important facial region.

A survey on periocular biometrics research

Ocular biometrics: A survey of modalities and fusion approaches

Facial soft biometrics for recognition in the wild: Recent works, annotation and cots evaluation

Identification using face regions: Application and assessment in forensic scenarios

Combination of face regions in forensic scenarios

Ongoing frvt part 6a: Face recognition accuracy with face masks using pre-covid-19 algorithms

Rank one's next-generation periocular recognition algorithm

Face id firms battle covid-19 as users shun fingerprinting

Iris segmentation for challenging periocular images

50 years of biometric research: Accomplishments, challenges, and opportunities

Periocular biometrics in the visible spectrum

Ocular biometrics in the visible spectrum: A survey

Deep-prwis: Periocular recognition without the iris and sclera using deep learning frameworks

Quality measures in biometric systems

Fusion of iris and periocular biometrics for cross-sensor identification

Biometrics beyond the visible spectrum: Imaging technologies and applications

Matching face against iris images using periocular information

Facial soft biometric features for forensic face recognition

Multiple classifiers in biometrics. part 1: Fundamentals and review

Overview of the combination of biometric matchers

A comprehensive overview of biometric fusion

Qualitybased conditional processing in multi-biometrics: Application to sensor interoperability

Score normalization in multimodal biometric systems

Multi-modal identity verification using support vector machines (svm)

A classification approach to multi-biometric score fusion, Audio-and Video-Based Biometric Person Authentication

Cross-eyed -cross-spectral iris/periocular recognition database and competition

Smartphone based visible iris recognition using deep sparse filtering

Histograms of oriented gradients for human detection

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Distinctive image features from scale-invariant key points

Compact multi-scale periocular recognition using SAFE features

Near-infrared and visible-light periocular recognition with gabor features using frequency-adaptive automatic eye detection

Scale-level score fusion of steered pyramid features for cross-spectral periocular verification

Deep face recognition

Deep residual learning for image recognition

Densely connected convolutional networks

Graphics Gems IV

Log-likelihood score level fusion for improved cross-sensor smartphone periocular recognition

Applying logistic regression to the fusion of the NIST'99 1-speaker submissions

Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST Speaker Recognition Evaluation

Pattern Classification -2nd Edition

Benchmarking qualitydependent and cost-sensitive score-level multimodal biometric fusion algorithms

Adapted fusion schemes for multimodal biometric authentication

Fusing iris and periocular information for cross-sensor recognition

Dynamic scale selected laplacian decomposed frequency response for cross-smartphone periocular verification in visible spectrum

Multi-source deep transfer learning for cross-sensor biometrics

On cross spectral periocular recognition

Fusion of operators for heterogeneous periocular recognition at varying ranges

On matching cross-spectral periocular images for accurate biometrics identification

Periocular recognition in crossspectral scenario

Multi-spectral imaging for robust ocular biometrics

Cross spectral periocular matching using resnet features

FIRME: Face and iris recognition for mobile engagement

A completed modeling of local binary pattern operator for texture classification

Modeling the shape of the scene: A holistic representation of the spatial envelope

Representing shape with a spatial pyramid kernel

Wld: A robust local image descriptor

Vision with Direction

Periocular recognition: Analysis of performance degradation factors

Retinal vision applied to facial features detection and face authentication

Text-independent writer identification and verification using textural and allographic features

Steerable pyramids and tight wavelet frames in l2(bbrd)

Rotation invariant texture characterization and retrieval using steerable wavelet-domain hidden markov models

Rotation-invariant texture retrieval with gaussianized steerable pyramids

A parametric texture model based on joint statistics of complex wavelet coefficients

Modeling multiscale subbands of photographic images with fields of gaussian scale mixtures

Comparison and fusion of multiresolution features for texture classification

Novel face recognition approach based on steerable pyramid feature extraction

Steerable pyramid-based face hallucination

The steerable pyramid: A flexible architecture for multi-scale derivative computation

The design and use of steerable filters

Rotation invariant texture recognition using a steerable pyramid

Blur insensitive texture classification using local phase quantization, in: Image and signal processing

Iris recognition based on sift features

Iris recognition with off-the-shelf cnn features: A deep learning perspective

Periocular recognition using CNN features off-the-shelf

Cnn features off-the-shelf: An astounding baseline for recognition

Learned-Miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments

Face recognition in unconstrained videos with matched background similarity

Application Independent Evaluation of Speaker Detection

Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition

System combination using auxiliary information for speaker verification

Expert Conciliation for Multi Modal Person Authentication Systems by Bayesian Statistics

On Combining Classifiers

Combining Multiple Matchers for Fingerprint Verification: A Case Study in FVC2004

The Nature of Statistical Learning Theory

Random forests

Cross-eyed 2017: Cross-spectral iris/periocular recognition competition

Secure iris recognition based on local intensity variations

Comparison and fusion of multiple iris and periocular matchers using near-infrared and visible images

Human and machine performance on periocular biometrics under nearinfrared light and visible light

Periocular region appearance cues for biometric identification

Incorporating image quality in multi-algorithm fingerprint verification

Biometric Performance Testing and Reporting -Part 1: Principles and Framework

Statistical pattern recognition: A review

Performance evaluation of local appearance based periocular recognition

Subspace-based discrete transform encoded local binary patterns representations for robust periocular matching on NIST face recognition grand challenge

On the fusion of periocular and iris biometrics in non-ideal imagery

A comparative evaluation of iris and ocular recognition methods on challenging ocular images

Matching highly non-ideal ocular images: An information fusion approach

Human identification from at-a-distance images by simultaneously exploiting iris and periocular features

Soft biometrics and their application in person recognition at a distance

Small sample size effects in statistical pattern recognition: recommendations for practitioners

An experimental comparison of classifier fusion rules for multimodal personal identity verification systems

Combining multiple matchers for fingerprint verification: A case study in biosecure network of excellence

Biosecure reference systems for on-line signature verification: A study of complementarity

Discriminative Multimodal Biometric Authentication Based on Quality Measures, Pattern Recognition

Experimental analysis regarding the influence of iris segmentation on the recognition rate

Best regions for periocular recognition with NIR and visible images

Scenario-based score fusion for face recognition at a distance

Robustness of signature verification systems to imitators with increasing skills

Dataset bias exposed in face verification

Advances in Neural Information Processing Systems

A survey of super-resolution in iris biometrics with evaluation of dictionarylearning

Demographic bias in biometrics: A survey on an emerging challenge

A comprehensive study on face recognition biases beyond demographics

Sensitivenets: Learning agnostic representations with application to face images