key: cord-0906425-pa80d27z authors: Kamboj, Aman; Rani, Rajneesh; Nigam, Aditya title: A comprehensive survey and deep learning-based approach for human recognition using ear biometric date: 2021-04-22 journal: Vis Comput DOI: 10.1007/s00371-021-02119-0 sha: c4b0b74703821f7fb6c8f8b3f5947bbce92093ee doc_id: 906425 cord_uid: pa80d27z Human recognition systems based on biometrics are much in demand due to increasing concerns of security and privacy. The human ear is unique and useful for recognition. It offers numerous advantages over popular biometrics traits face, iris, and fingerprints. A lot of work has been attributed to ear biometric, and the existing methods have achieved remarkable success over constrained databases. However, in unconstrained environment, a significant level of difficulty is observed as the images experience various challenges. In this paper, we first have provided a comprehensive survey on ear biometric using a novel taxonomy. The survey includes in-depth details of databases, performance evaluation parameters, and existing approaches. We have introduced a new database, NITJEW, for evaluation of unconstrained ear detection and recognition. A modified deep learning models Faster-RCNN and VGG-19 are used for ear detection and ear recognition tasks, respectively. The benchmark comparative assessment of our database is performed with six existing popular databases. Lastly, we have provided insight into open-ended research problems worth examining in the near future. We hope that our work will be a stepping stone for new researchers in ear biometrics and helpful for further development. In the last decade, there have been many progress witnesses for human recognition in border security, surveillance, access control, banking, etc. Humans are recognised based on possession (something they have), knowledge (something they know), and biometrics (something the person is). The possession and knowledge-based methods are significantly failed in real scenarios as there are chances of an item under possession got stolen, and one may forget the pin, password. Due to this, there is a vulnerability to breaching one's identity. Biometric-based recognition methods are better than pos- session or knowledge-based methods because they provide more security. Therefore, the recognition of humans using biometric is a widely adopted method. Researchers have reported biometrics systems using physiological traits such as face by [1] , fingerprint by [2] , iris by [3] , palm print by [4] , knuckle print by [5] , ear by [6] . Figure 1 depicts examples of these physiological biometrics traits. Every biometric trait has its advantages and disadvantages, and it is considered that there is no such biometric trait that acts as a universal. Table 1 depicts various challenges and issues of physiological traits, as discussed by [7] [8] [9] [10] . The human ear structure is depicted in Fig. 2 , in which the major 11 anatomical ear components are shown. The outer part of the ear is a helix, and the lower part of the ear is the lobe that surrounds the ear. The antihelix runs parallelly to an outer helix. The area between the inner helix and lower branch of the antihelix forms the concha, which has a shelllike shape. The lower part of the concha merges into a sharp intertragic notch. The crus of helix is the area of intersection between helix and antihelix. A little bump on the right side In 1890, French criminologist [11] first identified the ear structure's uniqueness and suggested its use as a biometric. Later in 1989, [12] practically investigated the aspect by collecting 10,000 ear images and identified that they are unique. He also suggested that ears are also unique among the twins. This research supports evidence for the unique shape of the human ear. The police have used ear patterns as proof [13] for the recognition. Also, it has been used as scientific evidence by Polish courts [14] . Amazon's patent on ear shows that it will be useful in near feature to directly answer the phone calls without unlocking and control various features from a distance. Unlike face changes with age, the ear's shape remains constant over the age of 70 years [15] . Moreover, ear images are not affected by makeup and expression, whereas images of the face get affected [16] . The fingerprint and iris are non-intrusive and required much user cooperation at the acquisition. However, ear images are acquired covertly without any consent from the target. Therefore, they are useful in surveillance and forensic investigation. A dedicated sensor is also required to capture fingerprint and iris data, whereas ear images are acquired using existing cameras on the mobile. Additionally, it is useful in scenarios when only the side face of the person is available. Multi-modal biometrics ask users to provide multiple traits that make it strong for liveliness detection and protect from various spoofing attacks. A human ear can be combined with biometrics such as the face, iris, and side face to improve security and performance. The data for face and ear can be captured simultaneously, which leads to building multi-modal recognition. The performance of the ear biometric system generally degrades due to the presence of occlusion of hairs and illuminations. An alternative to this is to use infrared images, as they are generated from body heat patterns and are independent of visible light conditions. Recent studies for human recognition have shown the use of infrared images and a fusion of both visual and infrared images [17] [18] [19] [20] . Moreover, infrared images are also useful for the detection of a spoofing attack. With the increasing concern of COVID-19 disease, touch-based biometrics like fingerprints, iris, palm print are avoidable for public safety. Therefore, there is a huge demand for a contactless biometric system for human recognition in real-world applications, such as marking attendance in offices, access control, banking, and surveillance. The face is also non-intrusive biometric, but it faces a challenge as faces are begin concealed with masks. The ear is a useful biometric in this situation due to its non-intrusive nature, and it can also be acquired even when the face is covered with a mask as the ear region remains uncovered. Figure 3 depicts the overall architecture of the ear biometric system. Initially, a database of side face images is collected using a camera or video footage. During the data collection, there are chances that certain kinds of noise may be introduced in the images, so they are first pre-processed. The next stage is to detect the location of the ear in the image. This step is crucial and needs to be error-free as it affects the system's overall performance. The cropped ear suffers from various challenges like pose and scale. Hence, ear normalisation is performed to eliminate these issues, and they are aligned along the vertical line. In the next step, a robust method is required to extract the ear biometric's unique feature. These features can be obtained using handcrafted or deep learning methods. These unique features with their identities are stored in the database, which is called the enrolment stage. During matching, a query image and identity pass through the system, and a matching score is calculated between the stored features and features of the query image based on Euclidean distance. The matching score acts as a threshold to decide between genuine and impostor matching. Recognition of a person from ear has made much progress. However, the existing work is majorly performed in laboratorylike conditions, and minimal work has been reported in the wild scenario. In the real world, images suffer due to varying pose, illumination, background clutter, occlusion of ear accessories, and hairs. Figure 6 depicts some images in the Fig. 2 Human Ear Anatomy. The human ear has very distinctive structural components. The outer ear is dominated by the shape of helix rim, lobe. The inner ear has many prominent features like antihelix, incisura intertragica, concha, triangular fossa, crus of helix, and tragus unconstrained environment. One can observe that these environmental conditions affect the images greatly and make ear detection and recognition very difficult. Moreover, the existing databases are lack in size and have few images per subject. Therefore, they are not compatible with deep learning technology. Contributions: The main motive of this work is to consolidate the research on ear biometric, in which we have provided an up-to-date status on benchmark databases, performance evaluation parameters, and existing methods on ear biometric. We have also introduced a new large-scale database to evaluate ear detection and recognition methods. Additionally, we have performed standardised evaluation using state-of-the-art deep learning methods. A comparative assessment of these methods is performed on seven benchmark databases. We have highlighted that there is a significant room for the further improvements in unconstrained ear detection and recognition. We have made the followings contributions in this paper: This section provides an in-depth overview of ear biometric. We have categories this work into benchmark databases, performance measurement parameters, and existing approaches, as illustrated in Fig. 4 . A detailed discussion about these is provided in subsequent sections. This section discusses the existing benchmark databases to assess the performance of ear detection and ear recognition algorithms. We have classified these databases based on the constrained and unconstrained environment, and their detailed discussion is provided in subsequent sections. In the constrained environment, the images are acquired in laboratory conditions in a controlled manner and source of variability is pre-decided. The sample images of databases are depicted in Fig. 5 , and their summary is provided in Table 2 . The detail discussion about these databases is as follows: (a) IITD: The database was designed by [23] . It contains images of 121 subjects. The images are in the greyscale format, and for each subject, there are three images. The images include slight angle variations. (b) IITK: The database was contributed by [24] . It has three subsets, viz. IITK-I, IITK-II, and IITK-III. -IITK-I: It has 801 images obtained from 190 different subjects. Each subject contains two to four images, with a total of 801 side face images. -IITK-II: It has 801 face images obtained from 89 subjects. Each subject has nine images and experiences various small in-plane rotations (looking at 20 • down and 20 • up ) and at three different scales. (c) USTB: This database was contributed by [25] . It has four subsets, which are summarised as follows: In the unconstrained environment, images experience environmental challenging conditions. The sample images of databases are depicted in Fig. 6 , and their summary is provided in Table 3 . The detail discussion about these databases is provided as follows: (a) UBEAR: The University of Beira Interior contributed this database in 2011 [30] . The images are acquired from the moving subjects under varying pose, illumination, and occlusion. From each video sequence of 126 volunteers, 17 images are captured. The images have a greyscale format with 1280 × 980 resolution. The database has the following two collections: -UBEAR-1: It contains 4606 images, and the ear's ground truth location is provided. -UBEAR-2: It contains 4606 images, and the ear's ground truth location is not provided. The AWE database was provided by the University of Ljubljana [29] . It contains 1000 images from 100 different subjects collected from the web. Each subject has ten images. The images experience the challenge of pitch angles, yaw angle, occlusion due to accessories and hairs. (c) In the wild Ear (ITWE): This database was contributed by Imperial College London [31] . It has the following two collections: -Collection-A: It contains images from 605 subjects. In each image, the ear is annotated with a unique 55 points. The column "Images" indicates the total images in the database, and the column "Sub" represents the total number of subjects in the database. The column "Resolution" indicates the number of pixels of the images. The column "Sides" indicates that whether the images of both right and left sides of face are present in the database. The columns "Gender" and "Age Group" specify the presence of both the sexes and what kind of age group the subjects belongs. The last column "Description" stands for type of images and environment used to collect the images A. Kamboj et al. An ear-based biometric system is judged by the quality of ear detection and recognition modules. The benchmark parameters to assess the performance of these modules are discussed in subsequent sections. The ear detection module detects the ear in the face image. The benchmark parameters used to assess the performance of this module are described as follows: The column "Images" indicates the total images in the database, and the column "Sub" represents the total number of subjects in the database. The column "Resolution" indicates the number of pixels of the images. The column "Sides" indicates that whether the images of both right and left sides of face are present in the database. The columns "Gender" and "Age group" specify the presence of both the sexes and what kind of age group the subjects belongs. The last column "Description" stands for type of images and environment used to collect the images (a) Intersection Over Union (IOU): It is a ratio of the area between ground truth and predicted box and is calculated using Eq.(1): Here, (X) is the ground truth box, which is manually marked over the desired region of interest, and (Y) is a box predicted by the model. The X ∩ Y represents the intersection area between X and Y, and X ∪ Y is the union of the area between X and Y. The value of IOU ranges from 0 to 1. Here, 0 signifies no overlapping, and 1 indicates tightly bound. An IOU score more than 0.5 indicates good quality of detection. (b) Precision: It is measured using the ratio of true positive pixels to the sum of true positive and false positive pixels and is calculated using Eq. (2): (c) Recall: It is measured using the ratio of true positive pixels to the sum of true positive and false negative pixels and is calculated using Eq. (3): (d) F1 Score: This score indicates the overall performance of the system and is calculated using Eq.(4): Here, TP represents true-positive ear pixels that are correctly detected, FP represents false-positive pixels in which non-ear pixels are detected as an ear, and FN represents false-negative pixels in which the background region is classified as an ear. TN (true negative) is zero, as ear detection is a one-class problem. An ear recognition system is evaluated based on identification and verification. The identification says "who you are," also known as 1-to n matching. The presented biometric is compared with all the biometrics enrolled in the database and returns the best match. The law enforcement and border security control applications work in identification modes. This system operates in two modes open set and closed set. In the open set, the biometric data, which an individual is presenting, are either enrolled in the database or not, whereas, in a closed set, the system returns the identity of a person whose reference has the highest degree of match score with the presented identity. The identification accuracy of a system is measured in terms of the correct recognition rate (CRR). The verification is proving the identity of someone. This is also called 1 to 1 matching. The claim's identity is matched with a specific biometric, which is present in the database. The verification accuracy of a system is measured using an equal error rate (EER). A receiver operating curve (ROC) is plotted between false acceptance rate (FAR) and false rejection rate (FRR). The lowest point on the curve on which FAR = FRR is called EER. The detail about these parameters is provided as follows. tem distinguishes between the imposter and genuine score and is calculated using Eq. (7) . Here, μ G and σ G are the mean and standard deviation of a genuine score, and μ I and σ I are the mean and standard deviation of the imposter scores. This section provides an in-depth review of existing approaches. We have classified these approaches into four categories: viz. handcrafted features, deep learning features, multimodal, and ear anti-spoofing, as illustrated in Fig. 4 . A detailed discussion on these approaches is provided in subsequent sections. This section discusses the various ear detection and recognition approaches based on handcrafted features. The traditional methods for feature extraction rely on hand-engineering features designed by certain experts to solve a specific domain's problem. The handcrafted features are learned using descriptors such as histograms of oriented gradients (HOG), binarised statistical image features (BSIF), scale-invariant feature transform (SIFT), local binary pattern (LBP), Gabor Filter, speeded up robust feature (SURF). A handcrafted-based feature learning-based methods are discussed in [34] . These descriptors encode distinct patterns based on texture, colour, edges, curvatures, shape orientations, and many more unique patterns in the images and learn the underlying features to understand the data. These learned features are fed to some classifiers such as SVM, KNN, random forest, and neural network to classify the data. The use of these features for ear biometrics is described below. (a) Ear Detection: Here, we have provided a detailed analysis of ear detection approaches based on handcrafted features. The summary of these approaches is provided in Table 4 , and their detailed discussion is as follows: In [35] , the authors proposed an ear localisation technique in which the clustering of edges is used for ear detection. Initially, pre-processing is performed based on the skin and non-skin region, and then, the edge map is computed. The edge length and curvature were used to eliminate the spurious regions. Then, hierarchical clustering of edges was used to localise the ear. The method was evaluated on the IITK database. The main advantage is that their method is found to be faster as it prunes out 60% area of the side face image and detects the ear in the only skin region. Additionally, a cross-correlation is applied to evaluate to verify the detection of the ear. An evaluation criterion is considered that an ear with 10% non-ear pixels is considered to be correct localisation. The disadvantage is that their method fails in some cases when the images are noisy and heavily occluded with hairs. Then, [36] presented an ear detection system for realtime scenarios. They used a Haar feature and a cascaded AdaBoost classifier. The cascaded AdaBoost classifier arranges classifiers so that a segment is passed to the next classifier; once a strong classifier accepts it, it is passed to the next classifier. They also found that the cascaded AdaBoost classifier takes less time than simple AdaBoost since most irrelevant segments are discarded at an early stage. The proposed technique was validated on UMIST, UND, WVHTF, and USTB. The method requires around 1.5 GB of memory space. The advantage is that their method is found to be invariant to noise, multiple ears of different scales, and partial occlusion. However, the main drawback is that they have not specified the criteria for the correct ear detection. In another study [37] , an edge detection and templatebased matching approach is presented for ear detection. Initially, the skin segmentation is performed, and then, the nose tip is detected. After that, the face region, which contains an ear, is extracted. In the edge-based method, connected component labelling is then applied to the extracted region, and a rectangle is drawn where the maximum number of connected edges is found. In the template-based method, a template is designed by taking the average intensities of ear images. Then, NCC (normalised correlation coefficient) is computed at every pixel. The technique has been evaluated on CVL (Computer Vision Laboratory) database, and it has been found that the edge-based detection approach achieved more accurate results than template-based. The advantage is that their method is simple and easy to implement. The disadvantage of the skin-based approach is that if the skin is not segmented properly, then it causes false ear detection. The template-based approach has disadvantage that it needs to be recreated for every database. Their method also fails if the side face image is oriented with angle. Additionally, the work has not been evaluated on any benchmark database. In [24] , the authors proposed connected components of the graph-based approach for ear The column "Database" represents the database used for training and testing, the column "Pre-processing" represents the technique used to pre-process the images for better features representation, the column "Technique" specifies the method used by the authors, the column "Evaluation Criteria" is the method used for evaluation of correct ear detection, and the last column "Accuracy" denotes the performance of the system The column "Database" represents the database used for training and testing. The column "ROI" represents whether an ear detection technique is used to detect the ear, and the column "Feature" specifies the kind of method used to learn robust ear features. The column "Classifier" describes the method used to classify the learned features into their class. The columns "CRR" and EER are the performance evaluation parameters detection. The technique has been evaluated on IIT-K, UND-E, and UND-J2. The advantage of their method is that it is invariant to pose, scale, and shape of the ear. The main drawback is that the method fails to detect the ear in images when occluded by hair or noise and poor illumination. A geometric-based approach was used in [38] for ear detection. The three parameters, elongation, compactness, and rounded boundary have been used. The technique was evaluated on the UND-J2 database. The method is fast and has shown good results, but it is evaluated on a database with very few images. A swarm optimisation for ear detection is used in [39] , in which the image is first processed by the skin segmentation algorithm. The entropy map is used to detect the location of the ear. An entropic classifier is used to check whether the ear is detected correctly or not. Their method is evaluated on four different databases, viz. Pointing Head Pose, CMU PIE, UMIST, and Colour FERET. The main drawback is that none of the databases used for the evaluation is from benchmark ear databases, and they have not compared their performance with existing methods. In another study [40] , an entropy cum Hough transformation is applied. A combination of ear localiser and ellipsoid ear classifier was used to identify the presence of the ear in the face image. The technique is validated on five databases, viz. FERET, Pointing Head Pose, UMIST, CMU-PIE, and FEI. The localisation error rate has been used as a measure for true ear region detection and is calculated between the distance of centre of detected ear region and annotated ground truth. However, the main drawback is that they have not evaluated their method on standard benchmark ear databases and did not compare the results with existing popular methods. In another study [41] , the authors performed feature-level fusion extracted from the depth and texture of images, and context information is exploited. A canny edge detector is used to extract the edges from the image. The experiment was validated on UND-J2 collection. Their method has shown in-variance to the rotation and has used intersection over union parameter and considered 50% overlapping between ground truth and predicted box as an evaluation criterion for true ear region detection. A template-based approach is used in [42] . The matching template is performed using dynamic programming. The technique was validated on 212 images collected from the internet. They have used the ROC curve as an evaluation measure. The drawback is that they have not evaluated the method on any standard benchmark database. In [43] , the authors used a Banana wavelet and Hough transformation-based technique for ear detection. The Banana wavelet is used to detect curvilinear ear features, and Hough transformation is used to find circular regions to improve the accuracy. They have used adaptive his-togram equalisation, and the top hat operation is used to pre-process the image. The method was evaluated on standard databases and has shown superior performance than the template and morphological-based operation. For verification of correct ear localisation, they have used LBP and Gabor features and SVM and KNN as a classifier to assess whether the detected region belongs to the ear or not. The main disadvantage of their method for automatic verification of correct ear is that the method has shown poor performance. There is a need to extract more robust features for further improvements. (b) Ear Recognition: Here, we have provided a detailed analysis of ear recognition approaches based on handcrafted features. The summary of these approaches is provided in Table 5 , and their detailed discussion is as follows: [44] presented the human ear recognition method. A Haar wavelet was used to locate the location of the ear. The image is indexed using the Partitioned Iterated Function Systems (PIFS). The experiment was validated on UND-J2 database with 228 ear images. Their method has shown robustness to occlusion and has shown superior results than PCA, LDA, KDAPoly, OLPP. The method has been evaluated on small databases, and exact performance measurement parameters are not used to evaluate the method. In [45] , the author presented ear recognition using SIFT features and homography distance. The homography distance is calculated between any four points matched between query and test image. Their method has shown superior performance than PCA and robustness to 18% occlusion, and 13 degrees of pose variation, and background clutter. The disadvantage of their method is that they have not specified the evaluation criteria used to test the performance of their method and have not used any standard benchmark database. In another study, a model-based method was designed using SIFT features [46] . A wavelet-based method was used to capture an outer ear boundary. Ear's features are enrolled in the database and are matched based on the part selected by the model. The method has been tested on 2D face images of the XM2VTS database on 269 images of 150 subjects. The drawback of their method is that it is evaluated on the small database, and exact performance measurement parameters are not used for the evaluation of the method. A morphological and Fourier descriptor-based method was used in [47] to segment the ear. Then, Gabor, log-Gabor, and complex Gabor filters were used to extract local information. The method was evaluated on a private database of 465 ear images of 125 subjects. The experimental results indicate that log-Gabor-based features outperform the approaches like Eigen's ear, force field transforms, and shape feature. Their work is not validated on any benchmark ear database, and the images have limited orientation and scale. In [48] , a 2D quadrature filters-based approach has been employed. The morphological operators and Fourier descriptors were used ear segmentation. Quaternionic and monogenic quadrature filters have been used for feature extraction. The technique has been evaluated on UND and IITD ear databases, and results indicate that the 2D quadrature filters perform better than monogenic quadrature filters. The drawback of their method is that it is evaluated on database with images possessing few variations. [49] presented ear recognition using feed-forward artificial neural networks. They defined seven elements of ear features for 51 ear images from 51 different subjects. After measuring these features, they conducted several experiments by varying layers and numbers of neurons. The results indicate that a 95% of accuracy is achieved using a 30 layer neural network with 18 neurons. The disadvantage is that their method is validated on private database with very few images. In [50] , the authors used a local binary pattern (LBP) for ear feature extraction. The LBP is applied to get histograms for matching. The experiment was validated on IIT Delhi ear database, which contains a cropped ear image of 125 different subjects. Experimental results suggested that the LBP performs better than PCA. The drawback of their method is that it is evaluated on database with images captured in indoor environment and possesses little variation. In [51] , an unsupervised clustering-based technique was used. A descriptor-based approach comprising histograms of oriented gradients, local binary patterns, and local phase quantisation is used for ear classification. The technique is validated on three databases UND-J2, AMI, and IITK The disadvantage is that the method is evaluated on images having few variations. In another study [52] , an AdaBoost algorithm and Gabor filters are used for ear recognition. Kernel Fisher discriminant was used for dimensionality reduction. The technique was evaluated on USTB and UND databases. In another study [53] , the author proposed ear recognition using a gradient ordinal relationship pattern. They used connected components of a graph to crop ear from a profile face image, and a reference point-based normalisation technique is used to align the ear. They used IITD and UND-E collection for the validation and achieved superior performance than existing methods. However, the images they have used possess few variations and are captured in an indoor environment. A geometrical features-based approach was used in [54] . A snake-based model is used to localise the ear, and then, geometrical features are used for recognition. They have used IIT Delhi ear database for the validation. Their method is evaluated on a small database, and images possess small variation and are captured in the indoor environment. In [55] , authors developed a toolbox CVL for ear biometrics. CVL ear toolbox provides a standardised framework for ear biometrics in the wild. They included four different databases WPUTEDB containing 3348 images, IIT Delhi database of 493 images, the University of Notre Dame contains 3D and 2D 3480 ear images. They used HOG, SIFT, SURF, and MSER features for experimentation and achieved a maximum identification accuracy using HOG. The tool is useful for benchmark evaluation of ear recognition methods. In [56] , the author presented a comparative analysis of LBP and its variants for ear recognition. They have also suggested the average and uniform variant of LBP. The method was evaluated on three databases IITD-I, AMI, and AWE. Their method has shown good performance on the constrained databases, and there is a significant drop in the performance over unconstrained images. A scattering wavelet network-based approach was used in [57] for unconstrained ear recognition. This method is able to extract robust features invariant to small translation and deformation. The method was evaluated on AWE, USTB-HelloEar databases. The method has shown superior performance in comparison with existing local feature descriptor-based methods; however, on the unconstrained database, the method has shown poor performance. In [58] , the author proposes a robust local-oriented patterns technique for ear recognition. The method learns local structure information by utilising edge directional information. The robust features extracted by the descriptor are invariant to rotation and illumination. The method was evaluated on AMI, IITD-II, and AWE database. The method has shown superior performance in comparison with other descriptor-based approaches. However, the performance is observed low on the unconstrained database. In [59] , the author proposed a handcrafted feature-based technique. A Gabor-Zernike operator was used to extract the global feature and local phase quantisation operator to extract local features. A genetic algorithm (GA) was applied to extract optimal features. The method was evaluated on three databases USTB-I, IITD-I, and IITD-II which have obtained promising results. On an unconstrained database, the method has shown poor performance than deep learning-based approaches. This section discusses the various ear detection and recognition approaches based on deep features learning. With the advancement in artificial intelligence techniques and power convolutional neural networks (CNN), various computer vision problems have been improved. Deep learning approaches have been inspired by the human brain's func- The column "Database" represents the database used for training and testing, the column "Pre-processing" represents the technique used to pre-process the images for better features representation, the column "Technique" specifies the method used by the authors, the column "Evaluation Criteria" is the method used for evaluation of correct ear detection, and the last column "Accuracy" denotes the performance of the system The column "Database" represents the database used for training and testing. The column "Feature Extraction" specifies the kind of method used to learn robust ear features. The column "Classifier" describes the method used to classify the learned features into their class. The columns "CRR" and "EER" represent the correct recognition rate and equal error rates to assess the performance of the system tioning and have shown improved detection, recognition, regression, and classification problems. The first neural network LenNet [60] was designed for recognition of 10 handwritten digits, and later, this neural network became more complex and classified 1000 classes on Image-Net. The popular networks for object detection are Faster-RCNN [21] , Mask-RCNN [61] , SSD [62] , SSH [63] and for object recognition are VGG-16 [22] , ResNet-150 [64] , Siamese [65] . The networks have multiple nonlinear layers, such as convolutional, max-pooling, batch normalisation, and activation layer. Each network has millions of parameters that are required to train on a large database. The advantages and challenges of these techniques for ear biometric are described below. The use of these features for ear biometric is as follows: (a) Ear Detection: Here, we have provided a detailed analysis of ear detection approaches based on deep features learning. The summary of these approaches based on deep learning is provided in Table 6 , and their detailed discussion is as follows: A multiple-scale Faster RCNN was employed for ear detection in an unconstrained environment [66] . The network was trained over multiple scales of images such as head, pan-ear, and ear. The experiment was validated over web images, UND-J2, and UBEAR database. The method achieved a high ear detection rate over the images, which suffer due to occlusion, scale, and pose variations. Also, they have used a region filtering approach to eliminate the redundant boxes predicted by the network. Finally, boxes with the highest bounding box are considered. Their method has shown remarkable performance. The drawback is that they have used the objectness score as an evaluation parameter to measure the performance, which is not a standard way to measure the performance of an object detection network. In [67] , the author used a manual ear's landmark to train the CNN network. It obtained geometric morphometrics distance automatically. The CANDELA project images have been used for ear detection, which contains images captured in the unconstrained environment. The drawback of their method is that it is not evaluated on any standard database and does not use any standard performance measurement parameters. In another study, an encoder-decoder-based pixel-wise ear detection approach was presented [68] . Their architecture is highly inspired by SegNet. The technique can distinguish between the pixels of the ear and non-ear. At the later stage, a post-processing step was performed to eliminate the spurious region. They evaluated the technique on the AWE database and also used HAAR-based features for comparison. Their method has shown superior results than HAAR-based approach, and they have analysed the impact of environmental covariants on-ear detection. In addition to this, the authors have used IOU parameters to measure the performance. In [69] , the authors used an average of three CNN networks of the same architecture for ear detection. CNN has three different sizes as small, medium, and large. The technique has been validated on UND, AMI, UBEAR, and Video databases. A Spatial Contrastive Normalisation technique has been used as a pre-processing to enhance the quality of images. To improve the performance, they have used partition and grouping algorithms to clean up multiple overlapping windows. The major drawback is that they have not specified any evaluation criteria to correctly assess the performance of their method. An ensemble-based CNN model was used by [69] to detect the ear. Initially, three CNNs of different sizes have been trained separately, and then, the weighted average of the models was used to detect the ear. They have evaluated the technique on IIT Indore and AWE database. The authors have considered the intersection over union (IOU) parameter to measure the performance. Additionally, they have specified that their method is robust to the occlusion of hairs, but they did not analyse the performance on other environment covariants. In a recent study [70] , the author presented a context-aware ear detection network for an unconstrained environment. The method was evaluated on six publicly available benchmark databases. The authors also have used the IOU parameter for the standardised evaluation. Their method outperformed various state-of-the-art methods. (b) Ear Recognition: Here, we have provided a detailed analysis of ear recognition approaches based on deep features learning. The summary of these approaches based on deep learning is provided in Table 7 , and their detailed discussion is as follows: A CNN features-based approach was used in [71] . The neural network has convolutional, max-pooling, and fully connected layers. The experiment was performed on USTB-III ear database. The method is not evaluated using standardised evaluation parameters and validated on only constrained database with few images. [ The VGG-face model has shown superior performance than other models. The drawback of their method is that they have not evaluated their model using any standard parameters. In [75] , the authors proposed an ear recognition pipeline, in which ear detection is performed using Refinet and recognition using ResNet and handcrafted features. The method is evaluated on UERC database, and the deep learning-based approach has shown superior results. They major disadvantage is that they have employed existing methods for ear detection, and recognition and the work have limited novelty. In a recent study [6] , the authors explored the use of deep learning models such as VGG, ResNetx, and Inception. They have employed various learning strategies such as feature extraction, fine-tuning, and ensemble learning. Also, they have evaluated the performance based on the customdesigned input size of the images. They have evaluated the results on EarVN1.0 database, which has ear images of an unconstrained environment. The drawback of their method is that it is evaluated on only one database, and comparative performance on other popular databases and techniques is not performed. The author of [76] employed custom-designed six layers deep CNN model for ear recognition. They have evaluated their method on IITD-II and AMI database. Their method is not evaluated using standardised parameters, and performance is evaluated on only constrained databases. A deep constellationbased ear recognition approach was provided in [77] . They have used the two-pathway approach of CNN to learn local and global information. To learn global information, the whole image is given as input, and for local information, the image patches are provided. The network is evaluated on unconstrained ear images of AWE database. However, the method has shown little poor performance. A NASNET model was used in [78] to perform ear recognition. The network was evaluated on UERC-2017 database. The performance of the network is compared with VGG, ResNet, and MobileNet. They provided an optimised network by reducing the number of learn-able parameters and reduced the number of operations. The NASNET has outperformed this method and achieved the highest recognition rate. This section provides details of various multimodal ear recognition approaches. A single biometric cannot fulfil the security requirements of all the applications. A unimodal system has various challenges like noisy data, intraclass variation, inter-class variations, uniqueness, spoofing. Fusing of information from multiple biometrics provides much more securable and reliable solution as discussed in [81] [82] [83] . In order to overcome these problems, a multimodal system that combines information from more than one modality is suitable to improve security and performance in the identification and matching tasks. In multimodal biometric, the information acquired from multiple biometrics is fused at three levels, viz. (1) feature, (2) score, and (3) decision level. At the feature level, the feature from different biometric modalities are extracted and fused. Later these are provided to the matching module. However, this is only performed when the biometrics modalities are compatible with each other. The feature vectors are extracted separately at the score level, and then, a matching score is calculated. Finally, these matching scores are combined and given to the matching module. Here, one needs to perform normalisation of matching score before given to the matching module. At the decision level, the biometric modality has its own decision, and then, both the decisions are combined to make a final decision. The ear biometric is suitable to combine with other modalities such as the face, iris, side face, and hand geometry to improve accuracy and security. The summary of these multimodal methods is provided in Tables 8 and 9 , and their detailed discussion is provided as follows: [84] presented a multimodal biometric for iris and ear. They use SIFT-based feature-level fusion. The proposed method is tested on CASIA database of iris images and − They are problem-specific. Therefore, lots of efforts are required to find efficient features for a specific domain. − These approaches require less computation power. − The handcrafted features which are designed for one biometric are not suitable for other biometric. − Less complexity of algorithm. − The performance of these methods for ear detection and ear recognition in the unconstrained environment is found to be poor [72] . − They mainly require SVM, KNN, neural network for the classification, which works on a fixed number of classes. Therefore, they are not applicable when the number of classes increases. This is the major backward of their applicability for ear biometrics. − Deep learning models are data-driven and need a large amount of data and high-end machines for computation. However, they beat in terms of performance in comparison with handcrafted features. Deep Learning − There is a data scarcity for ear biometric. However, techniques like zero short learning and dropout layers help to train the network efficiently. − One can use weights of pretrained models like AlexNet, VGG-16, ResNet-150 and fine-tunes the network's final layers to solve the ear-matching problem. − In the ear ROI detection task, a deep learning method needs data that define the object's exact location. The preparation of such kind of data for each image is a very strenuous task. − In [80] , the authors found that CNN features are invariant to the left and right ear during recognition. However, handcrafted features are severely affected. − These approaches require high computation power. − These methods require a vast amount of training time. However, the testing time per image is of few milliseconds. USTB-2 database of ear images. Their method gives accuracy, which is much higher than unimodal iris/ear biometric separately. [85] developed a multimodal system using ear and profile face. They used adaptive histogram equalisation to enhance the images. SURF features are extracted from images, and fusion is performed at both feature and score levels. The experiment is evaluated on three different databases IITK contains 801 side face images, UND-E contains 464 side face images, and UND-J2 contains 1800 side face images. The results indicate that the fusion of ear and profile face has shown much improved performance as compared to ear or face used individually. Moreover, it has been identified that the score-level fusion performed is better than the feature level. [86] proposed particle swarm optimisation (PSO)-based method. The technique was applied on the Face-Yale face database of 165 greyscale images, the IIT Delhi ear database, and 471 image ear images. The fusion of ear and face gives auspicious results as compared to single modality. In [87] , the author proposed a multimodal biometric system by fusion of ear and palm print. Texture-based features were extracted using LBP, Weber, BSIF to get a discriminating feature. The feature of both traits is combined. The results shows that the system has achieved 100% recognition accuracy. In [88] , authors fused ear and knuckle print. The unique patterns are extracted using LBP. Their proposed system gives 98.10% of accuracy. The main disadvantage is that their method is not evaluated on any benchmark database. The biometric authentication systems are vulnerable to various security attacks. As discussed by [89] [90] [91] , these attacks may occur at the sensor level, module level, or database. Like Table 9 Comparative summary of multi-modal approaches for ear recognition S.N. The column "Database" represents the database used for training and testing. The column "Modalities" represents combination of ear with other biometric trait. The column "Technique" specifies the method used to learn robust features. The column "Accuracy" describes the performance of the system Age group 15 to 62 years any other biometric system, ear-based biometrics system is also vulnerable to security threats discussed above. Any attack can jeopardise the security of the system. Researchers have devised various anti-spoofing methods for other biometrics traits such as face [90] , fingerprint [92] , and iris [93] . However, there has been small progress observed on-ear anti-spoofing methods, and they are summarised as follows. The first ear anti-spoofing database was prepared by [94] . This database contains images of three attacks printed photographs, display phones, and video attacks. An image quality assessment (IQA)-based technique was devised to extract the unique features, and the SVM classifier was trained to differentiate between the real and fake biometric. Further, this work was extended by [95] . They presented ear anti-spoofing methods using a three-level fusion of IQA parameters. They have shown results using various levels of fusion techniques and found that the score-and decision-level fusion techniques have given the best results. In another study [96] , the authors presented a new database Lenslet Light Field Ear Artefact Database (LLFEADB). The database contains images of various attacking devices like laptops, tablets, and mobile phones. They have applied face anti-spoofing methods for the verification of ear bonafide and found promising results. The existing databases discussed in section 2.1 are used by the researcher for the evaluation of ear detection and recognition technologies. In this section, we introduce a new database National Institute of Technology Jalandhar Ear in the Wild (NITJEW) Database, which is different from existing databases. Most of the existing databases contain images captured in controlled laboratory conditions, or the source of variability is predefined. Therefore, they are not suitable for real-time scenarios. Although there exist databases in the unconstrained environment, size of these databases is very small. Therefore, they are not compatible with exiting deep learning technologies as they are data-driven and require a massive amount of data to train the model. Additionally, most of these databases contain images collected from the web and do not consider the suitability of real-time ear recognition images. One major drawback is that ground truth for ear location is not provided in most databases, and some databases contain cropped ears. Therefore, they are not suitable for the Fig. 8 Sample images of NITJEW in unconstrained environment: Images suffer from angle, occlusion, illumination, and scale variations evaluation of a complete pipeline that is ear detection and recognition. The existing methods have already obtained a significant performance over the constrained database. In [24, 41] , an accuracy of more than 90% is already obtained. Therefore, there is a need for databases that contain challenging images of real-world conditions to provide room for further advancement in ear recognition technology. To overcome the existing database gaps in ear biometric, we have prepared a new database, NITJEW, and made it available to the research community. The database will be freely available and made public after the acceptance of the manuscript. The database is acquired from 510 different volunteer students, staff, and faculty members of the National Institute of Technology Jalandhar (NITJ). It includes images of genders (male/female). The database was acquired during different sessions from August 2017 to December 2019. The database was carefully designed by showing visual clues of real-world imaging conditions to the volunteers. Consent was taken from each participant for the use of their side face images in research studies. A desktop application was designed to capture the images of a subject through external cameras connected with a laptop, and it finally stores the images into the drive with a unique identification number. A distance of 1 to 5 meters is maintained between the face and camera. The subjects were asked to pose their head for pitch angle between (−50 • and +50 • ) and roll angle between (−50 • and +50). The images were taken in both indoor and outdoor environments. The size of images was kept same, i.e. of 1280 × 980 pixels, which is useful for various deep learning techniques as they required images to be of the same size. The images are coloured and in TIFF format with an approximate 1 MB size. The illumination conditions between different sessions were highly varied. The complete detail about the images, environmental description, and volunteers is provided in Table 10 . The main aim of our database is to simulate the covert acquisition of side face images in realworld conditions. For each subject, 20 side face images (10 for the left ear and 10 for the right ear) are captured. The database has a total of 10,200 images. The images experience challenging conditions of real-world scenarios (illumination, scale, pose, occlusion of hair, scarf, and hat). Few sample images from the database are shown in Fig. 8. The images in the database were annotated by the trained students using LabelMe toolbox [97] developed by MIT. On each image, a bounding box of minimum area rectangle that Fig. 9 Annotation sample: The bounding box in red represents the location of ear. This is represented using four different points (x org , y org ,w,h). Here, x org , y org is a starting point, w is the width, and h is the height. The blue line passes through the normalisation points used to normalise the cropped ear before matching tightly enclosed the ear boundary is drawn. The bounding rectangle is represented using four different points (x org , y org ,w,h). Here, x org , y org is a starting point, w is the width, and h is the height. This bounding rectangle is called the ground truth (GT) box. The GT of the ear inside face image is used to assess the ear detection module's performance by computing its overlapping with the predicted bounding box. Figure 9 represents the sample of annotation, in which the bounding box in red represents the location of the ear. Due to the unconstrained environment setting, the cropped ear from side face images is not aligned properly. Therefore, before matching, these images are aligned along the y −axis for proper registration of the ear. Accordingly, two key points that represent the farthest distance are marked on the ear extreme boundaries. A line is drawn that passes through these points. In Fig. 9 , the blue line passes through these points, which is used to normalise the ear image. To highlight the capability of a deep learning model for ear biometric. We have utilised modified Faster-RCNN [21] for ear detection, and VGG-19 [22] for recognition. These models are state of the arts and have shown superior results compared to handcrafted approaches in various computer vision tasks. A detailed discussion about them is provided in the subsequent sections. The very first part of the ear biometric system is to detect the ear inside the face image. This step is crucial and needs to be error-free as it affects the system's overall performance. To perform this, we have used the Faster-RCNN object detection model. The Faster-RCNN is preceded by RCNN and Fast-RCNN and has shown superior performance by overcoming the complexities of these methods. The FRCNN has several major components, viz. deep convolutional neural network, region proposal network, region filtering, ROI pooling, and classification and regression head (refer Fig.10 ) for detailed architecture. (a) Deep convolutional neural network: The first step in Faster-RCNN is a series of convolutional layers to compute the convolutional feature map from an image. We have taken the VGG-19 network [22] , which contains a series of convolutional, activation (ReLu), max-pooling, and batch normalisation layers. algorithm is used to keep only those anchor boxes having IOU with ground truth boxes which is more than 70%. (d) ROI pooling: The region produced by the RPN is of varying size. Therefore, the ROI pooling layer is used to convert them into a vector of fixed size (14×14) and followed by a max-pool operation. (e) Classification and regression head: Finally, these feature maps are given to classification head for prediction of class score and regression head for bounding box coordinates. The classification head consists of fully connected layers followed by softmax operation to predict the class, i.e. ear. The regression head indicates four coordinates for each ear location. Training strategy: The weights of pre-trained model VGG-19 trained on ImageNet database are used for training. In the case of training a network from scratch, the chances of over-fitting arise. For efficient network training, different hyper-parameters are chosen such as optimiser: Adam, epochs = 200, early stopping with patience = 30, Here, i is the index of anchor in a batch, the l c ls is the classification log loss between two classes (object or not), and N cls is the number of anchor boxes in mini-batch (i.e. 256). p i is the predicted probability of anchor i, g i is the label for ground truth, and its value is 1 for positive anchor and 0 for negative anchor. The λ = 10 is a constant for equally weighted of cls and reg. N cls is the number of anchor locations (i.e. 2400). The l r eg is regression smooth L1 loss between b i andt i . The b i is vector representing the four coordinates of predicted bounding box, and b * i is a vector representing the four coordinates of ground truth coordinates of i t h anchor box. Researchers have proposed various CNN deep models trained on natural images for object recognition. A transfer learning or fine-tuning of these models for similar tasks has shown remarkable performance. For scarce data scenarios, like ear biometric, the fine-tuning of models has achieved more performance than training a model from scratch as discussed in [73] . This is because the earlier layers in the network extract low-level general features, and their weights can be learned from natural images. The fully connected (FC) layers learn generic high-level features and train on the new database. Inspired by the fine-tuning, in this work, we have employed the VGG-19 [22] network to extract robust ear features. Figure 11 depicts the detailed layout of the proposed method. The VGG-19 is selected as it is a winner of the 2014 ImageNet competition and has less trainable parameters than other networks such as ResNet152, Inception, and Xception. The network has two parts: feature extraction and classification head. The feature extraction part has five blocks consisting of 16 convolutional layers, each having a 3 × 3 filter. The number of filters doubles in each block. (The first block has 64 filters, and the fourth and fifth blocks have 512 filters.) The convolutional layers are followed by the ReLu activation function to introduce nonlinearity in the network. Each block is followed by a max-pool operation to reduce the spatial size of the features map. The classification head of the model has three fully connected layers and a softmax layer. We have modified the classification head for ear recognition by keeping only one fully connected layer with neurons equal to the number of samples in the training database. Finally, a softmax operation is performed to predict the probability of each class. At testing, each feature map after the FC layer is used. The computed features for each image are then compared using Euclidean distance, which returns a final score. These scores are then normalised using the max-min algorithm, and ear identification and verification are performed to evaluate the system's performance. Training strategy of VGG-19: The weights of pre-trained model VGG-19 trained on ImageNet database are used for training. This is because, in the case of training a network from scratch, the chances of over-fitting arise. Also, we have applied a new strategy for training, in which we have kept the learning rate layers as 0.001, which is smaller than the later layer, i.e. 0.1. This is because the earlier layers are initialised with pre-trained weights, and the last FC layer is trained from scratch. For efficient network training, different hyper-parameters are chosen such as optimiser: SGD, epochs = 200, early stopping with patience = 30, initial The performance is significantly different on unconstrained databases because of challenging environment conditions Fig. 12 Comparative performance assessment of Faster-RCNN using accuracy, precision, recall, and F1-score against overlapped IOU over IITD, IITK, UND-E, USTB, NITJEW, AWE, and ITWE databases. Note that the performance on constrained databases is better than unconstrained databases learning rate = 0.001, momentum = 0.9, regularisation L1 with λ = 0.001, and batch size = 32. Loss function of VGG-19: The loss is categorical cross entropy as per Eq. 9: Here, M is the number of classes, y i is the i th scalar value in the model output, andŷ i is the predicted probability of class. The minus sign indicates that the value of loss gets smaller when the distributions get closer to each other. The comparative performance assessment of FRCNN and VGG is analysed on seven different databases, viz. IITD, IITK, UND-E, USTB, NITJEW, AWE, and ITWE. The readers are referred to Sect. 2.1 regarding detailed information about these databases. The FRCNN is trained using 50% images (5100) of the NITJEW database. The images suffered from various unconstrained environmental conditions. The trained model is evaluated using performance metrics accuracy, precision, recall, and F1-score. The detailed information regarding Fig. 13 The results of FRCNN over natural images. The regression head predicts the coordinates of ear location, and classification head gives class (ear) and its probability. The first two rows represent the correctly detected ears. Also, one can note that the multiple ears are also detected. The last row depicts examples where ear is not detected properly these measures is provided in Sect. 2.2. In the past, different researchers have used their self-compiled evaluation metrics and have not used the IOU parameter to measure the performance of ear detection. They have considered that the detection is correct irrespective of its overlapping with ground truth. However, an IOU represents an intersection over the union between the predicted bounding box and ground truth box, and it is used to judge the quality of the object detection model. The higher value of IOU indicates the tight overlapping, whereas the lower value is shown for loose overlapping. An ear detection method is considered accurate when it gives good results at higher values of IOUs. Therefore, we assessed the performance of FRCNN at IOU more than 0.5. The comparative results on different databases are depicted in Table 11 at different values of IOUs (0.5,0.6,0.7), and graph of each parameter plotted against whole grid values of IOUs between (0.0 and 1.0) is shown in Fig. 12 . At an IOU=0.5, an accuracy of more than 95% is observed on the constrained databases. One can clearly observe in a graph that there is a sharp drop in the performance with the increase of IOU. The minimum accuracy is observed on an unconstrained database due to the complexities in the images as shown in Fig. 6 . Qualitative Result Analysis: Figure 13 represents ear detection results over natural images by FRCNN. One can observe that the method can detect ear even due to the presence of extremely challenging conditions. Figure 13a -f indicates successful detection of ear. Figure 13 a and d indicates that the multiple ears in the image are also detected. Figure 13g -i represents failure cases of the network. The network fails when it gets ear-like features in the images, and this can be addressed by training the network over these kinds of images. These results demonstrate that the method is suitable for ear detection in an unconstrained environment. Fig. 14 Comparative performance assessment based on ROC graph on different databases. This performance is achieved on testing part of the databases. One can observe that, on the constrained database, the performance is much better than unconstrained databases. This indicates the significance and difficulty of unconstrained databases, and there is room for further improvements in ear recognition The VGG-19 network accepts input images of size 224 × 224 × 3. However, the cropped ear images returned by the FRCNN are of varying size. Therefore, all the images in the databases are resized using bilinear interpolation. Since we have few images per subject, i.e. (5 to 10), it is difficult to train the network from scratch. To avoid the chances of overfitting, we have used weights of the VGG-19 network trained over ImageNet. Then, identification experiment is performed to compute CRR for rank-1 accuracy and verification is performed to compute EER, DI, ROC curve of FAR and FRR. The detailed information regarding these measures is provided in Sect. 2.2. In each database, the first half of the images are used as a gallery, and remaining are used as query images. The comparative assessment of VGG-19 over the different databases is shown using the ROC plot of FAR and FRR in Fig. 14. A quantitative assessment is provided in Table 12 . The results indicate that on constrained databases, the performance is better than the unconstrained database because of the less complexity of images. In this section, we discuss various challenges and limitations of current research in ear biometric. We also provide future directions for these challenges. The ear biometric is The performance is significantly different on unconstrained databases because of challenging environment conditions less explored than other popular biometrics. It is a new biometric trait and offers many research possibilities due to its advantages over the other biometric traits. From the study of the literature, we have found several research problems that are required to explore in the future and discussed as follows: (a) Challenging Databases: Existing ear databases do not fulfil all conditions of the unconstrained environment, so they are not suitable for ear recognition in a realworld scenario. The size of these databases is very small, and deep learning-based methods require large annotated databases for training. The annotation is expensive and time-consuming. Therefore, there is a need to develop a more challenging large-scale database that includes plentiful scenarios, such as images of different acquisition devices, across the ages, partial data, pose, illumination and scale variations, intraclass variations, and a varying number of samples/subject. This would be another big step for support of real-world ear recognition. (b) Standardisation of Ear Detection: A substantial amount of work has been reported for automated methods for ear detection in the unconstrained environment. However, it has been identified that the existing work is not evaluated using standardised benchmark evaluation parameters. The researchers have used their selfcompiled evaluation metrics, which are varied from paper to paper. Moreover, publicly there is non-availability of standardise benchmark evaluation tools that make it difficult to compare each other's methods. Therefore, efforts are required to provide standardise evaluation metrics and tools for assessing ear detection methodologies. (c) Unconstrained Ear Recognition: There are several factors of unconstrained scenarios which affects the ear recognition. A minimal work has been reported for ear recognition in the unconstrained environment. The existing methods have poor results and are not applicable to real-time scenarios and on video footage. It is assumed that both the left and right ears of the human are different. However, no standard evaluation has been performed to measure the similarity between the left and right ears. The algorithms should be implemented to explore this fact. Few studies have been performed on the recognition of infants using ear images. It has also been identified that the size of the ear changes in older age. However, how it influences ear recognition has not been verified. The occlusion of hairs will always remain the challenge for recognition, and it can be addressed using thermal images. Therefore, there is a need to explore the power of deep learning algorithms to develop more effective and efficient ear recognition methods in real-world scenarios. (d) Image Modalities: Most of the research work is mainly performed on 2D ear images acquired using cameras or CCTV. However, other modalities like 3D ear and ear print also need to be explored. The segmentation, alignment, and recognition models for these modalities need to be developed. The heterogeneous recognition is also the need for the future, in which images are captured using different cameras. (e) Ear Liveliness Detection: The privacy and security of ear biometric is compromised using presentation, adversarial, or template attack. Few studies [94] [95] [96] have been performed on presentation attack detection and ear liveliness detection. However, there are many scopes to build methods that can countermeasure various security threats for ear-based biometric systems. In this paper, we have provided a comprehensive survey on existing work in the field of ear biometric, including benchmark databases, performance evaluation parameters, and existing techniques. We have introduced a new database, NITJEW. It contains images captured in an unconstrained environment and is challenging for existing technology. The database is large in size and suitable for deep learning technologies. This is the first large-scale database that is useful for evaluating both ear detection and recognition technologies to the best of our knowledge. To perform a comparative assessment of our database with existing databases, we have modified deep learning models Faster-RCNN and VGG-19 for ear detection and ear recognition. On analysis, it has been observed that these models perform pretty well over constrained databases. However, due to challenging environmental conditions, there is a significant difference in the performance over the unconstrained databases. The results demonstrate that there is still scope to build new models for unconstrained ear recognition for better performance and commercial deployment. The open research problems have been outlined that need to be addressed in the near future. We hope that the taxonomic survey and new database will inspire the research community and new researchers to further develop ear recognition. Face recognition and classification using googlenet architecture A novel method based on deep learning for aligned fingerprints matching Iris recognition supported best gabor filters and deep learning cnn options Deep discriminative representation for generic palmprint recognition Finger-knuckle-print recognition using deep convolutional neural network Deep convolutional neural networks for unconstrained ear recognition A comprehensive survey on various biometric systems A survey of emerging biometric modalities A comparative study of different biometric features Physiological biometric authentication systems, advantages, disadvantages and future development: A review La photographie judiciaire: avec un appendice sur la classification et l'identification anthropométriques Ear identification Forensic otoscopy-new method of human identification The effect of time on ear biometrics Comparison and combination of ear and face images in appearance-based biometrics Survey on recent ear biometric recognition techniques On ear-based human identification in the mid-wave infrared spectrum Global temporal representation based cnns for infrared action recognition Transferable feature representation for visible-to-infrared cross-dataset human action recognition Faster r-cnn: Towards realtime object detection with region proposal networks Very deep convolutional neural network based image classification using small training sample size Automated human identification using ear imaging An efficient ear localization technique Ear Recoginition Laboratory( University of science and technology Beijing USTB database Biometric recognition using three dimensional ear shape cvrl data sets ( university of notre dame und database) Face database. ( university of sheffield) Ear recognition: More than a survey Ubear: A dataset of ear images captured on-the-move in uncontrolled conditions Deformable models of ears in-the-wild for alignment and recognition Ustb-helloear: A large database of ear images photographed under uncontrolled conditions Earvn1.0: A new large-scale ear images dataset in the wild. Data in Brief Image feature detectors and descriptors. Studies in Computational Intelligence Ear localization using hierarchical clustering. In: Optics and Photonics in Global Homeland Security V and Biometric Technology for Human Identification VI Fast learning ear detection for real-time surveillance Edge detection and template matching approaches for human ear detection Heard: An automatic human ear detection technique Entropy based binary particle swarm optimization and classification for ear detection Entropy-cum-hough-transform-based ear detection using ellipsoid particle swarm optimization Robust localization of ears by feature level fusion and context information Human ear localization: A template-based approach A novel approach to automatic ear detection using banana wavelets and circular hough transform Hero: Human ear recognition against occlusions Toward unconstrained ear recognition from two-dimensional images On guided model-based analysis for ear biometrics Automated human identification using ear imaging Reliable ear identification using 2-d quadrature filters. Pattern Recognition Letters, Novel Pattern Recognition-Based Methods for Re-identification in Biometric Context Ear recognition with feed-forward artificial neural networks Lbp-based ear recognition 2d ear classification based on unsupervised clustering Ear recognition based on gabor features and kfda Robust ear recognition using gradient ordinal relationship pattern Human ear recognition using geometrical features extraction Toolbox for ear biometric recognition evaluation Ear recognition using local binary patterns: A comparative experimental study Unconstrained ear recognition using deep scattering wavelet network Robust local oriented patterns for ear recognition Genetic algorithm based local and global spectral features extraction for ear recognition Gradient-based learning applied to document recognition Mask r-cnn Ssd: Single shot multibox detector Ssh: Single stage headless face detector Deep residual learning for image recognition Facenet: A unified embedding for face recognition and clustering Ear detection under uncontrolled conditions with multiple scale faster region-based convolutional neural networks Automatic ear detection and feature extraction using geometric morphometrics and convolutional neural networks Convolutional encoder-decoder networks for pixel-wise ear detection and segmentation Ear detection and localization with convolutional neural networks in natural images and videos. Processes Ced-net: context-aware ear detection network for unconstrained images Ear recognition based on deep convolutional network Employing fusion of learned and handcrafted features for unconstrained ear recognition Training convolutional neural networks with limited training data for ear recognition in the wild Ear verification under uncontrolled conditions with convolutional neural networks Deep Ear Recognition Pipeline A deep learning approach for person identification using ear biometrics Constellation-Based Deep Ear Recognition Performance analysis of nasnet on unconstrained ear recognition Unconstrained ear detection using ensemble-based convolutional neural network model Handcrafted versus cnn features for ear recognition Multimodal biometric system iris and fingerprint recognition based on fusion technique Largescale evaluation of multimodal biometric authentication using state-of-the-art systems Knuckle print biometrics and fusion schemes-overview, challenges, and solutions A SIFT-Based Feature Level Fusion of Iris and Ear Biometrics Efficient human recognition system using ear and profile face New chaff point based fuzzy vault for multimodal biometric cryptosystem using particle swarm optimization Multimodal biometric recognition using human ear and palmprint Local binary pattern based multimodal biometric recognition using ear and fkp with feature level fusion Biometric authentication: The security issues Biometric antispoofing methods: A survey in face recognition Handbook of biometric antispoofing Fingerprint anti-spoofing in biometric systems Iris anti-spoofing An ear anti-spoofing database with various attacks Ear anti-spoofing against print attacks using three-level fusion of image quality measures Ear presentation attack detection: Benchmarking study with first lenslet light field database Labelme: a database and web-based tool for image annotation Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.