key: cord-0809406-az6sbjx6
authors: Yin, Jianhua; Zhu, Longzhen; Bai, Yang; He, Zhenyu
title: VEDesc: vertex-edge constraint on local learned descriptors
date: 2021-05-17
journal: Signal Image Video Process
DOI: 10.1007/s11760-021-01914-5
sha: 5d31daacd5d63a33be5e553382ad28d7ace7bfbb
doc_id: 809406
cord_uid: az6sbjx6

To improve the performance of local learned descriptors, many researchers pay primary attention to the triplet loss network. As expected, it is useful to achieve state-of-the-art performance on various datasets. However, these local learned descriptors suffer from the inconsistency problem without considering the relationship between two descriptors in a patch. Consequently, the problem causes the irregular spatial distribution of local learned descriptors. In this paper, we propose a neat method to overcome the above inconsistency problem. The core idea is to design a triplet loss function of vertex-edge constraint (VEC), which takes the correlation between two descriptors of a patch into account. Furthermore, to minimize the non-matching descriptors’ influence, we propose an exponential algorithm to reduce the difference between the long and short sides. The competitive performance against state-of-the-art methods on various datasets demonstrates the effectiveness of the proposed method.

Image matching, which is a fundamental problem in computer vision, has been widely used in numerous fields, including structure from motion [1] , wide-baseline stereo [2, 3] , 3D reconstruction [4] and simultaneous localization and mapping (SLAM) [5, 6] . Generally, a patch-based matching method includes extracting the feature points and matching their feature descriptors. The stable descriptors with properties of scale and rotation invariance are crucial for correcting image matching. Traditional methods, including SIFT [7] and SURF [8] , have been proven to be effective in various applications. Meanwhile, to accelerate matching and reduce storage, several algorithms [9] [10] [11] [12] extract some binary face descriptors with the BRIEF algorithm to achieve well performance on automatic facial expression recognition.

However, handcrafted descriptors are limited in terms of robustness and precision due to the lack of high-level structural information. Recently, with the development of convolutional neural networks (CNNs), the research hotspot focuses on the learned descriptors [13] [14] [15] [16] [17] instead of the handcrafted descriptors. These learned descriptors achieve much higher performance than the handcrafted ones in various areas, including image matching and patch verification. Specifically, To improve the performance of learned descriptors, the triplet loss [14] encourages that the distance between matching descriptors is less than the distance between nonmatching ones. Inspired by the triplet loss, recent works [16, 17] try to minimize the distance between matching descriptors and maximize the distance between non-matching descriptors. These experimental results demonstrate the triplet loss is effective on various datasets, improving the robustness of learned descriptors in extreme conditions, including weather, season, illumination, and distortion.

Although these triplet loss methods achieve impressive performance, they suffer from an inconsistency issue, as shown in Fig. 1a . Specifically, Fig. 1 shows the Euclidean distance of four image patches involving two matching pairs. The symbols a i and p i (i = 1,2) refer to descriptors of anchor patches and positive patches, and d represents the Euclidean distance between two descriptors. If a 1p 1 and a 2p 2 is two matching descriptors pairs, d a 1 a 2 is equal to d p 1 p 2 . However, the difference between d a 1 a 2 and d p 1 p 2 obtained in the traditional method is relatively large, which leads to the inconsistency issue. As presented in Fig. 1a , the distance (d a 1 a 2 = 1.05) between descriptors a 1 and a 2 in anchor patches is significantly smaller than that (d p 1 p 2 = 2.21) between their descriptors p 1 and p 2 in positive patches. The learned descriptors suffer from the inconsistency issue in terms of representing the difference between two image patches. This issue would cause irregular spatial relationships of descriptors and further reduce matching accuracy, even though the distance between matching descriptors is smaller than the distance between non-matching ones. The reason for the issue is that these works are less effective in exploiting the interrelation between the two descriptors of anchor (or positive) patches in the traditional triplet loss function.

Generally, vertex information contains several properties, and an edge represents the relationship between two vertexes. Descriptors and the distance between two descriptors can be naturally illustrated as a graph, in which the vertex represents the descriptor, and the edge denotes the relationship between two descriptors. To solve this inconsistency problem, we introduce the vertex-edge constraint (VEC) to the triplet loss function: the vertex constraint denotes the distance between two vertexes in different images; the edge constraint represents the distance between two descriptors in a patch. To exploit the vertex constraint, we propose a strategy to increase the distance between two non-matching descriptors and decrease the distance between matching ones. Furthermore, we encourage reducing the difference between the longer and shorter sides in the edge constraint. Fig. 1b shows the VEC algorithm's result, which demonstrates that our method effectively reduces the difference between opposite sides and further promotes matching accuracy of descriptors.

We summarize the main contributions of the work as follows:

(1) We introduce VEC to the triplet loss function (denoted as VEC-loss), containing vertex constraint and edge constraint. (2) To facilitate regular distribution of descriptors in high dimensional Euclidean space, we design an exponential algorithm to reduce the difference between the long and short sides in the image matching pair. (3) To avoid the influence of the distance between nonmatching descriptors, we design an strategy that enlarges the distance of two descriptors.

Descriptors' detection is a critical technique for image pair matching. We briefly review descriptors detected by the handcrafted method and the learned method in section 2.1. Furthermore, we discuss some practical algorithms with metric learning to improve the performance of learned descriptors in section 2.2.

The handcrafted method and the learned descriptor method are two main methods to detect local feature descriptors. Early works on the local descriptor mainly focused on the handcrafted method with gradient filters and intensity comparisons. SIFT [7] , which computes histograms with gradient ltering approach, possibly is the most classical and popular handcrafted method. These handcrafted methods, including SIFT [7] , SURF [8] , BRIEF [9] , DSP-SIFT [18] and PCA-SIFT [19] , have limited in several aspects, such as unable to get semantic segmentation information and weak robustness in image classification.

With the emergence of the CNNs, more researches focus on the learned method instead of the traditional handcrafted method in recent years. The MatchNet [20] method proposes a Siamese network, containing feature network and metric-learning loss function, to train samples. Compared with the traditional handcrafted method, MatchNet [20] significantly improves images matching performance and demonstrates the extraordinary potential of learned descriptors. The recent methods [13, [21] [22] [23] study detecting learned descriptor on the Siamese network. Kumar et al. [21] explore a centralsurround two-stream network to improve image matching performance. DeepDesc [22] introduces a stochastic sampling strategy into the Siamese network to distinguish the positive and negative samples. TFeat [23] demonstrates that the triplets of training samples, with in-triplet mining of hard negatives, can improve the performance of learned descriptors. L2Net [13] designs a deeper network and produces descriptor normalized to the unit norm by L 2 distance. The architecture of L2Net is an application and foundation of learned descriptors in the later works [14, 16, 17] . However, the metric-learning loss function of L2Net is less effective in finding hard samples from negative and anchor samples. Unlike the above-mention methods, we propose a new triplet loss function to train the metric network.

Metric learning, which is a classical method depending on learned distance function, has been successfully applied to various fields including face action detection [24] , tracking [25] , classification [26] , and information retrieval [27] . Yu et al. [28] propose a triplet loss as metric learning to significantly improve the face verification performance. Yoshida et al. [29] defines an interpretable graph metric learning (IGML) method for graph classification. RFNet [30] extracts learned descriptors from corresponding patches by computing Euclidean distance. Kumar et al. [21] propose a global loss to minimize the classification error and produce the best performance in patch matching.

In summary, metric learning is a foundation in various fields. Many previous works of metric learning focused on learning Mahalanobis distance [31, 32] , while much more efforts are spent on the learning vectors with Euclidean distance metrics [33, 34] . Particularly, several main studies [14, 16, 17] , which adopt L2Net structure, redefine the improved triplet loss function with Euclidean distance metric to improve performance of learned descriptors. HardNet [14] introduces a new triplet loss into L2Net, which can minimize the distance between the matching descriptors and closest non-matching descriptors. DOAP [15] proposes optimizing average precision (OAP) to L2Net and achieves more competitive results on many datasets. ExpTLoss [16] introduces the exponential Siamese and triplet loss function into L2Net, and achieve better performance on many tasks. CDF [17] assigns a nonparametric soft margin instead of a hard margin in L2Net and improves the performance of learned descriptors. However, learned descriptors with these above methods suffer from an inconsistency issue, which can cause irregular spatial relationship of descriptors and further reduce matching accuracy. Unlike the above methods, we propose VEC to the triplet loss function, an effective approach in exploiting the mutual relationship between adjacent descriptors of a patch.

In this section, we firstly review the traditional triplet loss function and then propose VEC-loss to detect learned descriptors. Furthermore, to achieve regular distribution of descriptors in Euclidean space, we exploit a new strategy to reduce the opposite sides' difference in image matching pair.

Generally, a graph is composed of vertexes and edges. The edge indicates the relationship between the two vertexes. Descriptors and the distance between descriptors can be naturally illustrated as a graph, in which the vertex represents the descriptor, and the edge denotes the distance between two descriptors. Supposedly, anchor sample set G A = (V A , E A ) and positive sample set G P = (V P , E P ) are generated, where V stands for the descriptor and E for the distance.

To improve the performance of local learned descriptors, recent researchers focus on the triplet loss function. Supposedly, the spatial relationship between two adjacent descriptors is linearly invariant and rigid, preserving the distance between corresponding descriptors. Thus, the triplet loss function S can be formulated as:

where α is the margin, F + i (a i , p i ) represents the distance between matching descriptors in positive sample and F − i (a i , p j ) denotes the distance between non-matching descriptors in negative sample. In our implementation, the margin parameter α equals to 1.0, which has been demostrated to achieve good performance in HardNet [14] .

Due to the promising performance of Euclidean distance in images matching, this metric has been widely used for local learned descriptors, including L2Net [13] , RFNet [30] , Hard-Net [14] . In this section, we introduce L 2 pairwise distance into the vertex-constraint function. The loss function evaluates the distance between pair descriptors from the anchor and positive patches, which are as the input of L2Net [13] .

Supposedly, we extract a batch X = {V A , V P } of matching local patches, where V A = {a 1 , a 2 ,..., a k } stands for the descriptors of the anchor patches and V P = {p 1 , p 2 ,..., p k } represents the descriptors of the positive patches, and k is the batch size. We adopt the Euclidean distance to evaluate the distance between two descriptors a i and p j . Furthermore, the descriptor vectors a i and p j should be unit-length (||a i || 2 = 1) to reduce the computational cost. As such, the Euclidean distance between the unit-length descriptor vectors can be computed as following:

where d(a i , p j ) refers to the vertex-constraint function F ver tex .

To reduce the distance between matching descriptors and enlarge the distance between the closest non-matching descriptors, we propose the vertex constraint to find hard negatives. It can be formulated as:

where d is the Euclidean distance between descriptor pair, p j min is the closest non-matching positive descriptor to a i , and a i min is the closest non-matching anchor descriptor to p j , respectively.

In addition to the vertex information, a graph contains the edge information indicating the relationship between two vertexes. Similar to the graph information, an image patch has descriptors and their adjacent edge. In this section, we propose an edge constraint to reduce the difference between the long and short sides. We calculate the Euclidean distance between two different unit-length descriptor vectors in a patch to construct the edge equation. Furthermore, to promote the spatial stability of descriptors, we design an exponential algorithm to lengthen the shorter side and shorten the longer side. To reduce the computational cost, we propose a normalized method to keep its value within the range between 0 and 1. In this way, the edge constraint function can be formulated as:

where d(a i , a j ) is the Euclidean distance between two descriptors of anchor patch and d( p i , p j ) is the Euclidean distance between two descriptors of positive patch.

This section proposes a strategy to find the positives where the diagonal elements correspond to the matching descriptors pair distance. In summary, we redefine the positive loss function as following:

where λ is a weight parameter used to balance the vertexconstraint and the edge-constraint loss functions. The parameter λ = 1 (or λ = 0) indicates that the positive loss function only consider the vertex constraint (or the edge constraint). When λ is greater than 0 and less than 1, the positive loss function takes the vertex constraint and the edge constraint into account.

The main contribution of this paper is to introduce VEC into the triplet loss function. To measure performance of our method, we experiment with four benchmarks: UBC Photo-Tourism [35] , HPatches [36] , W1BS benchmark [37] and Oxford Affine benchmark [38] . As a classical and popular patch-based benchmark, UBC PhotoTourism effectively evaluates descriptor performance on patch verification tasks. As a comprehensive and complicated dataset, HPatches is used to evaluate learned descriptors' performance by three different tasks (Patch Verification, Image Matching, and Patch Retrieval). As a novel dataset of ground-truthed image pairs, W1BS is used to evaluate two images' correspondences in a wide baseline stereo. The Oxford Affine benchmark, containing various distorted images, is helping to improve the stability of descriptors.

Similar to the previous study, we adopt the training settings to ensure that the VEC-loss function is the main factor for better performance in the following datasets. For these experiments, we introduce VEC-loss function into HardNet [14] and CDF [17] , denoted as HN+-VEC and CDF-VEC. The HN+-VEC or CDF-VEC method adopts the algorithm VEC to replace the triplet loss function in HardNet or CDF, and the other parts remain unchanged. The network consists of seven convolutional layers and is regularized with Batch Normalization and Dropout. We adopt the UBC PhotoTourism dataset [40] as training data. The dataset has three subsets, known as Liberty, Yosemite, and Notredame. We propose one subset for training data and the other two subsets for testing data. HardNet [14] and CDF [17] adopt L2Net as the descriptors extractor, where the input image size is required Numbers shown are FPR95(%). The lower FPR95 indicates the better performance of learned descriptors. "+" denotes training with data augmentation. The best results are in bold to be 32x32. Because the size of image patches in the UBC PhotoTourism is 64x64, we adopt a downsample method to obtain the patch size of 32x32 as the input of the L2Net [13] .

As the input data augmentation, the image patches are flipped randomly and rotated by 90,180 or 270 degrees. Similar to HardNet [14] and CDF [17] , we set momentum and weight of stochastic gradient descent (SGD) to 0.9 and 10 −4 . Furthermore, the learning rate decays linearly from 0.1 to 0. To match the result of HardNet and CDF, we set the batch size to 1024.

As one of the Brown databases, the UBC PhotoTourism dataset has been widely used to detect local learned descriptors. It contains three subsets: Liberty, Notredame, and Yosemite. Each subgroup contains more than 400k normalized 64x64 patches. These patches are reoriented with Difference-of-Gaussians (DOG) keypoints, which are extracted from 3D reconstruction scenes. The dataset comprises 100k matching and non-matching pairs. Furthermore, we assign one subset for training and the other two subsets for testing. For example, we adopt Liberty as the training dataset and the other two subsets as test datasets. For a fair comparison and without losing of generality, we keep the input approach of dataset consistent [14, 17] . We propose the false positive rate of 0.95 accurate positive recall (FPR95) to evaluate the learned descriptors' performance. The lower FPR95 indicates a better understanding. We introduce VEC to the HardNet [14] which is the first to adopt the triplet loss for learned descriptor, and CDF [17] which is the state-of-the-art method of learned descriptor. To verify performance of our method, we compare with a collection of a handcrafted method (SIFT [7] ) and some learned methods (DeepDesc [22] , L2-Net [13] , HardNet [14] , DOAP [15] , ExpTLoss [16] and CDF [17] ). The results of all methods are shown in Table 1 .

As presented in Table 1 , the mean FPR95 of HN+-VEC and CDF-VEC is less than the two original methods Hard-Net+ and CDF. The mean FPR95 of HN+-VEC is reduced from 1.52 to 1.16, and the mean FPR95 of CDF-VEC is decreased from 1.18 to 1.05. Furthermore, the learned descriptor with our method achieves the best performance in these subtasks. Compared with these state-of-the-art methods, the learned descriptors' version of CDF-VEC is the best (the average of FPR95 = 1.05). Unlike the other methods, our approach considers the interrelation between the adjacent descriptors in the same patch. The UBC PhotoTourism is a testament to the generalization of our algorithm.

HPatches dataset contains more than 1.5 million patches, which are extracted from 116 sequences. The dataset consists of 59 sequences affected by geometric deformation and 57 sequences affected by illumination. In these images, the feature keypoints are detected by DoG, Hessian, and Harris detectors. HPatches dataset defines three evaluation subtasks: Patch Verification, Patch Retrieval, and Image Matching. In each subtask, the extracted patches can be divided into three geometric noise levels: Easy, Hard, and Tough. We adopt the mean average precision(mAP) to evaluate learned descriptors' performance in the dataset. The higher mAP indicates a better understanding.

For a fair comparison and without losing of generality, we use the model trained on the Liberty subset to generate learned descriptors in the HPatches. We compare our method with a collection of methods, such as SIFT [7] , DOAP [15] , HardNet [14] , CDF [17] . The results of all methods are shown in Fig. 2 .

As presented in Fig. 2 , our CDF-VEC method achieves promising performance with the best mAP score in task Image matching (mAP = 50.95) and task Patch Retrieval (mAP = 70.44). Furthermore, CDF-VEC is the secondbest in task Patch Verification (mAP = 88.47), close to the primary method CDF. We attribute these gains to the regular spatial relative of learned descriptors, enabling the performance more robustly and accurately in three subtasks. HPatches is also a testament to the generalization of our method.

To verify the generalization and the application in these extreme conditions, we propose the VEC method to generate learned descriptors on W1BS dataset [42] . Wide baseline stereo matching considers pair images matching in many extreme conditions, such as weather, season, illumination, etc. W1BS dataset contains 40 image pairs, which can be divided into five groups according to the image acquisition factor.

Appearance (A): difference in appearance according to season or weather changes; Geometry (G): difference in object position, camera, and scale; Illumination (L): difference in wavelength, direction, and intensity of light source; Sensor(S): difference in sensor type (IR, MR); Map to photo (map2photo): object image and map image.

In the W1BS dataset, the local affine-covariant features are detected by FOCI [39] , Hessian-Affine [40] and MSER [41] for the reference image pairs. Furthermore, W1BS normalizes the patch size into 41x41 to explore descriptor performance in extreme conditions. We adopt mean Area Under Curve (mAUC) to evaluate learned descriptors' performance. The larger mAUC means a better understanding.

Similar to the above experiments, we use the Liberty as a training dataset to generate learned descriptors in the W1BS dataset. We compare our method with a collection of methods, SIFT [7] , HardNet [14] , ExpTLoss [16] , and CDF [17] . The results of all methods are shown in Fig. 3 .

As presented in Fig. 3 , it is not surprising to CDF-VEC achieves better performance than CDF in subsets, which is consistent with our observation on UBC Photo-Tourism subsets. HN+-VEC produces better performance than HardNet+ in many subsets. Furthermore, Our approach in HardNet+ [14] performs better than the other methods, which achieves the best performance with the average mAUC score of 8.41(HN+-VEC), and CDF-VEC is the third-best(mAUC = 8.24). The rigid and regular relationship between adjacent descriptors is the leading factor of better performance. W1BS is also a testament to the generalization of our method.

To verify the learned descriptor generalization and robustness in the various distortion types, we test the VEC algorithm to evaluate image pairs matching performance on the Oxford Affine dataset [38] . These images on Oxford Affine dataset undergo some particular distortion types, such as JPEG compression, rotation, blur, viewpoint, and light. The dataset consists of 8 groups, which are bikes (blur), trees(blur), graf(viewpoint), wall(viewpoint), bark(rotation), boat(rotation), leuven (light), and ubc (JPEG compression). Every group contains six images in PPM or PGM format and homographs image pairs. The feature keypoints are extracted by the Harris-Affine detector, and the image patches are cropped with a magnification factor of 6. We adopt the matching score to evaluate learned descriptors' performance. The larger matching score means a better understanding.

We use the model trained on the Liberty subset to generate learned descriptors in the Oxford Affine dataset. We compare our method with a collection of methods, HardNet [14] , CDF [17] , Sosnet [42] , and ExpTLoss [16] . The results of all methods are shown in Fig. 4.  Fig. 3 Evaluating the W1BS patch dataset. W1BS dataset consists of 40 image pairs divided into 5 parts by the nuisance factor: Appearance (A), Geometry (G), Illumination (L), Sensor (S) and Map to photo. The larger mAUC indicates the better performance of learned descriptors Fig. 4 Evaluating on the Oxford Affine dataset. Oxford Affine dataset consists of 8 groups: bark, bikes, boat, graf, leuven, trees, ubc, and wall. The larger matching score indicates the better performance of learned descriptors As presented in Fig. 4 , the matching performance with our method CDF-VEC achieves the best (average matching score = 36.38), compared with these state-of-the-art methods in the dataset. Furthermore, the future performance of learned descriptors with the CDF-VEC algorithm achieves state-ofart results in the other six groups, except for leuven and ubc. The better understanding shows learned descriptors with the VEC method can withstand various distortion types in the Oxford Affine dataset. We can attribute the VEC method's contribution, which considers the interrelation between the adjacent descriptors in the same patch. The Oxford Affine dataset is also a testament to the generalization of our algorithm. It is demonstrated that the performance of learned descriptors in most tasks achieves state of the art.

We introduce the VEC algorithm into the triplet loss function, inspired by learned descriptors' consistency. Furthermore, we reduce the difference between non-matching descriptors to facilitate regular distribution of learned descriptors in high-dimensional Euclidean space. To achieve better performance of stable distribution, we enlarge the distance between non-matching descriptors and reduce the distance between matching descriptors. We test the VEC method to validate the generalization and application on four datasets, including some extreme conditions. It has been demonstrated that the performance of learned descriptors in most tasks achieves state of the art.

Very large-scale global sfm by distributed motion averaging

A survey of augmented realty, presence teleoperators and virtual

Recent advances in augmented reality

An accurate method for 3D object reconstruction from unordered sparse views. Signal Image Video Process

Simultaneous localization and mapping (slam) part I

Simultaneous localization and mapping (slam) part II

Distinctive image features from scale-invariant keypoints

Speeded up robust features

Brief: Binary robust independent elementary features

Brisk: binary robust invariant scalable keypoints

An efficient alternative to sift or surf

Brief-based face descriptor: an application to automatic facial expression recognition (afer)

L2-net: Deep learning of discriminative patch descriptor in euclidean space

Working hard to know your neighbor's margins: Local descriptor learning loss

Local descriptors optimized for average precision

Better and faster: exponential loss for image patch matching

Learning local descriptors with a cdf-based dynamic soft margin

Domain-size pooling in local descriptors: Dspsift

Pca-sift: a more distinctive representation for local image descriptors

Matchnet: unifying feature and metric learning for patch-based matching

Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions

Discriminative learning of deep convolutional feature point descriptors

Learning local feature descriptors with triplets and shallow convolutional neural networks

An efficient approach for facial action unit intensity detection using distance metric learning based on cosine similarity

Tracking: robust visual tracking with correlation filters and metric learning. Knowl-Based

Global and local metric learning via eigenvectors

Tc-net for isbir: Triplet classifcation network for instance-level sketch based image retrieval

Deep metric learning with dynamic margin hard sampling loss for face verification

Distance metric learning for graph structured data

Rf-net: An end-to-end image matching network based on receptive field

Learning a mahalanobis metric from equivalence constraints

Metric learning by collapsing classes

Incomplete multiview spectral clustering with adaptive graph learning

Adaptive graph completion based incomplete multi-view clustering

Learning local image descriptors

Hpatches: a benchmark and evaluation of handcrafted and learned local descriptors

Wxbs: wide baseline stereo generalizations

A performance evaluation of local descriptors

Edge foci interest points

Scale and affine invariant interest point detectors

Robust widebaseline stereo from maximally stable extremal regions

Sosnet: second order similarity regularization for local descriptor learning