key: cord-0129708-nx59roku
authors: Zabihzadeh, Davood; Tuama, Amar; Karami-Mollaee, Ali
title: Low-Rank Robust Online Distance/Similarity Learning based on the Rescaled Hinge Loss
date: 2020-10-07
journal: nan
DOI: nan
sha: 043feeb1a7da894798a9c66badf4b9f06c54cc26
doc_id: 129708
cord_uid: nx59roku

An important challenge in metric learning is scalability to both size and dimension of input data. Online metric learning algorithms are proposed to address this challenge. Existing methods are commonly based on (Passive Aggressive) PA approach. Hence, they can rapidly process large volumes of data with an adaptive learning rate. However, these algorithms are based on the Hinge loss and so are not robust against outliers and label noise. Also, existing online methods usually assume training triplets or pairwise constraints are exist in advance. However, many datasets in real-world applications are in the form of input data and their associated labels. We address these challenges by formulating the online Distance-Similarity learning problem with the robust Rescaled hinge loss function. The proposed model is rather general and can be applied to any PA-based online Distance-Similarity algorithm. Also, we develop an efficient robust one-pass triplet construction algorithm. Finally, to provide scalability in high dimensional DML environments, the low-rank version of the proposed methods is presented that not only reduces the computational cost significantly but also keeps the predictive performance of the learned metrics. Also, it provides a straightforward extension of our methods for deep Distance-Similarity learning. We conduct several experiments on datasets from various applications. The results confirm that the proposed methods significantly outperform state-of-the-art online DML methods in the presence of label noise and outliers by a large margin.

The performance of many machine learning and data mining algorithms depends on the metric used to compute Distance/Similarity between data. Generic Distance/Similarity measures such as Euclidean or Cosine similarity in input space often fails to discriminate different classes or clusters of data. Therefore, learning an optimal Distance/Similarity function is actively studied in the last decade.

Distance Metric Learning (DML) methods aim to bring semantically similar data items together while keeping dissimilar ones at a distance. One important challenge for DML algorithms is scalability to both the size and dimension of input data (Bellet, Habrard et al. 2014) . For processing massive volumes of data generated in today's applications, online Distance/Similarity methods are proposed.

Many of these algorithms are based on (Passive/Aggressive) PA (Crammer, Dekel et al. 2006) approach (Chechik, Sharma et al. 2010 , Xia, Hoi et al. 2014 , Wu, Hoi et al. 2016 , Zhong, Zheng et al. 2017 , Hamdan, Zabihzadeh et al. 2018 , Li, Gao et al. 2018 , Rasheed, Zabihzadeh et al. 2020 . The main advantages of PA-based methods are closed-form solution and adaptive learning rate leading to a high convergence rate. However, since these algorithms are based on the Hinge loss, they are not robust against outliers and label noise data. Nowadays many modern datasets are collected from the Internet using crowdsourcing or similar techniques.

Hence, examples with wrong labels are usual in these datasets that can considerably degrade the performance of existing online DML methods. We address this challenge by formulating the online Distance/Similarity learning task using the robust Rescaled hinge loss function (Xu, Cao et al. 2017 ). The proposed model is rather general, and we can easily apply it to any existing PAbased methods. It significantly improves the robustness of the existing methods in the presence of label noise without increasing their computational complexity.

Most DML algorithms learn the metric from pair or triplet side information defined as: Existing online methods (Chechik, Sharma et al. 2010 , Xia, Hoi et al. 2014 , Wu, Hoi et al. 2016 , Zhong, Zheng et al. 2017 , Hamdan, Zabihzadeh et al. 2018 , Li, Gao et al. 2018 , Rasheed, Zabihzadeh et al. 2020 ) usually assume training triplets or pairwise constrains are exist in advance but many datasets in real world applications are not in this format. Also, available batch triplet construction algorithms are very time consuming and often require computing pairwise distances between data items. Thus, these are not applicable for online tasks. We tackle this challenge by developing an efficient robust one-pass triplet construction algorithm.

Another important challenge in online DML applications especially in the field of machine vision is the high dimension of the input data. Many existing methods learn Mahalanobis distance (Xia, Hoi et al. 2014 , Zhong, Zheng et al. 2017 , Hamdan, Zabihzadeh et al. 2018 , Rasheed, Zabihzadeh et al. 2020 or bilinear similarity (Chechik, Sharma et al. 2010 , Xia, Hoi et al. 2014 ) which require ( 2 ) parameters ( indicate the data dimension). Therefore, these methods are infeasible in high dimensional environments. We develop the low-rank versions of the proposed methods that reduce the computational cost significantly but also keep the predictive performance of the learned measure. Also, one can easily replace the low-rank projection matrix in the proposed methods with a nonlinear deep neural network model. Therefore, extending our methods for online deep Distance/Similarity learning is straightforward.

The rest of the paper is organized as follows: Section 2 reviews related works. In Section 3, we present the formulation of the online Distance/Similarity learning problem using the Rescaled Hinge loss as well as the development of the proposed algorithms. Experiments conducted to evaluate the proposed methods are discussed in Section 4. Finally, Section 5 concludes with remarks and recommendations for future work.

Most existing online Distance/Similarity learning methods learn Mahalanobis distance (Wu, Hoi et al. 2016 , Zhong, Zheng et al. 2017 , Hamdan, Zabihzadeh et al. 2018 , Li, Gao et al. 2018 , Rasheed, Zabihzadeh et al. 2020 or bilinear similarity (Chechik, Sharma et al. 2010 , Xia, Hoi et al. 2014 .

Most research in online metric learning is dedicated to learning Mahalanobis distance. Although some newer and more generic measures such as (Lin, Wang et al. 2017 , Zabihzadeh, Monsefi et al. 2018 ) are also presented. Mahalanobis-based methods learn a matrix ≽ given by:

Since the matrix ≽ , it can be decomposed as = where ∈ ℝ × and = ( ). Therefore, Mahalanobis distance learning is equivalent to find a linear transformation in the input space. On the other hand, bilinear similarity-based methods learn a similarity matrix given by:

The optimization problem of both Mahalanobis and bilinear methods is formulated based on the PA approach. We can show this optimization problem in the following general form:

where is the current Distance/Similarity matrix at time step t. ( , ) is the regularization term. denotes the input constraint arrived at time t. is often in the form of triplet ( , + , − ), and ( , ) indicates the margin-based Hinge loss function. In distancebased methods, the Hinge loss function is defined as:

(4) ( , ( , + , − )) = max{0, 1 + ( , + ) 2 − ( , − ) 2 } On the other hand, it is defined in similarity-based methods as:

(5) ( , ( , + , − )) = max{0, 1 − ( , + ) 2 + ( , − ) 2 } OASIS 1 (Chechik, Sharma et al. 2010 ) is a popular bilinear similarity learning method that uses Frobenius norm as a regularization term, i.e.

( , ) = 1 2 ‖ − ‖ 2 . OASIS eliminates p.s.d (positive semi-definite) constraint for scalability reason. However, this property is very useful to produce a low-rank metric as well as to prevent overfitting. This work is extended in OMKS 2 (Xia, Hoi et al. 2014 ) for multi-modal data. OMKS learns a separate linear similarity operator for each source of data in the feature space induced by an RKHS kernel. The final similarity function is a weighted average of these similarity measures. The weights are updated using Hedge (Freund and Schapire 1997) in an online manner.

OMDML 3 (Wu, Hoi et al. 2016 ) is similar to OMKS, but instead of bilinear similarity, it learns Mahalanobis distance for each source of data. To enforce p.s.d constraint per metric, OMKS uses full Eigen value decomposition which involves ( 3 ) operations. Therefore, this method is infeasible for high-dimensional DML tasks. To address this problem, LSMDML 4 (Rasheed, Zabihzadeh et al. 2020 ) utilizes DRP (Dual Random Projection) (Qian, Jin et al. 2015) in an online multi-modal environment to enforce p.s.d constraint per metric. DRP considerably decreases the time of the metric learning process in high-dimensional datasets while preserves the performance of learned metric. Also, LSMDML combines the learned metrics using a novel PA-based method which leads to a better convergence rate in comparison with the traditional Hedge algorithm.

SLMOML 5 (Zhong, Zheng et al. 2017 ) is the online version of the seminal ITML 6 (Davis, Kulis et al. 2007 ) method. It uses the logdet regularization term which automatically enforces p.s.d constraint at each time step. However, it has a low convergence rate and requires ( 2 ) parameters.

In (Hamdan, Zabihzadeh et al. 2018 ) a large-scale local online Distance/Similarity framework is presented. It learns multiple metrics for the task at hand, one metric per class in the dataset. Each metric in this framework consists of a global and a local component learned simultaneously in online fashion. It can find a nonlinear projection by learning multiple local metrics. Also, adding the global component to local metrics shares discriminating information between them and efficiently reduces the overfitting problem. Also, this framework utilizes DRP (Dual Random Projection) to achieve scalability respect to the data dimension. (Li, Gao et al. 2018 ) is an online DML method which learns projection matrix (see equation (1)) directly, so it does not require imposing the p.s.d constraint. In practice, has a rectangular form ( ∈ ℝ × , ≪ ). However, OPML learns a square × matrix and obtains a closed-form solution with the ( 2 ) time complexity. Also, it adopts the Frobenius norm regularization term and the popular Hinge loss function. The interesting feature of OPML is the triplet sampling strategy which constructs the triplet from incoming data in an online manner.

OAHU 2 (Gao, Li et al. 2019) aims to dynamically adapt the complexity of the model and effectively utilizing the input constraints during the learning process. For this purpose, this method introduces the Adaptive-Bound Triplet Loss (ABTL) instead of commonly used Hinge loss. Also, it uses an over-complete neural network model and connects a different MEI (Metric Embedding Layer) to the input and each hidden layer of this network. The overall loss is considered as a weighted average loss of each MEI. Similar to previous approaches, it uses the Hedge algorithm to update the weights.

As seen, all studied online Distance/Similarity models assume that the given training information is perfect. However, this assumption may be wrong in practical machine vision applications where this information is collected from the Internet by crowdsourcing or similar techniques. Although some robust DML methods such as (Yang, Jin et al. 2010 , Huang, Jin et al. 2012 , Wang and Tan 2014 , Wang, Nie et al. 2014 , Wang and Tan 2018 , Zabihzadeh, Monsefi et al. 2019 , Rasheed, Zabihzadeh et al. 2020 ) are presented to address this emerging challenge. These methods are focused on batch-environments. Among these algorithms, only Bayesian approaches (Wang and Tan 2018, Zabihzadeh, Monsefi et al. 2019) can be extended for online settings. However, while Bayesian learning helps to avoid over-fitting in a small or a dataset with noisy features, it is less effective to deal with the more complicated problem i.e. label noise.

Many metric learning algorithms generate triplets from training data using the following batch procedure. Each data point is considered similar to its nearest neighbors with the same label (named target neighbors of ). Suppose is a target neighbor of . The imposter of is any data point of a different class (i.e. ≠ ) which violates the following condition:

where is a Distance/Similarity measure such as Euclidean. The data point is set dissimilar to any of its imposters. Then, the triplets are formed by the natural join of similar and dissimilar pairs. Figure 1 illustrates the concepts of target neighbors and imposters. Generating triplets using this procedure is both time and space consuming and is not feasible for online tasks. On the other hand, while online triplet construction adopted by OPML is very efficient in terms of computational cost, it does not consider the distribution and structure of data. Therefore, it has a lower performance in comparison with the batch algorithm.

As observed, many Distance/Similarity algorithms are based on the margin-based Hinge loss function ( ℎ ). Let define the variable as follows:

The Hinge loss is then can be written as:

(8) ( , ( , + , − )) = max{0, 1 − } Figure 2 shows the loss function. As seen, the loss linearly grows for ≤ 1 with no bound. The unboundedness of the Hinge loss function causes the label noise and outlier data have a large effect in the training process leading to the poor performance of learned Distance/Similarity measure. Most existing Distance/Similarity learning methods can be formulated as follows:

Note that the constraint ≽ 0 is not mandatory in all methods, so we enclose it by bracket. We can derive many existing methods from this generic optimization problem. For example, if we consider reg( , ) = 1 2 ‖ − ‖ 2 and omit the ≽ 0 constraint, then by defining according to (6), we obtain the OASIS (Chechik, Sharma et al. 2010 ) and OKS 1 (Xia, Hoi et al. 2014 ) optimization problems and if we consider equal to (7), the optimization problem (9) reduces to the OPML (Li, Gao et al. 2018) . Similarly, if we set reg( , ) = 1 2 ‖ − ‖ 2 , enforce ≽ 0, and define as (7), we reach to the optimization problem in (Wu, Hoi et al. 2016) . Finally, if we set reg( , ) = ( , ) = trace( − ) − log det( − ) − and drop the ≽ 0 constraint, then we obtain the optimization problem of (Zhong, Zheng et al. 2017) .

One approach to limit the effect of label noise data in PA-based problems (such as equation (9)) is to select a small value for the hyper-parameter C. However, it causes lower values for the adaptive learning rate in the PA-based algorithm. Instead, we propose to replace the Hinge loss function with the robust rescaled hinge loss defined as: Figure 3 shows the diagram of the ℎ ( ) loss function with different values of . In this function, is a rescaling parameter and = 1/ (1 − exp(− )) is just a normalizing constant that ensures ℎ (0) = 1. As seen, this loss function is more robust than the Hinge function against the outliers and data contaminated with label noise. We can adjust the degree of 1 Online Kernel Similarity robustness using parameter. Also, the Hinge loss can be regarded as a special case of the rescaled Hinge. More specifically, ℎ ( ) becomes ℎ ( ) as → 0.

By replacing the Hinge loss function with the Rescaled Hinge loss in equation (9), we obtain the following optimization problem for online robust Distance/Similarity learning.

In the next subsection, we derive two efficient algorithms that efficiently solve the above optimization problem in online fashion.

Since the rescaled hinge loss is not convex, we need an efficient algorithm to solve the optimization problem (11). The proposed algorithms are based HQ (Half Quadratic) which is an efficient alternating approach for optimizing the non-convex problem. The main idea of HQ is to add an auxiliary variable such as to the problem using Conjugate theory (Boyd, Boyd et al. 2004) , such that the new optimization problem becomes quadratic respect to the main variable (with the same optimal solution as the original non-convex one).

, we can obtain the following problem which is equivalent to (11).

(12)

According to the definition of conjugate function we have (refer to the Appendix A of (Xu, Cao et al. 2017 )),

where ( ) = − log(− ) + , ( < 0). By substituting equation (13) in (12), we obtain;

The third relation in (14) holds since −reg( , ) is constant regarding . Using (14), we can rewrite (12) as:

To solve the above problem, we use the alternating optimization approach. First, given , we optimize (13) over and then given , we optimize it over . Suppose ( ) is given (the superscript indicates the iteration number), then (15) is equivalent to:

The above equation has a closed-form solution obtained by setting the derivative of it with respect to equal to zero.

After obtaining ( ) , we optimize the equation (15) respecting to ( + ) as follows:

The above problem is equivalent to:

The robustness of the optimization problem (19) can be explained using the penalty factor . Suppose the current triplet contains label noise data, so the hinge function ( ℎ ( )) returns a big loss for . Thus,

approaches to zero. Therefore, has a less effect in the learning process.

The obtained optimization problem, unlike existing models, assigns an adaptive weight ( ) for each incoming triplet. By adjusting reg( , ), p.s.d constraint, and , we can obtain a family of robust Distance/Similarity learning methods. For instance, we develop two proposed algorithms named Robust_OASIS and Robust_ODML 1 . These algorithms can be considered as robust versions of existing OASIS (Chechik, Sharma et al. 2010 ) and ODML (Wu, Hoi et al. 2016) respectively.

This robust similarity-based algorithm can be derived from the general optimization problem (15) by the following settings:

reg( , ) = 1 2 ‖ − ‖ 2 , drop ≽ 0 constraint, and define according to (6).

Then, the following optimization problem is achieved:

The solution of the above problem is obtained by iteratively computing from equation (17) and then optimizing by solving the following optimization problem.

The problem (21) has a similar solution to that obtained in (Chechik, Sharma et al. 2010) .

The main difference is that now the learning rate is bounded to the adaptive triplet weight instead of the fixed one ( ) in the OASIS method. Algorithm 1 summarizes the steps of Robust-OASIS (17) 3.3.2 Compute the triplet weight = (− ).

3.3.3. Obtain the learning rate = min( , (22) end;

This robust Mahalanobis-based algorithm can be derived from the general optimization problem (15) by the following settings: reg( , ) = 1 2 ‖ − ‖ 2 , enforce ≽ 0 constraint, and define according to (6-2).

Then, we obtain the following optimization problem:

Similar to the Robust-OASIS, we obtain the solution by iteratively computing from the equation (17) and then optimizing by solving the following optimization problem.

(24) = arg min 1 2 ‖ − ‖ 2 + subject to ( , + , − ) = 1 + 2 ( , + ) − 2 ( , − ) ≤ , ≥ 0, ≽ 0

The solution of the above problem is similar to that proposed in (Wu, Hoi et al. 2016 ). To enforce the p.s.d constraint, the simplest approach is to perform the full Eigen value decomposition of matrix and then set its negative Eigen values to zero. This approach requires ( 3 ) operations, so it is infeasible for high-dimensional DML tasks. Although some improved methods available in (Qian, Jin et al. 2015 , Hamdan, Zabihzadeh et al. 2018 , Rasheed, Zabihzadeh et al. 2020 , we address this problem by developing the low-rank versions of the proposed algorithms in the next subsection. 

The low-ranks versions of the proposed algorithms learn the projection matrix ( = ) directly. Therefore, they automatically enforce p.s.d constraint. Also, in many practical applications, has a rectangular form ( ∈ ℝ × , ≪ is the rank of matrix ). Thus, the provided low-rank algorithms require fewer parameters. The optimization problem for lowrank online Distance/Similarity learning can be formulated as:

We can rewrite both the bilinear similarity and the Mahalanobis distance as functions of as follows:

(27)

Thus, the bilinear similarity learning is equivalent to finding a linear projection and then applying dot product to the inputs in the projected space. Similarly, Mahalanobis distance learning corresponds to compute the Euclidean distance after transforming the inputs by .

The variable can be expressed in terms of and as:

Now, we can easily develop the proposed low-rank robust similarity learning algorithm named Robust-LOSL 1 from the generic optimization problem (26) with the following settings:

reg( , ) = 1 2 ‖ − ‖ 2 2 , define according to (29).

The obtained optimization problem can be solved by iteratively computing from the equation (17) and then optimizing by solving the following optimization problem:

(31) = arg min 1 2 ‖ − ‖ 2 + subject to ( , + , − ) = 1 − ( , + ) + ( , − ) ≤ , ≥ 0

The above optimization problem is non-convex. However, we can solve it efficiently by optimizing a simple linear neural network model depicted in Figure 4 . Thus, we can train the network from using backpropagation or more sophisticated algorithms such as Adams. The steps of Robust-LOSL are summarized in Algorithm3. Similarly, we can derive the proposed low-rank robust distance learning algorithm named Robust-LODML 1 from the generic optimization problem (26) with the following settings: reg( , ) = 1 2 ‖ − ‖ 2 , define according to (30).

We solve the obtained problem iteratively by computing from the equation (17) and then updating by optimizing the neural network model presented in Figure 4 . The sub-gradient of the loss function with respect to can be computed from the following equation: Algorithm4 summarizes the steps of Robust_LODML. We can easily replace the linear module in the proposed low-rank model with a nonlinear deep neural network module. Thus, extending our methods for online deep Distance/Similarity learning is straightforward. Also, the experimental results in the next section confirm that the proposed low-rank methods reduce the computational cost significantly but also keeps the predictive performance of the learned measure. (17) 3.3.2 Compute the triplet weight = (− ).

Optimize the network model parameterized by L using subgradient descent algorithm. end;

As seen, the proposed robust online Distance/Similarity learning model is general and can easily be applied to the existing online Distance/Similarity algorithms. Let be an online Distance/Similarity algorithm with the time complexity . By applying our method to , besides optimizing the Distance/Similarity measure, we require to compute the weight of the incoming triplet ( ) using the equation (17). Although requires evaluation of ℎ ( ), this evaluation is also needed for updating the metric. Therefore, it does not imply additional costs, and the overall time complexity of the robust method will be ( × ). The experimental results confirm that the convergence of the alternating loop is fast, and the best results are obtained by taking ≤ 3 in all experiments. Therefore, the obtained robust method has the same time complexity as the corresponding algorithm ( ).

Generating triplets using available batch algorithms is both time and space consuming. Also, the one-pass triplet constructing strategy adopted in OPML has low performance, especially in noisy environments. To this end, we propose an online triplet constructing algorithm named OCTG 1 which is not only very efficient but also effective in comparison with the available batch methods. By utilizing the distribution and clusters of input data, the proposed algorithm can effectively detect outliers and noisy label data. Therefore, its performance surpasses existing methods in noisy environments.

Suppose { | = 1,2, … , } is the set of cluster centers initialized by a sample of data at the beginning of the online algorithm. In this paper, we use the k-means algorithm to obtain cluster centers per class in the dataset. OCTG receives incoming data ( , ) at time step and finds its closest cluster center of the same class. Then, it considers any cluster center of a different class (i.e. ≠ ) which violates the following condition as an imposter (see Figure  5) :

The triplet set constructed at time step is formed as:

As seen, the proposed methods assign a weight to each incoming triplet. We assign the weight to equal to the minimum weights of the generated triplets. The small value for means that is a potential outlier or a noisy label instance. The weight and input data are then can be used to update the cluster centers using any existing online clustering methods.

The obtained weights can be used to enhance the performance of any metric-based algorithms such as kNN or CBIR (Content-Based Information Retrieval) in noisy environments. For example, we use the following version of kNN named Robust-kNN (instead of standard kNN) to classify the objects in the experiments. 

This section deals with the experiments performed to evaluate the performance of proposed methods in noisy environments. We first study the effect of label noise on the generated triplets and then discuss how these noisy triplets affect the performance of online Distance/Similarity methods. Afterward, we evaluate the performance of proposed methods on real datasets at different levels of label noise. The results are compared with peer Distance/Similarity methods.

As depicted in Figure 7 , we can distinguish between three different types of noisy triplets: anchor, positive, and negative noisy triplet. To study the effects of different types of noisy triplets, we apply 10% label noise to the Wine dataset. The noisy dataset is visualized using the T-SNE algorithm (Maaten and Hinton 2008) in Figure 8 . The statistics of the generated triplets using both batch and OCTG methods are summarized in Table 1 . methods assign very small weights ( = exp (− ℎ ( ))) to them in the learning process. That causes they have less effect on the learned metric.

To analyze the effect of different types of triplet noise in a typical DML task, we run the ODML (Wu, Hoi et al. 2016 ) with the following settings on the generated triplets by the batch method.

Ideal ODML: The ideal algorithm which knows the noisy triplets in advance and so ignores them in the training process Anchor Ideal ODML: The ideal algorithm which knows only the anchor noisy triplets in advance Pos Ideal ODML: The ideal algorithm which knows only the positive noisy triplets in advance Neg Ideal ODML: The ideal algorithm which knows only the negative noisy triplets in advance

In this experiment, we divide the dataset into train/test with 70/30 ratio and run the above algorithms 10 times on the dataset. Figure 9 depicts the mean of obtained results by various algorithms. For small values of C, the results indicate that the learned metric by ODML has no meaningful difference with that of Euclidean. For large values of C, ODML performs worse than Euclidean and its accuracy substantially degrades in the noisy environment. Also, among the ideal methods (cannot be implemented in practice), the Anchor Ideal ODML has the same performance as Ideal ODML and others (Pos Ideal ODML, Neg Ideal ODML) are ineffective. Thus, the main reason for low performance in this DML task is anchor noisy triplets.

We repeat the experiment by running Robust-LODML using the triplets generated by our mechanism. The results are depicted in Figure 10 and Figure 11 . As the results show, the proposed method is robust against label noise and its performance surpasses Euclidean metric even for the large values of C. Also, Robust-LODML effectively identified the contaminated instances and considerably reduces their weights in the training process.

The results are obtained by using only one dataset. In the next subsections, we evaluate the proposed methods on the variety of datasets at different levels of label noise. Also, the results are compared with state-of-the-art methods. Table 2 summarizes the statistics of evaluated datasets in the experiments. The data in all datasets except Letters is normalized so that the mean and standard deviation of each attribute becomes 0 and 1, respectively. Also, the dimension of images in Extended Yale Faces is reduced to 100 by applying PCA to alleviate the noise effects. The parameter in Table 2 denotes the input dimension after dimension reduction.

In the experiments, triplet side information is generated using OCTG for the proposed methods while the one-pass triplet construction in (Li, Gao et al. 2018 ) is adopted for the other methods. Each instance represents a person who takes a credit by a bank and is classified as good or bad credit risks according to the set of attributes. https://www.kaggle.com/uciml/german-credit

The results are obtained by k-fold cross validation (k=5 is set for Letters and Extended Yale Faces and k=10 for other datasets). The results are compared with peer distance-based methods: ODML (Wu, Hoi et al. 2016) , LPA-ODML 1 (Hamdan, Zabihzadeh et al. 2018) , and OPML (Li, Gao et al. 2018) .

The hyperparameters of the competing methods are adjusted by k-fold cross-validation as follows. The parameter in ODML and the proposed methods are selected from (10 −6 , 30). The in the proposed methods is chosen from the range (0.01, 5). Also, in OPML is selected from (10 −6 , 0.05). To evaluate the performance of the learned metrics, we adopt the kNN classifier with = 3 in the experiments. Table 3 presents the classification accuracy of the kNN using the learned metrics of the competing methods. Here, the parameter shows label noise level (in percent). The results are obtained by k-fold cross-validation on these datasets. Also, Figure 7 depicts the mean of k-fold cross validation accuracy of competing methods versus (ranging from 0% to 20%). As the results in Table 3 and Figure 12 indicate, the proposed robust methods (i.e. Robust-ODML and Robust-LODML) significantly outperform other DML methods in the presence of label noise. Also, the performance of these methods declines slowly than other ones with the increase of noise level. That confirms our claim that using robust loss function jointly with the proposed robust sampling preserves the discrimination of learned metric in a noisy environment. In addition, the low-rank version of the proposed method (i.e. Robust-LODML) almost has the same accuracy as Robust-ODML while reduces the computational cost significantly. In the next subsection, we evaluate our proposed methods in a more challenging dataset used to detect COVID-19 patients from Chest-X-ray images.

The dataset used in our experiments is publicly available in the kaggle repository 1 (Chowdhury, Rahman et al. 2020) . Figure 13 depicts some examples from both classes. It contains 219 COVID-19 cases and 1341 normal images. As seen, the dataset is imbalanced and is too small to train a deep CNN model from scratch. 

To extract features from the images, we use the pretrained Resnet18 (He, Zhang et al. 2016 ). This network is trained on the ImageNet dataset (with 1.4 million labeled images and 1,000 different classes). It has 71 layers and the input layer requires input images of size 224-by-224-by-3. We resize the images to the specified size and obtain 512 features from the global pooling layer, 'pool5', at the end of the model. We use 5-fold cross-validation to obtain the results in the experiments. The main concern in this task is to limit the number of missed COVID-19 cases. Hence, in addition to accuracy, we utilize a variety of metrics to evaluate our work. These metrics are Sensitivity (Recall), Precision, F1 Score, and G-mean (Geometric-mean). Here, COVID-19 and Normal are considered as positive and negative, respectively. The metrics are defined as follows: Table 4 presents the classification results of the kNN using the learned metrics in the different levels of label noise. The results of both sensitivity and precision of the competing methods versus noise level are shown in Figure 13 (a). Since sensitivity is more important in this task, we multiply it by 2. Also, Figure 13 (b) presents the G-mean results versus noise level. The high value of G-mean indicates that accuracy in both classes is high and balanced.

As the results indicate, the proposed methods achieve high sensitivity for COVID-19 patients in noisy environments. It is very important since the primary goal of this task is to limit the number of misclassified COVID-19 cases as much as possible. For example, the Confusion matrices of the proposed methods at = 20% are shown in Table 5 . As seen, only 1.8 and 1 (as the average of 5-fold cross validation) COVID-19 patients are misclassified as Normal by the proposed methods. Also, our methods obtain good precision (or predictive positive value). High precision is important since high FP (False Positive) increases the burden of the healthcare system for additional care and tests such as PCR (Polymerase Chain Reaction). Therefore, based on the results we can conclude the proposed methods perform well in detecting COVID-19 cases in the presence of label noise. However, the difference between sensitivity and specificity values indicate further improvements are possible by adopting balancing techniques in this imbalanced dataset. 

Existing online Distance/Similarity learning methods usually formulated by the Hinge loss and so are not robust against outliers and label noise data. Also, they often have the wrong assumption that training triplets or pairwise constraints exist in advance. Also, generating triplets using available batch algorithms is both time and space consuming. To address these challenges, we formulate the online Distance/Similarity learning problem using the robust Rescaled hinge loss function (Xu et al. 2017) . Also, an efficient robust one-pass triplet construction algorithm is presented in this paper. We further extend our work by providing the low-rank version of proposed methods which not only reduces the computational cost significantly but also keeps the predictive performance of the learned metrics.

We study the effects of label noise in a DML task and conduct several experiments to measure the performance of the proposed methods at different noise levels. Extensive experimental results show that the proposed methods can effectively detect wrong label data and reduce their influences in DML tasks. Thus, they consistently outperform other peers online Distance/Similarity learning algorithms in noisy environments.

We intend to extend the work for online deep distance/similarity learning. Some other directions for future work are I. Examining the performance of the proposed methods in other applications like CBIR.

II. Extension of the proposed methods in imbalanced environments.

III. Enhance the performance of the proposed online triplet construction algorithm.

A Survey on Metric Learning for Feature Vectors and Structured Data

Convex optimization

Large Scale Online Learning of Image Similarity Through Ranking

Can AI help in screening viral and COVID-19 pneumonia

Online Passive-Aggressive Algorithms

Information-theoretic metric learning

A decision-theoretic generalization of on-line learning and an application to boosting

Towards self-adaptive metric learning on the fly

Large Scale Local Online Similarity/Distance Learning Framework based on Passive/Aggressive

Deep residual learning for image recognition

Robust metric learning by smooth optimization

Acquiring linear subspaces for face recognition under variable lighting

OPML: A one-pass closed-form solution for online metric learning

UCI Machine Learning Repository

Cross-domain visual matching via generalized similarity measure and feature learning

Visualizing data using t-SNE

Fine-grained visual categorization via multi-stage metric learning

Large-Scale Multi-modal Distance Metric Learning with Application to Content-Based Information Retrieval and Image Classification

Robust Distance Metric Learning in the Presence of Label Noise

Robust Distance Metric Learning via Bayesian Inference

Robust Distance Metric Learning via Simultaneous L1-Norm Minimization and Maximization

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Online multi-modal distance metric learning with application to image retrieval

Online multiple kernel similarity learning for visual search

Robust support vector machines based on the rescaled hinge loss function

Learning from noisy side information by generalized maximum entropy model

Sparse Bayesian similarity learning based on posterior distribution of data

Sparse Bayesian approach for metric learning in latent space

Slmoml: online metric learning with global convergence

We would like to acknowledge the Machine Learning Lab in the Engineering Faculty of FUM for their kind and technical support.