key: cord-0610057-2fhqffrv
authors: Charoenphakdee, Nontawat; Lee, Jongyeong; Sugiyama, Masashi
title: A Symmetric Loss Perspective of Reliable Machine Learning
date: 2021-01-05
journal: nan
DOI: nan
sha: cc25c8ce934d449a8ff19ede0dd1e3fca3a269e6
doc_id: 610057
cord_uid: 2fhqffrv

When minimizing the empirical risk in binary classification, it is a common practice to replace the zero-one loss with a surrogate loss to make the learning objective feasible to optimize. Examples of well-known surrogate losses for binary classification include the logistic loss, hinge loss, and sigmoid loss. It is known that the choice of a surrogate loss can highly influence the performance of the trained classifier and therefore it should be carefully chosen. Recently, surrogate losses that satisfy a certain symmetric condition (aka., symmetric losses) have demonstrated their usefulness in learning from corrupted labels. In this article, we provide an overview of symmetric losses and their applications. First, we review how a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization and area under the receiver operating characteristic curve (AUC) maximization. Then, we demonstrate how the robust AUC maximization method can benefit natural language processing in the problem where we want to learn only from relevant keywords and unlabeled documents. Finally, we conclude this article by discussing future directions, including potential applications of symmetric losses for reliable machine learning and the design of non-symmetric losses that can benefit from the symmetric condition.

Modern machine learning methods such as deep learning typically require a large amount of data to achieve desirable performance [Schmidhuber, 2015; Goodfellow et al., 2016] . However, it is often the case that the labeling process is costly and time-consuming. To mitigate this problem, one may consider collecting training labels through crowdsourcing [Dawid and Skene, 1979; Kittur et al., 2008] , which is a popular approach and has become more convenient in the recent years [Deng et al., 2009; Crowston, 2012; Sun et al., 2014; Vaughan, 2017; Pandey et al., 2020; Vermicelli et al., 2020; Washington et al., 2020] . For example, crowdsourcing has been used for tackling the COVID-19 pandemic to accelerate research and drug discovery [Vermicelli et al., 2020; Chodera et al., 2020] . However, a big challenge of crowdsourcing is that the collected labels can be unreliable because of non-expert annotators fail to provide correct information [Lease, 2011; Zhang et al., 2014; Gao et al., 2016; Imamura et al., 2018] . Not only the non-expert error, but even expert annotators can also make mistakes. As a result, it is unrealistic to always expect that the collected training data are always reliable.

It is well-known that training from data with noisy labels can give an inaccurate classifier [Aslam and Decatur, 1996; Biggio et al., 2011; Cesa-Bianchi et al., 1999; Frénay and Verleysen, 2013; Natarajan et al., 2013] . Interestingly, it has been shown that the trained classifier may only perform slightly better than random guessing even under a simple noise assumption [Long and Servedio, 2010] . Since learning from noisy labels is challenging and highly relevant in the real-world, this problem has been studied extensively in both theoretical and practical aspects [Van Rooyen and Williamson, 2017; Jiang et al., 2018; Algan and Ulusoy, 2019; Liu and Guo, 2020; Wei and Liu, 2020; Karimi et al., 2020; Han et al., 2018 .

Recently, a loss function that satisfies a certain symmetric condition has demonstrated its usefulness in learning from noisy labels. A pioneer work in this direction is the work by Manwani and Sastry [2013] , where they showed that using symmetric losses can be robust under random classification noise (see also Ghosh et al. [2015] and van Rooyen et al. [2015a] ). However, the assumption of random classification noise can be restrictive since it assumes that each training label can be flipped independently with the fixed probability regardless of its original label. As a result, it is important to investigate a more realistic noise model to reflect a real-world situation more accurately. In this article, we review the robustness result of symmetric losses in the setting of mutually contaminated noise [Scott et al., 2013; Menon et al., 2015] . This noise model has been proven to be quite general since it encompasses well-known noise assumptions such as the random classification noise and class-conditional noise [Menon et al., 2015; Lu et al., 2019] . Furthermore, many instances of weakly-supervised learning problems can also be formulated into the setting of mutually contaminated noise [Kiryo et al., 2017; Bao et al., 2018; Lu et al., 2019; Shimada et al., 2020] . In this article, we will discuss how using a symmetric loss can be advantageous in BER and AUC optimization under mutually contaminated noise. Interestingly, with a symmetric loss, one does not need the knowledge of the noise rate to learn effectively with a theoretical guarantee [Charoenphakdee et al., 2019b] .

This article also demonstrates how to use a symmetric loss in a real-world problem in the context of natural language processing. We discuss an application of symmetric losses for learning a reliable classifier from only relevant keywords and unlabeled documents [Jin et al., 2017; Charoenphakdee et al., 2019a] . In this problem, we first collect unlabeled documents. Then, we collect relevant keywords that are useful for determining the target class of interest. Unlike collecting labels for every training document, collecting keywords can be much cheaper and the number of keywords does not necessarily scale with the number of unlabeled training documents [Chang et al., 2008; Song and Roth, 2014; Chen et al., 2015; Li and Yang, 2018; Jin et al., 2017 Jin et al., , 2020 . We will discuss how this problem can be formulated into the framework of learning under mutually contaminated noise and how using a symmetric loss can be highly useful for solving this problem [Charoenphakdee et al., 2019a] .

In this section, we review the standard formulation of binary classification based on empirical risk minimization [Vapnik, 1998 ] and well-known evaluation metrics. Then, we review the definition of symmetric losses and the problem of learning from corrupted labels.

Here, we review the problem of binary classification, where the goal is to learn an accurate classifier from labeled training data. Let x ∈ X denote a pattern in an input space X . For example, an input space X could be a space of d-dimensional real-valued vectors in R d . Also, let y ∈ {−1, +1} denote a class label and g : X → R denote a prediction function that we want to learn from data. In binary classification, we use the function sign(g(x)) to determine the predicted label of a prediction function, where sign(g(x)) = 1 if g(x) > 0, −1 if g(x) < 0, and 0 otherwise 1 . In ordinary binary classification, we are given n labeled training examples, {x i , y i } n i=1 , which are assumed to be drawn independently from a joint distribution D with density p(x, y). Next, to evaluate the prediction function, we define the zero-one loss as follows:

Given an input-output pair (x, y), if the signs of both y and g(x) are identical, we have zero penalty, i.e., 0-1 (yg(x)) = 0. On the other hand, we have 0-1 (yg(x)) = 1 if the signs of y and g(x) are different, which indicates incorrect prediction. 

The goal of binary classification is to find a prediction function g that minimizes the following misclassification risk, i.e., the risk corresponding to the classification error rate (CER):

As suggested by Eq.

(2), our goal is to find the prediction function that performs well w.r.t. the whole distribution on average, that is, the prediction function g should be able to also classify unseen examples from the same data distribution accurately, not only perform well on the observed training examples.

In practice, the misclassification risk R 0-1 CER cannot be directly minimized because we only observe finite training examples, not the whole probability distribution. By using training examples, the empirical risk minimization framework suggests to find g by minimizing the following empirical risk [Vapnik, 1998] :

Although the risk estimator in Eq. (3) is an unbiased and consistent estimator of the misclassification risk [Vapnik, 1998] , it is not straightforward to directly minimize it. Indeed, with the zero-one loss in the empirical risk, the minimization problem is known to be computationally infeasible. It is an NP-hard problem even if the function class to search g is a class of linear hyperplanes [Ben-David et al., 2003; Feldman et al., 2012] . Moreover, the gradient of the zero-one loss is zero almost everywhere and therefore hinders the use of a gradient-based optimization method. To mitigate this problem, it is a common practice to replace the zero-one loss with a different loss function that is easier to minimize, which is called a surrogate loss [Zhang, 2004; Bartlett et al., 2006] . As a result, we minimize the following empirical surrogate risk:

where regularization techniques can also be employed to avoid overfitting. The choice of a surrogate loss is highly crucial for training a good classifier and should be carefully designed. This is because the ultimate goal is still to minimize the misclassification risk R 0-1 CER (g), not the surrogate risk R CER (g). To ensure that minimizing the surrogate risk R CER (g) yields a meaningful solution for the misclassification risk R 0-1 CER (g), a surrogate loss should satisfy a classification-calibration condition, which is known to be a minimum requirement for binary classification (see Bartlett et al. [2006] for more details). Many well-known surrogate losses in binary classification satisfy this property. Table 1 provides examples of classification-calibrated losses and their properties [Bartlett et al., 2006; Masnadi-Shirazi and Vasconcelos, 2009; Masnadi-Shirazi et al., 2010; van Rooyen et al., 2015a ].

Although CER has been used extensively, one should be aware that using this evaluation metric may not be informative when the test labels are highly imbalanced [Menon et al., 2013] . For example, consider a trivial prediction function g pos such that g pos (x) > 0 for any x, that is, g pos only predicts a positive label. If 99% of the test labels are positive, CER of g pos is 0.01, which may indicate very good performance. However, g pos does not give any meaningful information since it always predicts a positive label regardless of an input x. Thus, low CER may mislead someone into thinking that g pos is a good classifier. Here, we review evaluation metrics that can be used as an alternative to CER to prevent such a problem.

Let E P [·] and E N [·] be the expectations of x over p(x|y = +1) and p(x|y = −1), respectively. Then, the BER risk is defined as follows:

It is insightful to note that CER can also be expressed as

where p(y = +1) is the class prior given by p(y = +1) = p(x, y = +1) dx. We can see that the BER minimization problem is equivalent to the CER minimization problem if the class prior is balanced, i.e., p(y = +1) = 1 2 . Furthermore, unlike R 0-1 CER , any trivial prediction function that predicts only one label cannot have R 0-1 BER lower than 1 2 regardless of the class prior. As a result, the prediction function g has an incentive to predict both classes to obtain a low balanced error risk R 0-1 BER . Therefore, BER is known to be useful to evaluate the prediction function g under class imbalance [Cheng et al., 2002; Guyon et al., 2005; Brodersen et al., 2010] . In addition, it is also worth noting that BER can be interpreted as an arithmetic mean of the false positive and false negative rates [Menon et al., 2013] .

Similarly to the CER minimization problem, we can minimize the empirical surrogate risk using training data and a classification-calibrated loss as follows:

where n P and n N are the numbers of positive and negative examples, respectively.

In binary classification, a receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate at various decision thresholds. The area under the ROC curve is called the AUC score, which can be used to evaluate the performance of a prediction function over all possible decision thresholds on average. The AUC score can also be interpreted as the probability that the prediction function outputs a higher score for a random positive example than a random negative example [Fawcett, 2006] . Let us consider the following AUC risk :

which is the complement of AUC score since the expected AUC score is 1 − R 0-1 AUC (g). Therefore, maximizing the AUC score is equivalent to minimizing the AUC risk. Intuitively, the high AUC score indicates that g outputs a higher value to positive examples than negative examples on average. Unlike CER and BER where the function sign(g(x)) is crucial for the evaluation, in AUC, the sign function is evaluated on the difference between the outputs of g on positive and negative data. As a result, an evaluation based on AUC is highly related to the bipartite ranking problem Menon and Williamson, 2016] , where the goal is to find a function g that can rank positive examples over negative examples. It is also worth noting that AUC is highly related to the Wilcoxon-Mann-Whitney statistic [Mann and Whitney, 1947; Hanley and McNeil, 1982] . Similarly to BER, AUC is also known to be a useful evaluation metric under class imbalance [Cheng et al., 2002; Guyon et al., 2005] .

Given training data, the empirical surrogate AUC risk can be defined as follows:

However, unlike the CER and BER minimization problems, a loss requirement for AUC optimization should be AUC-consistent to guarantee that the optimal solution of a surrogate AUC risk is also optimal for the AUC risk [Gao and Zhou, 2015; Menon and Williamson, 2016] . Note that this condition is not equivalent to classification-calibration. For example, the hinge loss is known to be classification-calibrated but not AUC-consistent [Gao and Zhou, 2015; Uematsu and Lee, 2017 ].

The term symmetric loss in this article refers to a loss function sym : X → R such that sym (z) + sym (−z) = K, where K is a constant 2 . It is known that if a loss function is symmetric and non-negative, it must be non-convex [du Plessis et al., 2014] . Figure 1 illustrates three symmetric losses, which are the zero-one loss, sigmoid loss, and ramp loss. It is worth noting that both the sigmoid loss and the ramp loss are classification-calibrated and AUC-consistent [Charoenphakdee et al., 2019b ]. 

The standard formulation of binary classification in Section 2.1 does not take into account that training labels can be corrupted. To extend the standard formulation to support such a situation, learning from corrupted labels under mutually contaminated noise assumes that the training data are given as follows [Menon et al., 2015; Lu et al., 2019; Charoenphakdee et al., 2019b] :

where, for 0 < π N < π P < 1,

Concretely, the formulation assumes that corrupted positive examples X P are drawn from the distribution where its density is a mixture of the class-conditional positive density p(x|y = +1) and class-conditional negative density p(x|y = −1), where π P controls the mixture proportion between two densities. Corrupted negative examples X N are also assumed to be drawn similarly but with a different mixture proportion π N .

We can interpret the assumption of the data generating process as follows. The given training data is clean if π P = 1 and π N = 0, i.e., the data of each class is drawn from the class-conditional distribution w.r.t. one class. On the other hand, we can see that the training data can be highly noisy if π P − π N is small, i.e., the corrupted positive data and corrupted negative data have a similar distribution and therefore difficult to distinguish. Note that in this framework, it is reasonable to assume that π P > π N because the corrupted positive distribution should still have more information about positive data than the corrupted negative distribution.

In learning from corrupted labels, it has been shown that CER minimization based on empirical risk minimization is possible if the knowledge of π P , π N , and the class prior of the test distribution is given [Lu et al., 2019 . On the other hand, the problem of estimating π P and π N from corrupted labeled data is known to be an unidentifiable problem unless a restrictive condition is applied [Blanchard et al., 2010; Menon et al., 2015; Scott, 2015] . Furthermore, the class prior of the test distribution has no relationship with π P and π N and it has to be specified if the goal is to minimize the misclassification risk [Lu et al., 2019] . For these reasons, CER minimization can be infeasible if one has access only to corrupted labeled examples without any additional information. As a result, it is important to explore other evaluation metrics that could be useful in learning from corrupted labels that can be optimized without requiring the knowledge of π P and π N . In Section 3, we will show that BER and AUC can be effectively optimized in learning from corrupted labels without having to estimate π P and π N . In Table 2 , we show related work on learning from clean and corrupted labels with different evaluation metrics that could be of interest to readers.

In this section, we begin by describing the related work in BER/AUC optimization from corrupted labels and then show that using a symmetric loss can be advantageous for BER and AUC optimization from corrupted labels.

In learning from corrupted labels, Menon et al. [2015] proved that both BER and AUC are robust w.r.t. the zero-one loss. More precisely, without using a surrogate loss, the minimizer of the BER/AUC risk w.r.t. the clean distribution and that of the BER/AUC risk w.r.t. the corrupted distribution are identical. In experiments, they used the squared loss as their choice of the surrogate loss and the comparison between the squared loss and other surrogate losses was not conducted. Next, van Rooyen et al. [2015b] generalized the theoretical result of Menon et al. [2015] for BER minimization from corrupted labels from the zero-one loss to any symmetric loss. Then, Charoenphakdee et al. [2019b] proved the relationship between the clean and corrupted BER/AUC risks for a general loss function, which elucidates that using a symmetric loss can be advantageous for both BER and AUC optimization from corrupted labels. Furthermore, Charoenphakdee et al. [2019b] also conducted extensive experiments to verify that using symmetric losses can perform significantly better than using non-symmetric losses.

To verify the robustness of BER minimization from corrupted labels, we investigate the relationship between the clean risk with the corrupted risk for any surrogate loss . First, let us define the following surrogate risk for BER from corrupted labels:

where E P and E N denote the expectations over p π P and p π N , respectively. Since we have samples X P and X N following p π P and p π N , respectively (see Section 2.4), the following empirical risk R BER can be minimized in practice:

Note that R BER can be minimized without the knowledge of π P and π N . Next, the following equation illustrates the relationship between the clean BER risk R BER and the corrupted BER risk R BER w.r.t. to any loss function [Charoenphakdee et al., 2019b] :

where γ (z) = (z) + (−z). From Eq. (9), we can see that g which minimizes R BER should also perform reasonably well for R BER for any loss function. However, together with R BER , a prediction function g that minimizes R BER should also take the excessive terms into account, which are the terms that are unrelated to our goal. As a result, the minimizer of R BER may not be guaranteed to be the minimizer of R BER because of the non-constant excessive term.

Next, let us consider a symmetric loss sym such that sym (z)+ sym (−z) = K, where K is a constant regardless of z. With a symmetric loss, we can rewrite Eq. (9) as

We can see that if a loss is symmetric, then the excessive term will be a constant and the minimizers of R BER and R BER must be identical. This suggests that g can ignore the excessive term when using a symmetric loss. As a result, BER minimization from corrupted labels can be done effectively without the knowledge of π P and π N by minimizing Eq. (8) using a symmetric loss.

Let us consider a corrupted AUC risk with a surrogate loss that treats X P as being positive and X N as being negative:

which can be empirically approximated using training data as

Charoenphakdee et al. [2019b] showed that the relationship between R AUC (g) and R AUC (g) can be expressed as follows:

Excessive term

Excessive term , where γ (z, z ) = (z − z ) + (z − z). Next, with a symmetric loss sym , we have

Similarly to BER minimization from corrupted labels, the result suggests that the excessive terms become a constant when using a symmetric loss and it is guaranteed that the minimizers of R AUC (g) and R AUC (g) are identical. On the other hand, if a loss is non-symmetric, then we may suffer from the excessive terms and the minimizers of both risks may differ.

We can see that in both BER and AUC optimization from corrupted labels, by using a symmetric loss, the knowledge of π P and π N are not required and we can treat the corrupted labels as if they were clean to learn in this setting. We refer the readers to Charoenphakdee et al. [2019b] for more details on experimental results, where symmetric losses are shown to be preferable over non-symmetric losses.

Here, we give two examples that the formulation of BER and AUC optimization from corrupted labels can be useful for weakly-supervised learning, which are (i) learning from positive of unlabeled data and (ii) learning from two sets of unlabeled data.

Here, let us consider the problem of binary classification from positive and unlabeled data. We consider the case-control setting where the training data are given as follows [Ward et al., 2009; du Plessis et al., 2014 du Plessis et al., , 2015 Niu et al., 2016; Kiryo et al., 2017; Xu et al., 2019] :

∼ p(x|y = −1),

∼ π U p(x|y = +1) + (1 − π U )p(x|y = −1).

With π P = 1 and π N = π U , we can relate the training data of learning from positive and unlabeled data to learning from corrupted labels. In this setting, Sakai et al. [2018] showed that for AUC maximization, a convex surrogate loss can be applied but the class prior π U needs to be estimated to construct an unbiased risk estimator. By using a symmetric loss, we can safely perform both BER and AUC optimization without estimating the class prior π U with a theoretical guarantee. Concretely, with a symmetric loss sym , BER minimization from positive and unlabeled data can be done effectively by minimizing the following empirical risk:

R sym BER-PU (g) = 1 2 1 n P x∈XP sym (g(x)) + 1 n U x∈XU sym (−g(x)) , and AUC maximization can be done effectively by minimizing the following empirical risk:

).

Here, let us consider the problem of binary classification from two set of unlabeled data, where the training data are given as follows [Lu et al., 2019 :

∼ π U p(x|y = +1) + (1 − π U )p(x|y = −1),

where π U > π U . We can relate given training data of this problem to learning from corrupted labels by having π P = π U and π N = π U . Therefore, BER and AUC optimization from two sets of unlabeled data with different class priors can also be carried out effectively with a symmetric loss without knowing class priors π U and π U . It is interesting to see that although the data collection procedure of learning from corrupted labels and learning from two sets of unlabeled data are very different in practice, the assumptions of the data generating process can be highly related.

Concretely, with a symmetric loss sym , BER minimization from two sets of unlabeled data can be done effectively by minimizing the following empirical risk:

and AUC maximization can be done effectively by minimizing the following empirical risk:

).

In this section, we demonstrate how to apply the robustness result of symmetric losses to tackle a weaklysupervised natural language processing task, namely learning only from relevant keywords and unlabeled documents [Charoenphakdee et al., 2019a] .

To reduce the labeling costs, weakly-supervised text classification has been studied extensively in various settings, e.g., positive and unlabeled text classification Liu, 2003, 2005] , zero-shot text classification [Zhang et al., 2019] , cross-lingual text classification [Dong and de Melo, 2019] , and dataless classification [Chang et al., 2008; Song and Roth, 2014; Chen et al., 2015; Jin et al., 2017 Jin et al., , 2020 Li and Yang, 2018] . Out target problem can be categorized as a variant of dataless classification, where we are given a set of n W relevant keywords:

which are keywords that provide characteristics of positive documents. Also, unlabeled documents drawn from the following distribution are provided:

where, for π U ∈ (0, 1),

Note that unlike ordinary dataless classification, where we need keywords for every class [Chang et al., 2008; Song and Roth, 2014; Chen et al., 2015; Li and Yang, 2018] , in this problem, only keywords for the positive class are provided. Therefore, this problem setting could be more practical in a situation where negative documents are too diverse to collect representative keywords for the negative class [Hsieh et al., 2019] . It is worth noting that our problem is also called lightly-supervised learning [Jin et al., 2017] , where the supervision comes from the relevant keywords. To solve this problem, Jin et al. [2017] proposed to use a method based on ensemble learning. The bottleneck of the method proposed by Jin et al. [2017] is lack of flexibility of model choices and optimization algorithms. This makes it difficult to bring many Figure 2 : An overview of the framework for learning only from relevant keywords and unlabeled document [Charoenphakdee et al., 2019a] . Blue documents indicate positive documents and red documents denote negative documents in the two sets of documents divided by a pseudo-labeling algorithm. Note that clean labels are not observed by the framework. useful models and techniques from a more well-studied problem such as supervised text classification to help solve this problem. Moreover, the theoretical understanding of this problem was limited. To alleviate these limitations, Charoenphakdee et al. [2019a] proposed a theoretically justifiable framework that allows practitioners to choose their preferred models to maximize the performance, e.g., convolutional neural networks [Zhang et al., 2015] or recurrent neural networks [Lai et al., 2015] . Moreover, this framework does not limit the choice of optimization methods. One may use any optimization algorithm for their model, e.g., Adam [Kingma and Ba, 2014] .

In learning only from relevant keywords and unlabeled documents, the choice of evaluation metrics depends on the desired behavior of the prediction function g we want to learn. For example, AUC is appropriate if the goal is simply to learn a bipartite ranking function to rank a positive document over a negative document. On the other hand, if the goal is document classification, one may use CER or the F 1 -measure, i.e., the harmonic mean of precision and recall, which has been widely used in text classification [Li and Liu, 2005; Jin et al., 2017 Jin et al., , 2020 Lertvittayakumjorn and Toni, 2019; Lertvittayakumjorn et al., 2020; Mekala and Shang, 2020; He et al., 2020] .

Here, we discuss a flexible framework for learning only from relevant keywords and unlabeled documents proposed by Charoenphakdee et al. [2019a] . Figure 2 illustrates an overview of the framework. First, pseudo-labeling is carried out to split unlabeled documents into two sets. Next, by using pseudo-labeled documents, AUC maximization is conducted, where the choice of the surrogate loss is a symmetric loss. Finally, after obtaining a bipartite ranking function by AUC maximization, a threshold selection procedure is performed to convert the ranking function to a binary classifier.

The first step is to utilize relevant keywords to perform pseudo-labeling on unlabeled documents. Concretely, given relevant keywords W and unlabeled documents X U , the pseudo-labeling algorithm A(W, X U ) splits X U into X P and X N . The key idea of this step is to use pseudo-labeling to bridge learning only from relevant keywords and unlabeled documents to learning from corrupted labels. More precisely, we assume that the pseudo-labeling algorithm A(W, X U ) returns the following data:

∼ p π P (x),

where the assumption of the data generating process is identical to that of the setting of learning from corrupted labels (see Section 2.4).

It is important to note that a pseudo-labeling algorithm we employ here is not expected to perfectly split documents into clean positive and negative documents. For the choice of the pseudo-labeling algorithm, Charoenphakdee et al. [2019a] simply used a cosine similarity score between keywords and documents and compared the score with a pre-defined threshold to split unlabeled documents into two sets. To further improve the pseudo-labeling accuracy, one may utilize domain-specific knowledge or a keyword mining method to collect more relevant keywords. Examples of such keyword mining methods are Textrank [Mihalcea and Tarau, 2004] and Topicrank [Bougouin et al., 2013] . Moreover, one may also incorporate an unsupervised learning method Seo, 2000, 2009] or apply a pre-trained model such as BERT [Devlin et al., 2019] .

After pseudo-labeling, we can conduct AUC maximization using a symmetric loss to learn a good ranking function g from pseudo-labeled documents.

Recall that, with any symmetric loss, the AUC risk minimizers of the corrupted risk and clean risk are identical, which is suggested from the following equation:

Eq. (11) indicates that as long as the pseudo-labeling algorithm succeeds in splitting the documents into two sets such that π P > π N , we can always guarantee that g can be effectively learned from pseudo-labeled documents. More precisely, the minimizers of the risk w.r.t. the pseudo-labeled document distribution and the clean document distribution are identical. However, since any pseudo-labeling algorithm that gives π P > π N can guarantee to learn a good ranking function, one important question is: how does the quality of the pseudo-labeling method impact the performance of the trained prediction function g in this framework? Intuitively, a good pseudo labeling should give a high proportion of positive documents in the pseudo-positive set and a high proportion of negative documents in the pseudo-negative set. Mathematically, a good algorithm should return two sets of documents with a large π P and a small π N , that is, π P − π N is large.

To elucidate the usefulness of a good pseudo-labeling algorithm, it is insightful to analyze an estimation error bound of AUC maximization from corrupted labels. Letĝ ∈ G be a minimizer of the empirical corrupted AUC risk R sym AUC in the hypothesis class G and g * ∈ G be a minimizer of the clean AUC risk R sym AUC . Then, the following theorem can be obtained. Theorem 1 (Estimation error bound [Charoenphakdee et al., 2019a] ). Let Q sym G be a class of functions mapping X 2 to [0, K] such that Q sym

denotes the bipartite Rademacher complexities of function class Q sym G (see Usunier et al. [2005] for more details). For all Q ∈ Q sym G and δ ∈ (0,1), with probability at least 1 − δ, we have

where the probability is over repeated sampling of X P and X N .

This theorem explains how the degree of corruption π P − π N affects the tightness of the bound and therefore the speed of convergence. When π P − π N is small, i.e., π P and π N are similar, the bound becomes loose. This illustrates the difficulty of AUC maximization when the pseudo-labeling algorithm performs poorly and we may need a lot of data. On the other hand, a good pseudo-labeling that gives a large π P − π N can give a smaller constant 1 π P −π N , which can lead to a tighter bound. Nevertheless, it is noteworthy that as long as π P > π N , with all parametric models having a bounded norm such as neural networks with weight decay or kernel models, this learning framework is statistically consistent, i.e., the estimation error converges to zero as n P , n N → ∞.

After obtaining a good ranking function g, an important question is how to convert the ranking function to a binary classifier. Here, we discuss how to decide the classification threshold for optimizing evaluation metrics such as the F 1 -measure.

It is known that many evaluation metrics can be optimized if a suitable threshold and p(y = +1|x) are known [Yan et al., 2018] . For example, sign[p(y = +1|x) − 1 2 ] is the Bayes-optimal solution for the classification accuracy, where 1 2 is the threshold. Moreover, it has been proven that the Bayes-optimal solution of AUC maximization with an AUC-consistent surrogate loss is any function that has a strictly monotonic relationship with p(y = +1|x) [Clémençon and Vayatis, 2009; Gao and Zhou, 2015; Menon and Williamson, 2016] . Therefore, finding an appropriate threshold to convert a bipartite ranking function to a binary classifier can give a reasonable classifier .

In learning from relevant keywords and unlabeled documents, no information about a proper threshold can be obtained from the training data since all given data are unlabeled. For this reason, we may not be able to draw an optimal threshold for optimizing the accuracy and F 1 -measure without additional assumptions. On the other hand, as shown in Section 3.3, a ranking function can be learned reliably with a theoretical guarantee based on AUC maximization. This is the main reason why Charoenphakdee et al. [2019a] proposed to first learn a reliable ranking function instead of directly learn a binary classifier in this problem.

Suppose the class prior p(y = +1) in unlabeled documents is known, a reasonable threshold β ∈ R can be given based on the following equation:

Intuitively, the threshold β allows g to classify the top proportion of unlabeled documents w.r.t. p(y = +1|x) to be positive, and negative otherwise. With unlabeled documents and the known proportion of positive documents, one can decide β that satisfies the empirical version of Eq. (12). Concretely, given unlabeled documents X val for validation, the threshold can be decided by finding β such that

This threshold is known as a precision-recall breakeven point, where it is the point that the precision is equal to recall (see Kato et al. [2019] for its proof). Therefore, this choice is arguable to be a reasonable threshold for the F 1 -measure since this evaluation metric is the harmonic mean of precision and recall. In practice, it may not be possible to know the proportion p(y = +1), yet we still want a classifier. Without knowing the proportion of positive documents, it is likely that we learn a wrong threshold, which leads to low performance. For example, as shown in Table 3 , the performance degrades dramatically with a wrong choice of the threshold. More details on a heuristic for threshold selection and the performance w.r.t. different thresholds can be found in Charoenphakdee et al. [2019a] . [Pang and Lee, 2004] and the 20 Newsgroups dataset (20NG) [Lang, 1995] with the baseball and hockey groups as positive. ACC denotes the classification accuracy and F 1 denotes the F 1 -measure. "Sigmoid" is the framework using the sigmoid loss [Charoenphakdee et al., 2019a] , which employs a recurrent convolutional neural networks model (RCNN) [Lai et al., 2015] with two layer long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997 ]. The "heuristic threshold" is the ratio between pseudo-positive documents over unlabeled documents and the "default threshold" is a baseline choice (see Charoenphakdee et al. [2019a] for details). It can be seen that given the same ranking function, the classification performance can drastically change depending on the choice of the threshold. It is also worth noting that if we cannot guarantee that π U ≈ p(y = +1|x), i.e., the proportion of positive documents in the unlabeled training documents is similar to that of the test data, then using a metric such as the F 1 -measure or classification accuracy is highly biased to the training distribution [Scott et al., 2012; . This problem is known as the class prior shift phenomenon and it might occur in real-world applications [Saerens et al., 2002; Tasche, 2017] . For example, by collecting unlabeled documents from the internet, the proportion of positive documents can be different from that of the test distribution where we want to deploy the trained classifier. Note that the class prior shift phenomenon can dramatically degrade the performance of a classifier [Saerens et al., 2002; . Therefore, if it is impossible to estimate the class prior p(y = +1|x) or the test environment is susceptible to the class prior shift phenomenon, we suggest to use other evaluation metrics such as BER or the AUC score.

In this article, we have reviewed recent advances in reliable machine learning from a symmetric loss perspective. We showed in Section 3 that if a symmetric loss is used for BER and AUC optimization from corrupted labels, the corrupted and clean risks have the same minimizer regardless of the model. Furthermore, we demonstrated in Section 4 that the theoretical advantage of symmetric losses is practically also valuable in learning only from relevant keywords and unlabeled documents. In this section, we conclude this article by discussing two future directions for the symmetric loss perspective of reliable machine learning.

The first direction is exploring more applications of symmetric losses for reliable machine learning. Here, we provide two examples of this direction. First, it has been recently shown that using a symmetric loss can also be beneficial in imitation learning from noisy demonstrations, where the goal is to teach an agent to imitate expert demonstrations although training data contain both expert and non-expert demonstrations. Tangkaratt et al. [2020a] showed that imitation learning with a symmetric loss can enable an agent to successfully imitate expert demonstrations without assuming a strong noise assumption such as it is Gaussian [Tangkaratt et al., 2020b] or requiring additional confidence scores for given demonstrations [Wu et al., 2019; Brown et al., 2019 Brown et al., , 2020 . Another example is to use a symmetric loss in classification with rejection, where a classifier is allowed to refrain from making a prediction if the prediction is uncertain [Chow, 1970; Yuan and Wegkamp, 2010; Cortes et al., 2016; Ni et al., 2019; Mozannar and Sontag, 2020] . Although well-known symmetric losses have favorable properties such as classification-calibration and AUC-consistency, it is important to note that learning with a symmetric loss cannot guarantee to give a classifier with reliable prediction confidence [Charoenphakdee et al., 2019b] . Recently, proposed an approach based on cost-sensitive classification, which enables any classification-calibrated loss including symmetric losses to be applied for solving classification with rejection. In the experiments, the sigmoid loss was shown to be highly preferable in classification with rejection from noisy labels and classification with rejection from positive and unlabeled data. These examples emphasize the potential of symmetric losses for reliable machine learning in addition to what we have introduced in Sections 3 and 4. Therefore, it could be useful to explore the use of symmetric losses for a wider range of problems, e.g., domain adaptation [Sugiyama and Kawanabe, 2012; Ben-David et al., 2012; Redko et al., 2020] , open-set classification [Saito et al., 2018; Ruff et al., 2020; Fang et al., 2020; Geng et al., 2020] , and learning from aggregate observations [Maron and Lozano-Pérez, 1997; Hsu et al., 2019; .

Although symmetric losses are useful in learning from corrupted labels, using them can sometimes lead to undesirable performance because training with a symmetric loss can be computationally hard. [Ghosh et al., 2017; Wang et al., 2019; Ma et al., 2020] . Thus, it is interesting to explore non-symmetric losses that can benefit from the symmetric condition. This is the second future direction we discuss in this section. Here, we provide two examples to demonstrate the potential of this research direction. The first example is motivated by the fact that a nonnegative symmetric loss must be non-convex [du Plessis et al., 2014] . To explore a robust convex loss, Charoenphakdee et al. [2019b] proposed the barrier hinge loss, which is a convex loss that satisfies a symmetric condition on a subset of the domain, but not everywhere. The barrier hinge loss was shown to be highly robust in BER and AUC optimization although it is not symmetric. This suggests that one can design a non-symmetric loss that benefits from the symmetric condition. Another example is an approach to combine a symmetric loss with a non-symmetric loss. Recently, Wang et al. [2019] proposed the reverse cross-entropy loss, which is a symmetric loss. Then, they proposed to combine the reverse cross-entropy loss with the ordinary cross-entropy loss by linear combination. Their experiments showed that the classification performance of the combined loss can be better than using only the reverse cross-entropy loss or other symmetric losses such as the mean absolute error loss [Ghosh et al., 2017] . Based on these examples, we can see that it could be beneficial to design a loss function that enjoys the advantages of both symmetric and non-symmetric losses.

Sariel Har-Peled, and Dan Roth. Generalization bounds for the area under the roc curve

Image classification with deep learning in the presence of noisy labels: A survey

On the sample complexity of noise-tolerant learning

Classification from pairwise similarity and unlabeled data

Convexity, classification, and risk bounds

On the difficulty of approximately maximizing agreements

Minimizing the misclassification error rate using a surrogate convex loss

Support vector machines under adversarial label noise

Semi-supervised novelty detection

Topicrank: Graph-based topic ranking for keyphrase extraction

The balanced accuracy and its posterior distribution

Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations

Safe imitation learning via fast bayesian reward inference from preferences

Sample-efficient strategies for learning in the presence of noise

Importance of semantic representation: Dataless classification

Positive-unlabeled classification under class prior shift and asymmetric error

Learning only from relevant keywords and unlabeled documents

On symmetric losses for learning from corrupted labels. ICML

Classification with rejection based on cost-sensitive classification

Dataless text classification with descriptive lda

Crowdsourcing drug discovery for pandemics

On optimum recognition error and reject tradeoff

Tree-based ranking methods

Learning with rejection

Amazon mechanical turk: A research tool for organizations and information systems scholars

Classification from triplet comparison data

Maximum likelihood estimation of observer error-rates using the EM algorithm

Imagenet: A large-scale hierarchical image database

Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT

A robust self-learning framework for cross-lingual text classification

Analysis of learning from positive and unlabeled data

Convex formulation for learning from positive and unlabeled data

Open set domain adaptation: Theoretical bound and algorithm

An introduction to roc analysis

Agnostic learning of monomials by halfspaces is hard

Classification in the presence of label noise: a survey

Exact exponent in optimal rates for crowdsourcing

On the consistency of auc pairwise optimization

Recent advances in open set recognition: A survey

Making risk minimization tolerant to label noise

Robust loss functions under label noise for deep neural networks

Deep learning

Result analysis of the NIPS 2003 feature selection challenge

Co-teaching: Robust training of deep neural networks with extremely noisy labels

A survey of label-noise representation learning: Past, present and future

The meaning and use of the area under a receiver operating characteristic (roc) curve

Towards more accurate uncertainty estimation in text classification

Long short-term memory

Classification from positive, unlabeled and biased negative data

Multi-class classification without multi-class labels

Analysis of minimax error rate for crowdsourcing and its application to worker clustering model. ICML

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels

Combining lightly-supervised text classification models for accurate contextual advertising

Learning from noisy out-of-domain corpus using dataless classification

Deep learning with noisy labels: exploring techniques and remedies in medical image analysis

Learning from positive and unlabeled data with a selection bias

Adam: A method for stochastic optimization

Positive-unlabeled learning with non-negative risk estimator

Crowdsourcing user studies with mechanical turk

Automatic text categorization by unsupervised learning

Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing & Management

Unsupervised domain adaptation based on source-guided discrepancy

Recurrent convolutional neural networks for text classification

Newsweeder: Learning to filter netnews

On quality control and machine learning in crowdsourcing

Deep learning. nature

Domain discrepancy measure using complex models in unsupervised domain adaptation

Human-grounded evaluations of explanation methods for text classification

Find: Human-in-the-loop debugging deep text classifiers

Learning to classify texts using positive and unlabeled data

Learning from positive and unlabeled examples with different data distributions

A pseudo label based dataless naive bayes algorithm for text classification with seed words

Data mining for direct marketing: Problems and solutions

Peer loss functions: Learning from noisy labels without knowing noise rates

Random classification noise defeats all convex potential boosters

On the minimal supervision for training any binary classifier from only unlabeled data. ICLR

Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach

Normalized loss functions for deep learning with noisy labels

On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics

Noise tolerance under risk minimization

A framework for multiple-instance learning

On the design of loss functions for classification: theory, robustness to outliers, and savageboost

On the design of robust classifiers for computer vision

Contextualized weak supervision for text classification

On the statistical consistency of algorithms for binary classification under class imbalance

Learning from corrupted binary labels via class-probability estimation

Bipartite ranking: a risk-theoretic perspective

Textrank: Bringing order into text

Consistent estimators for learning to defer to an expert

On the relationship between binary classification, bipartite ranking, and binary class probability estimation

On the statistical consistency of plug-in classifiers for non-decomposable performance measures

Learning with noisy labels

On the calibration of multiclass classification with rejection

Theoretical comparisons of positive-unlabeled learning against positive-negative learning

A crowdsourcing framework for on-device federated learning

A sentimental education: Sentiment analysis using subjectivity

Marc Sebban, and Younès Bennani. A survey on domain adaptation theory

Composite binary losses. JMLR

A unifying review of deep and shallow anomaly detection

Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure

Open set domain adaptation by backpropagation

Semi-supervised AUC optimization based on positiveunlabeled learning

Deep learning in neural networks: An overview

A rate of convergence for mixture proportion estimation, with application to learning from noisy labels

Classification with asymmetric label noise: Consistency and maximal denoising

Calibrated asymmetric surrogate losses

Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization

On dataless hierarchical text classification

Machine learning in non-stationary environments: Introduction to covariate shift adaptation

Dataset shift in machine learning

Chimera: Large-scale classification using machine learning, rules, and crowdsourcing

Robust imitation learning from noisy demonstrations

Variational imitation learning with diverse-quality demonstrations

Fisher consistency for prior probability shift

On theoretically optimal ranking functions in bipartite ranking

A data-dependent generalisation error bound for the AUC

A theory of learning with corrupted labels

Learning with symmetric label noise: The importance of being unhinged

An average classification algorithm

Statistical learning theory

Making better use of the crowd: How crowdsourcing can advance machine learning research

How can crowdsourcing help tackle the covid-19 pandemic? an explorative overview of innovative collaborative practices

Symmetric cross entropy for robust learning with noisy labels

Presence-only data and the em algorithm

Precision telemedicine through crowdsourced machine learning: Testing variability of crowd workers for video-based autism feature recognition

When optimizing f -divergence is robust with label noise

Imitation learning from imperfect demonstration

Revisiting sample selection approach to positive-unlabeled learning: Turning unlabeled data into positive rather than negative

Binary classification with karmic, threshold-quasi-concave metrics

Classification methods with reject option based on convex risk minimization

Integrating semantic knowledge to tackle zero-shot text classification

Statistical behavior and consistency of classification methods based on convex risk minimization

Character-level convolutional networks for text classification

Learning from aggregate observations

Spectral methods meet em: A provably optimal algorithm for crowdsourcing

We would like to thank our collaborators: Dittaya Wanvarie, Yiping Jin, Zhenghang Cui, Yivan Zhang, and Voot Tangkaratt. NC was supported by MEXT scholarship and Google PhD Fellowship program. MS was supported by JST CREST Grant Number JPMJCR18A2.