key: cord-0220966-mgpefyc5
authors: G'omez-R'ios, Anabel; Luengo, Juli'an; Herrera, Francisco
title: A robust approach for deep neural networks in presence of label noise: relabelling and filtering instances during training
date: 2021-09-08
journal: nan
DOI: nan
sha: 0dcff3541366b42025b2a49c4b69e5f294358527
doc_id: 220966
cord_uid: mgpefyc5

Deep learning has outperformed other machine learning algorithms in a variety of tasks, and as a result, it is widely used. However, like other machine learning algorithms, deep learning, and convolutional neural networks (CNNs) in particular, perform worse when the data sets present label noise. Therefore, it is important to develop algorithms that help the training of deep networks and their generalization to noise-free test sets. In this paper, we propose a robust training strategy against label noise, called RAFNI, that can be used with any CNN. This algorithm filters and relabels instances of the training set based on the predictions and their probabilities made by the backbone neural network during the training process. That way, this algorithm improves the generalization ability of the CNN on its own. RAFNI consists of three mechanisms: two mechanisms that filter instances and one mechanism that relabels instances. In addition, it does not suppose that the noise rate is known nor does it need to be estimated. We evaluated our algorithm using different data sets of several sizes and characteristics. We also compared it with state-of-the-art models using the CIFAR10 and CIFAR100 benchmarks under different types and rates of label noise and found that RAFNI achieves better results in most cases.

Over the last few years, deep learning and Convolutional Neural Networks (CNNs) in particular have become progressively popular as they have been used in a variety of applications, especially in computer vision, and outperform other models [1] [2] [3] . Generally speaking, the networks that have been used in these types of applications have become deeper and deeper over the years, due to their high performance.

A common problem when dealing with real-world data sets in the context of supervised classification is label noise. The term 'label noise' refers to when some instances in the data set have erroneous labels, thus misleading the training of machine learning algorithms [4] . This type of noise can be present in the data set because it was labelled automatically using text labels from the Internet, or because not enough experts were available to label an entire data set. In either case, the rate of label noise can vary and can increase to large values [5, 6] . As a result, label noise has been some instances to their original (clean) class.

We evaluated our proposal with a variety of data sets, including small and large data sets, and under different types of label noise. We also use CIFAR10 and CIFAR100 as benchmarks to compare our proposal with other state-of-the-art models, since these data sets are two of the most used in other studies.

The rest of the paper is organized as follows. In Section 2, we give a background on works that propose strategies to help neural networks learn with label noise, with special mention to the ones we compare our algorithm to. In Section 3, we provide a detailed description of the RAFNI algorithm and its hyperparameters. Section 4 details the experimental framework, including the data sets, the types and levels of noise and the network configurations we used. The complete results obtained for all data sets and the comparison with the state-of-the-art models are shown in Section 5 and Section 6, respectively. We also compared our algorithm with an algorithm that supposes the noise rate is known and we show the results in Section 7. Then, in Section 8, we analyse the effectiveness of the RAFNI algorithm on two of the data sets used in this study. Finally, we give some final conclusions in Section 9.

In this section, we provide some background in the context of label noise and the types of noise we use in this study (Subsection 2.1). Then, we present an overview of the most popular approaches made in the context of deep learning to overcome the problem of label noise and provide a description of the proposals we have selected to compare with RAFNI (Subsection 2.2).

In the context of supervised classification, we have a set X = {x 1 , . . . , x n } of n training instances and their corresponding labels y = (y 1 , . . . , y n ), where y i ∈ {1, 2, . . . , K}, i = 1, . . . , n and K is the total number of classes. Label noise presents when some instances in X have erroneous labels. That is, an instance x i ∈ X with a true label y i actually appears in the training set with another label y i , y i = y i , i ∈ {1, . . . , n}. The percentage of instances that present label noise is called the noise rate or noise level.

Depending on whether the label noise appears dependent or independent of the class of the instances and the instances themselves, we can distinguish between the following types of noise:

• Symmetric noise (also called uniform noise or NCAR). The noise is independent of the original true class of the instances and the attributes of the instances. Thus, the labels of a percentage of the instances of the training set are randomly changed to another class following a uniform distribution, where all the classes have the same probability of being the noisy label. This implies that the percentage of noisy instances is the same in all classes. There are two options for this type of noise: the noisy label is chosen from the set of all of the classes (thus existing the possibility of not changing the label), or the noisy label is chosen from the set of all classes except the original one. We chose the second option.

• Asymmetric noise (also called NAR). The noise is dependent on the original true class of the instances and independent of the attributes of the instances. Therefore, the probability of each class to be the noisy label is different and depends on the original true class, but all the instances in the same class have the same probability of being noisy. This implies that the percentage of noisy instances in each class can be different.

• NNAR. The noise is dependent on the original true class of the instances and on the attributes of the instances. Hence, the probability of each class to be the noisy label can be different, depends on the original class, and the instances in each class have different probabilities to be noisy as it depends on the instances themselves.

As it happens in classical machine learning, the types of noise that we used can be treated (either by filtering or relabelling) without a prior estimation of the probability distribution.

During the last few years, there has been an increment in the number of proposals to help deep neural networks, and CNNs in particular, to learn in the presence of label noise in supervised classification. Most works fall into one or more of the following approaches:

• Proposals that modify the loss function in some way, either to make the loss function robust to label noise [17] [18] [19] , or to correct its values, so the noisy labels do not negatively impact the learning [9, 11, 12, 20, 21] .

• Proposals that create a specific deep network architecture [5] or modify an existing one by adding a noise adaptation layer at the end of the desired architecture to model the noise [16, 22] .

• Proposals that try to correct the noisy instances [9, 11] .

Some proposals suppose that a subset of clean samples is available [5] , and others assume that the noise rate is known [9, 11] , which is not usual when dealing with real-world noisy datasets, though in [11] the authors propose a mechanism to approximate the noise rate. A more in-depth survey of all the work that has been done to learn deep neural networks in presence of label noise can be found in [4] . However, it is important to note that the majority of these proposals are highly focused on classifying the available benchmarks (such as CIFAR10/100 or TinyImagenet), and they use specific networks designed for CIFAR (ResNet32 or ResNet44) along with specific learning schedules. As a result, they are sometimes not generalizable to real-world problems.

We have selected a subset of five of these proposals to compare with our algorithm: one that uses a robust loss function, three that propose loss correction approaches, and one that proposes a hybrid approach between loss correction and sample selection. We choose them because they have official public implementations either on TensorFlow/Keras or PyTorch. In the following, we describe these five proposals.

1. Robust loss function approach that uses a generalization of the softmax layer and the categorical cross-entropy loss [17] . Here, the authors propose to make the loss function robust against label noise by modifying the loss function and the last softmax activation of the deep neural network with two temperatures, creating non-convex loss functions. These two temperatures can be tuned for each data set. This proposal has the advantage that using the code provided by the authors, it can easily be used with any combination of a deep network, data set and optimization technique, including transfer learning.

2. Loss correction approaches [11] . The authors propose two approaches to correct the loss values of the noisy instances, for which it is necessary to know the noise matrix of the data set, called backward correction and forward correction. They provide a mechanism to estimate the noise matrix, and when used, the approaches are called estimated backward correction and estimated forward correction. The first one uses the noise matrix to correct the loss values, so they are more similar to the loss values of the clean instances. The second explicitly uses the noise matrix to correct the predictions of the model.

3. Loss correction approach using the dimensionality of the training set [12] . The authors explain that, when dealing with noisy labels, the learning can be separated into two phases. In the first phase, which occurs in the first epochs of the training, the network learns the underlying distribution of the data. Then, in the second phase, the network learns to overfit the noisy instances. They use a measure called Local Intrinsic Dimensionality (LID) to detect the moment the training enters the second phase and use the LID to modify the loss function to reduce the effect of the noisy instances.

4. Loss correction approach [21] , which is based on the static hard bootstrapping loss proposed in [23] combined with a data augmentation technique called mixup proposed in [24] . They use a beta mixture model to fit the loss values of the instances so they can distinguish between clean and noisy instances and use the loss correction approach on the noisy ones.

5. Loss and label correction approach [9] . The authors propose a hybrid approach between sample selection and loss correction that tries to relabel noisy instances when possible and not use them when not. For the noisy instances, they rely on the network: if it returns the same label with a high probability in the first epochs of the training, it is possible to correct that instance and the algorithm changes its label to the one the network predicts. In contrast, if the network changes the prediction of an instance inconsistently, they stop using that instance. They assume that the noise rate in the data set is known, and they do not provide a way to estimate it. This approach can be used iteratively so that the training set is iteratively cleaned in several training processes.

In this section, we describe our proposal. First, in Subsection 3.1, we give an overall description of the algorithm and explain its basics. Then, in Subsection 3.2, we present a formal definition of the algorithm. Finally, in Subsection 3.3 we give a guide on how to tune the hyperparameters of the algorithm.

We propose the RAFNI algorithm, which filters and relabels instances based on the predictions and their probabilities made by the backbone neural network during the training process. In Figure 1 , we show the difference between training the backbone network with and without the RAFNI algorithm and the moment it is applied. The backbone network used is independent of the algorithm, and it can change or be modified, for example, including transfer learning. Generally speaking, we propose two mechanisms to filter an instance and one mechanism to relabel an instance, with some restrictions. These mechanisms are the following:

• First filtering mechanism. This mechanism only uses the loss value of the instances. The foundation is that the noisy instances tend to have higher loss values than the rest of them. As a result, this mechanism filters out instances that have a loss value above a certain threshold. This threshold is dynamic and will change during training.

• Second filtering mechanism. This mechanism depends on how many times an instance has been relabelled. Here we suppose that if the algorithm relabels an instance too many times is because the backbone network is unsure about its class and it is better to remove that instance. Thus, this mechanism filters an instance if it has been relabelled more than a certain number of times. In addition, we establish a period of a certain number of epochs after an instance has been relabelled during which the algorithm cannot filter nor relabel it again.

• Relabelling mechanism. This mechanism takes into account the probability predictions of the backbone network. We suppose that if the backbone network predicts another class with a high probability as the training progresses, it is probable that the instance is noisy and its class is indeed the one predicted by the backbone network. As a consequence, the relabelling mechanism changes the class of an instance if the backbone network predicts another class with a probability that is above a certain threshold. This threshold is also dynamic and will change during training.

These mechanisms have restrictions related to the moment they are applied. Since we are using the backbone network to relabel and filter instances, we need to wait until the network is sufficiently trained for the predictions to be reliable. This can be measured using the loss values of the instances. Intuitively, we want to start the algorithm (and thus the three mechanisms) when the backbone network has learned to classify the clean instances but it has not learned to overfit the noisy ones yet. Here, similarly to [21] , we approximated the loss values of the instances in each epoch of the training process by a mixture model with two components, but in our case, we use a Gaussian mixture model. To do that, we used the expectation minimization algorithm and used the two components to detect the moment where the RAFNI algorithm needs to start.

A mixture model is a model that can represent different subpopulations inside a population. These subpopulations or components follow a distribution that in a Gaussian mixture model is supposed to be a Gaussian distribution. That way, if we have a Gaussian mixture model with two components, we are approximating two subpopulations, each one with a Gaussian distribution, so we obtain two means and two variances. In our case, we have two components, one for the clean instances, with mean µ clean and standard deviation σ clean , and one for the noisy instances, with mean µ noisy and standard deviation σ noisy .

We detect the moment we need to start the RAFNI algorithm by calculating the overlap between the two Gaussians obtained by the mixture model over the loss values of the instances in all training epochs. At first, the two Gaussians will start separating from one another, while the network learns to classify the clean and easy examples. Then, at some point, they will start to get closer, as the network starts to overfit the noisy examples. Therefore, we start the algorithm when the overlap between the Gaussians is below a fixed value or when this overlap starts to increase. We have tested different values for this hyperparameter with different data sets and levels of noise and we found that 0.15 is a good value that can remain fixed across all data sets and noise rates. 

Let X = {x 1 , x 2 , . . . , x n } be

, not change epochs, so that if the label of an instance has been changed, the algorithm cannot change it again nor remove that instance from the training set until not change epochs epochs have passed.

In Figure 2 , we show the flowchart of the RAFNI algorithm, detailing how and when each mechanism is applied to each instance x i during a specific epoch m of the training process.

The numbers record length and not change epochs are hyperparameters of the algorithm that can be setted by the user. The two thresholds, loss threshold and prob threshold, are parameters that dynamically change every epoch m using the losses of the instances and their probabilities in the previous epoch, l m−1 . Specifically, the loss threshold is calculated for every epoch as the quantile of order x 1 (x 1 is an hyperparameter of the algorithm called quantile loss, that can be setted by the user) of the losses in the previous epoch l m−1 . Similarly, the prob threshold is calculated for every epoch as the quantile of order x 2 (also a hyperparameter, called quantile prob) of the probabilities returned by the backbone network for the misclassified instances. That way, the loss threshold usually descends as the epoch increases and the training instances are being filtered and their classes relabelled. Due to that, we need to stop updating the loss threshold parameter at some point to not filter too many instances. Similarly, we also need to stop updating the prob threshold parameter so it does not change the class of too many instances. To do this, we use again the two Gaussians obtained by the Gaussian mixture model and stop the update of both thresholds when the means of the Gaussians are sufficiently close. We tested different values and we obtained that 0.3 is a good value that works for different data sets and levels of noise. That way, we stop updating the thresholds if µ noisy − µ clean < 0.3. This algorithm can be used with any CNN as a backbone network. The code of the algorithm is available at https://github.com/ari-dasci/S-RAFNI.

RAFNI has a list of hyperparameters that can be fine-tuned by the user. Here we specify which hyperparameters are most important to be tuned if a validation set is available, which ones are less important and which ones do not need to be tuned. We also give a guide on how to tune them.

The complete list of hyperparameters of RAFNI is the following: the overlapping threshold between the noisy Gaussian and the clean Gaussian we use to start the algorithm, the difference between the means of the two Gaussians we use to stop the update of the loss threshold and the prob threshold, the quantile loss, the quantile prob, the record length and the not change epochs. The loss threshold and the prob threshold are not really hyperparemeters as they cannot be tuned, they change dinamically in each epoch based on the quantile loss and quantile prob hyperparameters, respectively.

We also have the hyperparameters inherent to training a deep neural network: the total number of epochs of the training, the batch size and whether to use fine-tune or not in the backbone CNN: if we do not use fine-tuning, the layers of the backbone neural network are not retrained and only the new added layers are trained, and if we use fine-tuning, all the layers are trained. Whether to use fine-tuning or not depends on the backbone network used (how deep it is) and if the data set we want to classify has enough images to retrain the whole network or not. The number of epochs of the training depends on if we are fine-tuning the backbone network and the size of the data set. The batch size depends on the size of the data set, it will increase as the size of the data set increases, usually.

If we focus on the specific hyperparameters of RAFNI, there are two of them that we recommend not changing: the overlapping threshold between the Gaussians we use to start the algorithm and the difference between the means of the Gaussians that we use to stop the updates of loss threshold and prob threshold. We tested different values for these parameters across all the data sets and levels of noise we used and we found that 0.15 is a good value for the overlapping threshold and 0.3 a good value for the difference between the means of the Gaussians. To show why we chose these values we can see Figures 3, 4 and 5. In Figure 3 we show the two components (the two Gaussians) obtained by the Gaussian Mixture Model (GMM) over the losses of the instances in the first epochs of the training of EILAT with 40% of symmetric noise. At first, as the learning progresses, the network starts to differentiate between the clean and the noisy instances, and thus the two components start to separate from themselves. Then, the network starts to overfit the noisy instances and the two components start to come together again. To stop this overfitting, we start the RAFNI algorithm when the overlap between the two components is less than 0.15 or when the overlap in an epoch is greater than in the previous epoch. In Figure 4 we can see how this overlap changes through the epochs of the training and in which specific epoch we are starting the RAFNI algorithm. Finally, to stop the updating of the loss threshold and the prob threshold we use the distance between the means of the two components: if they are close, it means that there are not enough noisy instances, so the two components are very close. In Figure 5 we can see the evolution of the difference between the means of the two Gaussians and the specific epoch where we stop the updating of the thresholds.

Regarding the rest of the hyperparameters of RAFNI, we recommend to tune them if possible, though we found that tunning the quantile loss and quantile prob is more important than tunning the record length and not change epochs hyperparameters. The quantile loss depends on how hard is the data set to classify: the more difficult it is, the less we can rely on the predictions of the backbone network and it is more convenient to use higher values so that the algorithm is more conservative, that is, so it does not remove nor change the class of too many instances. Something similar happens with the values of the hyperparameters quantile prob, record length and not change epochs, though in these cases is also related to the size of the data set: the harder it is to classify and fewer images have, higher values we should give to these hyperparameters. In the case of record length and not change epochs, we should also take into account the total number of epochs of the training: if this number is small, these two hyperparameters should also be small and they should not exceed the total number of epochs in any case). The ranges in which each hyperparameter can take values are the following. For the quantile loss and the quantile prob hyperparameters, given they are quantiles, their maximum value is 1 and their minimum is 0. However, we found that they perform best if they vary in the range [0.6, 0.99]. The minimum value for record length is 2, so it can track at least one change in the class of the instances, and its maximum is the total number of epochs in the training; the higher this value is, the fewer instances the algorithm will remove because their class has changed. Finally, the minimum value for not change epochs is 1 and the maximum is the total number of epochs in the training.

In this section, we describe the experimental framework we used to carry out the experiments. In Subsection 4.1, we describe the data sets we used. In Subsection 4.2, we detail the types of noise we used in each data set along with the noise levels we used in each one of them. Finally, in Subsection 4.3, we provide the specific configuration, backbone neural network and software we used for all the experiments.

We describe the data sets we used to analyse RAFNI under different types and levels of label noise. We used six data sets, each one with a different number of classes, images per class and a total number of images: RSMAS, StructureRSMAS, EILAT, COVIDGR1.0-SN, CIFAR10 and CIFAR100. There is a summary of the statistics of these data sets in Table 1 .

RSMAS, StructureRSMAS and EILAT are small coral data sets. RSMAS and EILAT [2, 25] are texture data sets, containing coral patches, meaning that they are close-up patches extracted [26] is a structure data set, containing images of entire corals. The patches in EILAT have a size of 64×64 and come from images taken under similar underwater conditions, and the ones in RSMAS have a size of 256×256 and come from images taken under different conditions. StructureRSMAS is a data set collected from the Internet and therefore contains images of different sizes taken under different conditions. COVIDGR1.0-SN is a modification of COVIDGR1.0 [27] . COVIDGR1.0 contains chest x-rays of patients divided into two classes: positive for COVID-19, and negative for COVID-19, using the RT-PCR as ground truth. All the images in the data set were taken using the same protocol and similar x-ray machines. The authors made available the data set along with a list containing the degree of severity of the positive x-rays: Normal-PCR+, Mild, Moderate and Severe. The x-rays with Normal-PCR+ severity are x-rays from patients that tested positive in the RT-PCR test but where expert radiologists could not find signs of the disease in the x-ray. The modification that we use, COVIDGR1.0-SN, is the same data set as COVIDGR1.0 but we removed the 76 positive images with Normal-PCR+ severity. To maintain the two classes balanced, as happens in the original data set, we also removed 76 randomly chosen negative images.

Finally, CIFAR10 and CIFAR100 [28] are the 60k tiny images of size 32×32 images proposed by Alex Krizhevsky. Concerning the other data sets used in this study, CIFAR10 and CIFAR100 are much larger in size. Both of them have a predefined test hold-out of 10.000 images, meaning they both have a training set of 50.000 images. Both datasets contain classes of common objects, such as 'Airplane' and 'Ship' in CIFAR10 or 'Bed' and 'Lion' in CIFAR100.

We state the types of noise we used for each data set and which rates we used for each one of them. In Table 2 we show a brief description. In summary, we used symmetric noise, asymmetric noise and NNAR noise. Symmetric noise is the most used type of noise and since it is not necessary to have external information to use it, we use this type of noise in all data sets except for COVIDGR1.0-SN. But, to also use more realistic and challenging types of noise, we used asymmetric and NNAR noise when possible. This is the case for CIFAR10 and COVIDGR1.0-SN.

For CIFAR10, we used the asymmetric noise introduced in [11] , which has been a standard when evaluating deep learning in the presence of asymmetric label noise. This noise is introduced between classes that are alike, simulating real label noise that could have occurred naturally. In particular, we introduced asymmetric noise between the following classes: TRUCK → AUTOMOBILE, BIRD → AIRPLANE, DEER → HORSE, CAT ↔ DOG, as defined in [11] . Note that since we are introducing an x% of noise in five of the ten classes, we are introducing an x 2 % of noise in the total dataset. For COVIDGR1.0-SN, we have the additional information on the severity degree in the images of the positive class, so we used them to introduce NNAR noise, where we change the labels of a percentage x of the instances of the data set subject to some condition over the instances. COVIDGR1.0-SN has two classes: P (COVID-19 positive) and N (COVID-19 negative), and the instances from P have associated a severity (Mild, Moderate and Severe). In this scenario, is more realistic that a positive image with mild severity has been misclassified as negative than a positive image with moderate or severe severity. Equivalently, is more realistic that a positive image with moderate severity has been misclassified as negative than a positive image with severe severity. As a consequence, we define the probability of the instances in the groups N (to change class to P), Mild (to change class to N), Moderate (to change class to N) and Severe (to change class to N) as it follows: 0.5 for N, 0.3 for Mild, 0.2 for Moderate and 0 for Severe. That way, we are changing the same amount of instances from P to N and vice versa, but when we change the class from P to N, it is more probable to change a mild positive image than a moderate positive image. In addition, we are making sure that no positive image with severe severity has changed from class P to N.

Here, we provide the specific configuration we used in the experiments we carried out. As the backbone neural network we used ResNet50 [29] , though the implementation of RAFNI is independent of the backbone network and can be changed easily. We used ResNet50 pre-trained using ImageNet, removing the last layer of the network and adding two fully connected layers, the first one with 512 neurons and ReLU activation and the second one with as many neurons as classes the data set had and softmax activation. Once we removed the last layer of ResNet50, its output had 2048 neurons. We chose 512 neurons for the first fully connected layer we added as an intermediate number between 2048 and the number of classes in the data sets. The fixed hyperparameters we used in each data set can be seen in Table 3 . We used the Stochastic Gradient Descent (SGD) with a learning rate of 1 × 10 −3 , a decay of 1 × 10 −6 and a Nesterov momentum of 0.9. We did not optimize these hyperparameters.

For the experimentation, we used TensorFlow 2.4 and an Nvidia Tesla V100. The values we gave to each hyperparameter can be seen in Table 4 . Using those values, we performed an exhaustive grid search to find the best configuration in each case. Since the CIFAR data sets and the rest of the data sets we used had different sizes, the experimental framework we used for them was different. For the smaller data sets RSMAS, EILAT, StructureRSMAS and COVIDGR1.0-SN, we used fivefold cross-validation for the experiments in the grid search, while for CIFAR10 and CIFAR100 we used a 20% hold-out using only the original train set. Then, to ensure a more stable final result, we did the following. For the smaller data sets, we repeated the five-fold cross-validation with the best hyperparameter configuration five times (noted 5x5fcv) and we report the mean and standard deviation of the 5x5fcv. This scheme of using mean and standard deviation is one of the most used in the literature. For the CIFAR data sets, we used the best hyperparameter configuration (found using only the training set) in the predefined test hold-out, we repeated that experiment five times and we report the mean and standard deviation. We used the accuracy measure, widely used for supervised classification. The accuracy is defined as the number of instances well classified in the test set divided by the total number of instances in the test set.

To make the comparison with the baseline model (that is, the backbone network alone, without filtering or relabelling instances), we used the same scheme as we used with RAFNI: we repeated five times the five-fold cross-validation (or the hold-out for the CIFAR data sets) and we report mean and standard deviation.

The best values for the hyperparameters in each case can be found in Table 5 .

In this section, we present the results we obtained for each data set using our proposal, and we compare it with the backbone network alone as the baseline.

We present the results obtained with the smaller data sets: RSMAS, EILAT, StructureRSMAS and COVIDGR1.0-SN. In Table 6 , we can observe the results with symmetric noise for the data sets RSMAS, EILAT and StructureRSMAS using RAFNI with ResNet50 as the backbone network, and the comparison with the backbone network alone as the baseline. The results are similar in all data sets as the noise increases, especially for RSMAS and EILAT. At 0% of noise, the difference between the use of the RAFNI algorithm and the baseline is minimal. Then, as noise increases, this difference starts to increase. At 10% of noise, RAFNI obtains 3.26% more than the baseline for RSMAS and 1.72% for EILAT. At 40% of noise, this difference is 14.85% for RSMAS and 17.64% for EILAT. At 70% of noise, the gain of using RAFNI is 14.9% for RSMAS and 39.21% for EILAT. For StructureRSMAS, these differences are lower: for example, at 40%, the gain of using RAFNI is 5.84%. However, RAFNI is still consistently better than the baseline at all levels of noise. The results obtained for COVIDGR1.0-SN with pseudo-symmetric noise are shown in Table 7 . Here we only used levels of noise up until 50% because this data set has only two classes. This data set has the advantage that it is a real-world data set, and it is more difficult to train (at 0% noise level) than the other data sets: ResNet50 obtains an accuracy of 77.06% at 0% noise. In addition, the noise we introduced in this data set is more realistic, so we can see how well the RAFNI algorithm behaves in a more real-life scenario. We can see that, at all noise levels, including 0%, the results are better using RAFNI than using the baseline, with gains that generally increase as the noise level raises. At 10% noise, RAFNI obtains 2.85% more than the baseline. This gain is 6.92% at 30% noise and 9.12% at 50% noise.

We show the results we have obtained using CIFAR10 and CIFAR100 with symmetric noise and CIFAR10 with asymmetric noise.

In Table 8 we can see the results for CIFAR10 and CIFAR100 using symmetric noise. RAFNI achieves better results in both data sets at all levels of noise except for CIFAR100 at 0% of noise, where the baseline is slightly better. Similarly to what happened with the small datasets, the accuracy gain of using RAFNI increases as the noise level increases. For CIFAR10 we have a gain of 7.5% at 20% of noise and a gain of 34.61% at 60% of noise. For CIFAR100 these gains are 7.36% Table 8 : Mean ± std accuracy obtained using CIFAR10 and CIFAR100 with symmetric noise and using the baseline network (ResNet50) and RAFNI with that network as the backbone network. The best results in each case are stressed in bold. Data In Table 9 we show the results for CIFAR10 using asymmetric noise. In this scenario, RAFNI also achieves better results than the baseline at all levels of noise. The accuracy gain of using RAFNI increases when the noise increases, being 4.69% at 20% of noise, 8.69% at 40% of noise and 10.41% at 60% of noise.

In this section, we compare our proposal, RAFNI, with some state-of-the-art models that do not used external information (like the noise rate): the loss correction approaches proposed in [11] , the robust function proposed in [17] , the proposal in [12] , and the one in [21] , which we described in Section 2.

The two proposed methods in [11] suppose that the noise rate is known, but the authors incorporated a mechanism to estimate it in the usual case that it is not known, so we used both of their approaches using this estimation (called estimated forward and estimated backwards).

To make the comparison we used CIFAR10, with symmetric and asymmetric noise, and CI-FAR100 with symmetric noise, which are the benchmarks that most of the papers used in the literature.

To make a fair comparison, we used the same experimental frameworks as the other papers whenever possible, that is, we changed our framework to use the same backbone neural network, the same number of training epochs, optimizer, the same learning rate scheduler, data augmentation, etc., as the model we are comparing our algorithm to. In each case, we used the best hyperparameters reported in each paper for each scenario, except for the number of epochs. Due to time restrictions, if the original number of epochs used by the authors exceeds 120 for CIFAR10 and 150 for CIFAR100, we changed them to use 120 epochs for CIFAR10 and 150 epochs for CIFAR100 (accordingly, we also train RAFNI for the same number of epochs in each case). The authors in [17] only used CIFAR100 under symmetric noise, so we only had the best values for the two temperatures for this case. For the other two scenarios (CIFAR10 with symmetric noise and with asymmetric noise), we evaluated different values in the range the authors gave for each hyperparameter and selected the best ones for each scenario and level of noise using the same validation set we used to search for the best hyperparameters for RAFNI. For this proposal, we use ResNet50 in all scenarios, since their method can be used with any CNN.

The authors in [21] used an original implementation of pre-activation ResNet18 in PyTorch. Unfortunately, we were not able to replicate that network with our algorithm. As a result, we compared our results using RAFNI with our experimental scheme (ResNet50 with fine-tuning, 10 and 15 epochs for CIFAR10 and CIFAR100 respectively) with their model using their experimental scheme (ResNet18, 300 epochs in both cases, data augmentation and learning rate scheduler). In their paper, they also compared their method with other proposals in the literature using different backbone networks. Finally, we used the same data sets as with our algorithm where it was possible (for the proposals made by [17] and [12] ), and the given ones in the rest, making sure that the noise injection was the same as it was in our data sets. We argue that, since the noise level is the same, it is introduced randomly in all the cases, and the test sets are also the same as they are predefined, we can safely compare the algorithms.

We also performed a Wilcoxon Rank-Sum test to check if the differences in the results were significant in each case. Since we repeated each experiment five times, we used all five accuracies for each data set and level of noise to perform the Wilcoxon test, instead of using the mean.

In Table 10 we show the accuracy obtained with the two approaches proposed in [11] and our algorithm using the same backbone network and experimental scheme as the one used in [11] . In the comparison with the estimated backward algorithm, we can see that RAFNI performs better in both data sets at all levels and types of noise. It is interesting that when classifying CIFAR100 with a noise rate of 40% and more, the estimated backward algorithm does not finish the training. Here, the Wilcoxon-Rank-Sum test obtains that there are significant differences with p-value 9.5 × 10 −7 .

When comparing RAFNI with the estimated forward algorithm, we can see that there is less difference between them, especially in CIFAR10, where the differences are usually less than 1% using both types of noise, on average. However, when classifying CIFAR100, RAFNI outperforms the estimated forward algorithm when the noise rate is 40% or more. In particular, RAFNI obtains a gain in accuracy of 6.48% at 40% noise and 11.9% at 60% noise. Using the Wilcoxon Rank-Sum test, we can say that RAFNI performs better with significant differences (p-value 3.6 × 10 −5 .

The results we obtained for the D2L algorithm [12] and our algorithm using the same experimental scheme can be seen in Table 11 . We can see that for CIFAR10 with symmetric noise there is not much difference between the accuracies obtained by D2L and RAFNI, with D2L being slightly better at 60% noise with a difference of 1.95% on average. But when introducing asymmetric noise, RAFNI obtained better results at all noise rates, with a difference in the accuracy of 4.06% on average at 40% noise. The biggest difference between these two methods, however, can be seen when classifying CIFAR100, where RAFNI outperforms D2L, especially as noise increases, with a difference of 26.43% at 40% noise and 40.97% at 60% noise. The Wilcoxon Rank-Sum test obtains significant differences with p-value 1.31 × 10 −5 .

The accuracies we obtained using the BiTempered method and RAFNI, both of them using ResNet50 as the backbone network, can be seen in Table 12 . Here, RAFNI obtains, on average, Finally, the accuracies we obtained with the algorithm proposed by Arazo et al in [21] and RAFNI, both of them using their original experimental schemes, can be seen in Table 13 . This is the case where we can see the most discrepancies depending on the data set that is being classified but also depending on the type of noise. For CIFAR10 with symmetric noise, which is easier to classify than CIFAR100, the algorithm from Arazo et al is better than RAFNI, obtaining 9.86% more at 60% noise, on average, though the difference at 20% is considerably lower. However, if we introduce asymmetric noise on CIFAR10, which is a more complicated and realistic type of noise, RAFNI obtains better results. The differences increase as the noise rate increases, being 24.02% at 40% noise. Now, for CIFAR100 with symmetric noise, RAFNI again outperforms the algorithm proposed by Arazo et al, and the differences in accuracy are present even at low percentages of noise, similar to what happened for CIFAR10 with asymmetric noise: RAFNI obtains 8.64% more accuracy, on average, at 20% noise, 9.29% at 40% noise and 12.25% at 60% noise. When we used the Wilcoxon Rank-Sum test on all results, we obtained that RAFNI obtained significant differences with p-value 1.15 × 10 −7 .

7 Comparison with an approach that suppose the noise rate is known

In this section, we compare our algorithm, which assumes no known external information, with the SELFIE algorithm, which assumes the noise rate of the data set is known. It is important to note that, in real scenarios, knowing the noise rate of the data set is not usual. We want to estimate the advantage of using this unusual information in specific models, so that we can quantify the potential loss of not having that kind of knowledge available. Thus, RAFNI is expected to perform worse than SELFIE since RAFNI does not assume extra information, but we want to see in which cases it is competitive.

To make this comparison, we used RAFNI with the same experimental scheme that SELFIE: DenseNet-25-12 as the backbone network, with no data augmentation and the same learning rate scheduler. We train both algorithms for 120 and 150 epochs for CIFAR10 and CIFAR100, respectively, due to time restrictions. For SELFIE, we had to implement the matrix of asymmetric noise as given in [11] , to use the same injection of noise in the data sets used in both algorithms. The results we obtained for both algorithms are shown in Table 14 . As expected, SELFIE obtained better results for both data sets and types of noise. Nonetheless, for the CIFAR10 data set RAFNI performs well, obtaining similar results at low noise rates. This also happened when using asymmetric noise, which is interesting since this type of noise is more realistic and difficult to deal with than symmetric noise. In fact, RAFNI outperforms SELFIE at 40% noise by more than 11%.

However, in general, if the information about the noise rate is known, it is better to use an algorithm that uses that information. We did not suppose this type of information was known in our algorithm as we wanted to make it applicable in all cases.

In this section, we analyse how well RAFNI removes and relabels instances using two data sets as samples: EILAT and COVIDGR1.0-SN. In particular, we take a look at a) how many instances were removed, and from those, we check how many of them were noisy; and b) how many instances were relabelled, and from those, we check how many of them were noisy and how many were correctly classified by the algorithm to their original class.

For both data sets, we analysed how well RAFNI behaves at every level of noise tested in Section 5. We used one five-fold cross-validation repeated five times for both data sets, so the results we give here are the total results, that is, for the total number of removals, for example, we sum all the removals in the five training sets. The results for EILAT can be seen in Table 15 and the results for COVIDGR1.0-SN can be seen in Table 16 . In both tables we show 1) the percentage of good removals, that is, instances that were noisy and RAFNI removed from the training set; 2) the total number of instances that RAFNI removed during training; 3) the percentage of good changes, that is, instances that were noisy and RAFNI changed to their original clean class; 4) the total number of instances that RAFNI changed from one class to another during training. Since EILAT has more than two classes, in its case we also show the percentage of noisy changes, that is, instances that were noisy and RAFNI changed to another class, but not their original class.

In both cases, we can see that RAFNI does a good job both removing and changing instances to their original class. In the case of the COVIDGR1.0-SN data set, the percentage of good changes is lower than with EILAT, but this is to be expected since the type of noise introduced in the COVIDGR1.0-SN data set is more difficult and realistic than the symmetric noise introduced in EILAT. Even in that case, RAFNI removes instances that were noisy with an accuracy above 92% at all levels of noise for COVIDGR1.0-SN. On the other hand, RAFNI changes instances to their original class with precision above 95% for EILAT, except at 70% of noise, when it descends to 81.81%. This shows that RAFNI is capable to detect the noisy instances and either remove them or change them to their original class with a high precision, which improves the learning, as we have seen in Section 5. The total number of removals and changes tends to increase as the noise level increases in both data sets, as would be expected since the number of noisy instances is increasing, so this is another sign that the algorithm is behaving well. 

In this paper, we proposed an algorithm, called RAFNI, that can filter and relabel noisy instances during the training process of any convolutional neural network using the predictions and loss values the network gives the instances of the training set. This progressive cleaning of the training set allows the network to improve its generalisation at the end of the training process, improving the results the CNN has on its own. In addition, RAFNI has the advantage that it can be used with any CNN as the backbone network and that transfer learning and data augmentation can be easily applied. It also does not use prior information that is usually not known, like the noise matrix or the noise rate. In addition, it works well even when there is no introduced noise in the data set, so it is safe to use when we do not know the noise rate of a data set. We also made the code available so it is easier to use it. Developing algorithms that can allow deep neural networks to perform better under label noise is an important task since label noise is a common problem in real-world scenarios and it negatively affects the performance of the networks. We believe that our proposal is a great solution to this problem: it can be easily fine-tuned to every data set, it allows to be used with any CNN, and it allows the use of transfer learning and data augmentation. We proved its potential using various data sets with different characteristics and using three different types of label noise. Finally, we also compared it with several state-of-the-art algorithms, improving their results. FPU16/04765 by Ministerio de Educación, Cultura y Deporte.

Imagenet classification with deep convolutional neural networks

Towards highly accurate coral texture images classification using deep convolutional neural networks and data augmentation

Automatic handgun detection alarm in videos using deep learning

Learning from noisy labels with deep neural networks: A survey

Learning from massive noisy labeled data for image classification

Cleannet: Transfer learning for scalable image classifier training with label noise

Classification in the presence of label noise: A survey

Webvision database: Visual learning and understanding from web data

Selfie: Refurbishing unclean samples for robust deep learning

Understanding deep learning (still) requires rethinking generalization

Making deep neural networks robust to label noise: A loss correction approach

Dimensionality-driven learning with noisy labels

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels

Iterative learning with open-set noisy labels

Collaborative learning for deep neural networks

Learning deep networks from noisy labels with dropout regularization

Robust bi-tempered logistic loss based on bregman divergences

Robust loss functions under label noise for deep neural networks

Generalized cross entropy loss for training deep neural networks with noisy labels

Probabilistic end-to-end noise correction for learning with noisy labels

Unsupervised label noise modeling and loss correction

Training convolutional networks with noisy labels

Training deep neural networks on noisy labels with bootstrapping

mixup: Beyond empirical risk minimization

Coral reef dataset, mendeley data, v2

Coral species identification with texture or structure images using a two-level classifier based on convolutional neural networks

Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images

Learning multiple layers of features from tiny images

Deep residual learning for image recognition

This publication was supported by the project with reference SOMM17/6110/UGR, granted by the Andalusian Consejería de Conocimiento, Investigación y Universidades and European Regional Development Funds (ERDF). This work was also supported by project PID2020-119478GB-I00 granted by Ministerio de Ciencia, Innovación y Univesidades, and project P18-FR-4961 by Proyectos I+D+i Junta de Andalucia 2018. Anabel Gómez-Ríos was supported by the FPU Programme