key: cord-0704254-35u0dtf7 authors: Zhu, Lei; Luo, Zhaojing; Wang, Wei; Zhang, Meihui; Chen, Gang; Zheng, Kaiping title: Towards Robust Cross-domain Image Understanding with Unsupervised Noise Removal date: 2021-09-09 journal: 29th ACM International Conference on Multimedia, MM 2021 DOI: 10.1145/3474085.3475175 sha: a97523efdaac65da4bc6bcd8b4d1cdfb7a3ecf52 doc_id: 704254 cord_uid: 35u0dtf7 Deep learning models usually require a large amount of labeled data to achieve satisfactory performance. In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge transfer from a label rich source domain to a label scarce target domain, thus potentially alleviates the annotation requirement for deep learning models. However, we find that contemporary domain adaptation methods for cross-domain image understanding perform poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Prior methods on WSDA remove noisy source data and align the marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, which have the problem of class misalignment, e.g., features of cats in the target domain might be mapped near features of dogs in the source domain. In this paper, we propose a novel method, termed Noise Tolerant Domain Adaptation, for WSDA. Specifically, we adopt the cluster assumption and learn cluster discriminatively with class prototypes in the embedding space. We propose to leverage the location information of the data points in the embedding space and model the location information with a Gaussian mixture model to identify noisy source data. We then design a network which incorporates the Gaussian mixture noise model as a sub-module for unsupervised noise removal and propose a novel cluster-level adversarial adaptation method which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images from COVID-19 and e-commerce datasets. The results show that our method significantly outperforms state-of-the-art WSDA methods. cross-domain knowledge transfer from a label rich source domain to a label scarce target domain, thus potentially alleviates the annotation requirement for deep learning models. However, we find that contemporary domain adaptation methods for cross-domain image understanding perform poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Prior methods on WSDA remove noisy source data and align the marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, which have the problem of class misalignment, e.g., features of cats in the target domain might be mapped near features of dogs in the source domain. In this paper, we propose a novel method, termed Noise Tolerant Domain Adaptation (NTDA), for WSDA. Specifically, we adopt the cluster assumption and learn cluster discriminatively with class prototypes (centroids) in the embedding space. We propose to leverage the location information of the data points in the embedding space and model the location information with a Gaussian mixture model to identify noisy source data. We then design a network which incorporates the Gaussian mixture noise model as a sub-module for unsupervised noise removal and propose a novel cluster-level adversarial adaptation method based on the Generative Adversarial Network (GAN) framework which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. Finally, we devise a simple and effective algorithm to train the network from end to end. We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images from COVID-19 and e-commerce datasets. The results show that our method significantly outperforms state-of-the-art WSDA methods. There is great interest in using deep learning for various multimedia applications, such as media interpretation [9, 35] , multimodal retrieval [15, [37] [38] [39] ]. Much of its success is attributed to the availability of large-scale labeled training data [7] . However, in practice, large-scale labeled data are hardly available, as manual annotating sufficient label information for various multimedia applications is both expensive and time-consuming. Thus, it is desirable to reuse labeled data from a related domain for cross-domain image understanding. This process is called Domain Adaptation (DA), which transfers knowledge from a label rich source domain to a label scarce target domain [24] . Intuitively, the data quality in the source domain affects the domain adaptation performance. However, in practice, high-quality source data related to a target task of interest is hardly available. In contrast, the Internet and social media contain large-scale labeled multimedia data which can be downloaded with keyword search [14, 40] , but unfortunately, these data contain noise, either in features, labels or both. Similarly in medical image analysis, annotating medical data requires medical expertise and due to subjectivity of domain experts and diagnostic difficulties, noisy labels are often inevitable. Thus, it is meaningful to study robust domain adaptation under the scenario when source data is noisy in order for better cross-domain image understanding. This problem has been referred to as Weakly Supervised Domain Adaptation (WSDA) [29] . Although WSDA enables many practical use cases of domain adaptation in real life and can substantially reduce the annotation costs, it is still not well studied in the literature. There are two entangled challenges in WSDA, namely source data noise and distribution shift across domains. Directly applying existing domain adaptation methods for WSDA will not work, as source data noise severely deteriorates the adaptation performance [16, 29] . Liu et al. [16] recently propose a Butterfly framework to address these issues. However, their work only considers the scenario where source domain contains label noise data and cannot handle feature noise data. Another limitation of the method is its large model size, and as a result, it incurs a large amount of computational resources to train the model. Shu et al. [29] recently propose transferable curriculum for WSDA. The transferable curriculum can select transferable and clean source data for adversarial domain adaptation, which makes their method robust to source data noise. However, one problem with their method is that the adversarial learning between the feature extractor and domain discriminator only aligns the marginal distribution across domains and ignores the fine-grained class structure in the embedding space. Noisy source data distorts the original source data distribution, thus directly aligning the marginal distribution across domains would potentially transfer the noise information from source domain to target domain and possibly cause class misalignment due to label noise, where even with perfect alignment of the marginal distribution, the class structure across domains may not be well aligned, e.g., features of cats in target domain might be mapped near features of dogs in source domain, which leads to poor target performance [31, 36] . Recently, several Unsupervised Domain Adaptation (UDA) methods [8, 28] adopt the cluster assumption [4] to alleviate the class misalignment problem, where they assume data distributes in the embedding space with separated data clusters and data samples in the same cluster share the same class label. In [28] , Shu et al. propose Virtual Adversarial Domain Adaptation (VADA) which combines virtual adversarial training [22] and conditional entropy loss to push the decision boundaries away from class clusters in the embedding space. In [8] , Deng et al. propose Cluster Alignment with a Teacher (CAT), which forces features of both source domain and target domain to form discriminative class-conditional clusters and aligns the corresponding clusters across domains. Although both works alleviate the class misalignment problem and demonstrate significant improvement in performance over prior domain adaptation methods that only align marginal distribution across domains, however, like most existing domain adaptation methods, they perform poorly when source domain contains noise [16, 29] . In light of the issues with existing WSDA methods and inspired from recent clustering based domain adaptation methods, in this paper, we propose a novel method for WSDA with unsupervised noise removal to address these issues. For its noise tolerance property, we call our method, Noise Tolerant Domain Adaptation, or NTDA in short. Specifically, we learn the clusters discriminatively with class prototypes (centroids) in the embedding space. We propose to estimate the probability of a source data point being noisy with a Gaussian mixture noise model based on its distance to the class prototype and filter source data points with high probabilities of being noisy (See Sec. 3.2 for more details). We incorporate the Gaussian mixture noise model as a sub-module within a deep network and propose a cluster-level adversarial adaptation method based on the GAN framework which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. Finally, we devise a simple and effective algorithm to train the network from end to end. Fig. 1 presents the workflow of our proposed NTDA method. To summarize, we make the following contributions in this paper: • We identify several issues with existing WSDA methods and propose a simple and effective method, NTDA, to address these issues. • We propose a novel Unsupervised Noise Removal (UNR) method based on the location information of data points, which can be applied to other clustering based domain adaptation methods to make them robust to source data noise. • We propose a Cluster-Level Adversarial Adaptation (CAA) method, which adversarially aligns target data points with the less noisy class prototypes in the embedding space and alleviates class misalignment problem. • We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images. The results show that NTDA significantly improves state-of-the-art results for WSDA. NTDA has been developed as part of the library for MLCask [21] for supporting healthcare analytics. MLCask is a model-data provenance and pipeline management system that rides on Apache SINGA [23] for supporting end-to-end analytics. The remainder of the paper is organized as follows. Section 2 provides a brief background on domain adaption, and related works. In Section 3, we present our methodology to tackle WSDA problems and propose NTDA. We conduct extensive experimental study and present the results in Section 4. We conclude in Section 5. divided into two major categories: discrepancy-based and adversarialbased methods. Discrepancy-based methods align feature distributions across domains by minimizing certain distribution discrepancy, such as Maximum Mean Discrepancy (MMD) [17, 33] , correlation distance [30] or Central Moment Discrepancy (CMD) [44] . Adversarial-based methods draws inspiration from the two-player game of Generative Adversarial Networks (GAN) [11] . DANN [10] trains domain invariant features via adding a domain classifier in the deep feature learning pipeline via gradient reversal. ADDA [32] adopts asymmetric feature extractors for adversarial training. CDAN [18] conditions the adversarial domain adaptation models on discriminative information conveyed in the classifier prediction. Our method is especially related to the more recent clustering based domain adaptation methods [8, 28, 46] with cluster assumption. Learning From Noisy Data is an active research area in machine learning. Recently, Zhang et al. [45] empirically demonstrate that noisy data will be memorized by deep networks which destroys their generalization capability. Arpit et al. [1] find that when training with noisy data, deep networks learn simple patterns first before memorizing noisy data. Based on this memorization effect of deep networks, Han et al. [12] propose a training paradigm termed "co-teaching" where they train two networks simultaneously, and utilize the small loss data from one network to teach the other one, which is called the small-loss trick. Yu et al. [42] further propose the "Update by Disagreement" strategy with "co-teaching" to prevent the two networks converge to consensus. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Although WSDA studies a more practical problem than ordinary domain adaptation, it is still under-explored in the literature. Recently, Liu et al. [16] propose a Butterfly framework which consists of four networks for WSDA. However, their method demands a large amount of computational resources for training. Shu et al. [29] propose transferable curriculum to select transferable and clean source data for WSDA, but they ignore the class structure in the embedding space for distribution alignment. Yu et al. [43] propose a theoretical framework for label-noise robust domain adaptation with a denoising Conditional Invariant Component. However, their method cannot handle cases when there are feature noise in the data. Zhang et al. [48] propose Collaborative Unsupervised Domain Adaptation for general and medical image analysis, where they optimize two networks collaboratively to learn from noisy source data and perform weighted instancelevel domain adaptation with unlabeled target data. However, their method aligns the marginal distribution across domains to reduce the distribution shift which suffers from class misalignment under the label noise scenario. In Unsupervised Domain Adaptation (UDA), we are given labeled data D = {( , )} =1 in source domain and unlabeled data D = { } =1 in target domain. The source and target data share the same set of labels and are sampled from probability distributions and respectively with ≠ . In Weakly Supervised Domain Adaptation (WSDA), we relax the assumption that D is clean to that D may be corrupted with noise in labels, features or both. The goal of WSDA is to effectively transfer knowledge from noisy source domain to unlabeled target domain. In this paper, we adopt the cluster assumption [4] , where we assume data distribution in the embedding space contains separated data clusters and data samples in the same cluster share the same class label. With labeled source data D , we propose to learn the clusters discriminatively with class information. We employ the prototype learning framework from [41] where we assign a prototype for each class in the embedding space. Let = ( ; ) be the feature for source data with label , be the class prototype for the ℎ class and { } =1 be the set of prototypes, where : X → R is the feature extractor, is the embedded feature dimension, is the set of parameters for and is the total number of classes. We measure the probability of a data point belonging to a specific class with a softmax over distance to prototypes as shown in Eqn. 1, where ( , ) = || − || 2 2 is the squared euclidean distance between two data points in the embedding space and is the hyper-parameter for scaling the exponent value. To train the network, we employ cross entropy loss on the prediction probability as shown in Eqn. 2. From a perspective of probability, Eqn 1 can be viewed as the posterior probability of a data point belonging to a specific class with a mixture of exponential distribution where the prototypes act as the mean representations for each class [41] . Minimizing Eqn. 2 increases the posterior probability for each data point belonging to its labeled class. Therefore, data points will cluster around the corresponding class prototypes in the embedding space which conforms the cluster assumption. We further propose a compact regularizer resembling the contrastive loss in [8] , which minimizes the distances between data points and their class prototypes to make each cluster more compact as follows: For prediction, we denote ( , ; { } =1 ) = ( | ) as a distancebased classifier with the class prototypes as its parameters. For a given input, we first obtain its feature with the feature extractor and then we classify it with the category of its nearest prototype, which is also the category with the maximum predicted probability. We aim to remove noisy source data so that they will not adversely affect domain adaptation [16, 29] . With discriminative clustering from previous subsection, we observe that noisy source data locate quite differently in the embedding space compared to clean source data at the early training phase of deep networks. We train a deep network with objective L +0.5 * L for 100 epochs and measure Fig. 2 . We observe that clean data locate closer to their class prototypes than almost all noisy data at the early phase of training but some noisy data also locate very close to their class prototypes at the late phase of training. To understand this phenomenon, we refer to Arpit et al.'s study [1] on the memorization effect of deep networks. Arpit et al. find that when training with noisy data, deep networks learn simple patterns first before memorizing noisy data. Our observation conform Arpit et al.'s finding. At the early phase of training, as deep networks learn simple patterns first, clean data share simple patterns with each other will form class-wise clusters in the embedding space, thus clean data will locate closer to their class prototypes than noisy data. At the late phase of training, deep networks will memorize the complicated input-to-label patterns from label noise data and the complicated noise patterns from feature noise data, thus, noisy data will also locate close to their class prototypes. The disparate distribution of distances to prototypes between clean and noisy source data at the early training phase of deep networks suggests the use of a two-component mixture model [3] to estimate the probability of a data point being clean based on its distance to the class prototype. In this paper, we propose a two-component Gaussian mixture model for this purpose as we empirically find it fits both the distance distribution of clean data and noisy data well as shown in Fig. 2(a) . The probability density function of a two-component Gaussian distribution is defined as where is the prior probability for clean ( = 1) or noisy ( = 2) data and ( | ) = N ( | , ) is the corresponding normal distance distribution with mean and covariance . We employ the Expectation-Maximization algorithm [6] to estimate the parameters of the Gaussian mixture model and calculate the posterior probability of a data point being clean as follows: With the unsupervised Gaussian mixture noise model, we will remove data points with large probabilities of being noisy to prevent them affecting domain adaptation. Noisy source data distort the real source data distribution, which make it more error-prone when aligning the data distribution across domains. More severely, due to label noise, the network cannot easily discriminate two data points from two different classes as they might be annotated with the same label. Thus there is no clear boundary between different classes in the embedding space and data from different classes would mix up with each other, which causes the class misalignment problem even more prominent for domain adaptation. Existing WSDA methods [29, 48] which align the marginal distribution across domains however would fail to resolve such issue and potentially transfer noise information from source domain to target domain. To address the problem, our idea is to align the target data with the more reliable class prototypes in the embedding space so that we can reduce the distribution shift across domains. Specifically, we find that the prediction entropy of a data point encodes the location information of it in the embedding space. If a data point has low prediction entropy, it will locate close to some class prototypes in the embedding space (see Eqn. 1) and if a data point has high prediction entropy, it will locate near the decision boundaries between class prototypes. Source data cluster around their class prototypes in the embedding space, thus they will have low prediction entropy. However, due to the distribution shift across domains, except for some easy-to-transfer target data, which will locate close to some class prototypes in the embedding space like source data, other target data will locate near the decision boundaries [5, 26] . Thus, we design a domain discriminator based on a data point's prediction entropy. The domain discriminator is defined as ( ; , which calculates the normalized prediction entropy of a data point as the probability of the data point belonging to the target domain, where the 1 ( ) term is used to normalize the output within the interval [0, 1]. Different from existing adversarial-based domain adaptation methods [10, 18, 32] , our domain discriminator shares its parameters with the classifier. Since optimizing the classifier with the source data already ensures the prediction entropy on the source data is small, we train the domain discriminator with the cross entropy loss only on target data as follows: Minimizing L ensures the domain discriminator can distinguish target data from source data based on their prediction entropy. To reduce the distribution shift across domains, we adversarially train the feature extractor with the GAN loss [11] as follows: Minimizing L with the feature extractor will align target data towards their corresponding class prototypes in the embedding space to decrease their prediction entropy for confusing the domain discriminator. As class prototypes are the representative of each class in the embedding space, they are generally more reliable and less noisy compared to source data points, thus, aligning target data towards the less noisy class prototypes is beneficial in the WSDA scenario. In addition, our adaptation method maps target data towards the source data clusters at cluster level with consideration Figure 3 : The overview of NTDA network architecture for WSDA with unsupervised noise removal and cluster-level adversarial adaptation. The network consists of a feature extractor, a domain discriminator, a label classifier and an unsupervised noise remover. Blue color represents the source domain, green color represents the target domain. In the embedding space, the red color represents noisy source data. Different shapes in the embedding space represents different classes and the "X" shape represents feature noise data. of source data's semantic structure, thus our method alleviates the class misalignment problem. In previous subsections, we have introduced different components of our network. In this subsection, we will combine these components into our final network and provide a simple and effective algorithm to train the network from end to end. First, to remove noisy source data, we propose a weighting scheme based on the Gaussian mixture noise model. Denote = √︃ ( , ) as the euclidean distance of source data to its class prototype in the embedding space, the weight for is defined as follows: where 1(·) is the indicator function, i.e., it returns 1 when the condition inside the brackets is true and returns 0 otherwise and is the threshold hyper-parameter within interval [0, 1]. We weight the supervision loss for source data as follows: The weighting scheme only selects source data points whose probability of being clean is larger than for training and it linearly scales the probability as the weight for the selected source data so that source data with higher probability of being clean will have larger weight. We present the network architecture of our model in Fig. 3 and the overall objective function to train the network is defined as follows: where 1 and 2 are two trade-off hyper-parameters. We present the algorithm to train the network in Alg. 1. Our algorithm warms up the network for epochs to ensure the network learns some simple patterns first for unsupervised noise modeling. Note can be chosen optimally via inspecting the distance distribution of source data as shown in Fig. 2 . After warming up, our algorithm alternatively performs unsupervised noise removal and cluster-level adversarial adaptation. As our network trains mostly on clean source data, clean source data will be drawn closer and closer to the class prototypes in the embedding space which will further separate the distance distribution of clean and noisy source data apart and make our unsupervised noise model more accurate for selecting clean source data. This positive cycle between the unsupervised noise model and the network greatly boosts the performance of our method. Model the distance distribution with a Gaussian mixture model on source data and calculates weights for source data with Eqn. 7. Train , and on Eqn. 10 and Eqn. 11 with batch size after removing source data with weight 0. Table 2 : Classification Accuracy (%) on Office-Home with 40% mixed corruption and Bing-Caltech with native noise. Since these three datasets are clean, we create their corrupted versions following the protocol in [29, 48] . In particular, we create noisy source data from the original clean data in three different ways: label corruption, feature corruption and mixed corruption. For label corruption, we change the label of each image uniformly to a random class with probability . For feature corruption, each image is corrupted by Gaussian blur and Salt-and-Pepper noise with probability . As for mixed corruption, each image is processed by label corruption and feature corruption with probability /2 independently. We term as the noise level for a domain adaptation task. In all the experiments, we use the noisy data for the source domain and clean data for the target domain. Bing-Caltech [2] dataset contains Bing dataset and Caltech-256 dataset. The Bing dataset consists of images retrieved by Bing image search for each of the Caltech-256 category. Apart from the statistical differences between Bing images and Caltech images, the Bing dataset contains rich noise, with multiple objects in the same image. We use the Bing dataset as the noisy source domain and Caltech-256 as the clean target domain. While the experiments on Office-31, Office-Home and COVID-19 use manually synthesised noise, the experiments on Bing-Caltech report the performance for the real world weakly supervised domain adaptation. Implementation Details. We adopt the 50-layer ResNet [13] as the feature extractor for all experiments on general images and MobileNet-V2 [27] as the feature extractor for all experiments on medical images. The hyper-parameter is set to 10, 1 is set to 0.5, 2 is set to 1, is set to 0.5. We employ SGD with weight decay 5e-4 to train the network. All experiments are repeated three times and we report the average result. For fair comparison, we report baseline results directly from the original papers if the experiment setting is the same and re-implement the methods to follow our experiment setting when there is difference. Performance Comparison on General Images. We present the results on Office-31 under 40% label corruption, feature corruption and mixed corruption in Table 1 and the results on Office-Home under 40% mixed corruption and Bing-Caltech in Table 2 . Our method outperforms all the baseline methods in almost all tasks and significantly pushes forward the state-of-the-art performance on Office-31 by improving the average accuracy by 6.8%, 2.4%, 2.9% on the label corruption, feature corruption and mixed corruption tasks respectively compared to the second best. It improves the average accuracy on Office-Home by 5.2%, and the accuracy on Bing-Caltech by 1.3% compared to the second best. For some tasks, the improvement of accuracy is more than 10%. More specifically, NTDA performs better than state-of-the-art WSDA methods TCL [29] and CoUDA [48] , indicating the effectiveness of our method for transferring knowledge from noisy source domain to unlabeled target domain. NTDA also outperforms state-of-the-art UDA methods by a large margin, indicating that existing UDA methods indeed suffer from the noise in source domain, thus it is necessary to come up with methods to counter the negative effects of noisy data. Performance Comparison on Medical Images. Following [48] , we apply 10% of label corruption on source domain of COVID-19 dataset and present the results in Table 3 . We use Accuracy (Acc), Macro Precision (MP), Macro Recall (MP) and Macro F1-measure as metrics. Our method significantly outperform baseline methods in all metrics. More specifically, our method, NTDA, outperforms CoUDA [48] , which is state-of-the-art method for the task. The results demonstrate the applicability of our method for cross-domain medical image analysis. In viewing that medical annotations are subjective and contain noises for domain adaptation, NTDA offers an appealing solution to the problem. Ablation Study. Removing noisy source data and reducing the distribution shift across domains can both boost the target performance significantly, improving the accuracy by 10%, 18.3% respectively. NTDA combines unsupervised noise removal and cluster-level adversarial adaptation to achieve the best performance. Note NTDA (W/o UNR) also shows quite good performance, which is better than state-of-the-art WSDA methods, TCL [29] and CoUDA [48] on the same task. This is because our CAA method by itself is also robust to source data noise to some extent. Fig. 5 shows the average classification results on Office-31 with mixed corruption under various noise levels. With the increase of noise level, the performance of all comparison methods degrade rapidly, while NTDA is more stable and provides much better performance which indicates that our method can handle various scenarios under weakly-supervised domain adaptation. In addition, NTDA performs as well as state-of-the-art domain adaptation method CDAN+E [18] even when noise level is 0, indicating that our method is also applicable in standard domain adaptation scenario. Unsupervised Noise Modeling Quality. To test how well our unsupervised noise modeling method selects clean data, we compare it against the transferable curriculum in [29] . We use precision and recall as metrics for the comparison. Precision measures the fraction of clean data among the selected data, while recall measures the fraction of total number of clean data that are actually selected. Table 5 presents the precision and recall values of transferable curriculum and our method on Office-31 with 40% mixed corruption. In all tasks, NTDA provide larger precision and recall results than transferable curriculum. On average, NTDA can select 97.1% of clean data for training the network, and among the selected data only 3% are noisy data which are significantly better than transferable curriculum which can only select 92.5% of clean data, and among the selected data there are 11.2% noisy data. Our method can select almost all clean data and remove all noisy data for training. Feature Visualization. Fig. 4 presents the t-SNE visualization of feature embeddings of DANN, CDAN+E, TCL, CoUDA and NTDA on task A→W with 40% mixed corruption. Fig. 4 (a)-(e) show the target feature embeddings by classes. NTDA model's embeddings are more compact and discriminative, while the rests' embeddings scatter and mix up among classes. Fig. 4 (f)-(j) show both the source and target feature embeddings by domain. NTDA aligns the target feature very well with the source data, while the other methods align the target feature not so well and mismatch the decision boundaries between the two domains. For DANN and CDAN+E, noisy source data degrades their feature embeddings. For TCL and CoUDA, as they align marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, their feature embeddings surfer from class misalignment problem. These results validate the effectiveness of our method. Hyper-parameters Sensitivity. Fig. 6(a) shows the sensitivity analysis of NTDA on hyper-parameter 1 under 40% label corruption, feature corruption and mixed corruptions. In general, NTDA performs stably when 1 is small, i.e. smaller than 0.5. When 1 becomes larger than 0.5, the performance of our method degrades. This is because 1 controls the strength of the compact regularizer to minimize the distances between data points and their class prototypes, when 1 becomes too large, it will tend to map all data points and class prototypes into a single point in the embedding space, thus making the performance much worse. Fig. 6(b) shows the sensitivity analysis of NTDA on hyper-parameter 2 under 40% label corruption, feature corruption and mixed corruption. In general, NTDA performs stably with the changes of 2 . When 2 is within the interval [0. 5, 8] , for label corruption and mixed corruption, the change of accuracy values is within 2% and for feature corruption, the change of accuracy is within 5%. Sample Analysis. Fig. 7 shows the sample analysis of NTDA for the task of cross-domain COVID-19 diagnosis under 10% label corruption. Although the task of identifying pneumonia cases is different from the task of identifying COVID-19 cases. Pneumonia cases share some similar characteristics with COVID-19 cases [48] . But there also exists distribution difference between the two datasets, e.g., image color difference and different artifacts displayed in the images. Despite these differences and the fact that the annotations in the source data are noisy, NTDA can still provide high prediction probabilities for target images on the correct class, which demonstrates the effectiveness of our method when applied for medical image analysis. In this paper, we study the under-explored Weakly Supervised Domain Adaptation problem (WSDA) for multimedia analysis. WSDA is a promising research area in view of its benefit to significantly reduce the annotation cost for deep learning. We identify several issues of existing WSDA methods and propose NTDA, a novel and effective method with unsupervised noise removal and clusterlevel adversarial adaptation to alleviate the adverse effect of noisy data during domain adaptation. We conduct extensive experimental evaluation using four public datasets covering both general image analysis and medical image analysis, and the results show that our new method significantly outperforms existing methods. A Closer Look at Memorization in Deep Networks Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach Pattern recognition and machine learning Semi-supervised classification by low density separation Progressive feature alignment for unsupervised domain adaptation Maximum likelihood from incomplete data via the EM algorithm Imagenet: A large-scale hierarchical image database Cluster alignment with a teacher for unsupervised domain adaptation An Implementation of a DASH Client for Browsing Networked Virtual Environment Unsupervised domain adaptation by backpropagation Generative adversarial nets Co-teaching: Robust training of deep neural networks with extremely noisy labels Deep residual learning for image recognition The unreasonable effectiveness of noisy data for fine-grained recognition Interpretable multimodal retrieval for fashion products Butterfly: A panacea for all difficulties in wildly unsupervised domain adaptation Learning transferable features with deep adaptation networks Conditional adversarial domain adaptation Unsupervised domain adaptation with residual transfer networks Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines Virtual adversarial training: a regularization method for supervised and semi-supervised learning SINGA: A distributed deep learning platform A survey on transfer learning Adapting visual category models to new domains Maximum classifier discrepancy for unsupervised domain adaptation Mobilenetv2: Inverted residuals and linear bottlenecks A DIRT-T Approach to Unsupervised Domain Adaptation Transferable Curriculum for Weakly-Supervised Domain Adaptation Deep coral: Correlation alignment for deep domain adaptation Domain adaptation with conditional distribution matching and generalized label shift Adversarial discriminative domain adaptation Deep domain confusion: Maximizing for domain invariance Deep hashing network for unsupervised domain adaptation Multimodal and crossmodal representation learning from textual and visual features with bidirectional deep neural networks for video hyperlinking Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation Effective multi-modal retrieval based on stacked auto-encoders Effective deep learning-based multi-modal retrieval Online asymmetric metric learning with multi-layer similarity aggregation for crossmodal retrieval Learning from massive noisy labeled data for image classification Robust classification with convolutional prototype learning How does disagreement help generalization against label corruption Label-noise robust domain adaptation Central moment discrepancy (cmd) for domaininvariant representation learning Understanding deep learning requires rethinking generalization Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation COVID-DA: Deep Domain Adaptation from Typical Pneumonia to COVID-19 Collaborative unsupervised domain adaptation for medical image diagnosis We thank the anonymous reviewers for their constructive comments, NUS colleagues and Beng Chin Ooi for their comments and contributions. This research is supported by Singapore Ministry of Education Academic Research Fund Tier 3 under MOE's official grant number MOE2017-T3-1-007. Meihui Zhang's work is supported by the National Natural Science Foundation of China (62050099).