key: cord-0488827-18v50kw0 authors: Yu, Chaohui; Wang, Jindong; Liu, Chang; Qin, Tao; Xu, Renjun; Feng, Wenjie; Chen, Yiqiang; Liu, Tie-Yan title: Learning to Match Distributions for Domain Adaptation date: 2020-07-17 journal: nan DOI: nan sha: 54572b3a9c0768c71fc7226ceb5d8e907c761da8 doc_id: 488827 cord_uid: 18v50kw0 When the training and test data are from different distributions, domain adaptation is needed to reduce dataset bias to improve the model's generalization ability. Since it is difficult to directly match the cross-domain joint distributions, existing methods tend to reduce the marginal or conditional distribution divergence using predefined distances such as MMD and adversarial-based discrepancies. However, it remains challenging to determine which method is suitable for a given application since they are built with certain priors or bias. Thus they may fail to uncover the underlying relationship between transferable features and joint distributions. This paper proposes Learning to Match (L2M) to automatically learn the cross-domain distribution matching without relying on hand-crafted priors on the matching loss. Instead, L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way. L2M is a general framework that unifies task-independent and human-designed matching features. We design a novel optimization algorithm for this challenging objective with self-supervised label propagation. Experiments on public datasets substantiate the superiority of L2M over SOTA methods. Moreover, we apply L2M to transfer from pneumonia to COVID-19 chest X-ray images with remarkable performance. L2M can also be extended in other distribution matching applications where we show in a trial experiment that L2M generates more realistic and sharper MNIST samples. Traditional machine learning generally assumes that training and test data are from the same distribution. In reality, this i.i.d. assumption barely holds. When an algorithm is trained on one domain and then tested on another domain, the performance is likely to drop due to the different data distribution [1] . Since the collection of massive labeled data is expensive and time-consuming, a more promising approach is to perform domain adaptation (DA) to enable the consistent performance of a predictive function on different domains. The core challenge of DA is to match the cross-domain joint distributions [2] . However, the labels on the target domain are often unavailable in unsupervised DA. Therefore, a trend is to approximately match the joint distributions by matching the marginal and conditional distributions as theoretically verified in [2, 3] . Existing approaches achieve this goal via learning a domain-invariant representation by minimizing predefined distribution distances such as MMD [4] [5] [6] [7] , or an implicit discrepancy by an adversarial min-max game [8] [9] [10] [11] . Recent works suggest that in addition to jointly matching these two distributions with equal weights [12] , an adaptive weighting scheme is necessary to achieve better distribution matching performance [7, [13] [14] [15] . Unfortunately, it remains challenging to apply DA to new applications. Existing methods are built with their own priors and inductive bias in approximating the joint distribution matching, which may fail to uncover the underlying relationship between transferable features and joint distributions [16] . For instance, MMD [17] may not be discriminative enough for high-dimensional data, and Jensen-Shannon divergence is not sensitive to mode collapse [18] [19] [20] [21] . A recent work Learning to Transfer (L2T) [22] aims to reduce such bias by learning the "transfer experience" from thousands of precomputed tasks before applying to new problems. However, L2T needs to build historical tasks from large auxiliary datasets, which is expensive and burdensome. Since deep learning makes it possible to learn features directly from the original datasets, can we design an automatic distribution matching strategy in a data-driven way? In this work, we propose a Learning to Match (L2M) framework to automatically match the crossdomain distributions while reducing the inductive bias on matching functions. Stepping back from the hand-crafted and predefined distances, we construct a meta-network to learn the distribution matching functions directly from the source and target domains. The meta-network is an MLP network which is theoretically a universal approximator for almost any continuous function [23] . We design a novel matching feature generator to L2M, where both task-independent and human-designed matching features can be taken as inputs to the meta-network for better distribution matching. Therefore, L2M can be seen as a general framework that unifies the deep features and human-crafted features (pre-defined distances) from the view of traditional vs. deep learning. Since it is challenging to optimize L2M with the unavailability of target domain labels, we propose to construct and update meta-data in a self-supervised manner [42] for updating the distribution matching loss. On the basis of matching features and meta-data, we propose an online optimization algorithm for L2M which can achieve accurate and steady performance. Experiments show that L2M outperforms several state-of-the-art methods on public DA datasets. L2M is a general and flexible framework that can be used in other cross-domain tasks. We apply L2M to COVID-19 X-ray image classification by transferring knowledge from normal pneumonia to COVID-19, where L2M outperforms other methods in this data-hungry and imbalanced task. As an extension, L2M can be used for generating more realistic and sharper hand-written digits. The code of L2M will be released soon at https://github.com/jindongwang/transferlearning/ tree/master/code/deep/Learning-to-Match. Transfer learning and domain adaptation. Domain adaptation (DA) is a specific area of transfer learning [24] . Existing works tend to explicitly or implicitly reduce the distribution divergence. The explicit distances are predefined divergence, such as Maximum Mean Discrepancy (MMD) [17] , KL or JS divergence, cosine similarity, mutual information, and higher-order moments [25] , which are well investigated in recent DA works [4] [5] [6] [7] 26] . Optimal transport (OT) is another popular measure for distribution matching [11, [27] [28] [29] [30] . There are other geometrical distances or transforms such GFK [31] and subspace learning [32, 33] . The implicit distance indirectly bridges the distribution gap through adversarial nets [34] , or learnable metrics [35] . GAN-based DA methods learn domain-invariant features by confusing the feature extractor and discriminator [8] [9] [10] , while metric learning [35] focuses on the sample-wise distance. Recent research implies performance improvement by adding more prior to the matching strategy such as adaptive weights between marginal and conditional distributions [7, 14, 15] with weights generated by the A-distance [2] . Learning to transfer (L2T) [22] is similar to our idea in spirit. However, L2T has to manually construct thousands of transfer tasks to learn a linear transformation matrix using MMD, while L2M does not rely on historical tasks and learns non-linear feature maps, which is more efficient and general. There are several works aiming at bridging two domains by normalization such as BN [36] , AutoDIAL [37] , AdaBN [38] , and TransNorm [39] , which did not focus on direct learning the cross-domain joint distributions. Distribution matching. Generative adversarial nets (GANs) [34] matches distributions between training and generated samples by iteratively training a domain discriminator and generator to confuse the discriminator. Our L2M is model-agnostic that can be applied in an adversarial manner by adopting GAN-based schemes such as DANN [8] or can also work without GAN. The pixel-level DA [40] learns the distribution matching in pixel-space. Figure 1 : The framework and computing flow of the proposed L2M approach. We can decompose h φ into a feature extractor f φ and a classification layer G y , where f φ is explicitly parameterized by φ since it is more important for domain-invariant representation learning. Under the principle of structural risk minimization (SRM) [41] , the optimal model parameter can be learned as: where L cls is the classification loss on the source domain, L match is the distribution matching loss, and λ is a trade-off parameter. It is challenging to directly match the cross-domain joint distributions since the labels for the target domain are not available. Therefore, existing methods tend to approximate L match using different priors. For instance, if we let L match = d(D s , D t ) where d is a predefined distance such as MMD [4] , then we can get the explicit distribution matching. If L match = D(D s , D t ) where D is the adversarial discriminator [8] , then we get the implicit distribution matching. In a nutshell, the main difference among existing works is the design of explicit or implicit L match . In this paper, we postulate L2M to automatically match the distributions across domains. The core of L2M is a meta-network that learns the distribution matching in a data-driven manner. To be specific, the meta-network is a Multi-Layer Perceptron (MLP) that has the ability to approximate any continuous functions [23] . Therefore, L2M can learn the distribution matching loss directly from the source and target domains: where g(·) is the distribution matching function (network) parameterized by θ. It is clear that this formulation is a general form that can theoretically include existing pre-defined distances. With the meta-network g θ , we build a general framework as shown in Fig. 1(a) . The architecture consists of four parts: feature extractor f φ , label classifier G y , meta-network g θ , and matching feature generator. Specifically, f φ is a CNN network to extract the features of the input domains, G y is trained to minimize the prediction loss on the labeled (source) domain, and g θ is an MLP network, which is used to match the cross-domain distributions (learn L match ). The most important part is the matching feature generator, which generates useful inputs to the meta-network g θ . For a general framework that allows both deep and human-designed features, we concatenate the deep features (direct link) with the human-designed distances (the green module D) via the concatenation module ∪. Then, the matching features can be taken as inputs by the meta-network g θ to learn the distribution matching functions. Our learning objective is well in accordance with the DA theory proposed in [2] that directly learns to reduce the distribution divergence between domains, such that the risk on the target domain can be bounded. The learning objective of L2M can be obtained as: However, it remains challenging to optimize the above equation due to three reasons. Firstly, what kind of matching features should we take as inputs to the meta-network g θ for better distribution matching? Secondly, we only have the labeled source domain and the unlabeled target domain, how to compute the distribution matching loss L match without the target domain labels? Thirdly, even if we have the matching features and optimization data, we cannot use a simple EM algorithm for optimization since updating L cls and L match on the same training data will definitely lead to overfitting and local optimum. Therefore, it is still non-trivial to optimize L2M. In the next sections, we introduce how to tackle the above three challenges. The matching feature generator generates useful representations as inputs to the meta-network g. We use F to denote the matching features. Technically, F can be any useful representations. In this paper, we propose two different kinds of matching features as shown in Table 1 : (1) Task-independent features, which are general and can be automatically computed by the main network f φ as shown in the direct link of Fig. 1 (a); (2) Human-designed distances (features) as indicated in the green module D in Fig. 1(a) , which are the pre-defined distances such as MMD or adversarial game. These features can either be used alone, or be concatenated by the concatenation module ∪. In later experiments, it is surprising to find that the combination of these two features can be seen as the combination of deep and human-designed features, which generally leads to better performance. More details of the matching features are in the supplementary file. We introduce the idea of "meta-data". Since direct computation of the distribution matching loss L match is hard due to the unavailability of target labels, we turn to using the meta-data D meta instead. To be more specific, D meta = {x x x t j } m×C j=1 ∼ P t (x x x,ŷ) whereŷ is the predicted (pseudo) label on the target domain. In each iteration, we randomly sample m instances for each class with high prediction scores calculated by the main network as the ground truth of the meta-data. This selection is iterated in the whole learning process for better performance. The pseudo labels of the meta-data can get more confident since the meta-data are chosen from the target domain data with the highest prediction probabilities. This assumption is validated in early works [7, 10] and can also be seen as a self-supervised technique [42] . Therefore, the matching loss is calculated on the training data (L match = L (train) match ) when updating the main network f φ , and the meta-network g θ is updated on the meta-data. In this paper, we propose an online updating algorithm for L2M. Fig. 1(b) illustrates the key learning steps. It should be noted that the data for updating φ and θ are different: when updating φ, we use the normal training data from the source domain to calculate the cross-entropy loss; when updating θ, we use the source domain and the pseudo-labeled target domain meta-data. The learning procedure of L2M consists of two main steps: main network update and meta-network update. In the following, we use t to denote learning steps. Main network update. This step is mainly for updating φ for the main network. To enforce the update of θ in the next step, we construct an assist model which is a copy of the main model by inheriting the same architecture and parameters from the main model (f φ , G y , g θ ) and use it for calculating the loss. We employ SGD for optimizing the classification loss L cls and distribution matching loss L match . L cls can be formulated as: where (CE) is the cross-entropy loss and B s denotes a mini-batch data sampled from D s . The distribution matching loss L match is calculated by the meta-network g θ : where B t is a mini-batch data sampled from D t . Note that this step does not need the meta-data from the target domain since we only sample a batch of source and target domain data (x x x) and do not need the target domain label y. Therefore, we do not update the matching loss. After getting the training loss, the updating equation of the copied main model can be obtained by moving the current φ(t) towards the descent direction of objective in Eq. (3): where α is the learning rate of the assist model. L Meta-network update. This step is for updating θ for the meta-network g θ on the meta-data D meta . Similar to updating φ, it is natural that updating θ requires "ground-truth" available for the distribution matching loss L match . However, this is not available in UDA problems. To solve this challenge, we employ a self-supervised strategy with the assumption that after one epoch of updating φ(t) to φ(t + 1), the distribution matching loss can get smaller with the increasing confidence of the target pseudo labels. This pseudo-label assumption is widely adopted in previous DA works [6, 7, 10] . Therefore, this validation loss can be updated by computing the discrepancy between the distribution matching loss on φ(t) and φ(t + 1): where tanh(·) is an activation function. Note that we fix φ in this step and minimize Eq. (7) w.r.t. θ can gradually update the meta-network g θ . The pseudo labels can be easily obtained by a single forward-pass and then selected according to the confidence (softmax probability). To ensure their confidence, we choose the samples with probabilities ≥ 0.8 in our experiments. Denote β the learning rate of meta-network g θ , then θ can be updated as: The above two steps are used iteratively as the pseudo labels of the meta-data can be more confident and all the losses can be iteratively minimized. In our experiments, we observe that the network will converge in dozens of epochs. The complete algorithm and convergence analysis are presented in the supplementary file. As for inference, L2M is the same as existing DA methods [6, 7, 10, 43] . We simply fix φ and θ and use the main model to perform a single forward-pass to get the results for the test data. Datasets. We adopt four public datasets: ImageCLEF-DA [44] , Office-Home [45] , VisDA-2017 [46] and Office-31 [47] . They are widely used by most UDA approaches [7, 9, 10, 43] . are presented in the supplementary file. We follow the standard protocols for UDA and take classification accuracy on the target domain as the evaluation metric and target labels are only used for evaluation. The best parameters are tuned according to [52] . The results are the average accuracy of 10 experiments by following the same protocol [6, 7, 10, 43] . Before using L2M, a natural question is which matching feature should be used for better performance. Moreover, how is the performance of MMD and adversarial discrepancy in L2M compared to existing MMD or adversarial-based DA methods? To answer these questions, we randomly choose two pairs of DA tasks from Office-Home dataset (R → A, R → P, and vice versa) to compare the performance of existing distance-based methods (DAN [5] and MEDA [7] use MMD while DANN [8] , CDAN [43] , and DAAN [15] use adversarial-based discrepancy) with L2M. Technically, all matching features can be combined, which will result in 2 4 = 16 different matching features. For computational issue, we construct eight matching features: It should be noted that both F emb and F logit can be applied to both explicit (deep) and implicit (adversarial) matching networks, leading to ten features in total. In addition, we do not combine three or four features since their performance can naturally be better but with more computations. The feature dimensions of each matching feature are presented in the supplementary file. The comparison results are in Table 2 . For better clarification, we compare the performance of best MMD-and adversarial-based methods in Fig. 2(a) , along with the average performance of L2M using these features. More experiments can be found at the supplementary. Firstly, we see that in both explicit and implicit distribution matching, L2M can generally achieve competitive performance with different matching features. This verifies that L2M is effective for distribution matching. Secondly, in some cases, the performance of L2M with MMD distances are better than previous adversarial-based methods. Since adversarial-based methods require much more training time, this makes L2M+MMD suitable solutions for resource-constrained applications. Thirdly, the performance of L2M with both task-independent features and pre-defined distances are generally better than using each feature solely, indicating the common practice is useful that deep learning performance can be boosted by combining deep features (F emb or F logit ) with human-designed features (F mmd or F adv ). We also observe that L2M with embeddings are generally better than logits, which is probably because those embeddings contain richer information than logits. Therefore, in the next experiments, we adopt [F emb , F adv ] for a balance between computation and better performance. In real applications, more domain-dependent matching features can be added according to domain knowledge. The results on Office-Home dataset are shown in Table 3 , while the results on ImageCLEF-DA and VisDA-17 datasets are in Table 4 . The results on Office-31 are provided in supplementary file due to space limits. From the results, we see that the L2M outperforms all comparison methods. Specifically, on ImageCLEF-DA dataset, although the baseline for this dataset is very high, L2M still achieves an average accuracy of 89.1% with a 0.6% improvement over the second-best baseline. On Office-Home dataset, L2M achieves an average accuracy of 69.6% with a 1.5% improvement compared to the second-best. Office-Home dataset is rather complicated and involves more samples and categories, which indicates the effectiveness of L2M. On Office-31 dataset, L2M achieves an average accuracy of 89.5%, which is also highly competitive. Last, on the VisDA-17 dataset, which is rather larger compared to the other datasets (280,000+ images), L2M achieves an accuracy of 77.5% with a significant improvement of 2.9%. All these results demonstrate that L2M can achieve competitive performance on DA tasks. Analysis of meta-data. We empirically analyze the batch size m of the meta-data D meta . It is obvious that a larger m will bring more uncertainty, and a smaller m is likely to make the metanetwork unstable. We record the performance of L2M using different values of m on several randomly selected tasks in Fig. 2(b) . The results indicate that L2M is robust to m and a small m can lead to competitive performance. Therefore, we set m = 5 in our experiments for computational efficiency. Distribution discrepancy. The A-distance [2] measures the distribution discrepancy that is defined as d A = 2(1 − 2 ), where is the classifier loss to discriminate the source and target domains. Smaller A-distance indicates better domain-invariant features. Fig. 2(c) shows that L2M can achieve a lower d A , implying a lower generalization error of L2M. Convergence analysis. L2M introduces a meta-network, which may make the training process harder. In this section, we empirically evaluate the convergence of L2M. As shown in Fig. 2(d) , the results on a randomly-chosen task show that L2M can reach a quick and steady convergence in a limited number of iterations. Therefore, L2M can be easily trained. Other than public datasets, we apply L2M to a COVID-19 chest X-ray image classification dataset [56] , where the source domain is normal or pneumonia, and the target domain is normal or COVID-19 pneumonia. Note that this is a class-imbalanced task which is more challenging and realistic. We use F1, Precision, and Recall as the evaluation metrics for this highly-imbalanced binary classification task. As shown in Table 7 , L2M achieves better results compared to finetune and other DA methods. Here we use the 95% confidence interval, where the corresponding value of z is 1.96. The computed confidence interval r is around 1.3%. More details about the dataset, comparison methods, and results are in the supplementary file. Extending L2M for image generation. We show the potential of L2M in generating MNIST hand-written digits. We use GAN [34] (adversarial distance) and GMMN [57] (MMD distance) as the baselines. We replace the MMD module in GMMN with L2M. Hyperparameter settings and training details are in supplementary. The generated samples are shown in Fig. 3 . L2M can generate more realistic samples compared to GAN, and sharper samples compared to GMMN. This indicates the potential of L2M in image generation. It should be noted that this is only a trial experiment and more efforts are needed for achieving SOTA performance on image generation. Limitations and solutions. L2M can be roughly regarded as that requires updating two networks iteratively. Therefore, compared with regular DA methods (e.g., DANN, CDAN, MDD), L2M needs more than more training time. It is suggested to use a smaller batch size of meta-data compared to training data to reduce GPU memory increment and speed up training. However, the inference time is the same as other methods for using the same backbone. L2M can be more efficient by adopting knowledge distillation as suggested in meta-pseudo-labels (MPL) [58] , which is left for future research. Additionally, a pre-trained L2M model can be deployed to the edge devices which can achieve accurate and fast inference. In this paper, for the first time, we step back from focusing on designing distribution matching features according to human knowledge, and instead, propose L2M to automatically match the cross-domain joint distributions for domain adaptation. Our work shows that by taking diverse matching features including task-independent and human-designed distances, L2M can directly learn the distribution matching in a data-driven way. L2M can be seen as a general framework that unifies deep feature learning and human-designed feature learning for better distribution matching. Experiments on public datasets substantiate the superiority of L2M over state-of-the-art approaches on DA and image generation tasks. We apply L2M to COVID-19 X-ray image adaptation experiment, where it significantly outperforms existing methods in such a highly imbalanced task. We believe that L2M can be helpful in other problems such as domain generalization, open-set DA, and partial transfer learning, which will be the focus of future research. We also put the framework and key learning steps of L2M here for better illustration. The complete learning procedure of L2M is listed in Algorithm 1. Build an assist model with its parameter inherited from the main model φ(t). Sample a mini-batch data B s , B t from both the source and target domain. Update φ by step i in Fig. 1(b) . The loss consists of L cls and L match , we only update the assist model φ and the meta-network θ only be updated in step ii. Select the data with the highest prediction confidence from D t to construct meta-data D meta . Update the meta-network θ by step ii in Fig. 1 (b). 8: end while 9: return {φ , θ } It is worth noting that this optimization is general and can be naturally used in image generation tasks. Hence, we also use the same optimization step in the MNIST digits generation experiments by injecting this process directly to the GMMN [57] models. Therefore, L2M is a general and flexible framework that can work for most cross-domain distribution matching tasks. Task-independent matching features. It is natural to use the extracted feature embedding by f φ as one kind of task-independent features, which is denoted as F emb ∈ R d , where d is the number of neurons in this layer. For classification tasks, another kind of features is the network logit: F logit ∈ R C , which is the activation of the last FC layer before softmax. Note that in fact, F logit should be computed by G y . For symbolic brievity we also draw it in the same way as F emb in Fig. 1(a) . Denote q the function of last FC layer, then they can be computed as: (1) Human-designed matching features. We adopt two popular distances as human-designed matching features: explicit distribution matching distance using MMD (F mmd ∈ R), and implicit distribution matching distance using adversarial nets (F adv ∈ R). Their basic idea is to approximate the joint distributions using marginal or conditional distributions. A recent work MEDA [7] showed that matching both conditional and marginal distributions can be useful. Therefore, we denote d m , d c the marginal and conditional distances (losses) respectively. Then, these features can be computed as: For explicit distribution matching using MMD [17] , the marginal and conditional distances can be computed as: where H k is the Reproducing Kernel Hilbert Space (RKHS) induced by kernel k, B (c) denotes samples belonging to class c, and φ(·) some feature mapping function. For implicit distribution matching using GAN [34] , the main idea is to design a domain discriminator G d to identify which domain the samples belong to. We train f φ and G y to confuse G d , and eventually G d gets confused and fails to discriminate the domains. In this situation, the marginal and conditional adversarial distances can be respectively computed as: where d is the cross-entropy loss for domain classification, and d is the domain label (0 or 1) of the input sample x. G Note that the target domain D t has no labels, making it difficult to compute the conditional distance d c . We apply prediction to D t using the classifier G y trained on D s to obtain soft labels, which will be iteratively refined. Clearly, MMD and adversarial distance are only two options for predefined distance and others can be used. In specific problems, more task-dependent features can be used. This makes L2M a general and flexible framework. In addition to empirically analyzing the convergence of L2M, we provide some theoretical analysis. The convergence of L2M depends on two items: the classification loss L cls on the training data, and the distribution matching loss L match on the meta-data. The convergence of L cls is well ensured since it is a standard cross-entropy loss in deep neural networks. The convergence of L match depends on two factors: the construction of meta-data and the loss itself. We adopt an iterative way to construct the meta-data by using the pseudo labels provided by the trained network. According to several recent works [7, 10, 43] , the convergence of such an iterative pseudo-label can be ensured, i.e., the pseudo labels will be more accurate, providing a strong support to the construction of the meta-data. On the other hand, the convergence of L match can also be ensured as long as the meta-network g θ is differential (in our work, it is differential) by following [59, 60] . Therefore, the convergence of L match can be ensured. In the view of domain adaptation theory, L2M is designed by following the DA theory according to [2] that the risk on the target domain is bounded by the following theorem: Theorem 1 Let h ∈ H be a hypothesis, s (h) and t (h) be the expected risks on the source and target domain, respectively, then where C 0 is a constant for the complexity of hypothesis and plus the risk of an ideal hypothesis for both domains. d H (p, q) refers to the distribution divergence between domains. As can be seen, L2M is directly minimizing the distribution distance (distribution matching loss) L match , which is in consistence with the above theorem. A.3 Remarks L2M vs. AutoML. L2M shares the same goal with AutoML: both are trying to reduce the human intervention in a machine learning process. However, AutoML focuses more on "auto" while L2M can be seen as a combination of deep features and human-designed features. Moreover, AutoML focuses on architecture design, hyperparameter search, and channel pruning, which are different from L2M. The main goal of L2M is to learn a good and automatic distribution matching between domains. From this point of view, L2M can also be seen as an "automated" DA method. Future works may lay emphasis on domain adaptation architecture design, which is more like automl. is not yet another "SOTA" and is not intending replacing other methods. The results in this paper demonstrates that the performance of L2M outperforms several SOTA methods. However, our goal is not to develop yet another SOTA to the community, but to introduce another kind of DA algorithm that can be easily applied to real applications without specific concentration on the loss function design and distribution matching module. Therefore, for a new application, both L2M and other existing SOTA methods are applicable. The advantage of using L2M is that it requires less human intervention of algorithm selection, while a simple embedding matching feature can achieve a competitive performance. If you need better results, you still need to have a deep domain knowledge and integrate it in L2M with the embeddings or logits features. Therefore, L2M can be used to enhance other methods. The statistics of these datasets are shown in Table 1 . For different variants of L2M using different matching features, we report the dimension information of eight matching features of each dataset in Table 2 . All methods use the ImageNet-pretrained ResNet-50 as the backbone network. Results of the comparison methods are obtained from original papers. For L2M, we set max iterations to be -31 2,048 31 2 2 2,050 2,050 33 33 ImageCLEF-DA 2,048 12 2 2 2,050 2,050 14 14 Office-Home 2,048 65 2 2 2,050 2,050 67 67 VisDA-2017 2,048 12 2 2 2,050 2,050 14 14 200000. The mini-batch SGD with nesterov momentum of 0.9 and batchsize 32 is used as the optimization strategy. The learning rate α of the meta-model and the overall model changes by following [8] : where k is the training iteration linearly changing from 1 to max iterations, γ = 0.001, α = 0.004, and decay rate υ = 0.75. The initial learning rate β of the meta-network is 0.01 and will gradually decrease to 0.0001 during training. Meta-network g θ uses a d − 1024 − 1024 − 1 structure where d is the dimension of input matching features, and more information of different matching features can be seen in Table 2 . We follow the standard protocols for unsupervised domain adaptation [61] , we use classification accuracy on the target domain as the evaluation metric and target labels are only used for evaluation. The results are the average accuracy of 10 experiments by following the same protocol [6, 7, 10, 43] . We use Pytorch to implement L2M and it is trained on a Linux machine with a 16GB P100 GPU. Table 3 reports the results on Office-31, which indicates that L2M outperforms all the recent DA methods in classification accuracy. We show more ablation experiments of L2M on Office-Home and ImageCLEF-DA in Table 4 and Table 5 , respectively. We did not run ablation experiments on VisDA-17 since this dataset is rather larger and needs more computations. The ablation results on other datasets are enough for observing the patterns of L2M variants. Combining these results with that from the main paper, more insightful conclusions can be made. (1) L2M achieves the best performance on multiple datasets, which indicates the efficiency of L2M. (2) All the 4 variants of L2M can achieve competitive performance, implying the effectiveness of the meta-network on matching functions and L2M is able to fit a wide range of matching features. Despite the performance on these public datasets, we want to emphasis that in real applications, L2M (emb+adv) is perhaps not always the best matching features. Therefore, in order to achieve the best performance, users can try several combinations of matching features along with their own domain experience before finding the suitable features. Since the performance of most matching features are with a low variance, any matching feature can achieve competitive performance compared to existing methods. We visualize the network activation (before FC layer) on task P→R using t-SNE in Fig. 2 . ResNet-50 does not align the distributions. JAN aligns both marginal and conditional distributions with equal weights, while MEDA adaptively aligns these two distributions whose results are better. However, the source and target domains are not fully matched by MEDA. For L2M, both the cross-domain distributions and categories are aligned well, implying that L2M learns more discriminative features. Other than benchmarking L2M on popular public datasets including Office-31, Office-Home, ImageCLEF-DA, and VisDA-17, we compare the performance of several DA methods including L2M in a real application. Different from public datasets, this application will prove the effectiveness of L2M and other DA methods in a real-world task, which is more appealing and inspiring. We present more details for applying L2M to COVID-19 chest X-ray image adaptation tasks. COVID-19 is a specific type of pneumonia compared to the normal kind of pneumonia, and there is not too much COVID-19 data available, it becomes necessary and feasible to use the sufficient labeled pneumonia data to help classify the COVID-19 symptom. Therefore, this task is a binary classification task, where the source domain is the well-labeled pneumonia data to classify whether this patient is having pneumonia or not, and the target domain is the unlabeled COVID-19 data. Out task is to classify whether each of the the target domain samples is having a COVID-19 symptom or not. This is a binary classification task, i.e., the normal category vs. pneumonia on the source domain, and the normal category vs. COVID-19 on the target domain. We also notice that this dataset is highly-imbalanced (as shown in the next section). Therefore, for better illustrate the results, we adopt F1 score, Recall, and Precision as the evaluation metrics rather than classification accuracy. These metrics are better for imbalanced classification tasks. It also demonstrates our contribution that L2M can achieve robust preformance in imbalanced tasks compared to other DA methods. Table 6 shows the description of the dataset 2 . Note that in this task, we use some COVID-19 data as the validation set to better tune the hyperparameters. In the source domain, there are two classes: normal and pneumonia, while there are normal and COVID-19 classes in the target and validation dataset. Fig. 3 shows some examples from the source and target domain. It is clear that data from two domains are very similar especially for pneumonia and COVID-19 classes. Therefore, it is feasible to perform domain adaptation or transfer learning between these two domains. We mainly compare the performance of L2M with three categories of methods: (1) Deep learning baselines, (2) Deep diagnostic methods, and (3) unsupervised domain adaptation methods. The deep learning baselines include three baselines: • Train on source: train a network on the source domain, and then apply the pretrained model on the target domain. • Train on target: this is an ideal state since there are no labels for the target domain in our task. Therefore, we directly use several extra labeled COVID-19 data from the dataset (they are 30% of the target domain data) and train a network on these data. Then, we can apply prediction on the target data. • Fine-tuning: This is a combination of the above two baselines. Firstly, we train a network on the source domain. Then, we fine-tune the pretrained model on the extra labeled target domain data. Finally, we apply prediction on the target data. The deep diagnostic method is DLAD [55] . The unsupervised DA methods are DANN [8] , MCD [53] , and CDAN+TransNorm [39] . All methods are using ResNet-18 as the backbone network following [56] . The results of these methods are obtained from COVID-DA [56] to ensure a fair comparison. Note that we did not compare COVID-DA since this method is a semi-supervised method that explicitly uses the labeled data on the target domain. Therefore, we only use the report of unsupervised methods and train L2M with the same experimental settings. The results are shown in Table 7 . Here we use the 95% confidence interval, where the corresponding value of z is 1.96. The computed confidence interval r is around 1.3%. This table is the same as the main paper but with more analysis of the results. From the results, we see that L2M outperforms all comparison methods in terms of F1 score and Recall. In Precision, the performance of Training on labeled target data achieves the best results, which is reasonable since this approach trains on the labeled target domain data and is expected to achieve the best precision. The UDA methods, namely DANN, MCD, and CDAN+TransNorm can sometimes achieve worse results than baselines, indicating that the different distribution distance of pneumonia and COVID-19 data are not that easy to compute by adversarial distance (DANN and MCD are using adversarial distances) or statistical alignment (TransNorm uses a source-target normalization technique) since these methods are built with their own priors and biases. In this situation, it is necessary to perform domain adaptation in a data-driven way stepping back from these predefined distances. Therefore, L2M can be useful in real-world applications. It also demonstrates our contribution that L2M can achieve robust preformance in imbalanced tasks compared to other DA methods. On the other hand, we also notice that the performance of fine-tuning is worse than training on source and target, which is probably due to the distribution gap between the source and target domains. In a nutshell, among baselines and DA methods other than L2M, training on target achieves the best performance, indicating the importance of labeled data. We can also see that in COVID-DA [56] , authors used semi-supervised settings to improve the F1, Precision, and Recall score to over 90%, which clearly shows optimistic performance. Therefore, L2M and other methods can also be applied to semi-supervised DA tasks by adopting several labeled target samples. This is left for future work since this work mainly focuses on unsupervised DA. We show the ablation study of L2M on this COVID-19 data in Table 8 . It is shown that by combining different matching features, L2M can generally achieve better performance than the comparison methods. We train GMMNs on the benchmark datasets MNIST [65] . We use the standard test set of 10,000 images, and randomly select 5000 from the standard 60,000 training images for validation. The remaining 55,000 are used for training. We train the GMMN network in both the input data space and code space of an auto-encoder. For all the networks, a uniform distribution in [-1, 1] H is used as the prior for the H-dimensional stochastic hidden layer at the top of the GMMN, which is followed by 4 ReLU layers. The output layer is a logistic sigmoid function, which guarantees that the code space dimensions lay in [0, 1]. The auto-encoder has 4 layers, 2 for the encoder and 2 for the decoder. For more details about the architecture of GMMN and auto-encoder, please refer to the original paper [57] . We train the GMMNs with mini-batch of size 1000, for each mini-batch, a set of 1000 samples will be generated from the network. The loss and gradient are computed from these 2000 samples. We replace the original square root loss function L MMD with L match of L2M to get the result GMMN+L2M. We set max epochs to be 500 and use Adam as the optimization strategy. The learning rate and momentum for both GMMN and auto-encoder, dropout rate for the auto-encoder are tuned using Bayesian optimization [66] . Fig. 4 shows more MNIST samples generated by GMMN with MMD and L2M. It is clear that L2M generates sharper samples than MMD. We believe that L2M has more potential in image generation and this is only a test experiment. We are well aware that there are lots of existing works for image generation these years and adhere to hope that L2M could be significantly extended for this task in the future. How transferable are features in deep neural networks? In NIPS Analysis of representations for domain adaptation On learning invariant representation for domain adaptation Domain adaptation via transfer component analysis Learning transferable features with deep adaptation networks Deep domain confusion: Maximizing for domain invariance Visual domain adaptation with manifold embedded distribution alignment Unsupervised domain adaptation by backpropagation Adversarial discriminative domain adaptation Bridging theory and algorithm for domain adaptation Optimal transport for domain adaptation Transfer feature learning with joint distribution adaptation Balanced distribution adaptation for transfer learning Open set domain adaptation: Theoretical bound and algorithm Transfer learning with dynamic adversarial adaptation network Support and invertibility in domain-invariant representations A kernel two-sample test On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions On unifying deep generative models How (not) to train your generative model: Scheduled sampling, likelihood, adversary? Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models Transfer learning via learning to transfer Approximation with artificial neural networks A survey on transfer learning Central moment discrepancy (cmd) for domain-invariant representation learning Deep coral: Correlation alignment for deep domain adaptation Domain adaptation with regularized optimal transport Joint distribution optimal transportation for domain adaptation Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation Optimal transport in reproducing kernel hilbert spaces: Theory and applications Geodesic flow kernel for unsupervised domain adaptation Subspace distribution alignment for unsupervised domain adaptation Return of frustratingly easy domain adaptation Generative adversarial nets Transfer metric learning: Algorithms, applications and outlooks Batch normalization: Accelerating deep network training by reducing internal covariate shift Autodial: Automatic domain alignment layers Adaptive batch normalization for practical domain adaptation Transferable normalization: Towards improving transferability of deep neural networks Unsupervised pixel-level domain adaptation with generative adversarial networks Introduction to statistical machine learning Self-supervised visual feature learning with deep neural networks: A survey Conditional adversarial domain adaptation The imageclef-da challenge Deep hashing network for unsupervised domain adaptation Visda: The visual domain adaptation challenge Adapting visual category models to new domains Deep residual learning for image recognition Deep transfer learning with joint adaptation networks Multi-adversarial domain adaptation Collaborative and adversarial network for unsupervised domain adaptation Towards accurate model selection in deep unsupervised domain adaptation Maximum classifier discrepancy for unsupervised domain adaptation Generate to adapt: Aligning domains using generative adversarial networks Covid-19 screening on chest x-ray images using deep learning based anomaly detection Deep domain adaptation from typical pneumonia to covid-19 Generative moment matching networks Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels Metaweight-net: Learning an explicit mapping for sample weighting Domain-adversarial training of neural networks Unsupervised domain adaptation with residual transfer networks Learning disentangled semantic representation for domain adaptation Adversarial-learned loss for domain adaptation Gradient-based learning applied to document recognition Practical bayesian optimization of machine learning algorithms