key: cord-0169998-c0xxzs50 authors: Ke, Zixuan; Kachuee, Mohammad; Lee, Sungjin title: Domain-Aware Contrastive Knowledge Transfer for Multi-domain Imbalanced Data date: 2022-04-05 journal: nan DOI: nan sha: e0c7c99924c95aa74968869a8ec0b6a047cb831d doc_id: 169998 cord_uid: c0xxzs50 In many real-world machine learning applications, samples belong to a set of domains e.g., for product reviews each review belongs to a product category. In this paper, we study multi-domain imbalanced learning (MIL), the scenario that there is imbalance not only in classes but also in domains. In the MIL setting, different domains exhibit different patterns and there is a varying degree of similarity and divergence among domains posing opportunities and challenges for transfer learning especially when faced with limited or insufficient training data. We propose a novel domain-aware contrastive knowledge transfer method called DCMI to (1) identify the shared domain knowledge to encourage positive transfer among similar domains (in particular from head domains to tail domains); (2) isolate the domain-specific knowledge to minimize the negative transfer from dissimilar domains. We evaluated the performance of DCMI on three different datasets showing significant improvements in different MIL scenarios. The majority of existing works in imbalanced learning focus on the class imbalance setting where classes are presented in a long-tailed distribution: a subset of classes (head classes) have sufficient samples, while other uncommon or rare classes (tail classes) are underrepresented by limited samples. This setting is challenging because the model naturally focuses largely on the majority classes and there may be no sufficient data for tail classes to recover their underlying distribution . Even though extensive work has been done on the class imbalance problem, the consideration of * Work done as an intern at Amazon Alexa AI. domains 1 is often missed. In many real-world scenarios, data naturally belongs to a set of domains e.g., for an online store, a potential domain assignment for each customer review can be defined based on the corresponding store departments. A simplistic solution is to ignore domain assignments and train a classifier for all domains, which we refer to as domain agnostic learning (D-AL). D-AL entirely ignores domains and assumes that the model can "automatically" discover the data distribution for domains and learn them equally well. The drawbacks of such an approach are obvious: if the training data is sourced from many domains, updating all parameters may lead the model to focus on the subsets of the data in proportion to their ease of access or frequency. Moreover, if the data from different domains are dissimilar, agnostic learning may cause undesirable convergence dynamics i.e., negative transfer. We, therefore, argue that in the multi-domain imbalanced learning (MIL) scenarios, a learning algorithm should consider domain information and leverage them to achieve effective knowledge transfer. The MIL is a challenging problem. First, different domains may have very different number of samples and show a long-tailed distribution. For example, an intelligent assistant (e.g. Amazon Alexa) may provide a wide variety of skills and different skills may vary largely in number of examples. It is possible that some internal developed skills (e.g. music or whether) have hundreds of thousands of samples while many third-party developed skills may have only less than 10 samples in the same dataset (Kachuee et al., 2021) . Second, domains may exhibit different semantic similarities and disparities with each other. For instance, a feature may show positive correlation with a label for cer-tain domains while it is negatively correlated for others. Third, the data-provided domain annotation may not be completely accurate or sufficiently finegrained. For example, a sentence "Due to software or hardware issues, my computer cannot open my favorite text book, One hundred Years of Solitude" may belong to both computer and books domains while it may have only one domain assignment in the dataset. Perhaps the most intuitive approach for MIL is multi-task learning (MTL), where separate heads are used for different domains. While MTL considers domains, we will show it performs poorly in our experiments due to the lack of knowledge transfer between the classifiers. We believe that the key to successful MIL is to not only enable but encourage positive transfer learning across domains. In this paper, we propose Domain-aware Contrastive Knowledge Transfer for Muti-domain Imbalance learning (DCMI). DCMI introduces a novel domain-aware representation layer based on domain embeddings which enables fine-grained and scalable representation sharing or separation. Complementary to the data provided domain assignments, we use an auxiliary domain classification task to help determine the relevance of a sample to each domain i.e., soft domain assignments. DCMI uses a novel contrastive knowledge transfer objective to move the representation from similar domains closer and representation from dissimilar domains further apart. We conduct extensive experiments on three different multi-domain imbalanced datasets to demonstrate the effectiveness of DCMI. The recent imbalance learning literature can be organized into the following categories: Data Resampling. This is one of the most widely used practices to artificially balance the distribution. Two popular options are under-sampling (Buda et al., 2018; More, 2016) and over-sampling (Buda et al., 2018; Sarafianos et al., 2018; Shen et al., 2016) . Under-sampling removes data from the head (dominant classes) while over-sampling repeats the data from the tail (minority classes). These approaches can be problematic as discarding tends to remove important samples and duplicating tends to introduce bias or overfitting. Data Augmentation. Data augmentation has been used to enrich the tail classes. A popular approach is to leverage the Mixup (Zhang et al., 2018) technique to augment the minority classes. Remix (Chou et al., 2020) assigns the label in favor of minority classes to the mixup samples, prepares a "feature cloud" for mixing up that has a larger distribution range for tail classes. Kim et al. (2020) adds noise to head classes to generate tail classes. Chu et al. (2020) decomposes the feature spaces and generate tail classes samples by combining class-shared feature from head classes and class-specific features from tail classes. However, this is usually a non-trivial work to generate meaningful samples that can help tail classes. Loss Reweighting. The basic idea of reweighting is to allocate larger weight for loss terms corresponding to tail classes while less weight for head classes. In class-sensitive cross-entropy loss (Japkowicz and Stephen, 2002) , the weight for each class is inversely proportional to the number of samples. Ren et al. (2018) leverages a hold-out evaluation set to minimize the balanced loss. Regularization. This approach adds an additional regularization term to improve the training for the tail samples. Lin et al. (2017) adds a factor to the standard cross-entropy loss to put more focus on hard, misclassified samples (usually attributed to the minority classes). Cao et al. (2019) proposed to regularize the minority classes strongly so that the generalization error of minority classes can be improved. While regularization is simple and effective, the soft penalty can be insufficient to make the model focus on the tail classes and a large penalty may negatively affect the learning itself. Parameter isolation. It has been shown that decoupling the learning into representation learning and classifier learning can be quite effective. BBN Zhou et al. (2020) proposed a two-branch approach where the representation learning branch is trained as there is no class imbalance (input random sampling data) while the classifier learning branch applies the reverse sampling technique. The two branches are then combined by a curriculum learning strategy. further improves BBN by replacing the cross-entropy loss in representation learning branch into a prototypical supervised contrastive loss. This approach offers the opportunity to optimize each part separately but also make it hard to transfer knowledge from head to tail classes Domain Imbalanced Learning. The above approaches mostly consider the class imbalance but ignore the imbalance across domains. proposed a doubly balancing technique for both class imbalance and cross-domain imbalance, which only limited to two domains, without any explicit mechanism to encourage the positive transfer and avoid the negative transfer. In this paper, we assume access to a set of samples Here, N is the number of samples, C is the number of classes, and M is number of domains, i.e., shared feature space and label set across domains. We assume the scenario where exists (a) class imbalance: classes are not evenly distributed in each domain; (b) domain imbalance: domains are not evenly distributed, i.e., some domains may have much more or less number of examples than other domains; and (c) domain divergence: while some domains are naturally similar to others and thus positively correlated, some domains are naturally dissimilar to others and negatively correlated. Given these assumptions, in multi-domain imbalanced learning (MIL) we seek a model to minimize the expected loss for all domains (i.e., macro average). In the MIL problem, it is crucial to identify the shared knowledge that can be transferred across similar domains to improve the tail domain performance and the domain-specific knowledge that needs to be handled carefully to avoid a negative transfer. To obtain domain-aware representations, we leverage domain embeddings to adaptively select the useful representation for each specific domain (Sec. 4.1). Additionally, regardless of the dataset provided domain assignment, in reality, a sample can belong to multiple domains to different degrees. To address this, we propose a domain classification task to obtain the relevance of a sample to each domain and transfer the related domain knowledge using a contrastive method (Sec. 4.2). We suggest a domain-aware representation layer to adaptively select the appropriate representation (neurons) for each domain. For a domain j, the corresponding embedding v j consists of differentiable parameters that can be learned in an end-to-end fashion. Based on this, the sigmoid function is (v) The flow of gradients from each loss term is controlled such that each term is only used to optimize a subset of trainable parameters as indicated by green, blue, and orange colors in the drawing. used to find the corresponding domain mask m j : Where τ is a temperature variable, linearly annealed from 1 to τ min (a small positive value). To obtain the domain-aware representation, we use element-wise multiplication of the output of the body network (i.e., BERT in this paper) h and the mask m j :ĥ Note that the neurons in m j may overlap with those in other domain masks to enable knowledge sharing. To make sure the v j to have a wide range and its gradient to have a large magnitude, a gradient compensation technique is employed to the original gradient g (Serrà et al., 2018) . Specifically, The embedding matrix is trained jointly with the supervised classification objective using a typical cross-entropy loss, denoted by L sup . Even though we obtain the domain-aware representation using the suggested domain embedding, there are two limitations: (a) apart from supporting shared features, there is no explicit mechanism to actively encourage knowledge transfer; (b) the dataset provided domains are not necessarily accurate and fine-grained in the real world. Certain examples can be attributed to multiple domains with different degrees of relevance. For example, a review written on a product is usually considered in the general domain of that product (e.g., computers); however, semantically, it may involve discussion of other domains (e.g., the music playback quality of a laptop). To address the above issues, we employ a domain classification task to estimate the relevance of each sample to different domains. We leverage these relevance/confidence scores as soft labels to conduct contrastive learning, allowing knowledge transfer from similar domains at the instance level. Domain Classification. To estimate the relevance of different domains for a given sample, we leverage a sigmoid classification head with M output neurons. For training, we employ binary cross-entropy (BCE) loss L dom using the dataset provided domain assignments as labels. Using the trained domain classifier, assuming it can generalize and capture domain similarities, we estimate the relevance of sample i to each domain using its sigmoid output score for domain j, denoted by a j i . Note that the domain classification task is only an auxiliary task to be used in the contrastive learning objective explained next. Therefore, we block gradients from this objective to flow outside the domain classifier head. Contrastive Learning. Fig. 2 shows an illustration of the proposed contrastive objective. Here, for a certain sample, regardless of the dataset provided domain, we compute its domain-aware representations for all domains:ĥ 1 i . . .ĥ M i . Then, we compute an augmented view of the sample by simply computing a weighted average of domain-aware representations and their normalized relevance: Based on this, we define the contrastive objective as: which is essentially a soft cross-entropy loss. Intuitively, the contrastive objective of (5) encourages learning representations that capture the attribution of the augmented view to each domain. Through this objective, similar domains are represented with closer representations and dissimilar domains are moved further apart such that they are easily distinguishable from the augmented view. Note that L con is different from the typical contrastive objectives usually used in the literature as it relies on soft domain assignments for the augmented view rather than distinguishing augmented and real data. As an example, assume that the domain-aware representationĥ j i is not a good representation for sample i and lacks knowledge that is potentially transferable from other domains (indicating by a single color in their representation boxes), we can see how L con helps (see Fig. 3 ): • Sample i semantically relevant to multiple domains (domain 1 and domain 3). In this case, a 1 i and a 3 i have a large value while a 2 i has a smaller value. Consequently, h i is mostly the average ofĥ 1 i andĥ 3 i (half orange and half green). Here, updating based on L con moveŝ h 1 i andĥ 3 i closer to h . In other words, the knowledge transfer is encouraged between the first and third representations for that sample. • Sample i is not semantically relevant to a domain (domain 2). Updating based on L con ,ĥ 2 i moves further from h i to reflect the difference between them. Consequently,ĥ 2 i is discouraged from a negative knowledge transfer. This is expected asĥ 2 i is not relevant to sample i. Final Objective. The final joint training objective is a combination of the supervised classification, domain classification and sample level contrastive loss terms: where, λ 1 and λ 2 are hyperparameters to adjust the impact of each term. Note that gradients computed from each objective update different parts of the network as shown in Fig. 1 via different colors. For example, L dom only updates the domain classifier head, and L con updates all parameters except those in the supervised classification head. Architecture. A fully connected layer with softmax output is used as the classification head in the last layer of BERT. We use the embedding of [CLS] as the output of BERT. The training of BERT, follows that of (Xu et al., 2019) . We adopt BERT BASE (uncased). Hyperparameters. Unless otherwise stated, the domain id embeddings have 768 dimensions. We use 0.0025 for τ min in Eq. 3. A dropout layer with the rate of 0.5 is placed between fully connected layers. To find the λ 1 and λ 2 hyperparameters in Eq. 6, we conducted a grid search in the [0, 5000] range using about 200 logarithmic increments. We provide the selected λ 1 and λ 2 for each dataset in Section 5.1.3. For the contrastive objective, an l 2 normalization is applied before computing the contrastive loss. The max length of the number of input tokens is set to 128. We use Adam optimizer and set the learning rate to 3 × 10 −5 . For all experiments, we train for 5 epochs using a mini-batch size of 64. We conduct experiments using three datasets: Document Sentiment Classification (DSC) (Ni et al., 2019) , Aspect Sentiment Classification (ASC) (Ke et al., 2021) and Rumour and Fake News Detection (RFD) (Zubiaga et al., 2016; Wang, 2017) . These datasets have natural class and domain imbalance. For all datasets, we use a random data split of 10% for test, 10% for validation, and the rest for training. To better evaluate the performance of each method in efficient knowledge transfer, we down-sample the training and validation sets of the DSC, ASC, and RFD with a factor of 1000, 10, and 10, respectively. We provide the exact domain and class statistics in the appendix. In addition to these datasets, we conduct additional experiments using an altered version of the ASC dataset with artificially dissimilar domains (Sec. 5.2.2). DSC. For this dataset, the task is to classify each full product review into one of the two opinion classes (positive and negative). The training data provides the particular type of product being reviewed as domain information. We adopt the text classification formulation in (Devlin et al., 2019) , where the [CLS] token is used to predict the opinion polarity. To build the DSC dataset, we use 29 domains from the Amazon Review Datasets (Ni et al., 2019) 2 , then binarize the ratings by converting 1-2 stars to negative and 4-5 stars to positive. ASC. This dataset provides a classification of review sentences on their aspect-level sentiment (one of positive and negative). For example, the sentence "The picture is great but the sound is lousy" about a TV expresses a positive opinion about the aspect "picture" and a negative opinion about the aspect "sound." We adopt the ASC implementation by Xu et al. (2019) , where the aspect term and sentence are concatenated via [SEP] in BERT. The opinion is predicted using the [CLS] token. The ASC dataset (Ke et al., 2021) consists of 19 domains from 4 sources: (a) HL5Domains (Hu and Liu, 2004) with reviews of 5 products; (b) Liu3Domains (Liu et al., 2015) with reviews of 3 products; (c) Ding9Domains (Ding et al., 2008) with reviews of 9 products; and (d) SemEval14 with reviews of 2 products -SemEval 2014 Task 4 for laptop and restaurant. RFD. This dataset is compose of PHEME rumor detection (Zubiaga et al., 2016) and LIAR fake news detection (Wang, 2017) datasets. For rumor detection, the task is to identify whether a piece of given news is a rumor or not, while for the fake news detection, it is to identify fake or real news pieces. We follow Devlin et al. (2019) where the [CLS] token is used for the classification. The RFD dataset consists of 6 domains from the PHEME dataset (5 domains) of rumor tweets (Zubiaga et al., 2016) 3 and the fake news detection LIAR (Wang, 2017) (1 domain). Note that domains in PHEME defined by different news events (e.g. a specific shooting incident), while the domain in LIAR is defined by news genres (e.g. politics). We intentionally selected this dataset to evaluate the performance of different methods when domains are merely a segmentation of samples rather than following a consistent definition. For each experiment, we report Area Under the ROC Curve (AUC) as the performance measure. Two types of results are reported: macro and micro. Macro is computed by macro averaging results computed for individual domains. Micro is computed from averaging the performance of all test samples regardless of their domain assignments. Note that there is an imbalance in the frequency of class labels (positive and negative in ASC, DSC; fake and real in RFD) in addition to the imbalance in the domains for each dataset. To ensure the statistical significance of the results, each experiment is repeated 5 times using random seed and random initialization, reporting the mean and standard deviation of each result. As the main focus of this study is the domain imbalance, to address class imbalance existing in our benchmarks, we adopt the existing DRS method (Cao et al., 2019) for all experiments. In our comparisons, we use multi-task learning (MTL) and domain-agnostic learning (D-AL) as intuitive and straightforward baselines. Additionally, since little work has been done in MIL, we adapt the recent class imbalance systems to MIL by re-sampling or re-weighting based on the domain statistics. For each case, we follow similar architectures as DCMI to ensure fair comparisons. The compared methods cover various approaches including: loss reweighting (D-DRW (Cao et al., 2019) ), regularization (D-Focal (Lin et al., 2017) ), re-sampling (D-DRS (Cao et al., 2019) ), parameter isolation (D-BBN (Zhou et al., 2020) and D-HybridSC ), and mixture-of-experts (D-MDFEND (Nan et al., 2021) ). Note that the prefix "D-" in the model name is to indicate that we adapt them to the domain imbalance model. Among these approaches, D-DRW and D-DRS are re-sampling and re-weighting methods with a deferred training scheduler. As suggested by Cao et al. (2019) the re-sampling or re-weighting are only effective after 80% of epochs have been trained. D-focal is a regularization-based method that uses an carefully designed loss function tailored for imbalanced data. D-BBN and D-HybridSC are two recent parameter isolation approaches that have shown state-of-the-art performance. D-MDFEND is used for multi-domain fake news detection which applies mixture-of-experts to deal with multi-domain transfer and isolation. Regarding the DCMI hyperparameters i.e. (λ 1 , λ 2 ), we used (50, 6), (30, 15), and (4, 3) for the ASC, DSC, and RFD datasets, respectively. Refer to Section 4.3 for the hyperparameter search space and other implementation details. Specifically, DCMI is much more data-efficient compared to other baselines, as it effectively encourages positive knowledge transfer across domains. Among the three datasets, DCMI has the largest improvement margin for RFD. This can be attributed to the fact that domains in RFD are more diverse than those in ASC and DSC. The sentiment classification domains as in ASC and DSC have similarities as in these tasks positive or negative sentiments are usually expressed with similar words/phrases. For example, wonderful and terrible have similar interpretation for different tasks/domains to express positive or negative sentiment. However, expressions in fake news or rumors are far more diversified, follow more complex semantics, and even contradicting at times. For example, "guns" and "shooting" appear many times in "Charlie Hebdo" domain while they almost never appear in other domains like "Germanwings Flight". Even more interestingly, "Trump" appears frequently in both the fake news of "COVID-19" domains and the real news of "government", therefore it is a significant keyword with different domain interpretation. Under such domain disparities, selectively transferring common knowledge while preventing negative transfer becomes crucial which we believe is addressed by this work. For the most recent state-of-the-art methods presented in Table 1 , we can observe mixed MIL performance results for different datasets indicating less adaptability compared to DCMI. This is perhaps because they do not employ any viable mechanism to explicitly encourage positive transfer. We claim that DCMI is capable of adaptively selecting the useful knowledge (neurons) for a given domain and thus robust to extremely dissimilar do- Table 2 : Ablation study of DCMI. "-L dom " and "−L con " indicate omitting the domain classification and contrastive loss terms, respectively. mains. To demonstrate this, we create an artificial case where domains are extremely dissimilar in the dataset by design. Specifically, we divide the ASC dataset into two parts. The first part contains first 10 domains and the second part contains the other 9 domains. We keep the first part as is, while inverting the labels for the second part (i.e., flipping positive to negative and vice versa). Note that in a sentiment classification task such as ASC, domains are highly correlated so inverting labels for half of domains creates a drastic domain disparity. Table 1 shows the results of using the altered ASC data. We can see all baselines except MTL and D-MDFEND reach on only around 50% AUC. This is because the extremely high domain divergence is causing a severe negative transfer and making it difficult for the majority of baselines to learn a good predictor. However, MTL and D-MDFEND perform better than other baselines, perhaps since negative transfer is reduced due to the use of separate heads for different domains in MTL and mixture-of-experts in D-MDFEND. Nevertheless, DCMI still outperforms MTL and D-MDFEND, confirming that DCMI is not only capable of isolating domain-specific knowledge but also is able to encourage positive transfer among similar domains, which is here for domains within each data part of the altered dataset. Label D-AL DCMI -L dom , L con DCMI The nicest part is the low heat output and ultra quiet operation. P. N. P. P. The flaw is inside the Zen. N. P. N. It feels cheap, the keyboard is not very sensitive. N. P. P. N. The downstairs bar scene is very cool and chill... P. N. N. P. The sushi is cut in blocks bigger than my cell phone. N. P. P. N. Table 3 : Qualitative comparison of predictions for different methods on a set of selected test samples from the ASC dataset (Ke et al., 2021) . Italic text indicates the aspect in the review. "P." indicates positive and "N." indicates negative assignments. We conduct an ablation study to analyze the impact of each objective term. The results of this experiment are presented in Table 2 . Here, "-L dom " indicates DCMI without the domain classification. "-L dom , L con " indicates DCMI without the domain classification and contrastive loss. Note that if we remove the domain-aware representation layer in addition to L dom and L con , DCMI becomes D-AL. Based on the results provided in Table 2 the full DCMI system gives the best results, showing that every suggested component is crucial to the final model performance. Table 3 shows several examples from ASC test set. For each example, we show the ground truth label (the third column), predictions of D-AL, DCMI and DCMI-[L dom , L con ]. By comparing D-AL and DCMI-[L dom , L con ], we can see the effectiveness of the domain-aware representation layer. By comparing DCMI and DCMI-[L dom , L con ], we can see whether the contrastive knowledge transfer is successful. In the first row, "quiet" is a positive sentiment word in the "laptop" domain. However, "quite" can indicate negative in other domains (e.g., a "quite" earbud in "MP3" domain indicates negative sentiment). We can see DCMI and DCMI -[L dom , L con ] are able to separate the different polarity of the same sentiment word from different domains, while D-AL fails, suggesting that the knowledge selection in DCMI is capable of learning discriminative domain-aware representation. In the second row, we can see D-AL mistakenly takes the review as positive due to the small amount of training data in the "MP3" domain. DCMI and DCMI -[L dom , L con ] can make the correct prediction because of their ability to transfer knowledge from similar domains. The last three rows of Table 3 showcase where only DCMI is correct. In the "laptop" domain (the third row), "cheap" conveys a negative sentiment in the example. However, "cheap" can indicate positive sentiment in the "laptop" domain if it is talking about the software domain. Therefore, an MIL model that only considers the annotated domain (e.g., DCMI-[L dom , L con ]) fails.Similarly, the polarities of "cool" and "chill" depend not only on the dataset provided domain but also on the degrees of domain relevance for a given sample. The last case is an ironic expression, indicating DCMI provides a deeper understanding of the review. In addition to the presented results, we provide a visual analysis of the domain-aware representation layer using t-SNE in the appendix. In this work, we studied the problem of learning from multi-domain imbalanced data, where not only there is class imbalance but also there is an imbalance among domains with varying degrees of similarity. We proposed a novel technique called DCMI that is capable of identifying the shared knowledge that can be transferred to improve the tail domain performance and the domain-specific knowledge that needs to be handled carefully to avoid negative transfer. DCMI employs a domainaware representation layer to adaptively select the relevant knowledge for each domain and leverages a novel contrastive learning objective to encourage knowledge transfer for relevant domains. Based on the experiments using three challenging multi-domain imbalanced datasets, DCMI shows improvements over the current state-of-the-art and demonstrates applicability to different scenarios. A systematic study of the class imbalance problem in convolutional neural networks Learning imbalanced datasets with label-distribution-aware margin loss Representation learning for imbalanced cross-domain classification Remix: Rebalanced mixup Feature space augmentation for long-tailed data BERT: pre-training of deep bidirectional transformers for language understanding A holistic lexicon-based approach to opinion mining Mining and summarizing customer reviews The class imbalance problem: A systematic study. Intelligent data analysis Self-supervised contrastive learning for efficient user satisfaction prediction in conversational agents Adapting BERT for continual learning of a sequence of aspect sentiment classification tasks M2m: Imbalanced classification via major-tominor translation Focal loss for dense object detection Deep representation learning on long-tailed data: A learnable embedding augmentation perspective Automated rule selection for aspect extraction in opinion mining Largescale long-tailed recognition in an open world Survey of resampling techniques for improving classification performance in unbalanced datasets Mdfend: Multi-domain fake news detection Justifying recommendations using distantly-labeled reviews and fine-grained aspects Learning to reweight examples for robust deep learning Deep imbalanced attribute classification using visual attention aggregation Overcoming catastrophic forgetting with hard attention to the task Relay backpropagation for effective learning of deep convolutional neural networks Contrastive learning based hybrid networks for long-tailed image classification liar, liar pants on fire": A new benchmark dataset for fake news detection BERT post-training for review reading comprehension and aspect-based sentiment analysis mixup: Beyond empirical risk minimization Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition Learning reporting dynamics during breaking news for rumour detection in social media In Table 4 , 6, and 5, we provide the frequency of samples corresponding to each domain for the ASC, DSC, and RFD datasets. Table 6 : The number of samples in each domain and data split for the ASC dataset. ASC is composed of four datasets. "N." indicates negative labels and "P." indicates positive labels. We visualize sample representations before and after the domain-aware representation layer using for ASC dataset. See Figure 4 for t-SNE visualizations. Here, we color the samples according to their domain assignments. Before the domain-aware representation layer, we can see the points related to different domains are mixed and hard to differentiate. However, after the domain-aware representation layer, samples with similar colors form clusters, indicating a higher embedding distance for different domains. From this visualization, we can infer that the suggested method is able to learn discriminative domain-aware representations.