key: cord-0462704-fdk7aydj authors: Huang, Yi; Giledereli, Buse; Koksal, Abdullatif; Ozgur, Arzucan; Ozkirimli, Elif title: Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution date: 2021-09-10 journal: nan DOI: nan sha: 2fe1c8221dd90ad8655d4b518f9c10420cc6ed55 doc_id: 462704 cord_uid: fdk7aydj Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling of common labels. Here, we introduce the application of balancing loss functions for multi-label text classification. We perform experiments on a general domain dataset with 90 labels (Reuters-21578) and a domain-specific dataset from PubMed with 18211 labels. We find that a distribution-balanced loss function, which inherently addresses both the class imbalance and label linkage problems, outperforms commonly used loss functions. Distribution balancing methods have been successfully used in the image recognition field. Here, we show their effectiveness in natural language processing. Source code is available at https://github.com/Roche/BalancedLossNLP. Multi-label text classification is one of the core topics in natural language processing (NLP) and is used in many applications such as search (Prabhu et al., 2018) and product categorization (Agrawal et al., 2013) . It aims to find the related labels from a fixed-set of labels for a given text that may have multiple labels. from the Reuters-21578 multi-label text classification dataset (Hayes and Weinstein, 1990) . Here, for the document with the title PENN CENTRAL SELLS U.K. UNIT, the aim is to find the labels acq (acquisitions), strategic-metal, and nickel from 90 labels. Multi-label classification becomes complicated when there is a long-tailed distribution (class imbalance) and linkage (co-occurrence) of labels. Class imbalance occurs when a small subset of the labels (namely head labels) have many instances, while majority of the labels (namely tail labels) have only a few instances. For example, half of the labels in the Reuters dataset, including copper, strategicmetal, and nickel, occur in less than 5% of the training data. Label co-occurrence or label linkage is a challenge when some head labels co-occur with rare or tail labels, resulting in bias for classification to the head labels. For example, even though the la-bel nickel occurs less frequently, the co-occurrence information of nickel/copper, nickel/strategic-metal is important for accurate modeling (Figure 1 ). Solutions such as resampling of the samples with lessfrequent labels in classification (Estabrooks et al., 2004; Charte et al., 2015) , using co-occurrence information in the model initialization (Kurata et al., 2016) , or providing a hybrid solution for head and tail categories with a multi-task architecture (Yang et al., 2020) have been proposed in NLP, however they are not suitable for imbalanced datasets or they are dependent on the model architecture. Multi-label classification has been widely studied in the computer vision (CV) domain, and recently has benefited from cost-sensitive learning through loss functions for tasks such as object recognition (Durand et al., 2019; Milletari et al., 2016) , semantic segmentation (Ge et al., 2018) , and medical imaging (Li et al., 2020a) . Balancing loss functions such as focal loss (Lin et al., 2017) , class-balanced loss (Cui et al., 2019) and distribution-balanced loss (Wu et al., 2020) provide improvements to resolve the class imbalance and co-occurrence problems in multi-label classification in CV. Loss function manipulation has also been explored (Li et al., 2020b; Cohan et al., 2020) in NLP as it works in a model architecture-agnostic fashion by explicitly embedding the solution into the objective. For example, Li et al. (2020b) has borrowed dice-based loss function from a medical image segmentation task (Milletari et al., 2016 ) and reported significant improvements over the standard cross-entropy loss function in several NLP tasks. In this work, our major contribution is the introduction of the use of balancing loss functions to the NLP domain for the multi-label text classification task. We perform experiments on Reuters-21578, a general and small dataset, and PubMed, a biomedical domain-specific and large dataset. For both datasets, the distribution balancing methods not only outperform the other loss functions for the total metrics, but also lead to significant improvement for the tail labels. We suggest that the balancing loss functions provide a robust solution for addressing the challenges in multi-label text classification. In NLP, Binary Cross Entropy (BCE) loss is commonly used for multi-label text classification (Bengio et al., 2013) . Given a dataset {(x 1 , y 1 ), ..., (x N , y N )} with N training instances, each having a multi-label ground truth of y k = [y k 1 , ..., y k C ] ∈ {0, 1} C (C is the number of classes), and a classifier output z k = [z k 1 , ..., z k C ] ∈ R, BCE is defined as (the average reduction step is not shown for simplicity): The sigmoid function is used for computing p k i , The plain BCE is vulnerable to label imbalance due to the dominance of head classes or negative instances (Durand et al., 2019) . Below, we describe three alternative approaches that address the class imbalance problem in long-tailed datasets in multi-label text classification. The main idea of these balancing methods is to reweight BCE so that rare instance-label pairs intuitively get reasonable "attention". By multiplying a modulating factor to BCE (with the tunable focusing parameter γ ≥ 0), focal loss places a higher weight of loss on "hard-to-classify" instances predicted with low probability on ground truth (Lin et al., 2017) . For the multi-label classification task, the focal loss can be defined as: (2) By estimating the effective number of samples, class-balanced focal loss (Cui et al., 2019) further reweights FL to capture the diminishing marginal benefits of data, and therefore reduces redundant information of head classes. For multi-label tasks, each label with overall frequency n i has its balancing term where β ∈ [0, 1) controls how fast the effective number grows and the loss function becomes (4) By integrating rebalanced weighting and negativetolerant regularization (NTR), distributionbalanced loss first reduces redundant information of label co-occurrence, which is critical in the multi-label scenario, and then explicitly assigns lower weight on "easy-to-classify" negative instances (Wu et al., 2020) . First, to rebalance the weights, in the singlelabel scenario, an instance can be weighted by the resampling probability P C i = 1 C 1 n i ; while in the multi-label scenario, if following the same strategy, one instance with multiple labels can be oversampled with a probability P I = 1 C y k i =1 1 n i . Therefore, the rebalanced weight can be normalized with r DB = P C i /P I . With a smoothing function,r DB = α + σ(β × (r DB − µ)), mapping r DB to [α, α + 1], the rebalanced-FL (R-FL) loss function is defined as: Then, NTR treats the positive and negative instances of the same label differently. A scale factor λ and an intrinsic class-specific bias v i are introduced to lower the threshold for tail classes and to avoid over-suppression. The v i can be estimated by minimizing the loss function at the beginning of training with a scale factor κ and class prior Finally, DB integrates rebalanced weighting and NTR as 3 Experiments Two multi-label text classification datasets of different size, property and domain are used (Table 1) . Reuters-21578 dataset (Distribution 1.0) contains documents that appeared on Reuters newswire in 1987 and that were manually annotated with 90 labels (Hayes and Weinstein, 1990) . Here, we follow the train-test split used by (Yang and Liu, 1999) to obtain 7769 training (1000 among which for validation) and 3019 test documents. The labels are equally split into head (30 with ≥ 35 instances), medium (31 with between 8-35 instances) and tail (30 with ≤ 8 instances) subsets. PubMed dataset comes from the BioASQ Challenge (License Code: 8283NLM123) providing PubMed articles with titles and abstracts, that have been manually labelled for Medical Subject Headings (MeSH) (Tsatsaronis et al., 2015; Coordinators, 2017) . 224,897 articles published during 2020 and 2021 are used, among which 10,000 are used for validation and testing purpose. The 18,211 labels are split by 3-quantiles into head (6018 with Reuters-21578 (left) and PubMed (right) using the SVM model or different loss functions. The F1 scores are reported for the total set of labels as well as for the head, medium and tail label sets, with the number of instances given in parenthesis. The experiments are performed with the SVM one-vs-rest model (SVM), the binary cross entropy (BCE), focal loss (FL), class balanced focal loss (CB), rebalanced focal loss (R-FL), negative-tolerant regularization FL (NTR-FL), distribution balance with no FL (DB-0FL), class balanced FL with negative regularization (CB-NTR) and distribution balanced loss (DB). ≥ 50 instances), medium (5581 with between 15-50 instances) and tail (6612 with ≤ 15 instances) subsets. We compare the use of different loss functions, and SVM one-vs-rest model as a classical multilabel classification baseline. For each dataset and method, we evaluate its best micro-F1 and macro-F1 scores (Wu et al., 2019; Lipton et al., 2014) for the whole label set (total) as well as different subsets of label frequency (head/medium/tail). The loss function parameters, the classification models used, and the implementation details are provided in Appendix A. A summary of the results of different loss functions are listed in Table 2 . There are about 10,000 documents and 90 labels in the Reuters dataset, with an average of 150 instances per label (Table 1 ). Figure 2 shows the long-tailed distribution where only a few labels have a high number of articles and these head labels also have high co-occurrence with other labels. The impact of the skewed distribution can also be seen from the comparison between the micro-F1 (around 90 for different loss functions) and macro-F1 (around 50-60) scores (Table 2) . Furthermore, among loss functions, BCE has the lowest performance for the Reuters dataset with total macro-F1 score of 47 and tail F1 scores of 0. The PubMed dataset contains around 225,000 documents with 18,000 labels (Table 1 ) and the imbalance is even more pronounced for this large dataset (Figure in Appendix) and the difference between the total micro-F1 score (60) and the total macro-F1 score (around 15) is very high. Overall, SVM underperforms the proposed distribution balanced loss functions in both datasets. Experiments with Reuters-21578 dataset. The loss functions FL, CB, R-FL and NTR-FL perform similar to BCE in head classes, yet outperform BCE in medium and tail classes, indicating the advantage of handling imbalance. DB provides the biggest improvement in tail class assignment; the tail micro-F1 score gains 21.49 from FL and 25.81 from CB. It outperforms prior works that also used this commonly used dataset, including approaches based on Binary Relevance, EncDec, CNN, CNN-RNN, Optimal Completion Distillation or attention-based GNN, that achieved micro-F1<89.9 (Nam et al., 2017; Pal et al., 2020; Tsai and Lee, 2020) Experiments with PubMed dataset. PubMed is a biomedical domain specific, larger dataset with bigger class imbalance. For this dataset, BCE does not work efficiently, therefore we use FL as a strong baseline. With FL, the medium and tail micro-F1 scores are 26 and 9. All other loss functions outperform FL in medium and tail classes, indicating the advantage of balancing label distribution. DB again has the highest performance for all classes but the most significant improvement is achieved for the medium (micro-F1:41) and tail (micro-F1:24) classes. Ablation Study. We further investigate the contribution of the three layers of DB by comparing DB results with R-FL, NTR-FL and DB without the focal layer (DB-0FL). As shown in Table 2 , for both datasets, removing the NTR layer (R-FL) or the focal layer (DB-0FL) reduces model performance for all subsets. Removing the rebalanced weighting layer (NTR-FL) yields similar total micro-F1 (Reuters: 90, PubMed:60) but the macro-F1 as well as medium and tail F1 scores are higher with DB , showing the value of adding the rebalancing weighting layer. We also test the contribution of NTR by integrating it with CB, yielding a novel loss function CB-NTR that has not been previously explored. For both datasets, CB-NTR has better performance than CB for all class sets ( Table 2 ). The only difference between CB-NTR and DB is the use of CB weight r CB instead of the rebalancing weightr DB . DB has very close performance to or outperforms CB-NTR in the medium and tail classes, suggesting that ther DB weight, which addresses the co-occurrence challenge, is useful. Error Analysis. We perform an error analysis and observe that the most common errors are due to incorrect classification to similar or linked labels for all loss functions. The most common three pairs of classes confused by all loss functions for the Reuters dataset are: platinum and gold, yen and money-fx, platinum and copper. For the PubMed dataset, the most common errors are: Pandemics and Betacoronavirus, Pandemics and SARS-CoV-2, Pneumonia, Viral and Betacoronavirus, and BCE has significantly more errors for these classes compared to the other investigated loss functions. We propose and compare the application of a series of balancing loss functions to address the class imbalance problem in multi-label text classification. We first introduce the loss function DB to NLP and design a novel loss function CB-NTR. The experiments show that the DB outperforms other approaches by considering long-tailed distribution and label co-occurrence, and its performance is robust to different datasets such as Reuters (90 labels, general domain) and PubMed (18,211 labels, biomedical domain). This study demonstrates that addressing challenges such as class imbalance and label co-occurrence through loss functions is an effective approach for multi-label text classifica-tion. It does not require additional information and can be used with all types of neural network-based models. It may also be a powerful strategy for other NLP tasks, such as part-of-speech tagging, named entity recognition, machine reading comprehension, paraphrase identification and coreference resolution, all of which usually suffer from longtailed distribution. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages Representation learning: A review and new perspectives Addressing imbalance in multilabel classification: Measures and random resampling algorithms SPECTER: Document-level representation learning using citation-informed transformers Class-balanced loss based on effective number of samples Bert: Pre-training of deep bidirectional transformers for language understanding Learning a deep convnet for multi-label classification with partial labels A multiple resampling method for learning from imbalanced data sets Multievidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning Construe/tis: A system for content-based indexing of a database of news stories Improved neural network-based multi-label classification with better initialization leveraging label cooccurrence BioBERT: a pre-trained biomedical language representation model for biomedical text mining A multilabel classification model for full slice brain computerised tomography image Dice loss for dataimbalanced NLP tasks Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection Optimal thresholding of classifiers to maximize f1 measure V-net: Fully convolutional neural networks for volumetric medical image segmentation Maximizing subset accuracy with recurrent neural networks in multilabel classification Magnet: Multi-label text classification using attention-based graph neural network Scikit-learn: Machine learning in Python Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising Order-free learning alleviating exposure bias in multi-label classification Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition Transformers: State-of-the-art natural language processing Learning to learn and predict: A metalearning approach for multi-label classification Distribution-balanced loss for multi-label classification in long-tailed datasets HSCNN: A hybrid-Siamese convolutional neural network for extremely imbalanced multi-label text classification A re-examination of text categorization methods We thank Igor Kulev for the helpful discussions, and the anonymous reviewers for their constructive suggestions. TUBITAK-BIDEB 2211-A Scholarship Program (to A.K.) and TUBA-GEBIP Award of the Turkish Science Academy (to A.O.) are gratefully acknowledged. A.1 Experimental Settings Evaluation metrics. For each dataset and method, we select the threshold with the best micro-F1 score on the validation set as our final model and evaluate its performance on the test set with micro-F1 and macro-F1 scores.Loss function parameters. We compare the performance of DB with different loss functions, where BCE or its modifications are used. The methods include: (1) BCE with all instances and labels of the same weight. (2) FL (Lin et al., 2017) : we use γ=2.(3) CB (Cui et al., 2019) : we use β =0.9.(4) R-FL (Wu et al., 2020) : we use α=0.1 and β=10, µ=0.9 (Reuters-21578) or 0.05 (PubMed). (5)NTR-FL (Wu et al., 2020) : we use κ=0.05 and λ=2. (6) DB (Wu et al., 2020) : we use same parameters with R-FL and NTR-FL when applicable.Implementation Details. We use the BertForSe-quenceClassification backbone in transformers library (Wolf et al., 2020) with the bert-base-cased pretrained model (Devlin et al., 2018) for Reuters-21578 dataset and the biobert-base-cased-v1.1 pretrained model (Lee et al., 2019) for PubMed dataset. bert-base-cased and biobert-base-cased-v1.1 are base BERT models with 110 million parameters. The training data are truncated with a maximal length of 512 and grouped with a batch size of 32. We use AdamW with a weight decay of 0.01 as the optimizer, and determine the learning rate by hyperparameter search. The experiments are implemented in PyTorch. For Reuters-21578 dataset we use one-GPU (V100) experiments which takes 5 minutes for one epoch. For PubMed dataset, we use one-GPU (A100) experiments which takes 1 hour for one epoch. For the SVM one-vs-rest model, we use scikit-learn library (Pedregosa et al., 2011) with TF-IDF features. With hyperparameter search, we apply the linear kernel and hyper-plane shifting optimized on each validation set. We further investigate the effectiveness of loss functions against the number of labels per instance (Table 3 in Appendix). For the Reuters dataset, we split the test instances into two groups, 2583 instances with only one label and 436 instances with multiple labels. On single-label instances, all functions from BCE to DB, have similar performance; while on multi-label instances, the performance of BCE drops more than DB. DB outperforms other Figure 3 : The long-tailed distribution and label cooccurrence for the PubMed dataset. The y-axis of distribution curve is log-scale, and the co-occurence matrix is color coded based on the quad root (for better visualization) of conditional probability p(i|j) of class in the i th column on class in the j th row. functions in micro-F1 of the multi-label instance group and macro-F1 of both groups. There are < 0.1% instances of PubMed dataset with a single label, so we divide instances into 3-quantiles by their number of labels. In each quantile, the novel NTR-FL, CB-NTR and DB outperform the rest of the models in all metrics.