key: cord-0223228-iotga7e3 authors: Claveau, Vincent; Chaffin, Antoine; Kijak, Ewa title: Generating artificial texts as substitution or complement of training data date: 2021-10-25 journal: nan DOI: nan sha: ac272191cb9a991b84353f275735988ff0a69a8b doc_id: 223228 cord_uid: iotga7e3 The quality of artificially generated texts has considerably improved with the advent of transformers. The question of using these models to generate learning data for supervised learning tasks naturally arises. In this article, this question is explored under 3 aspects: (i) are artificial data an efficient complement? (ii) can they replace the original data when those are not available or cannot be distributed for confidentiality reasons? (iii) can they improve the explainability of classifiers? Different experiments are carried out on Web-related classification tasks -- namely sentiment analysis on product reviews and Fake News detection -- using artificially generated data by fine-tuned GPT-2 models. The results show that such artificial data can be used in a certain extend but require pre-processing to significantly improve performance. We show that bag-of-word approaches benefit the most from such data augmentation. Even if text generation is not a new technology, recent neural approaches based on transformers offers good enough performance to be used in various contexts [25] . In this paper, we explore the use of artificially generated texts for supervised machine learning tasks within two different scenarios: the artificial data is used as a complement of the original training dataset (for instance, to yield better performance) or the data is used as a substitute of the original data (for instance, when the original data cannot be shared because they contain confidential information [1] ). The generation of these artificial texts is performed with a neural language model trained on the original training texts. In this paper, we show the interest of these scenarios with two Web-related text classification tasks, handling real and noisy language: fake news detection and opinion mining. Precisely, the main research questions dealt with in this paper are the following ones: (1) what is the interest of text generation to improve text classification (complement); (2) what is the interest of text generation to replace the original training data (substitution); (3) what is the interest of text generation for explainable classifiers, based on bag-of-words representation. In the remaining of the paper, after a presentation of related work in Section 2, we detail our classification approaches based on artificial text generation (Sec. 3). The tasks and experimental data are described in Section 4. The experiments and their results for each of our research questions are reported in Section 5 for the neural classifiers and Section 6 for bag-of-words based classifiers. Data augmentation for tasks of Natural Language Processing (NLP) has already been explored in several studies. Some researchers propose more or less complex automatic modifications of the original examples in order to create new examples that are worded differently but are similar with respect to the NLP task (same class, same relation between words...). This is done for instance by simply replacing some words by synonyms [9, 10, 17, 26] . The synonyms can be found in external resources such as WordNet [15] , or in distributional thesauri, or from static word embeddings (such as Glove [20] or word2vec [14] ). In a similar vein since it only modifies the original examples locally, some neural techniques exploit masked language models (such as Bert [7] ), that is, context-sensitive word embedding. These approaches works by masking a word in an original examples with the [mask] token and to condition its replacement by a word from the expected class [28] . It allows to generate a new example by replacing a word with another semantically close word (ideally a synonym). It is worth noting that, contrary to what we propose, the new example is not totally different (the syntactic structure of the new example is for instance very similar to the original one). Other approaches make the most of language models such as GPT-2 (Generative Pre-Trained Transformers [23] ) in order to produce a large quantity of data (texts) that are similar to the original data distribution. In Information Retrieval, this principle has been exploited to expand users' queries [6] . Even closer, text generation is used for relation extraction [18] , sentiment analysis of critics and questions [11] or for the prediction of hospital readmission and phenotype classification [1] . This paper is part of this line of work. Our interest here is to examine the gains and losses of our different scenarios of using artificial data, their preparation, and to examine their effects on different families of classifiers. Let us assume to have a set of (original) texts T divided into classes , from which we wish to generate artificial texts G for each class . As explained in the introduction, we want to examine different scenarios of usage of these generated data: complement arXiv:2110.13016v1 [cs.CL] 25 Oct 2021 or substitution. The scenario, as well as the usual text classification framework, are examplified in Figure 1 . We use GPT models to generate the artificial texts. These models are built by stacking transformers (more precisely decoders), trained on large corpora by auto-regression, i.e. on a task of predicting the next word (or token) knowing the previous ones. The second version, GPT-2 [23] , contains 1.5G parameters for its largest model, trained on more than 8 million documents from Reddit (i.e. general domain language such as discussions on news articles, mostly in English). A newer version, GPT-3, has been released in July 2020; it is much more larger (175 billion parameters) and outperforms GPT-2 on any tested task. Yet, the experiments reported below needs finetuning, which is not feasible with such a large model which rather rely on prompt engineering for task adaptation. For this fine-tuning step, we start from the medium model (774M parameters) pre-trained for English and made available by OpenAI 1 . In the work presented in this paper, we fine-tune one language model per class with the original training data T . Another training procedure available in the literature is to adapt a single model, but to condition it with a special token indicating the expected class at the beginning of the text sequence (i.e. at the beginning of each original example). Due to the limited amount of data available per class (compared to the number of parameters of the GPT-2 model), it is important to control the fine-tuning to avoid overfitting. To do so, we limit the number of epochs to 2,000; the other fine-tuning parameters are the default ones of the OpenAI GPT2 code that is used in our experiments. On a Tesla V100 GPU card, this fine-tuning step lasts about 1 hour for each dataset (see below). For each class of the dataset T , we use the corresponding model to generate artificial texts G which hopefully will fall into the desired class. We provide prompts for these texts in the form of a start-of-text token followed by a word randomly drawn from the set of original texts. Several parameters can influence the generation. We used the default values that we give here for reproducibility purposes, without detailing them (see the GPT-2 documentation): temp. = 0.7, top_p = 0.9, top_k = 40. The texts generated for the class containing a sequence of 5 consecutive words appearing identically in a text of T are removed. This serves two purposes : on the one hand, it limits the risk of revealing an original document in the case where the T data are confidential, and on the other hand, it limits the duplicates which are harmful to the training of a classifier in the case where the G data are used in addition to T . In practice, this concerns about 10% of the generated texts in our experiments. Note that in the scenario where the data are confidential, providing the generator itself is not possible, since it may be used to find back the data it was fine-tuned on. In the experiments reported below, 16,000 texts are thus generated for each class (this number of texts has been fixed arbitrarily). 1 https://github.com/openai/gpt-2 In the scenario where the original data cannot be distributed, notably for confidentiality reasons, it is appropriate to ask whether sensitive information can be recovered with the proposed approach. As said earlier, if the whole generative model is made available, this risk has been studied [2] , and exists, at least from a theoretical point of view under particular conditions 2 . When only the generated data are made available, there is also a risk of finding confidential information in them. Without other safeguards, it is indeed possible that among the generated texts, some are paraphrases of sentences of the training corpus. However, in practice, the risk is very limited: • first of all, because there is no way for the user to distinguish these paraphrases among all the generated sentences; • secondly, because additional measures can be taken upstream (for example, de-identification of the training corpus) and downstream (deletion of generated sentences containing specific or nominative information...); • Finally, more complex systems to remove paraphrases, such as those developed for the Semantic Textual Similarity tasks [8, inter alia], can even be considered. These measures make it highly unlikely that any truly usable information can be extracted from the generated data. The experiments detailed in the next section are real classification tasks of the Web: fake news detection in tweets and sentiment analysis in reviews. They are both classification tasks usually dealt with by machine learning. We test different languages: the fake news dataset consists of tweets in English while the sentiment analysis dataset is in French. They are presented hereafter. This dataset was developed for the detection of fake news within social networks as part of the MediaEval 2020 FakeNews challenge [21] . In this task, tweets about 5G or coronavirus were manually annotated according to three classes , ∈ { ′ 5 ′ , ′ ℎ ′ , ′ ′ } [24] . ′ 5 ′ contains tweets propagating conspiracy theories associating 5G and coronavirus, ′ ℎ ′ are for tweets propagating other conspiracy theories (which may be about 5G or covid but not associated), and ′ ′ tweets not propagating any conspiracy theories. It is worth noting that the classes are imbalanced; indeed, in the training dataset T : The data augmentation (i.e., text generation) is performed as explained in the previous section. Figure 2 presents three examples of generated texts from the MediaEval 2020 training for the '5G' class. The second dataset is taken from the FLUE evaluation suite for French [12] . It is the French part of the Cross Lingual Sentiment (CLS-FR) dataset [22] , which consists of product reviews (books, DVD, music) from Amazon. The task is to predict whether the review is positive (rated more than 3 stars on the merchant site) or negative (less than 3 stars). The dataset is divided into balanced training and test sets. In our experiments, we do not distinguish between products : we have only two classes (positive, negative) with reviews of books, DVDs or music. As with the MediaEval data, a language model is tuned for each class using the training data. Generation is then done as described in the previous section. Examples of generated negative reviews are given in Figure 3 . As can be seen from these examples (including the MediaEval examples in Figure 2 ), the generated texts seem to belong to the expected class (see Section 5.2 for a discussion of this point). However, they often have flaws that make the fact that they were generated detectable. This is particularly the case for French texts, which can be explained by the fact that we did not have, at the time of the experiments, a pre-trained model for French; the model, as well as the tokenizer, are therefore based on the English GPT model. -If the FBI ever has evidence that a virus or some other problem caused or contributed to the unprecedented 5G roll out in major metro areas, they need to release it to the public so we can see how much of a charade it is when you try to downplay the link. -So let's think about this from the Start. Is it really true that 5G has been activated in Wuhan during Ramadan? Is this a cover up for the fact that this is the actual trigger for the coronavirus virus? Was there a link between 5G and the coronavirus in the first place? Hard to say. -We don't know if it's the 5G or the O2 masks that are killing people. It's the COVID19 5G towers that are killing people. And it's the Chinese people that are being controlled by the NWO -Déçue... J'ai eu je l'avoue du mal à lire ce livre arrivé au milieu de celui-ci. L'histoire ne paraît pas vraiment très réaliste. Le GPT-2 models for French released very recently 3 could improve this aspect. 3 For example, the Pagnol model of LightOn: https://lair.lighton.ai/pagnol/. In the experiments reported below, the performance is measured in terms of micro-F1 (equivalent to accuracy), and, to take into account the imbalance of the classes (notably in the MediaEval dataset), in terms of macro-F1 and MCC (Matthews Correlation Coefficient 4 ), as implemented in the library scikit-learn [19] . The performance is measured on the official test sets of the MediaEval [21] and CLS-FR [12] tasks, of course disjoint from the training sets T . For our first experiments, we use state-of-the-art neural classification models based on transformers. For the MediaEval data, in English, we opt for a Roberta [13] pre-trained model for English (large model with a classification layer). It is this type of transformerbased models that obtained the best results on these data during the MediaEval 2020 challenge [4, 5] . Among the variants of bert [7] , Roberta was preferred here for its tokenizer that is more adapted to the specifics of the very free form of writing found in tweets (mix of upper and lower case, absence or multiplication of punctuation, abbreviations...). For the CLS-FR data of FLUE, we use the large-cased Flaubert model [12] . This allows us to compare with the results originally published on these data. We evaluate the performance according to our different training scenarios: on the original data T (which serves as a baseline), on the artificial data G, and finally on both the artificial and original data. In this last case, we test two training strategies : • the first, T + G, mixes the original and artificial examples, • the second, G then T , trains on the artificial data on the first epochs, then on the original data for the last epoch. This results in a kind of fine-tuning on the original data after a first training on the artificial data. The implementation that we use is based on the HuggingFace's Transformers library [27] with the batch size set to 16 and the number of epochs set to 3 in all scenarios (optimal number of epochs for the baseline), except the last one (3 on G followed by 1 on T ). The results for the MediaEval and CLS-FR datasets are reported in Table 1 . On the CLS-FR data, we observe very few differences between the different scenarios and compared to the baseline (note that our baseline is similar to the published state-of-the-art results). The classification task, which is relatively simple, obviously generates data of as good quality as the original data, leading to comparable results. On this type of task, artificially generated data can therefore be used without loss of performance. The MediaEval task is more difficult as can be seen with the results of the baseline (RobertA / T ). On these data, in a substitution scenario (i.e. when the generated data are used alone as training data), the results are strongly degraded compared to a system trained on the original data. This is of course due to the fact that the data generated by each of the language models may not belong to the expected class, as the models do not fully capture the specificity of the fine-tuning data. In a learning data complement scenario, the impact is less significant, especially if the artificial data is used only on the first few epochs. As we have seen, the G examples generated by our trained GPT-2 models may contain texts that do not belong to the expected classes. Manually filtering or annotating these texts is of course possible 4 Also called Φ coefficient; see the dedicated Wikipedia page. but remains a costly task. To reduce the effect of these texts on the classification at a lower cost, we propose to exclude them using a first classifier learned on the original data T : any text of G which is not classified by the classifier is removed. In this way, we hope to eliminate, automatically, the most obvious cases of problematic artificial texts. In the following experiments, we use the Roberta classifier trained on T (evaluated in the first row of Tab. 1). In this way, 40% of the examples are deleted. The artificial examples kept are noted G . The results with these new filtered sets of artificial examples in the same training scenarios are presented in Table 2 for the MediaEval and CLS-FR data. It can be seen that this filtering strategy pays off, with improved performance on all metrics compared to no filtering. In the substitution scenario, the performance is now close to the baseline, and is even better on the macro-F1; this is explained by the fact that the artificial set G is much more balanced than T and thus performs better on the minority classes of the test set. In the complement scenario, we observe a significant improvement over the baseline, especially with the sequential strategy. Beyond the global performance measures, it can be interesting to check if the classifier trained on the artificial data allows to make the same decisions as a classifier trained on T . To do so, we can look at the proportion of examples (from the test set) for which the decision between bert* / T and bert* / G differs. For the CLS-FR data, the classifiers agree on a large majority of examples. Figure 3 shows the confusion matrix of Flaubert / T and Flaubert / G on the CLS-FR data. From this confusion matrix, we can see that the classifiers do agree on the majority of examples. The cases of disagreement are proportionally more important on the false positives and false negatives, but even for these categories, we still find a lot of common errors (42 and 77 examples respectively for the false positives and false negatives). The classifiers have therefore not only comparable performance, but very similar behaviors in detail since they give the same class on most examples. We also test classifiers based on bag-of-words representations; we present only the results of the logistic regression (LR) which gave the best results. In general, these classifiers perform less well than the transformers-based approaches, but they allow for better explainability [3, 16, for a definition and characterization of learning methods], for example by examining the regression weights associated with words. They are also way less expensive to train. The implementation used is scikit-learn [19] , the texts are vectorized with TF-IDF weighting and L2-normalized, and the LR parameters are the default ones except for the following : multiclass strategy one-vs.-rest, number of iterations = 2500. Results for the same scenarios as above are presented for the MediaEval and CLS-FR tasks in Tables 4 and 5 . For this type of classifier, the interest of the generated data appears for both scenarios and on the two datasets. In the case of substitution, the classifiers are slightly better than those trained on the original data. This demonstrates the importance of having a larger amount of data to capture form variants in texts (synonyms, paraphrases...) that the bag-of-words representations cannot otherwise capture as easily as the pre-trained embedding-based representations of the Bert models. In the scenario where data is used as a complement, the performance increase is even more marked and thus becomes close to the neural baseline, while having the advantages of a classifier considered more interpretable. It is interesting to examine what is the influence of the quality of the generated data (even filtered) on the results of the final classifier (see Section 5.2). To study this, we inject noise into the classification Figure 4 (MediaEval data) with logistic regression as final classifier. As can be seen in this figure, empirical results about the influence of filtering quality are unsurprising. In the substitution scenario, the final performance is strongly dependent on the quality of the filtering classifier; in this case, a performance level equivalent to the original dataset is achieved when the accuracy of the filter exceeds 70 %. In the case of the complement scenario, the gain is significant as soon as the filter has an accuracy higher than random. In this work, we have explored the interest of text generation for two text classification tasks from the Web (fake news detection in tweets and sentiment analysis on product reviews). In a scenario where the original training data cannot be distributed, we have shown that it is possible to generate artificial data for supervised learning purposes. For state-of-the-art classifiers based on transformers, this degrades the performance (compared to the one achieved with the original data) but in a contained proportion (-4% accuracy). On the other hand, for the classifiers exploiting bag-of-words representations, we notice an improvement due to the larger amount of training data available. In a scenario where artificial data is added to the original data, we have shown that classifiers benefit from additional data, including neural networks. This result is particularly positive for the bag-ofwords approaches, which are more sensitive to reformulations, and which clearly benefit from the addition of these artificial examples. We thus have a good compromise between methods that are fast to train, more easily explainable, while having performance close to neural networks. As we have seen, these results are obtained provided that the generated data are filtered beforehand, which seems to contradict several studies cited in Sec. 2. In our experiments, this was done automatically; manual correction of the data (of their classes) is also possible and may allow better results, but with an additional annotation cost. The use of these methods for other data and other NLP tasks than text classification remains a promising avenue. Among these NLP tasks, those based on word labeling (token classification) pose different problems and require adapted solutions. In the future, it would be interesting to verify the consistency of our results according to other generation approaches [11] . It also seems interesting to study more deeply the impact of the quality of the classifier used to filter the artificial data. Moreover, the integration of this filtering step as a constraint during the generation of artificial examples is a promising avenue. For replicability purposes, the training scenarios presented in this article are available online for the MediaEval 5 and CLS-FR 6 tasks. The generation of examples relies on https://github.com/ minimaxir/gpt-2-simple. The data are available from their producers (see Section 4). Exploring Transformer Text Generation for Medical Dataset Augmentation Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. Extracting Training Data from Large Language Models. arXiv (2020) TIB's Visual Analytics Group at MediaEval '20: Detecting Fake News on Corona Virus and 5G Conspiracy Detecting fake news in tweets from text and propagation graph: IRISA's participation to the FakeNews task at MediaEval 2020 Query expansion with artificially generated texts BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization Towards textual data augmentation for neural networks: synonyms and maximum loss Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations Data Augmentation using Pre-trained Transformer Models FlauBERT: Unsupervised Language Model Pre-training for French RoBERTa: A Robustly Optimized BERT Pretraining Approach Distributed Representations of Words and Phrases and their Compositionality WordNet: A Lexical Database for English Explanation in Artificial Intelligence: Insights from the Social Sciences Siamese Recurrent Architectures for Learning Sentence Similarity DARE: Data Augmented Relation Extraction with GPT-2 Scikit-learn: Machine Learning in Python GloVe: Global Vectors for Word Representation FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020 Cross-Language Text Classification Using Structural Correspondence Learning Language Models are Unsupervised Multitask Learners FACT: a Framework for Analysis and Capture of Twitter Graphs Attention is All you Need EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Transformers: State-of-the-Art Natural Language Processing Conditional BERT Contextual Augmentation