key: cord-0534590-xjsk4zre authors: P'erez, Juan Manuel; Furman, Dami'an A.; Alemany, Laura Alonso; Luque, Franco title: RoBERTuito: a pre-trained language model for social media text in Spanish date: 2021-11-18 journal: nan DOI: nan sha: ae9e743c6ca4ac7e33c72cf140c69aa434eb3b19 doc_id: 534590 cord_uid: xjsk4zre Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks. However, for languages other than English such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model achieves top results for some English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and has also competitive performance against monolingual models in English tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it. Pre-trained language models have become a basic building block in the area of natural language processing. In the last years, since the introduction of the transformer architecture (Vaswani et al., 2017) , they have been used in many other different natural language understanding tasks, outperforming previous models based on recurrent neural networks. BERT, and RoBERTa are some of the most well-known such tools. Most models are trained on large-scale corpora taken from news or Wikipedia, which are considered general enough to comprise a large part of the language. Some domains, however, are very specific and have their own vocabulary, jargon, or complex expressions. The medical or scientific domains, for example, use terms and concepts which are not found in a general corpus or occur just a few times. In some other cases, words have specific meanings within a particular domain. Colloquial language -as found in Twitter and other social networks-is more informal, with slang and other expressions which rarely occur in Wikipedia. For these reasons, a number of pre-trained models have been created to handle these domains. SciBERT (Beltagy et al., 2019) and MedBERT (Rasmy et al., 2021) are examples of domain-specific models. For user-generated text, many models have been trained on Twitter for different languages. However, Spanish lacks pre-trained models for user-generated text, or they are not easily available in the most popular model repositories, such as the HuggingFace model hub. This hinders the development of accurate applications for user-generated text in Spanish. In this paper, we present RoBERTuito, a large-scale transformer model for user-generated text trained on Spanish tweets. We show that RoBERTuito outperforms other Spanish pre-trained models for a number of classification tasks on Twitter. In addition to this, and due to the collection process of our pre-training data, RoBERTuito is a very competitive model in multilingual and code-switching settings including Spanish and English. Our contributions are the following: • We publish the data used to train RoBERTuito (around 500M tweets in Spanish), to facilitate the development of other language models or embeddings, also using subsets of the corpus that model specific subdomains, like regional or thematic variants. • We make the weights of our model available through the HuggingFace model hub, thus sparing computation for researchers with no access to the computational power or simply sparing extra computation, while making the model transparent (albeit not interpretable). • We set up a benchmark for classification tasks involving user-generated text in Spanish. • We assess the performance of domain-specific models with respect to general-language models, showing that the first outperform the latter in the corresponding domain-specific tasks. • We assess the impact of preprocessing strategies for our models: cased input vs. uncased input text vs. uncased input text without accents, showing that the uncased version of the corpus yields better performance. Language models based on transformers (Vaswani et al., 2017) have become a key component in stateof-the-art NLP tasks, from text classification to natural language generation. One of the most popular transformer-based tools, BERT (Devlin et al., 2019) is a neural bidirectional language model trained on the Masked-Language-Model (MLM) task and in the Next-Sentence-Prediction (NSP) task. This language model can then be fine-tuned for a downstream task or can be used to compute contextualized word representations. RoBERTa (Liu et al., 2019) is an optimized pretraining approach that differs from BERT in four aspects: it trains the model with more data; it removes the NSP objective; it uses longer batches; and it dynamically changes the masking pattern applied to the data. These models, along with GPT (Radford et al., 2018) , supposed breakthroughs in the performance on benchmarks such as GLUE (Wang et al., 2018) . Nozza et al. (2020) provides a good overview of the BERTbased language models. After the explosion of language models based on transformers, some models have been trained on corpora that target more specifically a domain of interest instead of generic texts such as Wikipedia or news. For example, SciBERT (Beltagy et al., 2019 ) is a BERT model trained on scientific texts, and MediB-ERT (Rasmy et al., 2021) was crafted on medical documents. AlBERTo (Polignano et al., 2019) is one of the first models trained on tweets -particularly, in Italian. BERTweet (Nguyen et al., 2020 ) is a RoBERTa model trained on about 850M tweets in English, a part of them related to the COVID-19 pandemic. Multilingual models have also been successful at many tasks comprising more than one language. Multilingual BERT (mBERT) (Devlin et al., 2019) was pretrained on the concatenation of the top-104 languages from Wikipedia. In a parallel fashion, XLM-R (Conneau et al., 2020) was trained using RoBERTa guidelines on the concatenation of Common Crawl data containing more than 100 languages, obtaining a considerable performance boost over several multilingual tasks while keeping competitive with monolingual models. BETO (Canete et al., 2020) et al., 2022) is also a RoBERTA-based model, for whose development the authors explored sampling strategies over the Spanish portion of the mc4 corpus (Raffel et al., 2020) . To the best of our knowledge, TwilBERT is the only specialized pre-trained model on Twitter data for the Spanish language. However, this model has some limitations: first, the training data is not available, making it not auditable. Second, it is not clear how long its pre-training was. Third, the authors used a variant of the NSP task adapted to Twitter (Reply Order Prediction), in spite of many works showing that the type of training based on RoBERTa (MLM only) improves performance on downstream tasks. Finally, the model is not easily available (for instance, in the HuggingFace model hub 1 ), which makes its use difficult. In this section, we describe the tweet collection process used to build the corpus to train RoBERTuito. Twitter's free access streaming API (also known as Spritzer) provides a sample of around 1% of the overall published tweets, supposedly random, although some studies have shown some concerns about the possible manipulation of this sample (Pfeffer et al., 2018) . Unrepresentative, biased samples may produce biased behaviours in the resulting model and systematic, possibly harmful errors in downstream tasks that use this model. That is why we make available the training dataset and the specifics of the model, so that it can be fully inspected in case biases are suspected. In following releases of this tool, an extensive audit of the training corpus will be carried out. First, we downloaded a Spritzer collection uploaded to Archive.org dating from May 2019 2 . From this, we only keep tweets with language metadata equal to Spanish, and mark the users who posted these messages. Then, the tweetline from each of these marked users was downloaded. We decided to download data from users already represented in the initial collection to facilitate user-based studies in this dataset, and also because we believe the original sample of users to be representative, and thus we hope to maintain this representativeness by sampling from the same users. In total, the collection consists of around 622M tweets from about 432K users. Finally, we filtered tweets with less than six tokens, because language contained in those is very different from the language in longer tweets. To identify tokens we used the tokenizer trained in BETO (Canete et al., 2020) , without counting character repetitions and emojis. This leaves a training corpus of around 500M tweets, which we split in many files to facilitate reading in later processes. The code for the collection process can be found at https://github. com/finiteautomata/spritzer-tweets. Something to remark is that this collection process allows the data to contain code-mixed text or even tweets from other languages, as we only required the post on the original sample to be in Spanish. While other works such as Nguyen et al. (2020) required every tweet to be in English, we let other languages to be included in the pre-training data. A rough estimate of the language population using fasttext's language-detection module (Joulin et al., 2016) suggests that 92% of the data is in Spanish, 4% in English, 3% in Portuguese, and the rest in other languages. In this section, we describe the training process of RoBERTuito. Three versions of RoBERTuito were trained: a cased version which preserves the case found in the original tweets, an uncased version, and a deacc version, which lower-cases and removes accents on tweets. Normative Spanish prescribes marks for (some) accents in words, but their usage is inconsistent in usergenerated text, so we want to test if removing them improves the performance on downstream tasks. For each of the three configurations, we trained tokenizers using SentencePiece (Kudo and Richardson, 2018) on the collected tweets, limiting vocabularies to 30,000 tokens. We used the tokenizers library (Moi et al., 2019) which provides fast implementations in Rust for many tokenization algorithms. Preprocessing is key for models consuming Twitter data, which is quite noisy, have user handles (@username), hashtags, emojis, misspellings, and other noncanonical text. Nguyen et al. (2020) tried two normalization strategies: a soft one, in which only minor changes are performed to the tweet such as replacing usernames and hashtags, and a more aggressive one using the ideas of Han and Baldwin (2011) . The authors found no significant improvement by using the harder normalization strategy. Having this in mind, we followed an approach similar to the one used both in this work and in Polignano et al. (2019) : • Character repetitions are limited to a max of three • User handles are converted to a special token @usuario • Hashtags are replaced by a special token hashtag followed by the hashtag text and split into words if this is possible • Emojis are replaced by their text representation using emoji library 3 , surrounded by a special token emoji. 3 https://github.com/carpedm20/emoji/ A RoBERTa base architecture was used in RoBERTuito, with 12 self-attention layers, 12 attention heads, and hidden size equal to 768, in the same fashion as BERTweet (Nguyen et al., 2020) . We used a maskedlanguage objective, disregarding the next-sentence prediction task used in BERT or other tweet-order tasks such as those used in Gonzalez et al. (2021) . Taking into account successful hyperparameters from RoBERTa and BERTweet, we decided to use a large batch size for our training. While an 8K batch size is recommended in RoBERTa, due to resource limitations, we decided to balance the number of updates using a 4K size. To check convergence, we first trained an uncased model for 200K steps. After this, we then proceeded to run it for 600K steps for the three models. This is roughly half the number of updates used to train BETO (and also BERTweet) but this difference is compensated by the larger batch size used to train RoBERTuito. We trained our models for about three weeks on a v3-8 TPU and a preemptible e2-standard-16 machine on GCP. Our codebase uses HuggingFace's transformers library and their RoBERTa implementation (Wolf et al., 2020) . Each sentence is tokenized and masked dynamically with a probability equal to 0.15. Further details on hyperparameters and training can be found in the Appendix 11. We evaluated RoBERTuito in two monolingual settings (Spanish and English) and also in a code-mixed benchmark for tweets containing both Spanish and English. As the collection process allowed non-Spanish tweets to be included, we assess not only the performance of our model in Spanish, but also in other environments. Table 1 summarizes the evaluation tasks. For the Spanish evaluation, we set a benchmark for this model following TwilBERT , AlBERTo (Polignano et al., 2019) Hate speech detection in social media has gained much interest in the past years, based on the need to act against the spread of hate messages that develops in parallel with an increasing amount of user-generated content. It is a difficult task that requires a deep, contextualized understanding of the semantic meaning of a tweet as a whole. For this reason, we selected the hatEval Task A dataset , which is a binary classification task for misogyny and racism, to benchmark our model. The authors collected the dataset by three combined strategies: monitoring potential victims of hate accounts, downloading the history of identified producers of hateful content, and filtering Twitter streams with keywords. This dataset distinguishes between hate speech targeted to individuals and generic hate speech, and between aggressive and non-aggressive messages. For this work, we do not consider these classifications, and we are interested in predicting only the binary label of whether the tweet is hateful or not. The Spanish subset of this dataset comprises 6,600 instances. Irony detection has also recently gained popularity. Many works show that it has important implications in other natural language processing tasks that require semantic processing. Gupta and Yang (2017) showed that using features derived from sarcasm detection improves the performance on sentiment analysis. In addition to this, user-generated content is a rich and vast source of irony, so being able to detect it is of particular importance for the domain of social networks. As for English, we tested RoBERTuito in three tasks: emotion analysis, hate speech detection, and sentiment analysis. For emotion analysis and hate speech we used the English sections from the aforementioned datasets (EmoEvent and HatEval), while for sentiment analysis SemEval 2017 Task-4 dataset (Rosenthal et al., 2017) was used, which shares the same labels as its Spanish counterpart (negative, neutral, positive). In this case, we compare RoBERTuito abilities in English with monolingual models: BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , and BERTweet (Nguyen et al., 2020); and also against multilingual models such as XLM-R BASE (Conneau et al., 2020) and mBERT. While all these models share a base architecture, the different vocabulary sizes and number of parameters (see Appendix 11) make the comparison a bit more difficult. Finally, we assessed the code-switching abilities of our model in the Linguistic Code-Switching Evalua-tion Benchmark (LinCE) . LinCE comprises five tasks for code-switched data in several language pairs (Spanish-English, Hindi-English, Modern Standard Arabic-Egyptian Arabic, Arabic-English, among others), many of which were part of previous shared tasks. We evaluated RoBERTuito on three different tasks of the benchmark: part of speech (POS) tagging (AlGhamdi et al., 2016), named entity recognition (NER), and sentiment analysis (Patwa et al., 2020) . As the collection process of the data centered on Spanish-speaking users, some of which also speak English and Spanglish 4 , we test RoBERTuito on the Spanish-English subsection of the benchmark. This benchmark has a centralized evaluation system, not releasing gold labels for the test split of the tasks. We evaluated our models in the dev datasets and compared our results against the ones provided by Winata et al. (2021) , which achieves the best performance for the NER and POS tagging. As competing models to RoBERTuito for the Spanish-English evaluation, we have mBERT, XLM-R (both in base and large architecture) and monolingual models BERT and BETO. We followed fairly standard practices for fine-tuning, most of which are described in Devlin et al. (2019) . For sentence classification tasks, we fine-tuned the pretrained models for 5 epochs with a triangular learning rate of 510 −5 and a warm-up of 10% of the training steps. The best checkpoint at the end of each epoch was selected based on each task metric. For sentence classification tasks, a linear classifier is put on top of the representation of the [CLS] token. For token classification tasks (NER and POS tagging), we predicted the word tag by putting a linear classifier on top of the first corresponding sub-word representation. Table 2 displays the results for the evaluation for the four proposed classification tasks in Spanish. Figures are expressed as the mean of 10 runs of the experiments, along with a score averaging the metrics for each task in a similar way as the GLUE score. We can observe that, in most cases, all RoBERTuito configurations perform above other models, in particular for hate speech detection and sentiment analysis. For most tasks, no big differences are observed between uncased and deacc models, but both perform consistently above the cased model. Figure 1 : Distribution of the number of tokens per instance. Bars are grouped by task, and display the mean number of tokens per instance with their 95% confidence interval. Less is better of RoBERTa and slightly above BERT. As expected, BERTweet obtains the best results. As for the code-switching evaluation, Table 4 displays the results on the dev dataset of LinCE for NER, POS Tagging, and sentiment analysis. We compare RoBERTuito against other multilingual models such as mBERT and XLM-R BASE , and also listing the dev results reported by (Winata et al., 2021) . The results for XLM-R and mBERT are consistent with the results in that work. A minor improvement is observed but this could an artifact of different choices of hyperparameters or even a slightly different preprocessing. Finally, Table 5 displays the results from the leaderboard of the LinCE benchmark 5 for the three selected tasks: Spanish-English sentiment analysis, NER and POS tagging. For sentiment analysis, it obtains the best results in terms of Micro F1. For the other two tasks, it obtains the second position, for which an XLM-R LARGE model (Winata et al., 2021) has the top results. Among the compared models, RoBERTuito has 108 million parameters, while XLM-R LARGE sums up to around five times this number, making our model the most efficient in terms of size for this subsection of the benchmark (see Appendix 11 for further details on the size of the different models). Figure 1 shows the distribution of the number of tokens in the input text for the Spanish tasks. We can observe that RoBERTuito models have more compact representations than BETO and RoBERTa-BNE for this domain, and, between them, the deacc version has a slightly lower mean-length compared to the uncased version. RoBERTuito has also shorter representations for the code-mixed datasets, but longer in the case of the English tasks. Appendix 11 lists the complete figures for the three evaluation settings. (Aguilar et al., 2021) cased version is behind the others in terms of performance for most tasks -with the exception of POS tagging and NER-while the other two (uncased and uncased and deaccented) have comparable performances across all the tasks. We can read this in two ways: one, that a stronger normalization of the input text in Spanish results in no significant improvement in the performance of the model, and two, that keeping accent marks in the input text is neither beneficial nor harmful for the performance of the model. The reasons for the differences in performance for the uncased models need further investigation. We have some working hypotheses for these differences. First, we believe that accents and non-ASCII characterswith the exception of emojis-are used in a much more inconsistent way in user-generated text than in more normative text. Therefore, no regularities can be inferred from the data concerning those marks. Second, a bigger amount of data is required to account for the possible differences in meaning for upper case or lower case forms or the lack of difference between them. Future experiments will delve into those two. Our data collection process for the pre-training stage was centered in Spanish but it allowed other languages and other regional variants to be part of our dataset as well. This point made our model develop some mul-tilingual features, in particular in the code-switching LinCE benchmark. The results for this benchmark highlights that RoBERTuito is suited for Spanish-English code-mixing tasks, obtaining better results than mBERT and matching those of XLM-R. This comparison, however, is not completely fair because XLM-R and RoBERTa can handle over one hundred languages but this is not the case for our model. Lastly, the results for the English tasks show that RoBERTuito keeps competitive against monolingual models for the social domain. In this work, we presented RoBERTuito, a large-scale model trained on user-generated tweets. We set up a benchmark of classification tasks in social-media text for Spanish, and we showed that RoBERTuito outperforms other available general domain pre-trained language models. Moreover, our model features good code-switching performance in Spanish-English tasks and is competitive against monolingual English models in the social domain. We proposed three versions of this new model: cased, uncased, and deaccented. We observed that the uncased model performs slightly better than the cased one, and similarly to the deaccented version. Further research is needed to systematize the reasons behind these results. We have made our pretrained language models public through the HuggingFace model hub, and our code can be found at GitHub 6 . We will also make the training corpus available, thus facilitating the development of other models for user-generated Spanish, like word embeddings or other language models. It is even feasible to extract subsets of the corpus representing subdomains of interest, like regional variants of Spanish or specific topics, to develop even more specific models. Future work includes enhancing our benchmark to include assessment of the performance of such models in open-ended tasks, and experiments for specific subdomains of Spanish. The authors would like to thank the Google TPU Research Cloud program 7 for providing free access to TPUs for our experiments. We also thank Juan Carlos Giudici for his help in setting up the GCP environment, and Huggingface's team for their support in the development of RoBERTuito. Aguilar, G., McCann, B., Niu, T., Rajani, N., Keskar, N. S., and Solorio, T. (2021) . Char2Subword: Extending the subword embedding space using robust character compositionality. In Findings of the Association for Computational Linguistics: EMNLP Vocabulary efficiency Table 9 displays the mean sentence length for each considered model and group of tasks. For the Spanish and the code-mixed Spanish-English benchmark, RoBER-Tuito achieves the more compact representations in mean length in their uncased and deaccented forms. In the case of English, BERTweet achieves the shortest representations, with RoBERTuito having longer sequences of tokens than its monolingual counterparts. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter An argument for basic emotions CrystalNest at SemEval-2017 task 4: Using sarcasm detection for enhancing sentiment classification and quantification Lexical normalisation of short text messages: Makn sens a# twitter Bag of tricks for efficient text classification SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing Hugging face tokenizers library What the [mask]? making sense of language-specific bert models Overview of the task on irony detection in spanish variants Tampering with twitter's sample api EmoEvent: A multilingual emotion corpus based on different events Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets Exploring the limits of transfer learning with a unified text-to-text transformer Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction Sentiment analysis in Twitter Attention is all you need GLUE: A multi-task benchmark and analysis platform for natural language understanding Are multilingual models effective in code-switching? Transformers: State-of-the-art natural language processing LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation Part of Speech Tagging for Code Switched Data. Association for Computational Linguistics SciBERT: A Pretrained Language Model for Scientific Text. Association for Computational Linguistics Spanish pre-trained bert model and evaluation data Unsupervised Cross-lingual Representation Learning at Scale. Association for Computational Linguistics BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. Association for Computational Linguistics TWilBert: Pre-trained deep bidirectional transformers for Spanish Twitter BERTweet: A pre-trained language model for English Tweets Overview of Sentiment Analysis of Code-Mixed Tweets. Association for Computational Linguistics Hyperparameters and training Table 6 displays the hyperparameters of the RoBERTuito training for its three versions. For the first prototype of the model, a larger learning rate was tried but it usually diverged near its peak value, so we decided to lower it for the definitive training. Table 7 displays the results of the training in terms of cross-entropy loss and perplexity for the three versions of RoBERTuito. A comparison of the models in terms of the number of parameters and vocabulary sizes is presented in Table 8 . RoBERTuito is the smallest model in terms of vocabulary size, and as most models share a base architecture (see Table 6 ), this accounts for the difference in the number of parameters.