key: cord-0615819-c1xhwlbx
authors: Barreto, S'ergio; Moura, Ricardo; Carvalho, Jonnathan; Paes, Aline; Plastino, Alexandre
title: Sentiment analysis in tweets: an assessment study from classical to modern text representation models
date: 2021-05-29
journal: nan
DOI: nan
sha: 5a081341c29fa8146d8661e86c520bac244b945e
doc_id: 615819
cord_uid: c1xhwlbx

With the growth of social medias, such as Twitter, plenty of user-generated data emerge daily. The short texts published on Twitter -- the tweets -- have earned significant attention as a rich source of information to guide many decision-making processes. However, their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks, including sentiment analysis. Sentiment classification is tackled mainly by machine learning-based classifiers. The literature has adopted word representations from distinct natures to transform tweets to vector-based inputs to feed sentiment classifiers. The representations come from simple count-based methods, such as bag-of-words, to more sophisticated ones, such as BERTweet, built upon the trendy BERT architecture. Nevertheless, most studies mainly focus on evaluating those models using only a small number of datasets. Despite the progress made in recent years in language modelling, there is still a gap regarding a robust evaluation of induced embeddings applied to sentiment analysis on tweets. Furthermore, while fine-tuning the model from downstream tasks is prominent nowadays, less attention has been given to adjustments based on the specific linguistic style of the data. In this context, this study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets from distinct domains and five classification algorithms. The evaluation includes static and contextualized representations. Contexts are assembled from Transformer-based autoencoder models that are also fine-tuned based on the masked language model task, using a plethora of strategies.

less attention has been given to adjustments based on the specific linguistic style of the data. In this context, this study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets from distinct domains and five classification algorithms. The evaluation includes static and contextualized representations. Contexts are assembled from Transformer-based autoencoder models that are also fine-tuned based on the masked language model task, using a plethora of strategies.

Keywords sentiment analysis · text representations · language models · natural language processing · Twitter

In recent years, the use of social media networks, such as Twitter 1 , has been growing exponentially. It is estimated that about 500 million tweets -the short informal messages sent by Twitter users -are published daily 2 . Unlike others text style, tweets have an informal linguistic style, misspelled words, the careless use of grammar, URL links, user mentions, hashtags, and more. Due to these inherent characteristics, discovering patterns from tweets represents a challenge and opportunities for machine learning and natural language processing (NLP) tasks, such as sentiment analysis.

Sentiment analysis is the field of study that analyzes people's opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes expressed in written text [30] . Usually, one reduces the sentiment analysis task to find out the polarity classification, i.e., whether they carry a positive or negative connotation. One of the biggest challenges concerning the sentiment classification of tweets is that people often express their sentiments and opinions using a casual linguistic style, resulting in the presence of misspelling words and the careless use of grammar. Consequently, the automated analysis of tweets' content requires machines to build a deep understanding of natural text to deal effectively with its informal structure [3] . However, before discovering patterns from text, it is essential to define a more fundamental step: how automatic methods can numerically represent textual content.

Vector space models (VSMs) [21] are one of the earliest and most common strategies adopted in text classification literature to allow for machines deal with texts and their structures. The VSM represents each document in a corpus as a point in a vector space. Points that are close together in this space are semantically similar, and points that are far apart are semantically distant [55] . The firsts VSM approaches are count-based methods, such as Bag-of-Words (BoW) [33] and BoW with TF-IDF [33] . Although VSMs have been extensively used in the literature, they cannot deal with the curse of dimensionality. More clearly, considering the inherent characteristics of tweets, a corpus of tweets may contain different spellings for each unique word leading to an extensive vocabulary, making the vector representation of those tweets very large and often sparse.

To tackle the curse of dimensionality inherent from BOW-based approaches, in the last years it has become a standard practice to learn dense vectors to represent words and texts, the embeddings. Methods such as such as Word2Vec [53] , Fast-Text [54] , and others [2, 20, 51, 58] have been used with relative success to address a plethora of NLP tasks. Nevertheless, in general, the performance of such techniques are still unsatisfactory to solve sentiment analysis from tweets, taking into account the dynamic vocabulary used by Twitter users to express themselves. Specifically, in tweets, the ironic and sarcastic content expressed in a limited space, regularly out of context and informal, makes even more challenging to retrieve meaning from the words. Such attributes may degrade the performance of traditional word embeddings methods if not handled properly. In this context, contextualized word representations have recently emerged in the literature, aiming at allowing the vector representation of words to adapt to the context they appear. Contextual embedding techniques, including ELMo [42] and Transformer-based autoencoder methods, such as BERT [14] , RoBERTa [31] , and BERTweet [39] , capture not only complex characteristics of word usage, such as syntax and semantics, but also how the word usage vary across linguistic contexts. Those methods have achieved state-of-the-art results on various NLP tasks, including sentiment analysis [4, 12, 22, 1] .

Much effort in recent language modeling research is focused on scalability issues of existing word embedding methods. On this basis, inductive transfer learning strategies and pre-trained embedding models have gained important application in the literature, especially when the amount of labeled data to train a model is relatively small. With that, models obtained from the aforementioned contextual embeddings methods are rarely trained from scratch but are instead fine-tuned from models pre-trained on datasets with a huge amount of texts [27, 42, 45] . Pre-trained models reduce the use of computational resources and tend to increase the classification performance of several NLP tasks, sentiment analysis included.

Despite the successful achievements in developing efficient word representation methods in NLP literature, there is still a gap regarding a robust evaluation of existing language models applied to the sentiment analysis task on tweets. Most studies are mainly focused on evaluating those models for different NLP tasks using only a small number of datasets [43, 28, 31, 26, 58] . Our main goal is to identify appropriate embeddings-based text representations for the sentiment analysis of English tweets in this study. For this purpose, we evaluate embeddings of different natures, including: i) static embeddings learned from generic texts [2, 34, 35, 41] ; ii) static embeddings learned from datasets of Twitter sentiment analysis [5, 8, 20, 41, 51, 58] ; iii) contextualized embeddings learned from transformer-based autoencoders with generic texts with no adjustments [14, 31] ; iv) contextualized embeddings learned from Transformerbased autoencoders with a dataset of tweets with no adjustments [39] ; v) contextualized embeddings fine-tuned to the tweets language; and vi) con-textualized embeddings fine-tuned to the tweets sentiment language. In all assessments, we use a representative set of twenty-two sentiment datasets [11] as input to five classifiers to evaluate the predictive performance of the embeddings. To the best of our knowledge, there is no previous study that has conducted such a robust evaluation regarding language models of several flavors and a large number of datasets. In order to identify the most appropriate text embeddings, we conduct this study to answer the following four research questions.

RQ1. Which static embeddings are the most effective in the sentiment classification of tweets? Our motivation to evaluate those models is that many state-of-the-art deep learning models can require a lot of computational power, such as memory usage and storage. Thus, running those models locally on some devices may be difficult for mass-market applications that depend on low-cost hardware. To overcome this limitation, embeddings generated by language models can be gathered by simply looking up at the embedding table to achieve a static representation of textual content. We intend to assess how these static representations work and which are the most appropriate in this context. We answer this research question by evaluating a rich set of text representations from the literature [2, 5, 8, 14, 20, 34, 35, 39, 41, 51, 58, 59] . To achieve a good overview of the static representations, we conduct an experimental evaluation in the sentiment analysis task with five different classifiers and 22 datasets.

RQ2. Considering state-of-the-art Transformer-based autoencoder models, which are the most effective in the sentiment classification of tweets? Regarding recent advances in language modeling, Transformer-based architectures have achieved state-of-the-art performances in many NLP tasks. Specifically, BERT [14] is the first method that successfully uses the encoders components of the Transformer architecture [57] to learn contextualized embeddings from texts. Shortly after that, RoBERTa [31] was introduced by Facebook as an extension of BERT that uses an optimized training methodology. Next, BERTweet [39] was proposed as an alternative to RoBERTa for NLP tasks focusing on tweets. While RoBERTa was trained on traditional English texts, such as Wikipedia, BERTweet was trained from scratch using a massive corpus of 850M English tweets. In this context, to answer this research question, we conduct an experimental evaluation of BERT, RoBERTa, and BERTweet models in the sentiment analysis task with five different classifiers and 22 datasets to obtain a comprehensive analysis of their predictive performances. By evaluating these models we may obtain a robust overview of the Transformer-based autoencoder representations that better fit tweet's style.

RQ3. Can the fine-tuning of Transformer-based autoencoder models using a large set of English tweets improve the sentiment classification performance? One of the benefits of pre-trained language models, such as the Transformerbased models exploited in this study, is the possibility to adjust the language model to a specific domain by applying a fine-tuning procedure. We aim at assessing whether the sentiment analysis of tweets can benefit from fine-tuning BERT, RoBERTa and BERTweet language models with a vast, generic, and unlabeled set of around 6.7M English tweets from distinct domains. To that, we fine-tuned the pre-trained language model using the intermediate maskedlanguage model task. Besides, considering that the fine-tuning procedure can be a very data-intensive task that may demand a lot of computational power, in addition to the large corpus of 6.7M tweets, we use in the fine-tuning process nine other samples with different sizes, varying from 500 to 1.5M tweets. We conduct an experimental evaluation with all models in the sentiment analysis task with five different classifiers and 22 datasets as in the previous questions.

RQ4. Can Transformer-based autoencoder models benefit from a fine-tuning procedure with tweets from sentiment analysis datasets? Although using unlabeled generic tweets to adjust a language model seems to be promising regarding the availability of data, we believe that the fine-tuning procedure may benefit from the sentiment information that tweets from labeled datasets contain. In this context, we aim at identifying whether fine-tuning models with positive and negative tweets can boost the sentiment classification of tweets. We perform this evaluation by assessing three distinct strategies in order to simulate three real-world situations, as follows. In the first strategy, we use a specific sentiment dataset itself as the target domain dataset to fine-tune a language model. The second strategy simulates the case where a collection of general sentiment dataset is available to fine-tune a language model. In the third and last strategy, we combine the two previous situations. In short, we put together tweets from a target dataset and from a collection of sentiment datasets in the fine-tuning procedure. Finally, we present a comparison between the predictive performances achieved by these three evaluations and the fine-tuned models evaluated in RQ3. As in the previous questions, we conduct the experiments with five different classifiers and 22 datasets.

In summation, given the large number of language models exploited in this study, our main contributions are: (i) a comparative study of a rich collection of publicly available static representations generated from distinct deep learning methods, and with different dimensions, vocabulary size, and from various kinds of corpora; (ii) an assessment of state-of-the-art contextualized language models from the literature, that is, Transformer-based autoencoder models, including BERT, RoBERTa, and BERTweet; (iii) an evaluation of distinct strategies for fine-tuning Transformer-based autoencoder language models; and (iv) a general comparison over static, Transformer-based autoencoder, and fine-tuned language models, aiming at determining the most suitable ones for detecting the sentiment expressed in tweets 3 .

In order to present our contributions, we organized this article as follows. Section 2 presents a literature review related to the language models examined in this study. In Section 3, we describe the experimental methodology we followed in the computational experiments, which are reported in Sections 4, 5, 6, and 7, responding the four research question, respectively. Finally, in Section 8, we present the conclusions and directions for future research.

Sentiment analysis is an automated process used to predict people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [30] . Recently, sentiment analysis has been recognized as a suitcase research problem [9] , which involves solving different NLP classification sub-tasks, including sarcasm, subjectivity, and polarity detection, which is the focus of this study.

Pioneer works in the sentiment classification of tweets mainly focused on the polarity detection task, which aims at categorizing a piece of text as carrying a positive or negative connotation. For example, Go et al.

[24] define sentiment as a personal positive or negative feeling. There, they used unigrams as features to train different machine learning classifiers, using tweets with emoticons as training data. The unigram model, or bag-of-words (BoW), is the most basic representation in text classification problems.

Over the years, different techniques have been developed in NLP literature in an effort to make natural language easily triable by computers. Vector Space Models (VSMs) [21] are one of the earliest strategies used to represent the knowledge extracted from a given corpus. Earlier approaches to build VSMs are grounded on count-based methods, such as BoW [56] with TF-IDF (Term Frequency-Inverse Document Frequency) [49] representation, which measures how important a word is to a document, relying on its frequency of occurrence in a corpus.

The BoW model, which assumes word order is not important, is based on the hypothesis that the frequencies of words in a document tend to indicate the relevance of the document to a query [21] . This hypothesis expresses the belief that a column vector in a term-document matrix captures an aspect of the meaning of the corresponding document or phrase. Precisely, Let X be a termdocument matrix. Suppose the document collection contains n documents and m unique terms. The matrix X will then have m rows (one row for each unique term in the vocabulary) and n columns (one column for each document). Let w i be the i-th term in the vocabulary and let d j be the j-th document in the collection. The i-th row in X is the row vector x i: and the j-th column in X is the column vector x :j . The row vector x i: contains n elements, one element for each document, and the column vector x :j contains m elements, one element for each term. Suppose X is a simple matrix of frequencies, then the element x ij in X is the frequency of the i-th term w i in the j-th document d j [56] .

Such a simple way of creating numeric representations from texts have motivated early studies in detecting the sentiment expressed in tweets [6, 24, 40] . However, though widely adopted, this kind of feature representation leads to the curse of dimensionality due to the large number of uncommon words tweets contain [47] .

Thus, with the revival and success of neural-based learning techniques, several methods that learn dense real-valued low dimensional vectors to represent words have been proposed, such as Word2Vec [53] , FastText [54] , and GloVe [41] . Word2Vec [53] is one of the pioneer models to become popular taking advantage from the development of neural networks over the years. Wor2Vec is actually a software package composed of two distinct implementations of language-models, both based on a feed-forward neural architecture, namely Continuous Bag-Of-Words (CBOW) and Skip-gram. The CBOW model aims at predicting a word given its surrounding context words. Conversely, the Skip-gram model predicts the words in the surrounding context given a target word. Both architectures consist of input, a hidden layer and an output layer. The input layer has the size of the vocabulary and encodes the context by combining the one-hot vector representations of surrounding words of a given target word. The output layer has the same size as the input layer and contains a one-hot vector of the target word obtained during the training. However, one of the main disadvantages of those models is that they usually struggle to deal with out-of-vocabulary (OOV) words, i.e., words that have not been seen in the training data before. To address this weakness, more complex approaches have been proposed, such as FastText [54] .

FastText [54] is based on the Skip-gram model [53] , still it considers each word as a bag of character n-grams, which are contiguous sequences of n characters from a word, including the word itself. A dense vector is learned to each character n-gram and the dense vector associated to a word is taken from the sum of those representations. Thus, FastText can deal with different morphological structure of words that covers the words not seen in the training phase, i.e., OOV words. For that reason, FastText is also able to deal with tweets, considering the huge number of uncommon and unique words in this kind of text.

Going to another direction, the GloVe model [41] attempts at making efficient use of statistics of word occurrences in a corpus to learn better word representations. In [41] , Pennington et al. present a model that rely on the insight that ratios of co-occurrences, rather than raw counts, encode semantic information about pair of words. This relationship is used to derive a suitable loss function for a log-linear model, which is then trained to maximize the similarity of every word pair, as measured by the ratios of co-occurrences. Given a probe word, the ratio can be small, large or equal to one depending on their correlations. This ratio gives hints on the relations between three different words. For example, given a probe word and two others w i and w j , if the ratio is large, the probe word is related to w i but not w j .

In general, methods for learning word embeddings deal well with the the syntactic context of words but ignore the potential sentiment they carry. In the context of sentiment analysis, words with similar syntactic structure but opposite sentiment polarity, such as good and bad, are usually mismapped to neighbouring word vectors. To address this issue, Tang et al. [51] proposed the Sentiment-Specific Word Embedding model (SSWE), which encodes the sentiment information in the embeddings. Specifically, they developed neural networks that incorporate the supervision from sentiment polarity of texts in their loss function. To that, they slide the window of n-gram across a sentence, and then predict the sentiment polarity based on each n-gram with a shared neural network. In addition to SSWE, other methods have been proposed in order to improve the quality of word representations in sentiment analysis, by leveraging the sentiment information in the training phase, such as DeepMoji [20] , Emo2Vec [58] , and EWE [2] .

The aforementioned word embedding models have been used as standard components in most sentiment analysis methods. However, they pre-compute the representation for each word independently from the context they are going to appear. This static nature of these models results in two problems: (i) they ignore the diversity of meaning each word may have, and (ii) they suffer from learning long-term dependencies of meaning. Different from those static word embedding techniques, contextualized embeddings are not fixed, adapting the word representation to the context it appears. Precisely, at training time, for each word in a given input text, the learning model analyzes the context, usually using sequence-based models, such as recurrent neural networks (RNNs), and adjusts the representation of the target word by looking at the context. These context-awareness embeddings are actually the internal states of a deep neural network trained in an self-supervised setting. Thus, the training phase is carried out independently from the primary task on an extensive unlabeled data. Depending on the sequence-based model adopted, these contextualized models can be divided into two main groups, namely RNN-based [43] and Transformers-based [29, 31, 39, 28, 57] .

Transfer learning strategies have also been emerging to improve the quality of word representation, such as ULMFit (Universal Language Model Finetuning) [27] . ULMFit is an effective transfer learning method that can be applied to any NLP task, and introduces key techniques for fine-tuning a language model, consisting of three stages, described as follows. First, the language model is trained on a general-domain corpus to capture generic features of the language in different layers. Next, the full language model is fine-tuned on the target task data using discriminative fine-tuning and slanted triangular learning rates (STLR) to learn task-specific features. Lastly, the model is fine-tuned on the target task using gradual unfreezing and STLR to preserve low-level representations and to adapt high-level ones. Fine-tuning techniques made possible the development and availability of pre-trained contextualized language models using massive amounts of data. For example, Peters et al. [42] introduced ELMo (Embeddings from Language Models), a deep contextualized model for word representation. ELMo comprises a Bi-directional Long-Short-Term-Memory Recurrent Neural Network (BiLSTM) to combine a forward model, looking at the sequence in the traditional order, and a backward model, looking at the sequence in the reverse order. ELMo is composed of two layers of BiLSTM sequence encoder responsible for capturing the semantics of the context. Besides, some weights are shared between the two directions of the language modeling unit and there is also a residual connection between the LSTM layers to accommodate the deep connections without the gradient vanishing issue. ELMo also makes use of the character-based technique for computing embeddings. Therefore, it ben-efits from the characteristics of character-based representations to avoid OOV words.

Although ELMo is more effective as compared to static pre-trained models, its performance may be degraded when dealing with long texts, exposing a trade-off between efficient learning by gradient descent and latching on information for long periods [7] . Transformers-based language models, on the other hand, have been proposed to solve the gradient propagation problems described in [7] . Compared to RNNs, which process the input sequentially, Transformers work in parallel, which brings benefits when dealing with large corpora. Moreover, while RNNs by default process the input in one direction, Transformers-based models can attend to the context of a word from distant parts of a sentence and pay attention to the part of the text that really matters, using self-attention [57] .

The OpenAI Generative Pre-Training Transformer model (GPT) [45] is one of the first attempts to learn representations using Transformers. It encompasses only the decoder component of the Transformer architecture with some adjustments, discarding the encoder part. Therefore, instead of having a source and a target sentence for the sequence transduction model, a single sentence is given to the decoder. GPT' objective function targets at predicting the next word given a sequence of words, as a standard language modeling goal. To comply with the standard language model task, while reading a token, GPT can only attend to previously seen tokens in the self-attention layers. This setting can be limiting for encoding sentences, since understanding a word might require processing the ones coming after it in the sentence.

Devlin et al. [14] addressed the unidirectional nature of GPTs by presenting an strategy called BERT (Bidirectional Encoder Representations from Transformers) that, as the name says, encodes sentences by looking them at both directions. BERT is also based on the Transformer architecture but, contrary to the GPT, it is based on the encoder component of that architecture. The essential improvement over GPT is that BERT provides a solution for making Transformers bidirectional by applying masked language models, which randomly masks some percentage of the input tokens, and the objective is to predict those masked tokens based on their context. Also, in [14] , they use a next sentence prediction task for predicting whether two text segments follow each other. All those improvements have made BERT to achieve state-of-theart results in various NLP tasks when it was published.

Later, Liu et al. [31] proposed RoBERTa (Robustly optimized BERT approach), achieving even better results than BERT. RoBERTa is an extension of BERT with some modifications, such as: (i) training the model for a longer period of time, with bigger batches, over more data, (ii) removing the next sentence prediction objective, (iii) training on longer sequences, and (iv) dynamically changing the masking pattern applied to the training data.

Recently, Nguyen et al. [39] introduced BERTweet, an extension of RoBERTa trained from scratch with tweets. BERTweet has also the same architecture as BERT, but it is trained using the same Roberta pre-training procedure instead. BERTweet consumes a corpus of 850M English tweets, which is a concatenation of two corpora. The first corpus contains 845M English tweets from the Twitter Stream dataset and the second one contains 5M English tweets related to the COVID-19 pandemic. In [39] , the proposed BERTweet model outperformed RoBERTa baselines in some tasks on tweets, including sentiment analysis.

As far as we know, most studies in language modeling focus on designing new effective models in order to improve the predictive performance of distinct NLP tasks. For example, Devlin et al. [14] and Liu et al. [31] have respectively introduced BERT and RoBERTa, which achieved state-of-the-art results in many NLP tasks. Nevertheless, they did not evaluate the performance of such methods on the sentiment classification of tweets. Nguyen et al. [39] , on the other hand, used only a unique generic collection of tweets when evaluating their BERTweet strategy. In this context, we fulfill a robust evaluation of existing language models from distinct natures, including static representations, Transformer-based autoencoder models, and fine-tuned models, by using a significant set of 22 datasets of tweets from different domains and sizes. In the following sections, we present the assessment of such models.

This section presents the experimental methodology we followed in this article. We begin by describing, in Section 3.1, the twenty-two benchmark datasets used to evaluate the different language models we investigate in this study. In Section 3.2, we present the experimental protocol we followed. Then, in Section 3.3, we describe the computational experiments reported in Sections 4, 5, 6, and 7.

We used a large set of twenty-two datasets [10] to assess the effectiveness of the distinct language models described in Section 2 4 . Table 1 summarizes the main characteristics of these datasets, namely the abbreviation we use when reporting the experimental results to save space (Abbrev. column), the domain they belong (Domain column), number of positive tweets (#pos. column), proportion of positive tweets (%pos. column), number of negative tweets (#neg. column), proportion of negative tweets (%neg. column), and the total number of tweets (Total column).

Those datasets have been extensively used in the literature of Twitter sentiment analysis and we believe they provide a diverse scenario in evaluating embeddings of tweets in the sentiment classification task, regarding a variety of domains, sizes, and class balance. For example, while datasets SemEval13, SemEval16, SemEval17, and SemEval18 contain generic tweets, other datasets, such as iphone6, movie, and archeage, contain tweets of a particular domain. Also, the datasets vary a lot in size, with some of them containing only dozens of tweets, such as irony and sarcasm. We believe that this diverse and large collection of datasets may help drawing more concise and robust conclusions on the effectiveness of distinct language models in the sentiment analysis task.

To assess the effect of different kinds of language models in the polarity classification task, we follow the protocol of first extracting the features from the several vector-based language representation mechanisms (BOW, static embeddings, contextualized embeddings, fine-tuned embeddings). Next, those features compose the input attribute space for five distinct classifiers, namely Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), XGBoost (XBG), and Multi-layer Perceptron (MLP). We adopted scikitlearn's 9 implementations of those machine learning algorithms. Although we have used the default parameters in most of the cases, it is important to mention that we set the class balance parameter for SVM, LR, and RF (class weight = balanced ). Also, for LR, we set the maximum number of iterations to 500 (max iter = 500) and the solver parameter to liblinear. Moreover, for MLP, we set the number of hidden layers to 100. Table 2 shows a summary of the classification algorithms used in this study, remarking their characteristics. We aim at determining which language models are the most effective ones in Twitter sentiment analysis by leveraging classifiers from distinct natures, thus examining how they deal with the peculiarities of each evaluated model. Furthermore, it is important to note that we do not aim at establishing the best classifier for the sentiment analysis task, which may require a specific study and additional computational experiments.

Preprocessing is the first step in many text classification problems and the use of appropriate techniques can reduce noise hence improving classification effectiveness [19] . As this manuscript's main goal is to evaluate the performance of different models of tweet representation, the preprocessing step is simple so that the focus is on the language models and classifiers. Thus, for each tweet in a given dataset, we only replace URLs by the token someurl, user mentions by the token someuser, and all tokens were lowercased.

In the experimental evaluation, the predictive performance of the sentiment classification is measured in terms of accuracy and F 1 -macro. Precisely, for each evaluated dataset, the accuracy of the classification was computed as the ratio between the number of correctly classified tweets and the total number of tweets, following a stratified ten-fold cross-validation. F 1 -macro was computed as the unweighted average of the F 1 -score for the positive and negative classes.

Moreover, all experiments were performed by using Tesla P100-SXM2 GPU within Ubuntu operational system, running in a machine with Intel(R) Xeon(R) CPU E5-2698 v4 processor.

In the next sections, we evaluate a significant collection of vector-based textual representations attempting to answer the research questions introduced in Section 1. Specifically, we conduct a comparative study of vector-based language representation models from distinct natures, including Bag of Words, as a classic baseline, static representations and representations induced from Transformer-based autoencoder models, by fine-tuning or not the intermediate masked language task, in order to acknowledge their effectiveness in the polarity classification of English tweets. These language representation models are incrementally evaluated throughout Sections 4, 5, 6, and 7.

In Section 4, we begin by analyzing the predictive performance of the static representations, which include 13 pretrained embeddings from the literature, as shown in Table 3 , as well as the classical BOW with TF-IDF representation schema. Regarding the static embeddings described in Table 3 , we have selected representations trained on distinct kinds of texts (Corpus column) and built from different architectures (Architecture column), from feedforward neural networks to Transformer-based ones. The |D| and |V | columns refer to the dimension and vocabulary size of each pretrained embedding, respectively. Although the most usual way of employing embeddings trained from Transformer-based architectures is running the text trough the model to obtain contextualized representations, here we first investigate how these models behave when the experimental protocol is the same as earlier embeddings models: pretrained embeddings are collected from the embeddings layer and are the input of the classifiers.

Next, in Section 5, we present an evaluation of state-of-the-art Transformerbased autoencoder models, including BERT [14] , RoBERTa [31] , and BERTweet [39] . In this evaluation, for each assessed dataset, we represent their tweets as the average of the concatenation of the last four layers for each token representation of the models. For the sake of simplicity, the Transformer-based autoencoder models assessed in this study are referred to hereafter as Transformerbased models.

Lastly, in Sections 6 and 7, we evaluate the effectiveness of fine-tuning the aforementioned Transformer-based models regarding the intermediate maskedlanguage task in two different ways: (i) by using a huge collection of unlabeled, or non-sentiment, tweets, and (ii) by using tweets from sentiment datasets.

In Section 6, regarding the non-sentiment fine-tuning approach, we adopted the general purpose collection of unlabeled tweets from the Edinburgh cor-pus [44] , which contains 97M tweets in multiple languages. Tweets written in languages other than English were discarded, resulting in a final corpus of 6.7M English tweets, which was then used to fine-tune BERT, RoBERTa, and BERTweet. In addition to the entire corpus of 6.7M tweets, we used nine other samples with different sizes, varying from 500 to 1.5M tweets. Specifically, we generated samples containing 500 (0.5K), 1K, 5K, 10K, 25K, 50K, 250K, 500K, and 1.5M non-sentiment tweets.

Conversely, in Section 7, we evaluated the sentiment fine-tuning procedure using positive and negative tweets from the twenty-two benchmark datasets described in Table 1 . For this purpose, we used each dataset once as the target dataset, while the others were used as the source datasets. More clearly, for each assessed dataset, referred to as the target dataset, we explored three distinct strategies to fine-tune the masked-language model: (i) by using only the tweets from the target sentiment dataset itself, (ii) by using the tweets from the remaining 21 datasets, and (iii) by using the entire collection of tweets from the 22 datasets, including the tweets from the target dataset.

The computational experiments conducted in this section aim at answering the research question RQ1, as follows:

RQ1. Which static embeddings are the most effective in the sentiment classification of tweets?

We answer this question by assessing the predictive power of the 13 pretrained embeddings described in Table 3 . These embeddings were generated from distinct neural networks architectures, with different dimensions and vocabulary size, and trained on various kinds of corpora. Recall that by static embeddings we mean that the features are gathered from the embeddings layer working as a look-up table of tokens. In addition to the pretrained embeddings, we evaluate the BoW model with the TF-IDF representation, which is the most basic text representation used in Twitter sentiment analysis and text classification tasks in general. For all tweet representation, we take the average of all tokens representation of the tweet.

We begin by evaluating the predictive performance of the static representations for each classification algorithm presented in Table 2 . We report the computational results in detail for SVM as an example of this evaluation (refer to Online Resource 1 for the detailed evaluation for each classifier). Tables 4  and 5 show the results achieved by using each static representation to train an SVM classifier, in terms of classification accuracy and unweighted F 1 -macro, respectively. The boldfaced values indicate the best results, and the last three lines show the total number of wins for each static representation (#wins row), as well as a ranking of the results (rank sums and position rows). Precisely, for each dataset, we assign scores, from 1.0 to 14.0, to each assessed representation (each column), in ascending order of accuracy (F 1 -macro), where the score 1.0 is assigned to the representation with the highest accuracy (F 1 -macro). Thus, low score values indicate better results. When two assessed representation has the same performance, we take an average of their scores. If two assessed representations achieve the best performance, they will receive a score of 1.5 ((1+2)/2). Finally, we sum up the assigned scores obtained in each dataset for each assessed representation to calculate rank sums. With the rank sum of each assessed representation, we rank the rank-sum result from the best (1) to the worst (14), calculating the rank position.

As we can see in Tables 4 and 5 , RoBERTa (RoBstatic column) achieved the best performance in nine out of the 22 datasets in terms of accuracy, in 11 out of the 22 datasets in terms of F 1 -macro, and was ranked first in the overall evaluation (position row). Regarding the number of wins (#wins row), we can note that Emo2Vec and SSWE achieved the second best results, Table 6 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by evaluating each static representation on the 22 datasets, for each classification algorithm, in terms of accuracy reaching the best performance in four out of the 22 datasets for both accuracy and F 1 -macro. However, regarding the overall evaluation (position row), w2v-Edin and w2v-GN were ranked among the top three best static representations along with RoBERTa, in terms of accuracy. Regarding F 1 -macro, the top three best static representations were RoBERTa, w2v-Edin and BERT (BERT-static column). Tables 6 and 7 show a summary of the results by evaluating each static representation on the 22 datasets, for each classification algorithm. Each cell indicates the number of wins, the rank sums, and the rank position achieved by the related static representation (each line) used to train the corresponding classifier (each column). The Total column indicates the total number of wins, the total rank sums, and the total rank position, i.e., the sum of the rank positions presented in each cell for each assessed model. Moreover, in the total column, we underline the top three best overall results in terms of total rank position.

Regarding the overall evaluation (Total column), from Tables 6 and 7, we can see that although Emo2Vec achieved the highest total number of wins (i.e., 27 wins in terms of accuracy, and 29 wins in terms of F 1 -macro), w2v-Edin was ranked as the best overall model, achieving the lowest total rank position for both accuracy (22.0) and F 1 -macro (21.0). Nevertheless, considering each classifier (each column), we can note that RoBERTa achieved the best performance when used to train LR, SVM, and MLP, for both accuracy and F 1 -macro. Conversely, Emo2Vec achieved the best overall results when used to train RF and XGB classifiers. Analyzing the overall results in terms of the total rank position (Total column), we observe that Emo2Vec and w2v-GN, along with w2v-Edin, are ranked as the top three best static representations. These results suggest that w2v-Edin, Emo2Vec, and w2v-GN are well-suited static representations for Twitter sentiment analysis. In the previous evaluations, we analyzed the predictive performance achieved by each representation for one classification algorithm at a time, focusing on the individual contribution of the text representations in the performance on the final task. Next, we investigate the classification performance of the final sentiment analysis process, that is, the combination of text representation and classifier. Considering that the final classification is a combination of both representation and classifier, an appropriate choice of the classification algorithm may affect the performance of a text representation. For this purpose, we present an overall evaluation of all possible combinations of text representations and classification algorithms, examining them as pairs {text representation, classifier }. More clearly, we evaluate the classification effectiveness of 70 possible combinations of text representations and classifiers (14 × 5) on the 22 datasets of tweets. Table 8 presents the top ten results in terms of the average rank position and 9 presents the ten worst average rank position. Specifically, for each dataset, we calculate a rank of the 70 combinations and then average the rank position of each combination over the 22 datasets. From Table 8 , we can note that the best overall results were achieved by using RoBERTa to train an SVM classifier for both accuracy and F 1 -macro. Also, w2v-Edin + SVM and RoBERTA + MLP appear in the top three results along with RoBERTa + SVM. By Table 9 , we can notice the high-frequency of RF in the pair Model-Classifier. Tables 10 and 11 show a summary of the results for each text representation and classifier, respectively, from best to worst, in terms of the average rank position. As we can observe, Emo2Vec, RoBERTa, and w2v-Edin appear in the top three, being the representations that achieved the best overall performances. Among the classifiers, we can note that SVM and MLP seem to be good choices in Twitter sentiment Analysis regarding the usage of static text representations. Conversely, RF achieved the worst overall performance across all evaluations.

In addition to the individual assessment of text representations and classifiers presented in Tables 10 and 11, Table 12 shows the best results achieved for each dataset. We can see that RoBERTa achieved the highest accuracies in seven out of the 22 datasets, and highest F 1 -macro scores in nine out of the 22 datasets. Furthermore, as highlighted in Table 8 , RoBERTA + SVM achieved the best performances in six out of the 22 datasets in terms of accuracy, and in eight out of the 22 datasets in terms of F 1 -macro.

The top three static representations identified in the previous experiments, i.e., RoBERTa, w2v-Edin, and Emo2Vec, are very different from each other. While w2v-Edin and Emo2Vec were trained from scratch on tweets, RoBERTa was trained on traditional English texts.

The better performance of Emo2Vec and w2v-Edin can be caused by the inclusion of the sentiment analysis task in its training process. We also have other models built in this same strategy and trained from scratch with tweets, such as Deepmoji and SSWE, respectively, the seventh and eighth position by Table 10 . The Emo2Vec better performance may be a result of its multitask learning approach. Considering another model with the same architecture of w2v-Edin and also trained from scratch with tweets, the differential performance between w2v-Edin and w2v-Araque (the fourteenth position by Table 10 ) may lie in its volume of training data (w2v-Araque: 1.28M, and W2V-Edin: 10M) and the vocabulary size (w2v-Araque: 57K and w2v-Edin: 259K).

However, among these, RoBERTa is the only Transformer-based model, which holds state-of-the-art performance in capturing context and semantics of terms from texts. Furthermore, regarding w2v-Edin, although it was trained with a more straightforward architecture (feedforward neural network) as compared to others, its training parameters were optimized for the emotion detection task on tweets [8] , which may have helped determining the sentiment expressed in tweets.

Surprisingly, as shown in Table 10 , BERTweet achieved the worst overall performance among all assessed text representations, despite having been trained using the same state-of-the-art Transformer-based architecture as RoBERTa while yet using tweets. One possible explanation for this behavior is that BERTweet training procedure limits the representation of its training tweets to 60 tokens only, while RoBERTa uses a limit of 512 tokens. For that reason, we believe that RoBERTa model is able to capture more semantic information to the tokens from its training vocabulary as compared to BERTweet when one collects the token representation from the embeddings layer.

Finally, regarding research question RQ1, we can highlight and suggest that: (i) disregarding the classification algorithms, Emo2Vec, w2v-Edin, and RoBERTa seem to be well-suited representations for determining the sentiment expressed in tweets, and (ii) considering the combination of text representations and classifiers, RoBERTa + SVM achieved the best overall performance, which may represent a good choice for Twitter sentiment analysis in hardware-restricted environments, since the cost here is most due to the classifier induction.

In this section, we address the research question RQ2, as follows: RQ2.Considering state-of-the-art Transformer-based autoencoder models, which are the most effective in the sentiment classification of tweets?

To answer that question, we conduct a thorough evaluation of the widely used BERT and RoBERTa models and the BERT-based transformer trained from scratch with tweets, namely, BERTweet. These models represent a set of the most recent Transformer-based autoencoder language modeling techniques that have achieved state-of-the-art performance in many NLP tasks. While BERT is the first Transformer-based autoenconder model to appear in the literature, RoBERTa is an evolution of BERT with improved training methodology, due to the elimination of the Next Sentence Prediction task, which may fit NLP tasks on tweets considering they are limited in size and self-contained in context. Moreover, by evaluating BERTweet we analyze the performance of a Transformer-based model trained from scratch on tweets.

In this set of experiments, we give an example tweet as input to the transformer model and concatenate its last four layers to be the token representation and the tweet representation is the average of the tokens representation. Next, those representations collected from the whole dataset are given as input to the learning classifier method together with the labels of the tweets. Finally, the learned classifier is employed to perform the evaluation. In this way, we once again follow the feature extraction plus classification strategy but now using the contextualized embedding from each tweet. Table 13 presents the classification results when using the SVM classifier in terms of accuracy and F 1 -macro, and Table 14 shows a summary of the complete evaluation regarding all classifiers. As in previous section, to limit the number of tables in the manuscript, we only report the computational results in detail for the LR classifier as an example of this evaluation (refer to Online Resource 1 for the detailed evaluation). From Table 13 , we can note that BERTweet achieved the best results in 18 out of the 22 datasets for both accuracy and F 1 -macro. Similarly, regarding all classifiers, Table 14 shows that BERTweet outperformed BERT and RoBERTa by a significant difference in terms of the total number of wins for both accuracy and F 1 -macro. Next, we present an overall analysis of using BERT, RoBERTa, and BERTweet models to train each one of the five classification algorithms, examining them as pairs {language model, classifier }. Table 15 presents the average rank position across all 15 possible combinations (3 language models × 5 classification algorithms), from best to worst, as explained in Section 4. We can observe that BERTweet combined with LR, MLP, and SVM classifiers achieved the best overall performances for both accuracy and F 1 -macro. Conversely, using RF to train the Transformer-based embeddings seems to harm the performance of the models. Tables 16 and 17 show a summary of the results for each model and classifier, respectively, from best to worst, in terms of the average rank position. From Table 16 , we can see that BERTweet achieved the best overall classification effectiveness and was ranked first. Also, RoBERTa and BERT achieved comparable overall performances for both accuracy and F 1 -macro. Regarding the classifiers, as shown in Table 17 , MLP and LR achieved rather comparable performances and were ranked as the top two best classifiers regarding the Transformer-based models, followed by SVM, XGB, and RF. Regarding the results achieved for each dataset, Table 18 presents the best results in terms of accuracy and F 1 -macro. As we can notice, BERTweet outperformed BERT and RoBERTa in 17 out of the 22 datasets in terms of accuracy and in 18 out of the 22 datasets in terms of F 1 -macro. These results may confirm that Twitter sentiment classification benefits most from contextualized language models trained from scratch on Twitter data. Unlike BERT and RoBERTa, which were trained on traditional English texts, BERTweet was trained on a huge amount of 850M tweets. This fact may have helped BERTweet on learning the specificities of tweets, such as their morphological and semantic characteristics. For a better understanding of the results, we present an analysis of the difference between the vocabulary embedded in the assessed models. For this purpose, Table 19 highlights the number of tokens shared between BERT, RoBERTa, and BERTweet. In other words, we show the amount of tokens (in %) embedded in the models presented in each row that are also included in the models presented in each column, i.e., the intersection between their vocabularies. For example, regarding BERT (first row), we can see that 61% of its tokens can be found on RoBERTa (second column). The information below each model name in the columns refers to their vocabulary size (number of embedded tokens). It is possible to note that only 32% of the 64K tokens from BERTweet vocabulary (i.e., about 20K tokens) can be found in BERT. It means that, when compared to BERT, BERTweet contains about 44K (64−20) specific tokens extracted from tweets. Similarly, 55% of the tokens embedded in BERTweet (i.e., about 35K tokens) can be found in RoBERTa, meaning that BERTweet holds about 29K (64 − 35) specific tokens from tweets that are not included in RoBERTa. As a matter of fact, analyzing the tokens embedded in BERTweet, we find some specific tokens, such as "KKK", "Awww", "hahaha", Table 19 Percentage of vocabulary's tokens of language model in the line that is also in the vocabulary's tokens of language model in the column.

RoBERTa BERTweet

BERTweet 32 55 − "broo", and other internet expressions and slang that social media users often use to express themselves. While creating representations for these tokens is straightforward in BERTweet, BERT and RoBERTa need to do some extras steps. Specifically, when BERT and RoBERTa do not find a token in their vocabularies, they split the token into subtokens until all of them are found. For example, the token "KKK" would be split into "K", "K", and "K" to represent the original token. This analysis points out that this particular vocabulary, combined with a language model that was trained focused on learning the intrinsic structure of tweets, is the responsible for the BERTweet language model's best performance on tweet sentiment classification.

In this context, regarding RQ2, we believe BERTweet is an effective language modeling technique in distinguishing the sentiment expressed in tweets. Also, regarding the classifiers, in general, MLP and LR seem to be good choices when using Transformer-based models.

Different from static representation, when we used only the embedding layer of the 13 language models, in this section, we use the whole language model: the tweet goes from the embedding layer up to the last layer to be transformed in a vector representation. Attempting to understand the benefits from using the whole language model (embedding layer and language model), we compare the predictive performance of Transformer-based models evaluated in this section against all the static representations assessed in Section 4. Table 20 presents the top ten results across all 85 possible combinations of models and classifiers (17 models × 5 classification algorithms), and Table 21 shows an overall evaluation of the models, from best to worst,in terms of the average rank position. In addition, Table 22 shows the best results achieved for each dataset.

From Tables 20 and 21 we can notice that the Transformer-based BERTweet model outperformed all other models and was ranked first in both evaluations. Also, Table 21 shows that the Transformed-based models achieved the best overall results against all static models and were ranked as the top three best representations. Furthermore, from Table 22 , the Transformer-based BERTweet model achieved the best overall classification effectiveness in 16 out of the 22 datasets in terms of accuracy and in 17 out of the 22 datasets in terms of F 1 -macro. These results point out that learning language model parameters is essential in distinguishing the sentiment expressed in tweets. Static representations may lose a lot of relevant information considering they ignore the diversity of meaning that words may have depending on the context they appear. In contrast, Transformer-based models benefit from learning how to encode the context information of a token in an embedding.

In this section, we aim at performing computational experiments in order to answer the research question RQ3, stated as follows:

RQ3. Can the fine-tuning of Transformer-based autoencoder models using a large set of English tweets improve the sentiment classification performance?

To answer this research question, we evaluate the classification effectiveness of BERT, RoBERTa, and BERTweet language models fine-tuned with tweets from a corpus of 6.7M unlabeled, or generic unlabeled, tweets, as described in Section 3.3. Precisely, we use this set of tweets to fine tuning the model weights using the intermediate masked language model task as the training objective with the probability of 15% to (randomly) mask tokens in the input. We also compare the fine-tuning results of such models against those achieved by using the original weights of the Transformer-based models, as presented in Section 5, in order to analyze whether the adjustment of the models via finetuning improves the predictive performance of the sentiment classification.

In general, the performance of the fine-tuned models is very sensitive to different random seeds [17] . For that reason, all the results presented in this section are the average of three executions using different seeds (12, 34, 56) to account for the sensitivity of the fine-tuning process regarding different seeds [16] . The first part of the experiments reported in this section consists in determining whether the predictive performance of the Transformer-based models are affected by the fine-tuning procedure using tweets from corpora of different sizes. For this purpose, in addition to the entire Edinburgh corpus of 6,657,700 tweets (around 6.7M tweets), we used nine other smaller samples of tweets with different sizes, varying from 500 to 1.5M tweets. Specifically, we generated samples containing 0.5K, 1K, 5K, 10K, 25K, 50K, 250K, 500K, and 1.5M generic unlabeled tweets. In the fine-tuning processes, we performed three training epochs, except for the tuned models with 6.7M tweets, when we used one epoch, as there was a degradation of some models, such as BERTweet. In all fine-tuning process, all layers are unfrozen. Regarding the batch size, we use the available hardware capacity of eight instances per device. We used a learning rate of 5e-5 with a linear scheduler and Adam optmizer with beta1 equal to 0.9, beta2, 0.999 ,and epsilon, 1e-8. We also use a max gradient of 1 and with no weight decay.

Tables 23 and 24 present the average classification accuracies and F 1 -macro scores, respectively, when fine-tuning the Transformer-based models with the different samples of tweets generated from the Edinburgh corpus. These results were achieved by using the SVM classifier (refer to Online Resource 1 for the detailed evaluation of each classifier). Regarding the variance in performance across the different seeds, the mean and maximum standard deviations are 0.05% and 0.5% for both accuracy and F 1 -macro, respectively. Note that BERT was most benefited when fine-tuned with samples of 250K tweets (position row), for both accuracy and F 1 -macro. RoBERTa achieved the best overall results when fine-tuned with samples of 1.5M and 250K tweets, in terms of accuracy and F 1 -macro, respectively. On the other hand, BERTweet benefited from smaller samples, achieving higher overall predictive performances when fine-tuned with samples of 25K and 5K tweets in terms of accuracy and F 1 -macro, respectively. This is an expected result as BERTweet is already trained from scratch from tweets. As we are fine-tuning the language model task, BERT and RoBERTa seems to require more samples to accommodate the Twitter-based vocabulary into the weights' model.

Next, we analyze the overall performance of the fine-tuned Transformerbased models for each classification algorithm. Table 25 summarizes the results. Regarding the variance across the different seeds, the mean and maximum standard deviations are 0.2% and 0.7% in terms of accuracy, and 0.26% and 0.98% in terms of F 1 -macro.

Interestingly, from Table 25 , we can note that when fine-tuning a language model to fit a specific type of text, such as tweets, applying large corpora does not guarantee better predictive performances. Specifically, the best overall results (Total column) were achieved when fine-tuning BERT, RoBERTa, and BERTweet models with samples of 250K, 50K, and 5K tweets, respectively, for both accuracy and F 1 -macro.

Regarding the results achieved for each dataset, Table 26 shows the best predictive performances in terms of accuracy and F 1 -macro. We can see that BERTweet achieved the best results for most datasets when fine-tuned with fewer number of tweets. More specifically, BERTweet outperformed the other models when fine-tuned with samples varying from 1K to 25K tweets in 14 out of the 22 datasets for both accuracy and F 1 -macro.

As in previous sections, we also present an overall evaluation of combining all fine-tuned models and classifiers across the 22 datasets, in terms of the average rank position. Table 27 shows the top ten results among all 150 possible combinations (3 models × 10 samples of tweets × 5 classification algorithms). As we can see in Table 27 , fine-tuned BERTweet embeddings achieved the best overall performances when used to train LR, MLP, and SVM, mastering the top ten results. Also, note that by using LR, MLP, and SVM, BERTweet outperformed all other models when fine-tuned with samples containing 50K tweets or less. Tables 28 and 29 show the top ten results among all fine-tuned models and a summary of the results for each classifier, from best to worst, respectively, in terms of the average rank position. From Table 28 , we can notice that all BERTweet fine-tuned models (0.5K, 1K, 5K, 10K, 25K, 50K, 250K, 500K, 1.5M, and 6.7M) were ranked in the top ten results. Furthermore, neither BERT nor RoBERTa appear in the top results, even when they are fine-tuned with the entire corpus of 6.7M tweets. RoBERTa appears only in the top 24 accuracy score with an average rank of 37.02 tuned with 50K tweets and combined MLP classifier and in top 28 F 1 -macro score with an average rank of 37.27 tuned with 50K tweets and combined LR classifier. BERT appears only in the top 56 accuracy score with an average rank of 66.05 tuned with 1.5M tweets and combined MLP classifier and in top 51 F 1 -macro score with an average rank of 60.77 tuned with 6.7M tweets and combined LR classifier. Among the classifiers, as we can see in Table 29 , MLP and LR achieved the best predictive performances and were ranked as the top two best classifiers. Conversely, RF was ranked as the worst classifier.

From all previous evaluations, we can note that as the size of the samples increases, the fine-tuning procedure seems to be less effective. It may be due to the adjustment of the weights of the models' layers during the backpropagation process. Considering that the fine-tuning procedure consists in unfreezing the entire model obtained previously and adjusting their weights with the new data, the original model and the semantic and syntactic knowledge learned in its layers are changed. In that case, we believe that after some training iterations, the adjustment of the weights starts to damage the original knowledge embedded in the models' layers. The aforementioned conclusion may further explain why BERTweet achieved improved classification performance by using smaller samples of tweets as compared to BERT and Table 25 Overview of the results (number of wins, rank sum, and rank position, respectively) achieved by each classifier when fine-tuning the Transformer-Autoencoder models with different samples of unlabeled tweets in terms of accuracy RoBERTa. Our hypothesis is that, considering that the weights in BERTweet's layers are specifically adjusted to fit tweets' language style, using more data to fine-tune the model means only continue the initial training. It may be that lots of data may harm the learned weights of the model. Thus, we suggest that when fine-tuning Transformer-based models, such as BERT, RoBERTa, and BERTweet, samples of different sizes may be exploited instead of adopting a dataset with a massive number of instances. Additionally, we present a comparison among all fine-tuned Transformerbased models against their original versions. Tables 30, 31, and 32 report this comparison in terms of the average rank position for BERT, RoBERTa, and BERTweet, respectively. We can see that the fine-tuned versions achieved meaningful predictive performances as compared to their original models, which indicates that fine-tuning strategies can boost classification performance in Twitter sentiment analysis. Moreover, from Tables 30 and 31, we note that the fine-tuned versions of BERT and RoBERTa benefited most from samples containing a large amount of tweets. Conversely, as pointed out before, BERTweet achieved better overall performances by using smaller samples, as shown in Table 32 .

Addressing research question RQ3, we could see that fine-tuning Transformerbased models improves the classification effectiveness in Twitter sentiment analysis. Nevertheless, using large sets of tweets does not guarantee better predictive performances, particularly for those models trained from scratch on tweets, such as BERTweet. We could observe that BERTweet benefited most from samples of tweets containing 50K tweets or less. Furthermore, regarding the classifiers, in general, MLP and LR seem to be good choices of classifiers to be employed after extracting the features from fine-tuned Transformer-based models.

7 Fine-tuning Transformer-based models using sentiment datasets

The experiments conducted in this section aim at answering the research question RQ4, stated as follows: RQ4. Can Transformer-based autoencoder models benefit from a fine-tuning procedure with tweets from sentiment analysis datasets?

We address this research question by evaluating whether the sentiment classification of tweets benefits from fine-tuned language models using tweets from sentiment analysis datasets. For this purpose, we use the same collection of 22 benchmark datasets presented in Section 3.1 (Table 1) . We perform this evaluation by assessing three distinct strategies to simulate three real-world scenarios. In addition, as done in Section 6, all experiments were performed three times using different seeds (12, 34, 56) , with all the same hyperparameter and we report the average of the results.

The first fine-tuning strategy we investigate, referred to as InData, simulates the usage of a specific sentiment dataset itself as the new domain dataset to fine-tune a pre-trained language model. Precisely, each one of the 22 datasets is used once as the target dataset. For each of the 22 datasets, we use a 10-fold cross-validation procedure. In each of the ten executions, we use the tweets from nine folds as the source data (i.e., the training data) used to adjust a language model, which is then validated on the one remaining part of the data, referred to as the target dataset (i.e., the test data).

The second strategy, referred to as LOO (Leave One dataset Out), aims at simulating the situation where a collection of general sentiment datasets is available to fine-tune the language model. We use each dataset once as the target dataset while the tweets from the remaining 21 datasets are combined to tune the language model. Although the target dataset contains sentiment labels for each tweet, these labels are not used in the fine-tuning process as we leverage the intermediate self-supervised masked language model task to fine-tune the network parameters.

The third and last strategy, referred to as AllData, is a combination of the two others. Specifically, as for strategy InData, for each assessed dataset (target dataset), and for each of the nine folds in the 10-fold cross-validation procedure, we combine the tweets from the nine folds (i.e., the training data of the target dataset) with the tweets from the remaining 21 datasets to fine-tune a language model. This last strategy evaluates the benefits of combining the tweets from a specific sentiment target dataset with a representative general sentiment dataset corpus in the fine-tuning process. Table 33 presents the predictive performances achieved by fine-tuning each language model with strategies InData, LOO, and AllData, one at a time, by using the SVM classifier. As in previous sections, for space constraints, we only report the detailed evaluation using the SVM classifier (refer to Online Resource 1 for the detailed assessment of each classifier).

From Table 33 , we can observe that BERT benefited most from strategy InData, which uses only the target dataset itself to adjust the language models. Conversely, fine-tuning RoBERTa and BERTweet models using strategies that combine tweets from distinct sentiment analysis corpora achieved the best results for most datasets. More clearly, AllData, which combines the tweets from the target dataset and tweets from a collection of sentiment datasets, achieved the best overall results with both RoBERTa and BERTweet. Also, regarding BERTweet, note that strategy LOO achieved comparable performances to AllData. It is also noteworthy that smaller datasets seem to have benefited most from fine-tuning RoBERTa and BERTweet by using strategy LOO. On the other hand, larger datasets achieved higher predictive performances when using strategy AllData to fine-tune RoBERTa and BERTweet. Table 34 shows a summary of the complete evaluation regarding all classifiers.

Regarding the overall results achieved for each dataset, Table 35 presents the best results. We can note that when fine-tuning the Transformer-based models with tweets from sentiment datasets, BERTweet outperformed BERT and RoBERTa for all datasets, except for datasets sarcasm (sar) and hobbit (hob). Interestingly, as mentioned before, while strategy LOO achieved the best results for smaller datasets, larger datasets seem to benefit from strategy AllData. Precisely, strategy AllData achieved the best overall performances in ten out of the 22 datasets in terms of accuracy and in 11 out of the 22 datasets in terms of F 1 -macro. Strategy LOO achieved the best results in nine out of the 22 datasets for both accuracy and F 1 -macro. The better performance of the AllData strategy for larger target datasets indicates that the significant amount of information present in the target dataset is indispensable for the fine-tuning process, while the information present in smaller datasets seems not to contribute to the fine-tuning process, making the LOO strategy adequate for datasets with a limited amount of tweets.

Conversely, strategy InData did not achieve meaningful results. The inferior performance of the InData strategy in almost all datasets shows that, regardless of the size of the dataset, the use of external and more extensive data brings more information to the fine-tuning process, improving the final performance.

Next, we present an overall evaluation of combining all fine-tuned models and classifiers across the 22 datasets, in terms of the average rank position. Table 36 reports the top ten results among all 45 possible combinations (3 language models × 3 fine-tuning strategies × 5 classification algorithms). We can observe that the LR classifier trained with BERTweet embeddings finetuned via strategy AllData achieved the best overall predictive performances. Also, note that the fine-tuned BERTweet embeddings with strategies AllData and LOO, combined with LR, MLP, and SVM, appear at the top of the ranking (top six results). Another point worth highlighting is that BERTweet masters the top ten results, appearing in eight out of the ten positions in terms of accuracy and in nine out of the ten positions in terms of F 1 -macro. Tables 37 and 38 show the results among all fine-tuned models and a summary of the results for each classifier, from best to worst, respectively, in terms of the average rank position. Once again, from Table 37 , we can notice that all BERTweet fine-tuned models (InData, LOO, and AllData) were ranked in the top three results. Among the classifiers, as we can see in Table 38 , MLP and LR achieved the best predictive performances and were ranked as the top two best classifiers. Conversely, RF was ranked as the worst classifier.

To evaluate the effectiveness of fine-tuning the Transformer-based models using tweets from sentiment datasets, we present a comparison among all finetuning strategies assessed in this study for each language model. Specifically, we compare the fine-tuned models presented in this section, by using strategies InData, LOO, and AllData, against the best fine-tuned models identified in Section 6, i.e., BERT-250K, RoBERTa-50K, and BERTweet-5K. Tables 39, 40 , and 41 report these results in terms of the average rank position for BERT, RoBERTa, and BERTweet, respectively. Regarding BERT, as shown in Table 39 , note that all fine-tuning strategies using tweets from sentiment datasets achieved better overall results than using the sample of 250K generic tweets. Moreover, strategy InData appears at the top of the ranking as the best fine-tuning strategy. It is worth mentioning that strategy InData uses only the tweets from the target dataset itself to adjust the language model. This means that the strategy InData used a number of tweets much smaller than the 250K tweets contained in the sample.

On the other hand, as we can see in Tables 40 and 41 , strategy InData did not achieve meaningful results for RoBERTa and BERTweet models. Nevertheless, for these models, strategies AllData and LOO, which also use tweets from sentiment datasets, achieved rather comparable performances and were ranked as the top two best fine-tuning strategies.

To acknowledge the effectiveness of fine-tuning the Transformer-based models using tweets from sentiment datasets, we present an overall comparison among all fine-tuning strategies and all 47 models previously assessed in this study. Tables 42 and 43 present, respectively, the 10 best and the 10 worst overall combination of model and classifier, assessing the average rank position of all 280 (56 models and five classifier) model and classifier combinations. We note that BERTweet tuned with tweets from sentiment datasets and combined with LR and MLP had the four best results, in terms of accuracy, and the two best results, in terms of F 1 -macro. These combinations were followed by BERTweet tuned with generic tweets. More specifically, combinations with the strategy AllData and LOO achieved better overall results. Independently of the language model, LR and MLP were the most frequent classifier in the top 10 results. Conversely, all the ten worst combinations are static representations combined with RF, which was unanimous in the worst model and classifiers combinations. Assessing only the different kinds of embeddings , Tables 44 and 45 present, respectively, the best and the worst average rank position comparing all 56 representations (the nine models tuned with sentiment datasets and the 47 previous representations). This analysis confirms the good performance of finetuning the Transformer-based models using tweets from sentiment datasets. More specifically, the strategies AllData and LOO obtained the two best results. It is possible to notice that tuning BERTweet with generic tweets also brings performance improvement to BERTweet. Regarding the worst behaviors, presented in Table 45 , it is possible to note that all the ten strategies are again static representations. Lastly, regarding research question RQ4, we can highlight that fine-tuning Transformer-based models using tweets from sentiment datasets seems to boost classification performance in Twitter sentiment analysis. As a matter of fact, the strategies AllData and LOO exploited in this section, which use a collection of sentiment tweets to adjust a language model, achieved better overall results than using samples of unlabeled, or generic unlabeled, tweets. Although we do not use the labels of those tweets in the fine-tuning procedure, they may carry a lot of sentiment information as compared to the tweets from the Edinburgh corpus, which originated the samples of generic unlabeled tweets used in the experiments. Furthermore, BERTweet embeddings fine-tuned with strategy AllData seems to be very effective in determining the sentiment expressed in tweets, especially when used to train LR, MLP, and SVM classifiers.

In this article, we presented an extensive assessment of modern and classical word representations when used for the task of Twitter sentiment analysis. Specifically, we assessed the classification performance of 14 static representations, the most recent Transformer-based autoencoder models, including BERT, RoBERTa, and BERTweet, as well as different fine-tuning strategies of the language representation tasks in such models. All models were evaluated in the context of Twitter sentiment analysis using a rich set of 22 datasets and five classifiers from distinct natures. The main focus of this study was on identifying the most appropriate word representations for the sentiment analysis of English tweets. Based on the results of the experiments performed in this study, we can highlight the following conclusions:

-Considering the static representations in limited resource scenario, we could note that Emo2Vec, w2v-Edin, and RoBERTa models seem to be wellsuited representations for determining the sentiment expressed in tweets. The good performance achieved by Emo2Vec and w2v-Edin indicates that being trained from scratch with tweets can boost the classification performance of static representations when applied in Twitter sentiment analysis. Although RoBERTa was not trained from stratch with tweets, it is a Transformer-based autoencoder model, which holds state-of-the-art performance in several NLP tasks. Regarding the classifiers, we could see that SVM and MLP achieved the best overall performances, especially when used to train RoBERTa's static embeddings. -Regarding the Transformer-based models, we could observe that BERTweet is the most appropriate language model to be used in the sentiment classification of tweets. Specifically, the particular vocabulary tweets contain, combined with a language model that was trained focused on learning their intrinsic structure, can effectively improve the performance of the Twitter sentiment analysis task. Considering the combination of language models and classifiers, we can point out that BERTweet achieved the best overall results when combined with LR and MLP. Furthermore, by comparing the Transformer-based models and the static representations, we could notice that the adaptation of the tokens' embeddings to the context they appear performed by the Transformer-based models benefits the sentiment classification task. -When fine-tuning the Transformer-based models with a large set of English unlabeled tweets we could note that although it improves the classification performance, using as many tweets as possible does not necessarily means better results. In this context, we presented an extensive evaluation of sets of tweets with different sizes, varying from 0.5K to 1.5M. These results have shown that while BERT and RoBERTa achieved better predictive performances when tuned with sets of 250K and 50K tweets, respectively, BERTweet outperformed all fine-tuned models using only 5K tweets. This result indicates that models trained from scratch with tweets, such as BERTweet, needs less tweets to have its performance improved. Moreover, by comparing all fine-tuned models taking into account the classifiers, BERTweet combined with MLP, LR, and SVM achieved the best overall performances. -Analyzing the fine-tuning of the language model based on Transformers autoencoders with sentiment analysis datasets, i.e., with tweets that express polarity, we can see that the tuned models' performance is better than when tuned with generic tweets. All fine-tuning strategies with sentiment analysis datasets performed better than the best-tuned models adjusted with generic tweets. We conclude then that it is worth fine-tuning a model based on Transformer autoencoders using a set of sentiment tweets. Among the fine-tuning strategies -using sentiment analysis tweets -explored in the study, it was possible to perceive that each Transformer model presented a better performance with different adjustment methods. The use of only the target dataset, for example, was a good option to be used with BERT. For RoBERTa and BERTweet, the combination of the target dataset with a set of tweets from other datasets presented a good strategy for fine-tuning the language model. In a general comparison, we noticed that BERTweet tuned with the union of the target dataset and the set of sentiment analysis tweets (BERTweet 22Dt) performed better than the other adjusted models. Besides, we could observe that BERTweet 22Dt presented a good performance when combined with LR and MLP classifiers. -After answering our research questions, we can briefly state that: (i) Transformerbased autoencoder models perform better than static representation, (ii) Transformer autoencoder models fine-tuned with English tweets behavior better than the respective original models and, finally, (iii) it is worth finetuning a language model originally trained with generic English tweets with tweets from sentiment analysis datasets. Considering all original and fine-tuned models, the best overall performance for the English tweets sentiment analysis task was achieved by the Transformer-Autoencoder model trained from scratch with generic tweets (BERTweet) when fine-tuned with tweets from a target sentiment dataset added by tweets from a large set of other sentiment datasets. This strategy was called BERTweet 22Dt, which we consider a good suggestion for sentiment classification of English tweets, mainly when combined with MLP or LR classifiers.

For future work, we plan to investigate other methods for fine-tuning language models, mainly considering the polarity classification as the downstream tuning task. Transformer-Autoencoder pre-trained models, like BERT, RoBERTa and BERTweet, can have its weights adjusted looking for becoming more accurate in a specific task, like sentiment analysis. This adjustment is made by adding an extra classification layer in the top of the model and backpropagating the error in the final task through language models' weights. We intend then to compare the best results obtained in this study with the ones achieved by this specific-task category of fine-tuning.

Docbert: BERT for document classification

Learning emotion-enriched word representations

Application of Deep Learning Approaches for Sentiment Analysis. Deep Learning-Based Approaches for Sentiment Analysis

Applying BERT to document retrieval with birch pp

Enhancing deep learning sentiment analysis with ensemble techniques in social applications

Robust sentiment detection on Twitter from biased and noisy data

Learning long-term dependencies with gradient descent is difficult

Determining word-emotion associations from tweets by multi-label classification

ACM International Conference on Web Intelligence (WI)

Sentiment analysis is a big suitcase

Exploiting Different Types of Features to Improve Classification Effectiveness in Twitter Sentiment Analysis

On the evaluation and combination of state-of-the-art features in Twitter sentiment analysis

Efficientqa : a roberta based phrase-indexed question-answering system

Extracting diverse sentiment expressions with target-dependent polarity from twitter

Bert: Pre-training of deep bidirectional transformers for language understanding

Characterizing debate performance via aggregated twitter sentiment

Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping

Finetuning pretrained language models: Weight initializations, data orders, and early stopping

Adaptive recursive neural network for target-dependent twitter sentiment classification

Summary from the kdd-03 paneldata mining: The next 10 years

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

A Vector Space Model for Automatic Indexing

Target-dependent sentiment classification with bert

Vader: A parsimonious rule-based model for sentiment analysis of social media text

Anais do IV Brazilian Workshop on Social Network Analysis and Mining

Deep Learning

Universal language model fine-tuning for text classification

Albert: A lite bert for self-supervised learning of language representations

Flaubert: Unsupervised language model pre-training for french

Sentiment analysis: Mining opinions, sentiments, and emotions

A robustly optimized bert pretraining approach

Short text opinion detection using ensemble of classifiers and semantic indexing

Introduction to Information Retrieval

Advances in pre-training distributed word representations

Distributed representations of words and phrases and their compositionality

Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)

SemEval-2016 task 4: Sentiment analysis in Twitter

Language-independent twitter sentiment analysis

Bertweet: A pre-trained language model for english tweets

Twitter as a corpus for sentiment analysis and opinion mining

GloVe: Global vectors for word representation

Deep contextualized word representations

Deep contextualized word representations

The Edinburgh twitter corpus

Improving language understanding by generative pre-training

Proceedings of the 11th International Workshop on Semantic Evaluation

Semantic sentiment analysis of microblogs

Evaluation datasets for twitter sentiment analysis: A survey and a new dataset, the sts-gold

A Statistical Interpretation of Term Specificity and Its Application in Retrieval

Twitter polarity classification with label propagation over lexical links and the follower graph

Learning sentiment-specific word embedding for Twitter sentiment classification

Sentiment strength detection for the social web

Distributed representations of words and phrases and their compositionality. NIPS pp

Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics

From frequency to meaning: Vector space models of semantics

From frequency to meaning: Vector space models of semantics

Emo2Vec: Learning generalized emotion representation by multi-task training

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

The authors would like to thank the Brazilian Research agencies FAPERJ and CNPq for the financial support.

The authors declare that they have no conflict of interest.