key: cord-0843540-vwwncucc authors: Demotte, P.; Wijegunarathna, K.; Meedeniya, D.; Perera, I. title: Enhanced sentiment extraction architecture for social media content analysis using capsule networks date: 2021-09-16 journal: Multimed Tools Appl DOI: 10.1007/s11042-021-11471-1 sha: e97ffcbd121d60d5d322de78bfcb684ff75520fc doc_id: 843540 cord_uid: vwwncucc Recent research has produced efficient algorithms based on deep learning for text-based analytics. Such architectures could be readily applied to text-based social media content analysis. The deep learning techniques, which require comparatively fewer resources for language modeling, can be effectively used to process social media content data that change regularly. Convolutional Neural networks and recurrent neural networks based approaches have reported prominent performance in this domain, yet their limitations make them sub-optimal. Capsule networks sufficiently warrant their applicability in language modelling tasks as a promising technique beyond their initial usage of image classification. This study proposes an approach based on capsule networks for social media content analysis, especially for Twitter. We empirically show that our approach is optimal even without the use of any linguistic resources. The proposed architectures produced an accuracy of 86.87% for the Twitter Sentiment Gold dataset and an accuracy of 82.04% for the CrowdFlower US Airline dataset, indicating state-of-the-art performance. Hence, the research findings indicate noteworthy accuracy enhancement for text processing within social media content analysis. With the recent rapid growth in information and communication technologies, social media has become a major form of human interaction. This has created a large pool of data collection for researchers to analyze data, infer trends, and make suggestions. Sentiment analysis is a process of systematic computational analysis of opinions, sentiments, and expressions in the text, and plays a vital role in analyzing user opinions [20] . Twitter is one of the growing social media networks with over 330 million users. Tweets are uniquely suited for sentiment analysis to infer further knowledge because of their brevity and precise nature. Tweets are limited to a maximum length of 280 characters though statistically, only 1% of the tweets reach this prescribed limit [26] . Recent research in text processing and sentiment analysis using deep learning has reported state-of-the-art results considering many aspects of language modeling through neural language representation [40] . Such techniques include Convolutional neural network (CNN) and Recurrent neural network (RNN) such as Long Short-Term Memory (LSTM) networks and Bi-LSTMs [35] . However, these techniques have inherent limitations that degrade performance in social media content analysis [15] . In contrast, CNNs that have produced better performance, are also have issues of information loss due to the max-pooling operation used between the convolutional layers [31] . In addition, recently introduced attention-based models such as Transformer-based strategies show better performance. However, these techniques require more language resources and computational power, which hinders their ability to be readily used in many language modeling tasks [2, 6, 9, 23, 43] . In this research, we have explored capsule networks. Capsule networks were originally invented for image analysis [28, 31] and recently introduced for natural language processing tasks [37] . We use capsule network based approach for social media analysis to extract sentiments present in the textual content. Capsules-based architectures have produced competitive results [14, 37] , especially compared to CNN-based approaches [39] . Capsules within capsule networks encode information about the objects within the data as a vector representation. This elevates the ability of the capsules to capture the exact order pose of the information for the background information in many natural language processing (NLP) tasks. Moreover, routing procedures introduced under capsule networks [31] mitigate the information loss seen in pooling strategies in CNNs. Therefore, we empirically evaluate shallow capsules, capsules with static routing, and capsules with dynamic routing against the CNN-based Twitter sentiment classification procedures to set a new benchmark for Twitter sentiment analysis using capsule networks. Furthermore, the capsule networks presented in this study show improved accuracies compared to the baseline models for all datasets used for the experiment. Also, the capsule networks are lightweight and easy to train. Shallow capsule networks with static routing produce optimal performances considering the short sequential nature of the Twitter text. The rest of the paper is structured as follows. Section 2 discusses the related work and the usage of capsule networks in NLP. The capsule networks based architectures are discussed in detail under the model architectures in Sect. 3 along with the static routing and dynamic routing algorithms. Sections 4 and 5 describe the implementation process and the result analysis, respectively. Section 6 discusses the lessons learned with the novel contribution of the proposed study and Sect. 7 concludes the paper. Apart from the traditional statistical methods, many tools and solutions of social media content analysis, particularly in Twitter data, incorporate machine learning techniques. Advanced techniques for Twitter data classification utilize n-gram features as local spatial patterns and sequential information to achieve sophisticated language modeling procedures. Such strategies include CNNs and RNNs such as LSTM [40] , which model the text classification tasks beyond the boundaries of the meaning of words. Therefore, low dimensional features such as word embeddings [24, 25] are favorably used to extract convoluted syntactic and semantic features of the text content from social media. Among the related studies, Liao et al. have used a basic form of CNNs, with a single hidden layer, for sentiment analysis of Twitter data [19] . Jianqiang et al. have proposed the use of convolutional neural networks for Twitter sentiment analysis using their Global Vectors for word representation (GloVe) Deep Convolutional Neural Network (DCNN) model [11] . This GloVe-DCNN model has shown improvements over Bag of Words (BoW) with Support vector machines (SVM), BoW with Logistic Regression (LR), and GloVe with LR models. In addition, Johnson and Zang have proposed the use of CNNs for high dimensional text classifications in the broader topic of sentiment analysis [12] prior to Jianqian et al. Although they have achieved significant improvements, their models were complex and expensive to train. The study by Cai has compared the performance of very deep convolutional neural networks (VD-CNN) [5] with Google's pre-trained Bidirectional Encoder Representations from Transformers (BERT) architecture [8] . The results have placed the VD-CNNs below BERT's state-of-the-art performance; however, the VD-CNN's architecture is comparatively simpler and cheaper to train than BERT [8, 29] and BERT variants such as RoBERTa (A Robustly Optimized BERT Pretraining Approach) [21] , Albert (A Lite Bert) [17] and DistilBERT (distilled version of BERT) [34] . Successively, the possibility of using ensemble architectures for Twitter sentiment analysis has been researched with a range of techniques. Twitter sentiment analysis using an ensemble of several traditional approaches like Naive Bayes, Random Forest, SVM and Logistic Regression has proven more accurate than the models that have used individually [33] . Ensembles of traditional machine learning approaches and novel deep learning techniques have also been proposed by Araque et al. [3] . They have tried both an ensemble of several sentiment classifiers trained with different kinds of features and an ensemble of features, where the combination is made at the feature level. The multimodality of Twitter presents a new dimension to the challenges in sentiment analysis. Twitter enables users to express themselves using images and GIF videos, which are combined with text. While most studies tackle sentiment analysis with one modality, Kumar and Garg's have attempted to analyze tweets consisting of both infographic and typographic data [16] . Their study has addressed the multimodal sentiment analysis of tweets. For textual sentiment analysis, they have applied a hybrid approach of lexicon and machine learning techniques. Various other neural networks such as Skip-grams and denoising autoencoders have also been tested for multimodal Twitter data for sentiment analysis [4] . Sentiment analysis of Twitter data has been harnessed in several research studies showcasing its applications. For instance, recent research on COVID-19 related tweets and social media content harvested by filtering #COVID related keywords produced some intriguing insights into the reactions of the masses to pandemic-related restrictions and government interventions [10, 18] . The study by Imran et al. [10] , has addressed how countries from the same region show high correlations among them except for Norway and Sweden, mainly due to the different approaches taken by their respective governments. The Stanford CoreNLP [22] tool has also been used in building a system that analyses tweets in real-time to predict stock market fluctuations [7] . The proposed system attempts to predict the stock market prices of several reputed companies by the sentiments of the tweets that have mentioned the company names. Signal or spike detection in Twitter data is another interesting area of research in Social media analysis. Spikes in tweets in the form of hashtags, frequently mentioned keywords, sentiments of tweets and volume of tweets can be used to infer trends and make useful predictions into the future. As a related study, Nazir et al. [27] , have proposed the use of three viable algorithms to detect spikes in tweets. They have assessed the spikes in tweets, while showing the use of integrating a Gaussian algorithm and a threshold algorithm that provides better results on the real-time data. When considering the domain of text classifications and language modeling tasks, regardless of certain advancements that were produced by the LSTMs and Bi-LSTMs in neural language representation, their intrinsically sequential nature of modeling strategies has led to several limitations. While vanishing gradient problem hinders encoding longer sequences within the learning approach [13] , LSTMS and Bi-LSTMs also endure from a computational bottleneck with the sequential information processing [41] . CNNs overcome this computational bottleneck by providing parallelization within convolutional filters. While the CNNs produced better results compared to LSTMs in text classification [39] yet endure the information loss due to the pooling strategy when representing deep neural language representation [31] . Table 1 summarizes the techniques used by some of the related studies. Most of the studies have used techniques such as Artificial recurrent neural network (RNN), Deep LSTM, and different embedding methods such as BoW, GloVe and Embeddings proposed from language model (ELMo) and BERT. Among these, several studies have used GloVe word embedding. The capsule networks have produced state-of-the-art results with the dynamic routing procedure proposed by Sabour et al. [31] . The intention behind the capsule strategy was to represent the features of objects within the data as vector representation to identify the exact order or pose of the information. The dynamic routing procedure reduces the information loss of CNNs due to max-pooling and elevates the advancement of the part-to-whole relationship between capsules for deeper capsule representation in classification tasks. Rajasegaran et al. [30] , have proposed an optimized strategy to eliminate high computation cost and vanishing gradient problem of deeper capsules by applying 3D convolution with capsule strategy. This method reduced 68% of parameters while producing state-of-the-art results in the domain of capsules. Inspired by the capsule network architecture, Wang et al. [36] have applied capsules for sentiment classifications with the combination of RNNs, which produced the best results at that time. In another study, Yang et al. [37] have conducted an empirical experiment of capsule networks with dynamic routing to validate the utilization of capsule networks for text classification. The implementation of different variations of capsule architectures as capsule-A and capsule-B for binary and multi-class text categorization with a dynamic routing process, have produced optimal performances in text classification. Another, dynamic routing based Siamese architecture with a twin capsule network and a fully connected network has proposed by Abeysinghe et al. [1] . They have shown that the use of capsule layers-based Siamese network reduces the information loss in CNNs and allows train the model with a smaller number of parameters and datasets, while achieving on par performance with CNNs. With even deeper analysis, Kim et al. [14] have produced an approach based on static routing between capsules depicting the use of capsules for text classification. This method has addressed the limitations of capsule networks with dynamic routing due to the variations of text with background noise, as opposed to the corresponding image classification tasks. We explore the use of capsule networks with static and dynamic routing methods to obtain higher accuracies for the sentiment extractions from social media text content. Thereby, setting a new standard for benchmark in sentiment analysis of Twitter data using deep learning architectures with low resource setting. Analyze sentiments using contextual embedding [26] BiLSTM GloVe ELMo BERT Twitter sentiment analysis without word embedding [19] X A sentiment analysis model with word embeddings and word sentiment polarity score [11] X GloVe Binary classification approach to sentiment analysis of tweets [5] X GloVe BERT A weighted ensemble model to analyze tweet sentiments [33] BoW X Multimodal sentiment analysis to identify sentiment polarity of tweets with text, image or infographics [16] X Analyze the sentiment polarity, reactions among cultures [10] X FastText GloVe Real-time sentiment analysis of emotional tweets to predict stocTwitter [7] X A model to capture tweets semantics and sentiments [38] X X The proposed model in this study uses shallow capsule networks, deep capsule networks and ensemble deep capsule networks on top of the CNNs intending to enhance the classification strategy. The scalar representation of CNNs based language modeling tasks is replaced with vector representation of capsules to identify the exact order or pose of the information. Penetrating deeper with capsules, an additional routing mechanism is introduced to map the low-level capsules to the high-level capsules. This technique was used to enhance the pooling strategy in CNNs [31] , which results in information loss. This section describes the baseline CNN structure, main capsule-based layers on top of the CNN structure, dynamic routing and static routing strategies between capsules, and the task of Twitter sentiment analysis with the proposed capsule architecture. For the sentiment analysis tasks using CNN-based techniques, the text representation of Twitter data content is fed into the CNN using pre-trained word vectors. Therefore, each word in tweets is considered as a word vector. Let, a tweet consist of n words with k-dimensional word vectors. The feature map specific for a tweet could be considered as a map obtained through the concatenation of word vectors governed by Eq. (1). Here, x i ∈ R k refers to the word vector of the i-th word of the input tweet and ⊕ refers to the concatenation operator of the word vectors. Therefore, the concatenated word-vectors form a n× d dimensional feature map which will be used as the input features for the CNN. The convolution operations extract n-gram features from a context window, where a filter is applied on top of the context window. Let the context window be x i∶i+l ∈ R l × k , where the context window consists of l number of word-vectors concatenated with each other and i is the starting index of the context window. A filter H∈ R l × k is applied on top of the corresponding context window to extract a feature f i ∈ R . This process is governed by Eq. (2), where • represents the element-wise matrix multiplication, b denotes a biased term and g represents an activation function (ReLU or tanh) for extracted features. The considered filter convolves with each possible context window,CW ∈ {x 1∶1+l , x 2∶l+1 , ........., , x n−l+1∶ n } . This extracts the number of features governed by Eq. (3). Here, d in is the input dimension for the convolution operation (concatenated word vectors) and d out is the resulting number of features after the convolution. The padding is kept as 0 and strides as 1 for the convolution process in our experiments. This procedure generates (n − l + 1) sized feature column. We can use the max-pooling operation on top of the extracted features to highlight the most significant feature in the extracted feature set as, f max = max{f i } . Consequently, the N number of features could be generated with N number of filters. For Twitter sentiment analysis tasks, these extracted features could be combined with a fully connected neural network using the softmax or sigmoid activated dense layers based on the requirements of the task. Vanilla capsule networks, built solely upon convolutions [37] , mainly include three varieties of layers based on the task specificity, namely primary capsules, convolutional capsules and text capsules. We have evaluated combinations of different variations of these layers. Moreover, we used both dynamic and static routing between capsules. These routing procedures are established instead of pooling operations in CNNs, to obtain better performances in feature extraction and computational processing. Compared to the pooling operations in CNNs such as max-pooling and average pooling, the dynamic routing procedure does not discard the information of a specific region that describes the precise position of an entity within the considered region [1, 31] . As per the intuition behind pooling, the most significant and average feature of a given region represents that the considered region in max-pooling and average pooling, respectively. Thus, pooling does not encode the exact order or pose of the information that explains the precise position of an entity within the data. The dynamic routing algorithm proposes a novel strategy to map low-level capsules to high-level capsules in a hierarchical manner based on a matrix multiplication operation, where the exact pose or order of information within the capsules are preserved. We represent the objects within the data as the vector representation of capsules instead of the scalar representation of the CNNs using the following process. The generated feature columns are concatenated to obtain a feature map as in Eq. (4), instead of applying pooling operations on the extracted features by N filters. The feature map M∈ R (n−l+1) × N includes feature columns extracted by N filters and m i ∈ R (n−l+1) represents the feature column extracted by i − th filter. In order to obtain the primary capsules based on the extracted features by the CNN, a matrix multiplication operation is carried out. We instantiate a capsule c i ∈ R d as d -dimensional vectors. A matrix filter W i ∈ R N × d is multiplied with concatenated feature columns M , given in Eq. (4). This procedure results in a column list of capsules c∈ R (n−l+1) × d computed as given in Eq. (5), where b 1 represents the bias term and f represents the squash function. Moreover, with p number of matrix filters, a map of capsules C∈ R (n−l+1) × p × d generated with (n − l + 1) × p number of capsules. The squash function is stated in Eq. 11, which converts each capsule's length between a value 0 and 1. Therefore, the length of a capsule could be considered as the probability of the existence of an entity within capsules such as syntactic and semantic information of text or sentiment category of given data. In this layer, the capsules are mapped to a local region of the layer below to facilitate the ability of capsules to identify local spatial patterns quite effectively. We assume that a local region with size (m × p) in the layer below (primary capsule layer) is mapped to the convolutional capsule layer. Therefore, capsules in that region compute matrix multiplication operations to learn child-to-parent relations between low-level capsules and high-level capsules. For the matrix multiplication operation, a weight matrix W c ∈ R E × d × d is used, where, E denotes the number of capsules in a convolutional capsule layer. Given a child capsule, a parent capsule is generated according to Eq. (6). Here, û j|i is the convolution capsule generated, u i is the local region (m × p) for a given child capsule in lower-layer, W j c is the j − th matrix in the matrix tensor W c and b j|i is the bias term for û j|i convolution capsule generation for a given lower layer capsule u i . Consequently, (n − l − m + 2) × E number of d-dimensional convolutional capsules are generated using this procedure. The sentiment capsule layer is designed as the final layers of capsule architectures. This layer mainly consists of capsules for each target sentiment category to represent classification tasks. Therefore, the capsules in this layer are generated based on the matrix multiplication to learn child-to-parent relationships. To obtain the sentiment capsules based on the layer below, all capsules in that layer are flattened into the list of capsules and multiplied by the transformation matrix W d ∈ R U × d × d as in Eq. (6) , where U denotes the number of sentiment capsules for the corresponding task and d is the instantiated parameter for the dimension of capsules. The capsules in the sentiment capsule layers have the length or the norm of the vector representation denoting the probability of the existence of the target sentiment category. Thus, these probabilities were used to extract the sentiment of a given sequence of text. Routing by agreement algorithms is initially designed as a strategy to learn the child-toparent relationship between capsules incrementally, by mitigating the issues of the pooling strategies used in CNNs, to map low-level features to high-level features in Deep CNNs [31] . Also, Kim et al. [14] have suggested that static routing procedures are better at handling variability of background information of text than the dynamic routing procedures that are proposed by Yang et al. [37] . In this study, we empirically evaluated these routing algorithms for capsule networks for Twitter sentiment analysis. The main purpose of the dynamic routing algorithm is to establish a non-linear map between child capsules to parent capsules iteratively, to send child capsules to its most relevant parent capsules by ensuring that the child-to-parent relationship is correctly established. Therefore, using this process each child-capsule can learn its potential parent to be mapped incrementally varying the connection strength between child-toparent. This procedure elevates the issues due to the pooling strategy used in CNNs. Generally, pooling strategies result in information loss due to the neglect of surrounding features of the most significant features [1] . Dynamic routing further elevates vector representation of capsules considering essential background information, especially for text-based classification tasks [31] . Algorithm 1 describes the dynamic routing between two capsule layers. First, we initialize the log prior probabilities b ij , between each capsule i in the layer below and each capsule j in the layer above, as stated in Eq. (7) that corresponds to line-3 of Algorithm 1. These log prior probabilities b ij , represent the connection strength between a pair of child and parent capsules. Secondly, the log prior probabilities are learnt incrementally within the iterative learning procedure as shown in line-4 of Algorithm 1. The connection strength of a child-capsule for all parent-capsules in the layer above is calculated based on the softmax function to indicate the probability of sending the information represented in the child-capsule to each of the parent capsules as shown in line-6 of Algorithm 1. This process is governed by Eq. (8) . Here, c ij represents the coupling coefficient between capsule i in the layer below and capsule j in the layer above, and exp denotes the exponentiation function. The proposed strategy based on the softmax function calculates all coupling coefficients between a capsule in the layer below and every capsule in the layer above for routing purposes. Moreover, the routing procedure computes the capsules in the layer above using coupling coefficients and predicted capsules, which are retrieved during the matrix transformation process as described in Sects. 3.2.2 and 3.2.3. This process represents by line-8 of Algorithm 1 and governs by Eq. (9) . Here, s j denotes the computed capsules for the layer above. The connection strength c ij represents the coupling coefficient between capsule i in the layer below and capsule j in the layer above. As mentioned in Eq. (6), û j|i is the generated convolution capsule. Also, as shown in line-10 of Algorithm 1 and Eq. (10), the log prior probabilities b ij , between capsule i in the layer below and capsule j in the layer above are updated iteratively by considering the similarity between the predicted capsule û j|i and computed capsule s j within the routing procedure. In our proposed model, the squash non-linearity is applied for each computed capsule s j after the iterative updating process, which hinders the degradation of instantiated parameters of capsules within the iterative process [31] . The squash function is applied to each computed capsule s j as in Eq. (11) , which corresponds to line-12 of Algorithm 1. Here, ||s j || denotes the standard norm for capsule s j . The length vector v j represents the probability of the existence of objects with a capsule. Therefore, the final layers of the capsule architectures are designed to represent the tweet category existence probability within the length of the capsules. The text-based classification tasks have higher variability of background information compared to image processing tasks [14] . As suggested by Kim et al. [14] , the text-based tasks are considered under a static routing process that eliminates different variations of routing between child-to-parent based on spatial patterns, without considering the whole context of the text. Thus, the capsules in the layer below will only be mapped to their parent's capsules in the layer above, using a matrix transformation governed by Eqs. (12) and (13) . Here, W ij ∈ R M × N is the transformation matrix that transforms the capsules i in the layer below to capsules j in the layer above. M is the dimension of the capsules to be generated in the layer above and N is the number of capsules in h i that denotes the capsules in the layer below. Then, the squash function shown in Eq. (11) is applied to obtain the vectors with the length of the vector v j , as the probability of the existence of an entity within a capsule. We classify Twitter data using a separate margin loss function to identify the location of a given category in each sentiment capsule. Here, we utilize the length of the capsule to represent the probability of the existence of a given sentiment category with a sentiment capsule. Equation (14) is used to derive the marginal loss for sentiment capsules [31] . If the tweet category exists within the text capsule, then Ts = 1 , otherwise it is set to 0. The values m + and m − are set as 0.9 and 0.1 accordingly. After several experiments, we have set the down-weighting coefficient λ to 0.25 that gives the optimal performance. This downweighting coefficient reduces the initial learning of sentiment capsules for tweet sentiment categories that are not present within those sentiment capsules. The total loss is simply taken as the sum of the losses for all sentiment capsules. The proposed solution uses two types of shallow capsule networks as illustrated in Fig. 1 . These capsule networks include two capsule layers namely, primary capsules and sentiment capsules followed by the word embedding layer and the convolutional layer. The convolutional layer is employed specifically to extract n-gram information from the text. Primary capsules are generated by considering the feature maps obtained through the CNN layer. The number of capsules in the final capsule layer or the sentiment capsule layer is equal to the target number of sentiment categories. Thus, sentiment capsules represent the sentimental features of the text that are utilized to classify the text into sentiment classes. As the routing procedure between capsules, both dynamic routing and static routing have experimented. In a neural network perspective, the word embedding layer consists of n number of k-dimensional vectors where the ultimate input feature map represents n × k dimensionality. For shallow capsule networks, this feature map is fed to a CNN layer, where The deep capsule network architecture combines all three capsule layers: primary capsule layer, convolutional capsule layer, and sentiment capsule layer followed by the word embedding layer and the convolutional layer, as shown in Fig. 2 . The convolutional layer is employed to extract n-gram information from the text as in the shallow capsule networks. Primary capsules are generated by considering the feature maps generated through the CNN layer. The convolutional capsules are generated by considering the dynamic routing procedure in Sect. 3.2.5. The significance of convolutional capsules can elaborate as the ability to relate local features within the text since the local regions of primary capsules are mapped to the convolutional capsules as indicated in Sect. 3.2.2. Moreover, as in the shallow capsule networks, the number of capsules in the final capsule layer or the sentiment capsule layer is equal to the target number of sentiment categories. These sentiment capsules are generated based on convolutional capsules and the dynamic routing procedure. From an architectural perspective, deep capsule networks only have one additional capsule layer namely the convolutional capsule layer. The word embedding layer, which is the initial layer of the network consists of a feature map that represents n × k dimensionality. This feature map is fed to a CNN layer where N number of l × k filters was utilized to extract n-gram features from the text as in shallow capsule networks. The resultant N × (n − l + 1) feature map is utilized to generate primary capsules as indicated in Sect. 3.2.1. Ultimately a map of capsules C∈ R (n−l+1)×p×d generated as the primary capsules. These capsules with dynamic routing procedures as described in Sect. 3.2.5 generate convolutional capsules with (n − l − m + 2) × E × d in dimensionality. As the final capsule layer, sentiment capsules are generated for the sentiment classification purpose. The dynamic routing procedure with convolutional capsules was utilized to construct the sentiment capsules. These capsules are u × d in dimensionality, where u is the number of target sentiment categories and d represents the dimensionality of sentiment capsules. Generally, the ensemble capsule networks have produced prominent performances in text classification tasks [37] . Therefore, we evaluated an ensemble capsule network for Twitter data sentiment classification with the dynamic routing algorithm. As illustrated in Fig. 3 , the ensemble capsule network consists of three layers namely the primary capsule layer, convolutional capsule layer, and sentiment capsule layer. Three separate deep capsule networks consisting of these layers were utilized to extract different variations of n-grams features from Twitter data. In the final sentiment capsule layer, the generated capsules were average pooled considering three capsule networks for the classification purpose. We used two widely used and publicly available Twitter datasets as follows: CrowdFlower US Airline dataset -this dataset is released by Crowd-Flower and has a total of 14,640 tweets related to six major US Airlines: American airline, United airline, US Airways, Southwest airline, Delta airline, and Virgin airline. Each of these tweets is tagged as positive, negative, or neutral tweets. 1 The Stanford Twitter Sentiment Gold (STSGd) dataset -this dataset is created by Saif et al. [32] . There are 2034 tweets and manually annotated as negative or positive on the agreement of three annotators. The summary statistics of each dataset are shown in Table 2 . For the experiment with the CrowdFlower US Airline dataset, the vocabulary |V| is consisted of 26841 tokens and the maximum tweet length was 36. For the STSGd dataset, the vocabulary size |V| was set to 8470 and the maximum tweet length was set to 31. In the deep learning era, the models trained using pre-trained word-embeddings have reported state-of-the-art performance even without using linguistic resources. This is because the pre-trained word-embeddings can extract syntactic and semantic information of a given token in a language-independent manner. The proposed Twitter sentiment analysis task uses two types of word-embeddings models as follows. We empirically evaluate the usage of the two embeddings based on the task specificity on Twitter data sentiment analysis. • GloVe 300-dimensional word-vectors trained Common Crawl corpus with 840 billion tokens with a vocabulary of 2.2 million tokens. • Glove Twitter 200-dimensional word-vectors pre-trained on 2 billion tweets, 27 billion tokens and vocabulary of 1.2 million tokens that are Twitter specific. The baseline model for the STSGd dataset is derived from the study presented by Jianqiang et al. [11] . They have used a radial basis function (RBF) kernel SVM and an LR model using unigram and bigram features consists of BoW. In addition, they have used the same models with additional word sentiment polarity features, Twitter-specific features and word vector features with GloVe. The DCNN using GloVe word embeddings is considered as a basis for the proposed capsule network in our study. Further, we have used 10-fold crossvalidation as the evaluation metrics of our approach using the CrowdFlower US Airline dataset. Generally, Twitter content includes high noise due to non-dictionary terms, ill-formed language structure, and grammatical mistakes. Therefore, the following procedures are used to reduce the noise within the data. • Removed the special characters within the tweets that do not carry any specific information about the tweet category. • Removed the URLs and links within the tweets as they do not carry any sentimentspecific information. Generally, the activities of the neurons within capsules in a capsule network represent an entity within data in its exact order or pose and with certain other properties, using a vector representation of capsules. As the final layer of our deep learning architecture, we use sentiment capsules to represent the sentiment categories in a sentiment analysis task. Since this considers the number of sentiment categories, we use three and two sentiment capsules for the CrowdFlower US Airline dataset and the Stanford Twitter Sentiment Gold dataset, respectively, which correspond to the number of sentiment classes in each dataset. Furthermore, the length of a sentiment capsule or the norm of vector representation of the capsule represents the probability of the existence of sentiment category within the capsule. Thus, these probabilities are used to extract the sentiment of a given sequence of text. We evaluated the Twitter-based sentiment classification model for each dataset by varying the components of the models as follows. This process enables to measure model performance empirically, by showing the effectiveness of capsule-based architectures for Twitter analysis. • Four main model architectures are used as shallow capsule network with static routing, shallow capsule network with dynamic routing, deep capsule network with dynamic routing, and ensemble capsule network with dynamic routing. • Each model is fed with both Twitter-specific 200-dimensional word-embeddings and 300-dimensional common crawl corpus-based GloVe word-embeddings. We have used the Adam optimizer for the optimization process with exponential learning rate decay. The models are trained on Google Colab with Tensorflow as the implementation platform. The optimal hyperparameters for the models in the STSGd dataset are indicated in Table 3 . For each model training, the learning rate was set to 1e − 3 , and the learning rate decay was set to 0.95. Max tweet length is defined as the tweet length to be fed to the models as input embedding dimension, considering the variations of the datasets. The evaluation is based on the 10-fold cross-validation approach. As given in Table 3 , each capsule architecture with dynamic routing processes utilizes three iterations for dynamic routing procedure to enhance the child-to-parent relationship between capsules. The number of convolutional filters in the initial layer of each model is indicated in the column of the number of filters. The ensemble capsule network utilizes three filter sizes in the initial convolutional layers to structure ensemble architecture as shown in the filter sizes column of Table 3 . All other models use the filter size of three to extract n-gram features from the convolutional layer. Additionally, the dimension of capsules for each layer is indicated layer-wise. While shallow capsule networks have two layers, deep capsule layers networks have three layers of capsules, respectively. Here, |C| indicates the number of capsules in layer-wise for each layer of the network. Further, the We used accuracy, precision, recall, and F1 score as the evaluation metrics for the STSGd dataset. Since the CrowdFlower US Airline dataset includes multi-class classification tasks, weighted evaluation metrics are used for the evaluation. The 10-fold cross-validation is used for each experiment. The classification results for the STSGd dataset and Crowd-Flower US Airline dataset are given in Tables 4 and 5 , respectively. For the results obtained for the STSGd dataset, all the trials with capsule networks outperformed existing baseline techniques. This could be justified as the ability of capsule networks to handle language syntactic and semantic information quite effectively utilizing vector representation of the capsules. Capsules-based strategies further demonstrate the capability of handling background information of text which validates the optimal results in text-based sentiment analysis tasks. Since tweets are based on short texts, shallow capsule networks reported optimal performance for deep capsule-based architectures. Shallow capsule networks with static and dynamic routing have produced competitive results, while static routing based technique has slightly higher performance. Therefore, as stated by Kim et al. [14] , this could be elaborated as the ability to identify the variability of background information of text using static routing. Compared to the image classification task, textbased classification tasks do not depend on the exact order of words like the objects within images. Thus, the static routing can be optimistic when using child-to-parent links among layers of the capsule network. Moreover, the experiments were conducted with two varieties of input embeddings obtained through the Stanford GloVe project. The 300-dimensional GloVe embeddings trained on large common crawl corpus reported better performance compared to the 200-dimensional GloVe embeddings trained on Twitter-specific data. This observation could be justified by the fact that generic GloVe embeddings have learned deep semantic structure compared to the Twitter-specific GloVe embeddings, which carry information only for a specific domain. In particular, the deep capsule architectures perform slightly lower compared to shallow capsule networks. This observation could be expected because the tweets are based on text with shorter sequences, hence lesser information contained with the text-based tweets. Since the number of the learnable parameters in deep capsule network-based architectures are much higher than shallow capsule networks, short sequences of text prevent proper language modeling with deep capsule architectures compared to shallow capsule architectures. To further validate our proposed architecture, the models were evaluated against the CrowdFlower US Airline dataset. As shown by the results for the CrowdFlower US Airline dataset in Table 5 , the performance of the shallow capsule network with static routing guarantees optimal performance. Therefore, for Twitter-based sentiment analysis tasks, shallow capsule networks could be effectively employed to capture Twitter-specific syntactic and semantic relations for sentiment analysis tasks. Since the existing approaches for the CrowdFlower US Airline dataset do not validate the performance based on 10-fold cross-validation, they are not reported under this experiment. Optimistically, shallow capsule networks could be introduced as a lightweight model compared to BERT-like models, which are more resource-intensive of both linguistic resources and computational power. Therefore, capsule-networks-based models could be used as a replacement for BERT-like models with competitive results for Twitter-based content analysis. Moreover, to evaluate the model performance with respect to the number of training epochs for the STSGd dataset, a separate experiment was carried out. The dataset was divided into train, validation, and test set based on the 8:1:1 ratio. The performance was evaluated using the accuracy metric and the results are illustrated in Fig. 4 . Here, four shallow capsule networks were experimented with based on Twitter-specific GloVe embeddings and GloVe embeddings trained on common crawl corpus. The shallow capsule networks trained on GloVe embeddings with common crawl corpus consistently outperformed the shallow capsule networks trained on Twitter-specific GloVe embeddings. Interestingly, these shallow capsule networks produce the best accuracy within three or four iterations, indicating the effectiveness of the model architecture for low resource consumption in neural language modeling tasks. Furthermore, to understand the intuition behind the capsule network as a lightweight model compared to existing highly resource-intensive BERT-like language models, the number of total parameters and trainable parameters used for each learning model are shown in Tables 6 and 7, respectively. Accordingly, compared to the BERT-Base model with 110M parameters and the BERT-Large model with 340M parameters [8] , the largest model under this experiment only had 38M total number of parameters with only 14M trainable parameters. Therefore, the capsule networks could be effectively utilized for tasks with low language resources, competitively compared to BERT-like models with less computational resource setups. Furthermore, the tests conducted using BERT pre-training tasks for low resourced languages such as Romanian [9, 23] , Arabic [2, 43] and Filipino [6] , have shown sub-optimal results compared to the large corpus of English pre-trained data used for BERT-like models with pretraining procedures. Thus, the ideology behind capsule networks could increase the usage of capsule networks in low-resource language domains, where the language resources are not sufficient for the pretraining procedure of BERT-like models. The research findings indicate that capsule networks cab be effectively used for text classification tasks without using any linguistic resources. This would enable the research exploration of text processing and classification under a low resource setup without compromising the accuracy or effectiveness of the task. Therefore, it is evident that the capsule networks provide equal or better results as per the current state-of-the-art. The outcome of this research based on capsule networks under limited resources with fewer parameters and computational power still demonstrates sufficiently competitive results against highly resource-intensive BERT-like models [8] . The increased training and inference costs associated with the transformer models can limit the applicability of BERT-like models for a given text analysis task. Although BERT-like models give more context, their processing capabilities gets compromised in situations where these models are difficult to apply. According to the obtained results, our proposed capsule network-based approach is reasonably accurate and contextually rich at comparable levels, though it does not require more resources due to the lightweight architecture with fewer parameters. This is useful for processing tasks that require model re-training with shorter lead times to release new inference, such as edge or real-time sentiment extraction. The pre-trained GloVe word-vectors based on a Twitter-specific corpus, which contains 27 billion tokens and a vocabulary of 1.2 million tokens, were tested with our proposed architectures. Although Twitter-specific word-vectors could capture syntactic and semantic relationships in tweets by considering the context of the domain, the models trained with more generic GloVe word-vectors from common crawl corpus outperformed the models with Twitter-specific pre-trained word-embeddings. This is because of the deep feature extraction of generic GloVe wordvectors, which were trained with the largest common crawl corpus of 840 billion tokens and vocabulary of 2.2 million tokens. Therefore, the GloVe word-vectors trained on common crawl corpus could be effectively used to identify sentiment categories within tweets. The variability of the tokens within the Twitter data could be effectively managed with generic GloVe embeddings since the common crawl corpus includes data from most of the domains within the Twitter-based textual representations. Moreover, it is possible to have numerous variations of models as shallow capsules with static and dynamic routing methods. Also, the use of deep capsule networks with dynamic routing and ensemble capsule networks could be recommended for better accuracy in the Twitter data processing. Shallow capsule networks with static routing produced promising results for the datasets used in this research. The effectiveness of shallow capsule networks could be described as the ability to capture syntactic and semantic relationships of tweets as short sequences of text. The static routing algorithm elevates child-to-parent relationships in a specific way for text-based classification tasks. It handles background information of text quite effectively, preventing the drawbacks caused by the background noise of text. Table 8 shows a comparison of the proposed solution with the existing studies in terms of the used datasets, techniques, and the obtained accuracies. The existing studies are based on several Twitter databases such as Obama-McCain Debate (OMD), Sentiment Strength Twitter Dataset (SS-Tweet), Stanford Twitter Sentiment Test (STSTd), SemEval2014 Task9 (SE2014), Stanford Twitter Sentiment Gold (STSGd), Sentiment Evaluation (SED), Sentiment Strength Twitter (SSTd) and STS-Gold Twitter dataset. Accordingly, the proposed approach has shown the highest accuracy of 86.87% for the STSGd dataset using 300-dimensional common crawl Glove word-embeddings and shallow capsule network with dynamic routing. The highest accuracy of 82.04% was reported utilizing 300-dimensional common crawl Glove word-embeddings and shallow capsule network with static routing for the CrowdFlower US Airline dataset. The novel contribution and usefulness of the proposed approach compared to the existing studies based on capsule networks can be highlighted. The capsule-based architectures could be effectively used as a replacement for CNN-based deep learning architectures due to the vector representations of features instead of scalar feature representation of CNNs. The vector representation of features in capsules effectively handles the background information of the text. Moreover, highly resource-intensive models like BERT could be replaced with capsule-based techniques, since capsule architectures could produce competitive results in low resource domains for BERT models as suggested in Sect. 5. It is challenging to process short sequences of social media text content with varying context and background information. Static routing could be more effective over dynamic routing algorithms for short sequences when handling the variability of background information. Pre-processing of Tweets can be applied to improve the model's performance due to the noise of special characters and web URLs, which do not carry any sentiment information within a tweet. A possible future research extension would be to explore the use of Attention-based capsule networks with dynamic routing for relation extraction as part of sentiment analysis and text content processing with social media data. Moreover, contextual embeddings could be integrated with capsule-based techniques, since most of the deep learning techniques have reported promising performances utilizing this strategy. This research explored the use of capsule networks in social media text content analysis with natural language processing. The proposed strategy aimed at sentiment analysis of Twitter-based data utilizing a variety of capsule networks. Twitter-specific and generic GloVe embeddings were used in shallow and deep capsule networks together with static and dynamic routing for sentiment analysis of tweets. A notable achievement in this research is the higher level of accuracy over the existing sentiment analysis methods used in social media content, thereby setting a new benchmark standard for Twitter data analysis with capsule networks. The classification results support the use of shallow capsule networks with static routing for optimal performance. Moreover, it produced state-of-the-art results considering the relatively shorter sequences of texts in tweets. For the CrowdFlower US Airline dataset, the shallow capsule network with static routing produced an optimal accuracy of 82.04%, while the highest accuracy of 86.67% for the Stanford Twitter Sentiment Gold dataset was reported by shallow capsule networks with dynamic routing. Furthermore, considering the lightweight nature of the capsule networks, they are useful for low resource languages where the BERT-like models could not be utilized due to a lack of language resources for pre-training procedures. Thus, we have proven a novel methodology to analyze social media text content in resource-constrained setups such as edge processing, where the capsule networks of the analysis model can be deployed. This will revolutionize social media content analysis as the proposed capsule network-based distributed processing architecture can easily rely upon portable devices and nodes, which can open the pathway to real-time sentiment analysis at the edge of the processing channel. This study concludes that the introduction of capsule networks into state-of-the-art text processing and natural language processing methods has shown impressive performance and potential in the research area of Twitter sentiment analysis. Funding The research is not funded. Capsule networks for character recognition in low resource languages Sarcasm and sentiment detection in Arabic tweets using BERTbased models and data augmentation Enhancing deep learning sentiment analysis with ensemble techniques in social applications A multimodal feature learning approach for sentiment analysis of social network multimedia Sentiment analysis of tweets using deep neural architectures Establishing baselines for text classification in low-resource languages Real-time sentiment analysis of Twitter streaming data for stock prediction BERT: Pre-training of deep bidirectional transformers for language understanding The birth of Romanian BERT Cross-cultural polarity and emotion detection using sentiment analysis and deep learning on COVID-19 related tweets Deep convolution neural networks for Twitter sentiment analysis Effective use of word order for text categorization with convolutional neural networks RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients Text classification using capsules Six challenges for neural machine translation Sentiment analysis of multimodal Twitter data ALBERT: A lite BERT for selfsupervised learning of language representations Exploratory Analysis of a cocial media network in Sri Lanka during the COVID-19 virus outbreak CNN for situations understanding based on sentiment analysis of Twitter data Sentiment analysis and subjectivity Roberta: A robustly optimized BERT pretraining approach The Stanford CoreNLP natural language processing toolkit Proceedings of the 28th International Conference on Computational Linguistics Distributed representations of words and phrases and their compositionality Advances in pre-training distributed word representations Transformer based deep intelligent contextual embedding for Twitter sentiment analysis Social media signal detection using tweets volume, hashtag, and sentiment analysis Capsule networks-a survey How multilingual is multilingual BERT Deepcaps: Going deeper with capsule networks Dynamic routing between capsules Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold An ensemble classification system for Twitter sentiment analysis DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Combination of convolutional and recurrent neural network for sentiment analysis of short texts Sentiment analysis by capsules Investigating capsule networks with dynamic routing for text classification A quantum-inspired sentiment representation model for Twitter sentiment analysis A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification Deep learning for sentiment analysis: A survey Sentence-state LSTM for text representation Unsupervised sentiment analysis of Twitter posts using density matrix representation BERT-based Arabic social media author profiling