key: cord-0853000-dxfw1nvc authors: Balakrishnan, Vimala; Shi, Zhongliang; Law, Chuan Liang; Lim, Regine; Teh, Lee Leng; Fan, Yue title: A deep learning approach in predicting products’ sentiment ratings: a comparative analysis date: 2021-11-05 journal: J Supercomput DOI: 10.1007/s11227-021-04169-6 sha: 38e9dc9310197ef2e1656e4b530ae1b8affd6fac doc_id: 853000 cord_uid: dxfw1nvc We present a benchmark comparison of several deep learning models including Convolutional Neural Networks, Recurrent Neural Network and Bi-directional Long Short Term Memory, assessed based on various word embedding approaches, including the Bi-directional Encoder Representations from Transformers (BERT) and its variants, FastText and Word2Vec. Data augmentation was administered using the Easy Data Augmentation approach resulting in two datasets (original versus augmented). All the models were assessed in two setups, namely 5-class versus 3-class (i.e., compressed version). Findings show the best prediction models were Neural Network-based using Word2Vec, with CNN-RNN-Bi-LSTM producing the highest accuracy (96%) and F-score (91.1%). Individually, RNN was the best model with an accuracy of 87.5% and F-score of 83.5%, while RoBERTa had the best F-score of 73.1%. The study shows that deep learning is better for analyzing the sentiments within the text compared to supervised machine learning and provides a direction for future work and research. Online shopping has grown tremendously, significantly more during the on-going COVID-19 pandemic, which resulted in many countries enforcing stay-at-home orders among their citizens. With the closure of most retail shops and fear of COVID-19 infections, online shopping has become the main means for customers to satisfy their consumption needs. It is common for online retailers to solicit customer reviews on products and services through textual reviews and/or ratings [1, 2] . These online reviews play a great role in influencing the purchasing decisions made by customers while providing more insights to the sellers. As online platforms including social media contain voluminous data, sentiment analysis provides an easy and fast mechanism to categorize the reviews, hence providing useful insights to both customers and sellers on the feedback of the products and services [3, 4] . Sentiment analysis generally elicits a sentiment orientation (i.e., positive, neutral, negative) of textual information, which can improve decision-making processes for multitude domains including businesses such as finance and stock market [5] [6] [7] , digital payment services [4] , retails [2, 8] , and products [1, 3, 9] , among others. Scholars investigating sentiment analysis based on textual communications have also examined or attempted to determine the sentiment ratings, often using scales ranging from 1 to 5 or 10 (i.e., higher scores indicate more positive reviews) [10] . Though often performed using machine learning approaches, deep learning has gained momentum in sentiment analysis in recent years showing promising results [6, 10] . Further, scholars have also explored various word embedding techniques including the popular Word2Vec and its variants to the more advanced and state-of-art transformer-based pre-trained models such as Bi-directional Encoder Representations from Transformers (BERT) [10] [11] [12] [13] that have displayed much better results in text classifications. Nevertheless, as shown later in Sect. 2.2, studies exploring deep learning algorithms, particularly those exploring and comparing various embedding techniques are lacking, both for English and non-English datasets [10, 12, 13] . Moreover, recent reviews show studies exploring data augmentation techniques in supervised deep learning algorithms to improve prediction improvements [14] . The technique, which is generally a regularization technique that synthesizes new data from existing data has been widely used in computing vision [14, 15] ; however, works relating to textual data is limited due to the difficulty of establishing standard rules for automatic transformations of textual data while conserving the quality of the annotations [14, 16, 17] , except for a few. For example, authors in [17] explored various data pre-processing and regularization techniques to analyze the sentiments of Vietnamese users on Twitter with results indicating data augmentation to be a promising solution to boost the accuracies of classifiers. To address the gaps identified above, this study aims to predict the customer review ratings using deep learning models based on an e-commerce dataset containing reviews for women's clothing. Specifically, this is achieved through data pre-processing and data augmentation to increase the variability of the dataset. Several word embedding techniques were examined including Word2Vec, Fast-Text, BERT model and its variants (i.e., RoBERTa and ALBERT) in order to identify the best embedding technique along with the deep learning algorithms. Several Neural Network (NN) classifiers were then used such as Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Bi-directional Long Short Term Memory (Bi-LSTM), on two different setups, that is, 5-class versus 3-class. The models were evaluated through performance metrics. Further, we also validated our models against several machine-learning algorithms including Naïve Bayes, Logistic Regression and Support Vector Machine (SVM), etc. The paper contributes to extensively analyzing various well-known deep learning models along with the more recent and advanced BERT variants in order to identify the best sentiment review prediction model using both the original and augmented datasets. The remainder of the paper consists of background, methodology, results, discussion and conclusion. Sentiment refers to 'a feeling or an opinion, especially one based on emotions' [18] while sentiment analysis is the process of analyzing people's sentiment expressed toward services, products, mandates, organizations, etc. [19] . A sentiment rating on the other hand, refers to the use of numerical values (or stars) to indicate the intensity of one's sentiment [10] . As mentioned previously, sentiment analysis has been studied and applied in various fields primarily to gauge how people feel about something, and its popularity among research scholars can be attributed to the proliferation of social media. Sentiment analysis is often performed using machine learning, lexicons or hybrid approach [20] . Machine learning remains to be the widely used approach in sentiment analysis as the algorithms demonstrate high accuracy of classifications, however, the classifiers are very domain-dependent. On the other hand, lexicon-based approach uses opinion lexicons to determine the semantic orientation of the words as negative or positive with the help of scores [20] . Although this approach does not require labeled data and learning procedures, powerful linguistic resources are usually required, which are not always available especially for non-English datasets. The hybrid approach is a combination of the machine learning and lexicon-based approaches. Studies investigating sentiment analysis based on user reviews using machine learning approaches are many, the majority of which have used supervised algorithms such as Naïve Bayes, Decision Tree, SVM, etc. For instance, Haque and colleagues [21] applied a semi-supervised approach on Amazon review dataset of three different categories of products, using pool-based active learning for data labeling. Their experiments show Linear Support Vector Machine to produce the highest accuracy. Scholars have also attempted to improve sentiment analysis based on specific features, such as Pang et al. [22] who analyzed the performance of Naive Bayes, Maximum Entropy and SVM on movie reviews with ratings (i.e., a number of stars) whereas authors in [23] examined the effect of word lengths for airline reviews. Despite the popularity of the machine learning approaches, researchers have noted the need for more advanced and robust sentiment analysis approaches to better understand customers and their needs [24] . Deep learning refers to 'neural networks with multiple layers of perceptrons inspired by the human brain' [20] and has been shown to bring benefits toward text generation, word representation estimation, sentence classification and feature presentation [25] . The approach has been successfully used to analyze sentiments for reviews [10, 11, 26] , stock price prediction [5, 6] and also non-English datasets [27] [28] [29] , with popular algorithms including RNN, CNN, Bi-LSTM and integrated versions of the algorithms. To further elaborate, [29] performed sentiment analysis using RNN based on Word2Vec embedding on reviews extracted from the Indonesian Traveloka website. Their proposed model reported an accuracy of 91.9%. Hameed and Garcia-Zapirain [30] used Bi-LSTM on three datasets, namely IMDB, Movie Review and Stanford Sentiment Treebank (SST2) with accuracy results of 85.8%, 80.5% and 90.6%, respectively. The authors found Bi-LSTM to be computationally efficient and wellsuited for sentiment analysis tasks as well. A similar approach was adopted by Xu and colleagues [31] who used Word2Vec along with Bi-LSTM, LSTM, RNN and CNN to extract sentiments of Chinese hotel reviews, with Bi-LSTM emerging as the best model with an F-score of 92%. On the other hand, [32] compared CNN, RNN and deep NN (DNN) using Word2Vec and Term Frequency-Inverse Document Frequency (TF-IDF) on 13 different datasets, with results showing the models to have the best performance when Word2Vec was used across all the metrics. Also, RNN using Word2Vec emerged as the best model although computationally expensive compared to the others. Others have merged several deep learning models in improving sentiment analysis, for example, [33] proposed an LSTM-CNN grid-search (GS) model to predict sentiment analysis on two datasets, namely Amazon and IMDB movie reviews. The authors specifically implemented a grid-search approach in their proposed work and compared their model against several baseline algorithms such as CNN, LSTM, CNN-LSTM, etc., with results indicating their model to have outperformed the baselines with an overall accuracy of 96%. A similar work was accomplished by [26] using Amazon reviews in which topic modeling was first administered with Fuzzy C-means prior to classifying sentiments using CNN. The authors reported their proposed model to have an enhanced accuracy between 6 and 20% compared to the traditional systems. Literature also revealed studies exploring the more advanced embedding technique, BERT and its variants in improving sentiment analysis for reviews. For instance, [34] improved sentiment analysis for commodity reviews using BERT-CNN with F-score results indicating the combination of BERT-CNN (84.3%) to be the best compared to BERT (82%) and CNN (70.9%). Similarly, [12] developed SenBERT-CNN to analyze JD.com (mobile phone merchant) reviews by combining BERT and CNN, the latter of which was used to extract deep features of the text. The authors found BERT-CNN to have the highest accuracy (95.7%) compared to LSTM, BERT and CNN. On the other hand, [10] used Neural Network (NN) models to predict drug reviews using a dataset from Drugs.com. The reviews had a score ranging from 0 to 9 indicating satisfaction level of patients. The authors proposed several NN models including BERT-LSTM on two setups (i.e., 10-class and 3-class, which is the compact version of the dataset), with results showing BERT-LSTM to be the best for the 3-class setup with an average F-score of 82.37%, albeit with a very high training time. Others include the work of [11] who examined several NN models along with BERT for a movie review dataset with results indicating BERT to produce the best accuracy while [13] used BERT for Twitter sentiment analysis, which transformed jargons into plaintext for BERT training. A summary of the studies using deep learning algorithms to predict sentiment analysis based on user reviews is given in Table 1 . This section provides the methodology adopted in this study, outlining the datasets used, data pre-processing steps, feature extractions, sentiment review and rating classifications, experimental setups and evaluations. Figure 1 illustrates the overall methodology. The dataset for this study comprised customer reviews on women's clothing, consisting of 23, 486 observations, including clothing ID, age of the reviewers, title of the reviews, review text, rating, recommended indicators, positive feedback counts, division name, department name and class name. The review text is used to predict the rating given to the products (i.e., 1: extremely negative -5: extremely positive). The dataset is available at Kaggle [35] . A preliminary check revealed approximately 845 missing reviews, hence these were removed resulting in a final sample size of 22, 641. Figure 2 illustrates the word cloud for the two extreme ratings in the dataset. Data augmentation is commonly used to enrich the training dataset such that the trained models are robust and produce improved performance for deep learning models, and the technique has been widely used in computer and speech processing [14, 15] , with interests in textual data augmentation increasing over the last few years [14, 36] . As textual communications are inherently more complex (i.e., syntax and semantic constraints), several data augmentation techniques have been proposed [17, 37, 38] , question answering [39] , synonym replacement [16] , etc. The present study adopted one of the recent methods introduced in [39] , that is, Easy Data Augmentation (EDA) comprising four NLP operations, namely random deletion, random insertion, random swap, and synonym replacement (see Table 2 for explanation and examples). EDA is known for its simplicity and ease of use as it does not require any predefined datasets, and often yield promising results [17, 36, 40] . For instance, Xiang and peers [40] compared various data augmentation techniques on several datasets and found EDA to perform better than DICT (i.e., a synonym replacement thesaurus [41] ) but was outperformed by their proposed POS-based augmentation technique. The present study adopted the default setting recommended for EDA by [36] , that is, up to four augmented sentences were generated for each original sentence using a learning rate of 0.1. We administered all four operations listed in Table 2 on each sentence, hence generating Table 2 Types of data augmentation used in the present study [36] Bold words refer to the changes made as per the operation listed Operations The quick sluggish umber fox jumps over the lazy dog four different variations for each. As the augmentations were done according to the pre-evaluated and recommended parameters, the resulting augmented dataset closely represents the original sentences, hence maintaining the meaning of the original data and conserving the true labels [36] . The EDA technique resulted in a single augmented dataset, and was used to train and evaluate the sentiment rating prediction models, along with the original dataset. Common natural language processing tasks were then incorporated, that is, canonicalization, which involves conversion of text into lowercases, removal of leading and trailing spaces, numbers, punctuations and stop words (i.e., common words in English that carries little information about the context of the texts such as 'a,' 'an,' 'the,' etc.). These were then followed with tokenization (i.e., splitting sentences into singular words) and lemmatization, which reduces the words into its root forms (e.g., 'silky' to 'silk,' 'happened' to 'happen'). Additionally, index encoding and zero padding were performed to ensure all the matrixes were of the same size, accomplished using the Keras library. Table 3 illustrates a hypothetical case for the pre-processing steps. Features are individual measurable properties or dimensions for algorithms to process whereas feature extraction is the process of translating the processed texts into informative format. In general, the feature extractions techniques are dependent on the prediction models used in a sentiment analysis. In this study, word embeddings (i.e., vector representations of a particular word) were extracted as features, through several techniques, namely: FastText: an extension of Word2Vec that breaks words into n-grams (smaller parts), e.g., 'apple' to 'app' with the intention of learning the morphology of the words. The model also returns a bag of embedded vectors for each word in the text [43] . Word2Vec and FastText might not handle polysemous words (i.e., words with multiple meaning) as they are deemed to be context-free (i.e., map the same word to the same embedding vector). For example, 'fire' would have the same representation in 'building on fire' and 'fire someone.' To mitigate this problem, scholars have begun to explore transformer-based embeddings, including BERT and its variants. BERT-variant models were pre-trained by incorporating the context of the word within the text in Wikipedia and BooksCorpus [44] , and the embedding are then used through a classifier for predictions. As they produce contextualized word embeddings, they produce state-of-the-art results on Natural Language Processing tasks [12, 34] . The BERT-base model is a bi-directional (both left-to-right and rightto-left direction) transformer for pre-training over a lot of unlabeled textual data to learn a language representation that can be used to fine-tune for specific classification tasks (see [44] for further details). One of its popular variant is RoBERTa (Robustly Optimized BERT approach), which was introduced by Facebook. It is basically an improved version of BERT, capable of handling more data with higher computing power. Compared to BERT, RoBERTa has been shown to have a higher prediction power. Finally, Google and Toyota developed a smaller/smarter BERT variant known as A Lite BERT (ALBERT), which is dramatically smaller in size compared to BERT. The present study examined BERT-base model and two of its variants, that is, RoBERTa and ALBERT. Three well-known NN algorithms were identified from the literature, namely CNN, RNN and Bi-LSTM. NN models are basically made up of artificial neurons organized in layers, known as input (i.e., predictors), output (i.e., predictions) and hidden layers. In a feed-forward multilayer NN model (see Fig. 3 ), each layer receives inputs from the previous layers, and the inputs are combined using adaptive weights that are calibrated through a training process [45] . There is an activation function for each neuron, with popular ones including tangent sigmoid, logarithmic sigmoid and Softmax [45] . RNN belongs to a class of NN that are good at modeling sequence data and processing for predictions. It is a word-based vector and deals with long-term dependencies among words in a text corpus. RNN processes sequential data using its internal memory and allows the network to retain the information that has been processed before the current stage [46] . In the current study, we used an LSTM layer with 256 units, a dropout rate of 0.3 and learning rate of 0.001. Softmax, which converts a vector of values to a probability distribution, was used as the activation function. CNN, on the other hand, is designed to adaptively learn spatial hierarchies of features, typically composed of three layers, that is, convolution, pooling, and fully connected layers. The first two layers perform feature extraction, whereas the fully connected layer maps the extracted features into a final output [46, 47] . The extracted features can hierarchically and progressively become more complex, hence parameters are often optimized through algorithms [47] . We used a convolution layer with 256 filters with a window size of 3, 4 and 5-word vectors, along with a linear rectification unit (ReLU) as the activation function. Further, a kernel regularizer that applies an L1 regularization penalty with a value of 0.01 was also applied, along with a dropout rate of 0.3. Finally, the Bi-LSTM is an improvised version of RNN that processes the input text storing the semantics of the previous and future context information. It is composed of LSTM units that operate in both directions, consisting of recurrently connected memory blocks with each memory cell containing three gates, namely the input gate (controls if the information is allowed in), forget gate (controls the length of time information remains in the memory) and output gate (controls the output of the memory cells [48] . We used two types of dense layers, namely a layer with 64 units using ReLU as the activation function, and another with 3 and 5 classes using Softmax as the activation function. A similar dropout rate of 0.3 was used for the Bi-LSTM model as well. It is to note that we carried out additional analyses using conventional machine learning algorithms to compare their performance with the deep learning models. Specifically, five well-known machine learning algorithms used in sentiment analysis studies [7, [21] [22] [23] [24] were selected, namely SVM, Naive Bayes, Random Forest, Logistic Regression and Decision Tree. Naïve Bayes is one of the simplest and widely used probabilistic algorithms for classification problems, requiring only a Fig. 3 General Neural Network Architecture [45] small amount of training data. In other words, it returns a probability based on the class that has the 'maximum posterior' [3, 49] . SVM on the other hand, attempts to find the best hyperplane for classification purpose, and is known to work well with high-dimensional datasets. However, it requires a substantial amount of time to determine the optimal kernel functions [50] . The Decision Tree is a powerful classification algorithm that describes the relationship of attributes and targets in the form of a tree using a 'if-then' rule-based structure [51] . It has the ability to deal with large datasets compared to other machine learning algorithms; however, it also suffers from an instability issue where a small change in the training samples tend to cause a large difference in the classification results [52] . An improvement to Decision Tree is Random Forest, which is one of the best known algorithms for classifications, often yielding good accuracy results without any overfitting issues. Random Forest produces a number of individual trees and makes a final prediction by aggregating the decisions of the individual trees [53] . Finally, the boosting approach merges weak classifiers to improve classification performance, and studies have shown that the approach is superior to other machine learning algorithms such as SVM and Decision Tree [54] . The experiments were conducted in several setups and scenarios, as follows: There were two experiment setups based on the labels: 1. 5-class: refers to the original rating scale from 1 to 5 (i.e., 1 -extremely negative; 2 -negative; 3 -neutral; 4 -positive; 5 -extremely positive) [35] , 2. 3-class: Ratings 1 and 2 were combined to reflect negative sentiments, 3 as neutral, 4 and 5 combined as positive sentiment [10] . The scenarios of the experiments are as follows: 1. RNN, CNN, Bi-LSTM using Word2Vec and FastText using both the original and augmented datasets, tested in the 5-and 3-class setups; 2. BERT variants (i.e., BERT, RoBERTa and ALBERT) using both the original and augmented datasets, tested in the 5-and 3-class setups. This excludes the Word2Vec and FastText embedding. Upon the identification of the best setups from the experiments above (i.e., original versus augmented, word embedding techniques (Word2Vec, FastText and BERT variants) and class setups (i.e., 5-class versus 3-class) (Sects. 4.1 and 4.2), we carried out further modeling using ensemble models, namely CNN-RNN, CNN -Bi-LSTM, RNN -Bi-LSTM and CNN-RNN-Bi-LSTM, using the majority voting technique to choose the best prediction (Sect. 4.3). Further, to validate the findings, the best class and word embedding setup was also used against several machine learning algorithms, that is, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, and SVM (Sect. 4.4). All the NLP tasks and model developments were accomplished using Python and Keras. We used AdamW as the compiler and CrossEntropyLoss as the loss functions. For validation purpose, we adopted the k-fold cross-validation in which the data will be partitioned into k disjoint folds, with one of the folds used for testing while the remaining k-1 folds used for training. We used k = 10, hence 10 different models were trained and tested over 10 iterations before the final value is averaged. In classification approaches, it is common to use k = 5 or 10 [1, 4, 23] . The standard performance metrics for classification problems were used to assess all the models, namely: where TP -true positive; FN -false negative [55] . 4. F-measure-the harmonic mean between precision and recall, and the range of F-measure is between 0 and 1. Greater value of F-measure indicates better performance of the model. The formula for determining F-measure is: 5. Area under the ROC (Receiver Operating Characteristic) curve (AUC-ROC)-Similar to F1-score, AUC has also the range of 0 and 1. The higher the score for AUC, the better the performance. ROC curve is a graph that shows the plot between sensitivity (true positive rate) and (1-specificity) (false positive rate). (1) Accuracy = TP + TN TP + FP + FN + TN Table 4 presents the results of the experiments involving all the NN models using the original dataset in both 5-and 3-class setups, along with Word2Vec and FastText techniques. This was accomplished to assess the performance of the various NN models based on the two word embedding techniques and class setups. Conversely, the results for the augmented dataset for the same embedding and class setups are provided in Table 5 . It can be generally observed that Bi-LSTM based on Word2Vec consistently outperformed other NN models, regardless of the setups, followed very closely by RNN. The results for the augmented dataset produced a more consistent pattern where RNN emerged to be the best model using Word2Vec, for both the setups (see Table 5 ). This is in accordance with other studies that found RNN using Word2Vec to be the best model in sentiment classification [29, 32] , however in contrast to [31] who found CNN-Word2Vec to be the best model. Studies using RNN have generally found the use of word embedding techniques to produce better prediction models compared to other techniques such as TF-IDF [29] . Further, it can also be observed that the performance of the models were better for the augmented dataset compared to the original, across all the metrics. This is probably because data augmentation, which is one of the most useful interfaces to train NN models, is able to prevent overfitting by shuffling particular forms of language. Therefore, it mitigates NN models from learning spurious correlations and memorizing high-frequency patterns that do not generalize [36] . Similar observations have been reported in other studies that have compared the use of EDA in textual communications both in English [40] and non-English languages [17] . Tables 6 and 7 show the performance results for the BERT variants, where a consistent pattern was noted for RoBERTa for both the datasets and setups. It can also be observed that prediction performance is better in the 3-class setup as opposed to 5-class setup. In fact, the same pattern was found in the NN models (Tables 3 and 4) , probably due to a more refined classification when the number of classes/categories are smaller. A similar result was reflected in [10] where the authors reported an improved F-score in their 3-class setup as opposed to the 10-class. The BERT variants were found to perform better in the augmented dataset as well, akin to the NN models however, with lower metric scores. Of all three variants, RoBERTa produced the best results, though marginally close to BERT. This is in line with [56] who found RoBERTa to outperform the BERT model, achieving a 2 to 20% increase in model performance on the majority of NLP tasks. However, this result is also in contrast with [57] who found BERT to perform better than RoB-ERTa for sentiment analysis task, with the author attributing this to the quality of data and features extracted for their sentiment analysis task. In conclusion, the results in Sects. 4.1 and 4.2 revealed RNN-Word2Vec to be the best model using the 3-class setup and augmented dataset. Therefore, the rest of the experiments was executed using Word2Vec and 3-class setup on the augmented dataset. Table 8 depicts the results for the ensemble models based on the best setup (i.e., 3-class) using the augmented dataset and Word2Vec. This was done to assess the performance of the merged NN models in predicting the sentiment reviews as opposed to individual models in Sects. 4.1 and 4.2. Our results indicate all the ensemble models to perform better than the NN models individually (see Table 5 ), with the CNN-RNN-Bi-LSTM to have the best accuracy (i.e., 96%) and F-score of 91.1%. This pattern of observation have been reported in other studies as well, whereby multi-models were generally found to perform better than individual models [26, 33] , regardless of the datasets used. Though the metric differences between the ensemble models are not significantly large, our results provide evidence that the use of ensemble models (which aims to improve predictions) helps to improve the overall review prediction results compared to the traditional approach of using deep learning models. Finally, to validate our findings against the machine learning approach, the same setup as in Sect. 4.3 was used with several machine learning models, as shown in Table 9 . All the models were found to have performed poorly as opposed to the deep learning models, with at least 20% of differences in terms of the accuracy results. Based on these results, the study concludes that the more robust deep learning models are better suited to perform sentiment rating predictions compared to the conventional machine learning approach. This study contributed to the research domain of online customer reviews using several deep learning algorithms based on various embedding techniques. Our findings show that all the prediction models work better in a setup with fewer and more refined classes (3-class versus 5-class), and using augmented dataset improves the prediction compared to the original dataset. As for the context-free embeddings, Word2Vec was found to produce better results than FastText, though the differences were minimal. Similarly, RoBERTa produced the best results compared to BERT and ALBERT. Finally, our results also show the ensemble models to produce the best results compared to the individual models, and also against the machine learning models. We identify several limitations. The dataset used in this study was not checked for spams or fake reviews, hence this may have affected the predictions to a certain extent. Thus, an additional step in automatically detecting fake reviews and spams could be included in the pre-processing stage [26] . The scope of the study is also limited to English reviews, thus the proposed models and findings may not be applicable in a multi-lingual setting. This is considered important as online customers are known to originate from all around the world, and there is a tendency to communicate in languages other than English, such as Chinese, Spanish, etc. In future studies, other languages could be further explored by enhancing the current proposed framework in order to handle languages other than the English language. We experimented with well-known NN models, using various embedding techniques including the more advanced BERT and its variants. However, other approaches could be explored such as the use of lexicons, which can be merged with NN and BERT-variant models, such as lexicon enhanced BERT and lexicon-RNN. Moreover, the present study did not consider the proportion of polysemous words for BERT, in line with numerous other studies that have shown BERT-derived representations could reflect words' polysemy level and their partitionability into senses [58] [59] [60] . Nevertheless, it would be interesting to further investigate this notion by considering the proportion of polysemous words for BERT variants. Further, our results indicate machine learning algorithms performed considerably poorly compared to the NN models in the same setup. Although deep learning models are generally known to perform better than machine learning models, they are however, computationally expensive. Therefore, future studies could explore optimization techniques or use other ensemble boosting approaches to improve the prediction performance of the machine learning models. In addition, predicting review ratings based on real-time data and applications would be an interesting and important direction as well considering the popularity of online shopping that is gaining momentum during the COVID-19 pandemic which has dramatically changed the shopping landscape globally. Customer preferences extraction for air purifiers based on fine-grained sentiment analysis of online reviews Exploring customer sentiment regarding online retail services: a topicbased approach E-commerce product review sentiment classification based on a Naïve Bayes continuous learning framework. Inf Process Manage A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews Investment strategies applied to the Brazilian stock market: a methodology based on sentiment analysis with deep learning A hybrid model integrating deep learning with investor sentiment analysis for stock price prediction Sentiment analysis of financial news using unsupervised approach A social media analytic framework for improving operations and service management: a study of the retail pharmacy industry Sentiment analysis: predicting product reviews' ratings using online customer reviews Comparing deep learning architectures for sentiment analysis on drug reviews Fine-grained sentiment classification using BERT, 2019 Artificial intelligence for transforming business and society (AITB) Sentiment analysis of online product reviews based on SenBERT-CNN An effective BERT-based pipeline for twitter sentiment analysis: a case study in ITALIAN Text data augmentation for deep learning Imagenet classification with deep convolutional neural networks Contextual augmentation: data augmentation bywords with paradigmatic relations A review: preprocessing techniques and data augmentation for sentiment analysis The determinants of the U.S. consumer sentiment: linear and nonlinear models Topic-level sentiment analysis of social media data using deep learning A comprehensive survey on sentiment analysis: approaches, challenges and trends Sentiment analysis on large scale Amazon product reviews Thumbs up?: Sentiment classification using machine learning techniques Improving sentiment scoring mechanism: a case study on airline services Developing a supervised learning-based social media business sentiment index Sentiment analysis using deep learning techniques: a review Predicting the customer's opinion on amazon products using selective memory architecture-based convolutional neural network Enhancing Arabic aspect-based sentiment analysis using deep learning models Thai sentiment analysis with deep learning techniques: a comparative study based on word embedding POS-tag, sentic features Sentiment analysis using recurrent neural network Sentiment classification using a single-layered BiLSTM model Sentiment analysis of comment texts based on BiLSTM Sentiment analysis based on deep learning: a comparative study A novel LSTM-CNN-grid search-based deep neural network for sentiment analysis A commodity review sentiment analysis based on BERT-CNN model Women's clothing reviews EDA: easy data augmentation techniques for boosting performance on text classification tasks Data augmentation for low-resource neural machine translation Improving neural machine translation models with monolingual data Paraphrase-driven learning for open question answering Lexical data augmentation for sentiment analysis Character-level convolutional networks for text classification Efficient estimation of word representations in vector space Enriching Word Vectors with Subword Information BERT: pre-training of deep bidirectional transformers for language understanding Prediction of wind pressure coefficients on building surfaces using artificial neural networks Deep learning for sentiment analysis: a survey Convolutional neural networks: an overview and application in radiology Bi-LSTM model to increase accuracy in text classification: combining Word2Vec CNN and attention mechanism The text mining handbook: advanced approaches in analyzing unstructured data Estimate at completion for construction projects using evolutionary support vector machine inference model Boosted decision trees as an alternative to artificial neural networks for particle identification Application of J48 decision tree classifier in emotion recognition based on chaos characteristics Comparative performance of six supervised learning methods for the development of models of hard rock pillar stability prediction A formwork method selection model based on boosted decision trees in tall building construction A primer on neural network models for natural language processing a robustly optimized BERT pretraining approach Exploiting BERT and RoBERTa to improve performance for aspect based sentiment analysis, dissertation. Technological University Dublin Let's play mono-poly: BERT can reveal words' polysemy level and partitionability into senses Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence Target-dependent sentiment classification with BERT Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.