key: cord-0220787-1mysciqb authors: Ahmad, Kashif; Alam, Firoj; Qadir, Junaid; Qolomany, Basheer; Khan, Imran; Khan, Talhat; Suleman, Muhammad; Said, Naina; Hassan, Syed Zohaib; Gul, Asma; Al-Fuqaha, Ala title: Sentiment Analysis of Users' Reviews on COVID-19 Contact Tracing Apps with a Benchmark Dataset date: 2021-03-01 journal: nan DOI: nan sha: 624de9fbadb3d98ce4816ee9fe12efdd6a07b6f4 doc_id: 220787 cord_uid: 1mysciqb Contact tracing has been globally adopted in the fight to control the infection rate of COVID-19. Thanks to digital technologies, such as smartphones and wearable devices, contacts of COVID-19 patients can be easily traced and informed about their potential exposure to the virus. To this aim, several interesting mobile applications have been developed. However, there are ever-growing concerns over the working mechanism and performance of these applications. The literature already provides some interesting exploratory studies on the community's response to the applications by analyzing information from different sources, such as news and users' reviews of the applications. However, to the best of our knowledge, there is no existing solution that automatically analyzes users' reviews and extracts the evoked sentiments. In this work, we propose a pipeline starting from manual annotation via a crowd-sourcing study and concluding on the development and training of AI models for automatic sentiment analysis of users' reviews. In total, we employ eight different methods achieving up to an average F1-Scores 94.8% indicating the feasibility of automatic sentiment analysis of users' reviews on the COVID-19 contact tracing applications. We also highlight the key advantages, drawbacks, and users' concerns over the applications. Moreover, we also collect and annotate a large-scale dataset composed of 34,534 reviews manually annotated from the contract tracing applications of 46 distinct countries. The presented analysis and the dataset are expected to provide a baseline/benchmark for future research in the domain. Abstract-Contact tracing has been globally adopted in the fight to control the infection rate of COVID-19. Thanks to digital technologies, such as smartphones and wearable devices, contacts of COVID-19 patients can be easily traced and informed about their potential exposure to the virus. To this aim, several interesting mobile applications have been developed. However, there are ever-growing concerns over the working mechanism and performance of these applications. The literature already provides some interesting exploratory studies on the community's response to the applications by analyzing information from different sources, such as news and users' reviews of the applications. However, to the best of our knowledge, there is no existing solution that automatically analyzes users' reviews and extracts the evoked sentiments. In this work, we propose a pipeline starting from manual annotation via a crowd-sourcing study and concluding on the development and training of AI models for automatic sentiment analysis of users' reviews. In total, we employ eight different methods achieving up to an average F1-Scores 94.8% indicating the feasibility of automatic sentiment analysis of users' reviews on the COVID-19 contact tracing applications. We also highlight the key advantages, drawbacks, and users' concerns over the applications. Moreover, we also collect and annotate a large-scale dataset composed of 34,534 reviews manually annotated from the contract tracing applications of 46 distinct countries. The presented analysis and the dataset are expected to provide a baseline/benchmark for future research in the domain. Since the emerge of COVID-19, public authorities are trying their best to slow down the infection rate of the virus, globally. As part of their efforts, several solutions, such as closing public places, imposing full or partial lock-downs, and limiting people's contacts, have been implemented. Contact tracing has been globally recognized as one of the effective methods to slow down the infection rate of the virus [1] . To this aim, most of the initial efforts were based on manually tracing the contacts of infected persons. Manual contact tracing works only when the infected person knows who has been in physical contact with him/her, which reduces the effectiveness of the method. Moreover, manual contact tracing is a very time-and resource-consuming process [2] , [3] . The potential of contact-tracing could be fully utilized if, ideally, the contact tracing mechanism can track the contact of an infected person on a very large scale. For instance, it would be ideally beneficial if the authorities are able to track where the infected person has been and identify and notify the potential contacts of the patient. The technology, such as proximity sensors in smartphones and wearable devices, can help in such situations allowing the authorities to automatically notify the potential cases more quickly and accurately [1] , [4] . To this aim, several mobile applications with a diversified set of features have been developed world-wide each aligned with COVID-19 related policies, social values, and local infrastructure. However, the success of such applications is largely constrained by the number of users. According to Hinch et al. [5] , the potential of these mobile applications could be fully utilized if used by at least 80% of the mobile users-which is 56% of the total population in the case of UK as reported by the authors but generally depends on the mobile penetration in the country. In order to increase the number of these applications' users, different strategies and policies have been devised [6] . For instance, public authorities in several countries have made it mandatory for the residents to install the contact tracing application to be able to access shopping malls, transportation, hospital, and other public places. However, there are several concerns over these applications in terms of both effectiveness and privacy. For instance, since the applications require tracking individuals' movement with GPS and other sensors to track their interactions, privacy concerns may arise [7] , [8] . Moreover, the literature also identifies lack of understanding and unavailability of the technology (e.g., smartphones) with a large portion of the population in third world countries, as one of the main reasons for less effectiveness of such contact tracing applications [9] . We believe an analysis of users' reviews on these applications will facilitate a better understanding of the concerns over these applications. There are already some efforts in this regard [3] , [9] , [10] . However, the majority of the methods rely on exploratory and manual analysis of the users' reviews, which is a resource and time-consuming process. Moreover, some of the works also rely on existing general sentiment analysis platforms/tools without training or fine-tuning the tools on COVID-19 application reviews. For instance, in [10] , a commercial tool namely AppBot 1 has been used for sentiment analysis of users' reviews on only nine mobile applications used in Europe. However, the tool relies on AI models trained for generic sentiment analysis, and returns four types of sentiments namely positive, negative, neutral, and mixed. As a result, the outcome is not reliable as the models are not trained on the task-specific data (i.e., App reviews). For instance, a vast majority of the reviews are highlighting some technical issues, such as difficulties with registration, etc., which also need to be analyzed. To address those limitations, we believe a task-specific model trained on a large-scale manually annotated users' reviews dataset will help to make better and context-specific classification of the reviews. Moreover, the existing literature relies on the users' reviews of fewer applications used in a specific region, which cover only a portion of the population of the world. In this paper, we analyze how AI models can help in automatically extract and classify the polarity of users' sentiments, and propose a sentiment analysis framework to automatically analyze users' reviews on COVID-19 contact tracing mobile applications. In detail, we collected and annotated a largescale dataset of Andriod and iOS mobile applications users' reviews for COVID-19 contact tracing. After manually analyzing and annotating users' reviews, we employed both classical (i.e., MNB, SVM, Random Forest) and deep learning (i.e., Convolution Neural Network [11] , fastText [12] , and different transformers [13] ) methods for classification experiments. This resulted in eight different classification models. Moreover, to the best of our knowledge, this is the first attempt to develop a large-scale benchmark dataset for sentiment analysis of users' reviews on COVID-19 contact tracing applications, which are from 46 distinct countries from Google Play and App Store are covered. The main contributions of the work can be summarized as follows: • We provide 34,534 manually labeled reviews based on the analysis of 40,000 reviews from 46 different COVID-19 contact tracing applications. The labels consist of sentiment polarities (i.e., positive, neutral, and negative) and a label (technical issue). 1 https://appbot.co • We provide an in-depth analysis of the dataset that demonstrates different characteristics and insights. • We share the dataset and data splits 2 with the research community for both reproducibility and further enhancements. • We report benchmark results using eight different classification experiments, which can serve as a baseline for future studies. The rest of the paper is organized as follows: Section II provides an overview of the related work. Section IV discusses the development of the dataset. Section V details the classification experiments. The detailed observations of our study are discussed in Section VII. Section VIII concludes this study and provides directions for future research. To fight against the COVID-19 pandemic, almost all research communities, such as health, NLP, and Computer Vision have been playing a significant role. As a result, several interesting solutions, aiming at different aspects of the pandemic, have been proposed over the last year [14] . For instance, there have been efforts for an early COVID-19 outbreak detection to help in an emergency response preparedness [15] . Similarly, a large portion of the efforts aimed at an automatic diagnosis, prognosis, and treatment [15] , [16] . Fake news detection, risk assessment, logistic planning, and understanding of social interventions, such as monitoring social distancing, are the other key aspects of the pandemic that received the attention of the community [14] , [17] , [18] . Contact tracing is also one of the aspects of the pandemic that has been widely explored in the literature. For instance, the study by Lash et al. [19] analyzed the mechanism and results of contact tracing in two different countries. The authors report that an accurate and efficient mechanism of contact tracing can significantly reduce the infection rate of the virus. However, several challenges are associated with a timely and accurate contact tracing of a COVID-19 patient. In this regard, a joint effort from the community, and the use of more advanced methods relying on different technologies, such as global positioning system (GPS), Wireless Fidelity (Wi-Fi), Bluetooth, Social graph, network-based APIs, and mobile tracking data will help to a great extent [20] , [21] . Handheld devices, such as mobile phones, which are already embedded with such technologies, are ideal platforms for deploying contact tracing solutions. Being a feasible solution, several mobile applications have been already developed in different parts of the world. In addition to basic contact tracing capabilities different features are also implemented based on the domestic COVID-19 policies [22] . For instance, in different countries, such as Qatar, Australia, the applications are used to access different facilities. Similarly, in Saudi Arabia, the application is used to seek permission for going out during lockdown. In Table I , we provide a list of some of the prominent contact tracing applications used in different parts of the world. Despite being a feasible solution for slowing down the infection rate, these applications are subject to criticism due to risks associated with them. In the literature, several issues, such as privacy, power consumption, and annoying alerts, have been reported. For instance, Bengio et al. [23] analyze and reported the privacy issues associated with COVID-19 contact tracing applications. Besides, some recommendations on how to ensure users' privacy, the authors also proposed a decentralized design for contact tracing by optimizing the privacy and utility trade-offs. Reichert et al. [24] also analyzed the privacy concerns over the applications, and proposed a privacypreserving contact tracing mechanism relying on a privacypreserving protocol namely secure multi-party computation (MPC) [25] to ensure individuals' privacy. Power consumption is another key challenge to contact tracing applications. The literature also provides several interesting works, where the feasibility of such mobile applications is carried out by analyzing people's response/feedback on these applications [22] , [26] , [27] . For instance, in [28] an online survey was conducted to analyze citizens' response to HSE 3 , a contact tracing application used in Ireland. During the survey, a reasonable percentage of the participants showed their intention of using the application. However, the survey mainly aimed to analyze and identify different barriers in the use of such an application without analyzing the experience of the users with the application. In order to better analyze, understand, and evaluate users' experience and feedback on the COVID-19 contact tracing applications, a detailed analysis of the public reviews is required, which are available of Apple 4 and Google Play 5 Store. There are already some efforts in this direction. For instance, Rekanar et al. [3] provide a detailed analysis of users' feedback on HSE in terms of usability, functional effectiveness, and performance. However, the authors rely on manual analysis only, which is a timeconsuming process. Another relevant work is reported in [10] , where an exploratory analysis of users' feedback on nine COVID-19 contact tracing applications, used in Europe, is provided. To this aim, the authors rely on a commercial appreview analytics tool, namely Appbot 6 , to extract and mine the users' reviews on the applications. To the best of our knowledge, the literature still lacking a benchmark dataset to train and evaluate ML models for automatic analysis of users' feedback on COVID-19 contact tracing applications. Moreover, the existing literature relies on the users' reviews of fewer applications used in a specific region, which cover only a portion of the population of the world. Hence, our work differs in a number of ways (i) we analyze reviews of a large number of applications used in different parts of the world, (ii) manually annotated dataset and provide them for the community, and (iii) provide detail experimental results. In this section, we will provide an overview of the methodology adopted in this paper. The complete pipeline of the proposed work is depicted in Figure 1 . The work is mainly carried out in two different phases including (i) dataset development phase, and (ii) experimental phase. In the development phase, as a first step, we scraped Google Play and App Store to obtain users' reviews on the COVID-19 contact tracing application used in different parts of the world. After obtaining users' reviews, a crowd-sourcing study has been conducted to annotate the reviews for the training and evaluation of ML models for automatic sentiment analysis of the reviews. The annotation and other details obtained during the crowdsourcing study are then analyzed as detailed later. In the experimental phase, before conducting experiments, data is pre-processed to make it more meaningful for the AI models deployed in the experiments. During the experiments, we employed several AI models as detailed later. In the remainder of this section, we provide details of our dataset development process in which we describe our methodology for collecting, annotating, and analyzing the dataset. In order to obtain real-world users' reviews for our analysis, we crawled reviews from 46 COVID-19 contact tracing applications used in the different parts of the world and hosted on Google Play and Apple's App Store. These applications are listed in Table I . We note that in this work we consider reviews in the English language only where we made sure to analyze and annotate at least 50% of reviews. However, to make sure the dataset is balanced in terms of reviews from different applications, for some applications, such as Aarogya Setu, a lesser portion of the available reviews is analyzed. Besides users' reviews, we also obtained replies to the reviews, if any are available, as well as the ratings. However, for this study, we only used the reviews for the analysis and experiments. We note that the reviews were obtained from December 20th to December 25th, 2020. For the annotation of sentiment, typically three sentiment polarities are used such as positive, negative, and neutral. From our initial analysis we realized that applications can have technical problems, hence, we used another label, technical issues, for the annotation. Hence, our annotation consists of four labels: (i) positive, (ii) negative, (iii) neutral, and (iv) technical issues. To facilitate the annotation process, we developed a web application where the users' reviews on the applications were presented to the annotators to be manually label them. In Figure 2 we present a screen-shot of the annotation platform, which demonstrates the review and labels to be annotated (Q.1). In addition, we asked the annotators to briefly provide the reason behind their decision provided in response to Q.1. The question (Q.2) is used to evaluate the quality of the annotation (i.e., whether the annotator carefully read the review or not). Moreover, we believe this question will provide useful information for the manual analysis of the users' feedback. In total, 40,000 reviews were analyzed. To assure the quality of the annotations, each review is analyzed by at least two During the annotation process, we removed some reviews due to reasons such as (i) not being in English, (ii) having a large number of emoticons/signs, and (iii) they are irrelevant. This process resulted in having a total of 34,534 annotated reviews. Overall, the dataset provides sufficient number of samples per class. However, fewer samples are obtained in the Neutral category, which contains reviews that are not directly linked to the usage of the applications. In total, we have 15,587 samples in the positive class while the negative, neutral, and technical issues classes are composed of 8, 178, 1, 271, and 9, 496, respectively. From the analysis of the second question (Q.2), we identified the reasons/information that influenced the participants' decision. In this section, we provide the statistics of the second question (Q.2) In Figure 3a , 3b, and 3c, we provide the distribution of the most common reasons/causes associated with the positive, negative and technical issues, respectively. As can be seen in Figure 3a , in the majority of the positive reviews, users' found the applications useful, informative, and helpful in the battle against COVID-19. Some sample positive reviews are "Thank you very much. it's very helpful and informative. it helps keep people away from suspicious areas."; "A very good app for tracing and stopping coronavirus"; "Always getting updated information about the virus"; "Very useful and informative app." A significant portion of positive reviews is also based on ease in the installation while some reviews mentioned that the application they are using is working fine without further details. However, the most encouraging aspect is the fact that a significant percentage of users have appreciated the idea, concept, and efforts made by the authorities for contact tracing to slow down the infection rate. Some sample reviews include "A good initiative by the government"; "Good initiative to prevent the spread of corona virus, I appreciate who work behind this effort." There were also a large number of short reviews where the users simply showed their positive response without mentioning any particular reason. Besides these, other common reasons for their positive reviews highlighted by the users include some specific features of different applications in different parts of the world. For example, Takkawalna from Saudi Government was used to seek permission for going out during lockdown was praised for being a source of seeking permissions. Figure 3b reflects power consumption, uselessness, and privacy as the most common issues with these applications. A significant amount of reviews also highlight that majority of the applications are not user-friendly. There were also reviews depicting other issues, such as annoying notifications, unnecessary access to the gallery, slow response from the helpline, and unavailability of some key features, which could further improve the effectiveness of the applications. Some sample negative reviews include "Too much personal information collected. Privacy risk. Non compliant to international standards."; "Allow too many permission please ban this application. A total waste."; "I have concerns with their data privacy."; and "This app is a battery hog." Sadly, this app does not solve any problem. The government thought that the privacy is far more important than people's lives. Even in the app You see that the government is trying to protect your privacy, but they don't have any announcement about how to protect people's lives in this app. It's a shame. On the other hand, as can be noticed in Figure 3c , the key technical issues with these applications include registration and update issues. Moreover, a large number of reviews also highlight that majority of the applications crashes or frequently stops working. Besides these common issues, the reviews also hint about certain technical issues, such as device compatibility and connectivity issues, lack of support for some languages, such as English, and not correcting QR codes by different applications. Some sample reviews highlighting technical issues in the applications include "The app continually crashes."; "I have business visa i am unable to register please give a solution on this."; "I installed..but i cant register yet."'; "I cant update?"; and "Install the apps, but keep showing connection error. Even restart the phone, also the same." We also provide country-wise statistics in Figure 4 , where we summarize the number/percentage of samples/reviews on the applications used in different countries belonging to each category. An important observation from the figure is the variation in the distribution of number of negative, positive, neutral, and reviews highlighting technical issues in different parts of the world. The variations in the number of reviews in each class depict how different response to the use of the applications has been observed in different parts of the world. As can be seen, in certain countries, such as Japan, Israel, Canada, and Ireland, the ratio of negative reviews is high. The ratio of the positive reviews is sufficient in most of the countries, which shows the trust of users in the applications. On the other hand, as expected, fewer neutral reviews from the majority of the countries are obtained for the dataset. The dataset also covers a significant ratio for the technical issues class in the majority of the countries. For instance, the ratio of the reviews highlighting technical problems in the applications is significantly high in Denmark, Tunisia and Cyprus. In order to analyze the changes in the polarity of users' sentiments over time, in Figure 5 , we provide some preliminary temporal analysis to analyze the variation in the distribution of negative, positive, neutral, and reviews highlighting the technical issues over time. We note that in the current work we provide some preliminary temporal analysis, which will be explored in the future, and to this aim, we manually analyzed the 200 most recent reviews and the initial 200 reviews on applications having a reasonable time duration in the initial and more recent reviews. As can be seen in the figure, overall higher variation has been observed in the positive, negative, and neutral categories. As far as the individual applications are concerned, higher variation in the polarity of sentiments is observed for the applications used in Australia, Singapore, UAE, and Canada. During the data analysis, though there were some doubts about privacy, we observed that at the beginning the initiative/idea of contact tracing was largely appreciated by the users in different parts of the world. Moreover, we also observed that the users of these applications faced device compatibility and registration issues with the application with time. Interestingly, in the case of most of the applications, the number of negative reviews increased with time. One of the possible reasons for the increase is the applications' failure in achieving what they promised. To understand the lexical content, we have conducted an analysis of the number of tokens for each review. It can help to understand the characteristics of the dataset. For example, for CNN and LSTM based architectures, it is necessary to define max sequence length. The minimum, maximum and average number of tokens in the dataset are 3, 198, and ∼18, respectively. Figure 6 provides the statistics of the length of the reviews in the dataset. We also analyzed the lexical content in each category to understand whether they are distinctive in terms of the lexical content -top n-grams. This analysis also demonstrates the quality of the labeled data. We compared the vocabularies of all categories using the valence score [29] , [30] , ϑ for every token, x, using the following the Equation 1: where C(.) is the frequency of the token x for a given class L i . T Li is the total number of tokens present in the class. In ϑ(x) ∈ [−1, +1], the value +1 indicates the use of the token is significantly higher in the target class than the other classes. In Table II , we present top frequent bi-and tri-grams with ϑ = 1.0 for each category. From the table, we observe these n-grams clearly represent the class-wise information of the data. V. EXPERIMENTS As discussed earlier, we obtained a large number of samples for positive, negative, and technical issues while fewer samples are obtained in neutral class. Moreover, the reviews highlighting technical problems in the applications could also be treated as negative reviews. Thus, in order to cover different aspects of the problem, we divide it into three different tasks. Task 1: Ternary classification (PNT)-we treat the problem as ternary classification problem, where positive, negative, and technical issues are considered. The models trained for this task are expected to help in identifying the reviews highlighting technical problems in the applications along with the positive and negative reviews. Task 2: Binary classification (PN)-the negative and technical issues classes are merged into a single negative class to form two classes for a binary classification problem along with positive reviews. One of the main reasons for treating the task as a binary classification is the availability of fewer samples in the neutral class. Task 3: Ternary classification (PNN)-we have three classes namely positive, negative, and neutral. We note that in this task, the negative class is the combination of original negative and technical issues classes. All these tasks will help in analyzing how the performances of the proposed sentiment analyzer vary with different sets of annotations. For the classification experiments, we divided the dataset into training, validation, and test sets with a proportion of 60.3%, 6.7%, and 30%, respectively. While dividing the dataset we used stratified sampling to maintain class distribution across different sets. The data split/distribution is performed for each task separately, which results in a different number of samples for training, validation, and test set for each task. The data split for each task will be made publicly available, separately, to ensure a fair comparison in future work. Table III summarizes the distribution of the data into training, validation, and test sets used in each task, respectively. Before proceeding with the experiments, the data is also cleaned by removing unnecessary tokens, such as non-ASCII characters, punctuations (replaced with whitespace), and other signs. For this study, our classification experiments consist of multiclass classification using both classical and deep learning algorithms as detailed below. 1) Classical Algorithms: For this study, we used several classical algorithms such as Multinomial Naive Bayes (MNB) [31] , SVM [32] and Random Forest (RF) [33] . As a feature representation with these algorithms, we used the bag-ofngrams, which is one of the most commonly used methods for text classification and retrieval applications, applied with classical algorithms. Earlier this has been widely used as a simple, yet effective and computationally efficient method. Motivated by its better performance in similar types of text classification applications, such as fake news and floods detection in Twitter On 60 reviews we have more than 120 tokens, which has not shown in the figure. Class labels Train Dev Test Total text [17] , [34] , [35] , we experimented with this representation using the mentioned classical algorithms. 2) fastText: fastText is an NLP library aiming at efficient word embedding and text classification with a higher speed compared to traditional deep learning solutions [36] . For word embedding, the model relies on Continuous Bag of Words (CBOW), which is based on a shallow Neural Network (NN), strategy by predicting a word via its neighbors. To ensure training at a higher speed, the model relies on a hierarchical classification mechanism by replacing the traditional soft-max function with a hierarchical one resulting in a reduced number of parameters. 3) Transformers: BERT [13] is a state-of-the-art pre-trained model, which has shown its success in many downstream NLP tasks. It is typically used for downstream classification problems either by using embedding representations as features or fine-tuning the model. The main strength of the model comes from pre-training on a very large text dataset that allows the model to understand and interpret text easily in different NLP applications. Moreover, the model also possesses the ability to learn from context. For this study, we use different transformer models, which include BERT [13] , RoBERTa [37] , XLM-RoBERTa [38] and DistilBERT [39] . To measure the performance of each classifier, we use weighted average precision (P), recall (R), F1-measure (F1). We used weighted metric as it has the capability to take into account the class imbalance distribution. To train the classifiers using the MNB, SVM, and RF we converted the text into bag-of-n-gram vectors weighted with logarithmic term frequencies (tf) multiplied with inverse document frequencies (idf). To utilize contextual information, such as n-grams which are useful for classification, we extracted unigram, bigram, and tri-gram features. We used grid-search to optimize the parameters for MNB, SVM, and RF. For the MNB, we optimize laplace smoothing α parameter with 20 values between 0 and 1. For the SVM, we optimize linear kernel with C parameters with 30 values ranges from 0.00001 to 10, and radial-basis-function kernel with C and γ parameters (for γ we use 10 values from 1e-5 to 1e-1). For RF we optimize the number of trees (10 values from 200 to 2000), and the depth of the tree (11 values from 10 to 110). Choosing such ranges of values depends on the available computational resources as they are computationally expensive. For fastText, we use pre-trained embeddings trained on Common Crawl 7 and default hyperparameter settings available with fastText toolkit. 8 For transformer-based models, we use the Transformer Toolkit [40] . We fine-tune each model using the below hyperparameter settings with a task-specific layer on top of the model. As reported in [13] the training with the pre-trained transformer models shows instability, hence, we do 10 runs of each experiment using different random seeds and choose the model that performs the best on the development set. For training the transformer-based models for each task we finetune the model 10 epochs with the 'categorical cross-entropy' as the loss function and used the following hyper-parameter settings. Table IV provides the experimental results on task 1 in terms of weighted accuracy, precision, recall, and F1-Score. Overall better results are obtained with transformers compared to the classical and deep learning based methods. One of the main reasons for the better performance of the transformers is their text interpretation capabilities. Though no significant differences have been observed in the performances of the different transformers, a slight improvement is observed for RoBERTa over the rest of transformers. To better analyze the performance of the proposed methods, we also provide class-wise performance. Overall reasonable results are obtained on all three classes, however, the performance of all the methods is higher on the positive class. One of the possible reasons for the comparatively lower performance on the other two classes is the lower inter-class variation. As detailed earlier, reviews in the negative and technical issues classes contain similar types of words, and there are higher chances of confusion in the classes. The experimental results of task 1 provide bases for task 2, where the negative and technical issues classes are merged. Table V provides experimental results on task 2, where the models have to differentiate between positive and negative reviews. As expected, the performance has been improved significantly on task 2, which proves our hypothesis that negative and technical issues classes have similarity in contents. Moreover, similar to task 1, transformers have outperformed the rest of the methods. As can be seen in the table, in contrast to task 1, no significant differences can be observed in the performance of the methods on different classes, which indicates that reviews highlighting technical problems in the applications evoke negative emotions/sentiments. Moreover, no significant variation in the performance of the methods on a particular class has been observed. Table VI provides the experimental results on task 3, where the models have to differentiate among positive, negative, and neutral reviews. Similar to previous two tasks, transformers produced better results compared to the classical and deep learning based methods. As can be seen in the table, better results are reported for all the methods on positive and negative classes. However, the performance of the proposed methods is significantly lower especially for the bag of words and ngram with the Naive Bayes classifier. One of the main reasons for the lower performance on neutral class is due to the fewer samples in the class as described in Section IV. We note that task 2 and task 3 are performed separately to analyze the impact of the fewer samples in the neutral class. Contact tracing of COVID-19 patients has been globally recognized as one of the most effective ways of controlling the infection rate. However, there are several limitations of the existing mechanisms. Manual contact tracing is a tedious and time-consuming process. Moreover, it is difficult to keep track of all potential contacts of a patient. Digital solutions, such as the use of mobile applications, has been considered as a promising solution where a patient's contacts can be traced and informed quickly. However, there are several concerns over the working mechanism and performance of the applications. This work has revealed different facets of the COVID-19 contact tracing applications, advantages, drawbacks, and users' concerns over these applications. We have summarized the main points hereafter. • The idea/initiative of contact tracing via a mobile application is highly appreciated by people worldwide. Besides contact tracing, the applications are also proved useful in implementing and ensuring public policies on COVID-19. However, there are also some concerns over the working mechanism and the effectiveness of the applications. • Analysis of users' reviews on these applications helps to better understand and rectify the concerns over the applications. • Majority of the reviews lie in three categories, namely positive, negative, and technical issue. On the other hand, very few neutral reviews are observed. • Privacy in terms of tracking via GPS and access to the gallery and other information by the applications have been the main concerns. Moreover, a vast majority of the users of these applications in different parts of the world are not with the high power consumption of the applications. • Majority of the users also faced some technical problems while using the applications. Some key technical issues include device compatibility, registration, slow updates, connectivity issues, and lack of support of some languages e.g., English. • The distribution of negative, positive, neutral, and technical issues may vary over time. • Overall better performance has been observed for the AI models in sentiment analysis of users' reviews allowing an efficient analysis of users' response to the application more quickly. • The transformers have been proved more effective among the models deployed for sentiment analysis in this work. • The experimental results indicate that reviews highlighting technical problems in the applications evoke negative emotions/sentiments. In this paper, we focused on the sentiment analysis of users' reviews on the COVID-19 contact tracing mobile applications and analyzed how users react to these applications. To this aim, a pipeline is composed of multiple phases, such as data collection, annotation via a crowd-sourcing activity, and development, training, and evaluation of AI models for the sentiment analysis. The existing literature mostly relies on the manual/exploratory analysis of users' reviews on the application, which is a tedious and time-consuming process. Moreover, in the existing studies, generally, data from fewer applications are analyzed. In this work, we showed how the automatic sentiment analysis can help in analyzing users' responses to the application more quickly. Moreover, we also provided a large-scale benchmark dataset composed of 34,534 reviews from 47 different applications. We believe the presented analysis and the dataset will support future research on the topic. We believe, many interesting applications and analysis can be conducted keeping the dataset as a baseline. Temporal and topical analysis are the key aspects to be analyzed in the future. Effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of SARS-CoV-2 in different settings: a mathematical modelling study. The Lancet Infectious Diseases Contact tracing apps promised big and didn't deliver Sentiment analysis of user feedback on the HSE contact tracing app Contact tracing: Can 'big tech' come to the rescue, and if so Effective configurations of a digital contact tracing app: A report to NHSX COVID-19 digital contact tracing applications and techniques: A review post initial deployments Ulf Buermeyer, and Hannah Zillessen. COVID-19 contact tracing and data protection can go together Wetrace-a privacy-preserving mobile COVID-19 tracing approach and application Acceptability of app-based contact tracing for COVID-19: Cross-country survey study Mining user reviews of COVID contact-tracing apps: An exploratory analysis of nine European apps Imagenet classification with deep convolutional neural networks Bag of tricks for efficient text classification Pre-training of deep bidirectional transformers for language understanding Leveraging data science to combat COVID-19: A comprehensive review Early outbreak detection for proactive crisis management using twitter data: COVID-19 a case study in the US Collaborative federated learning for healthcare: Multi-modal COVID-19 diagnosis at the edge Fake news detection in social media using graph neural networks and NLP Techniques: A COVID-19 use-case Model generalization on COVID-19 fake news detection COVID-19 contact tracing in two counties-North Carolina Integrating emerging technologies into COVID-19 contact tracing: Opportunities, challenges and pitfalls Applications of machine learning and artificial intelligence for COVID-19 (SARS-CoV-2) pandemic: A review A survey of COVID-19 contact tracing apps Inherent privacy limitations of decentralized contact tracing apps Privacypreserving contact tracing of COVID-19 patients Cryptography made simple COVID-19 contact-tracing apps: A survey on the global deployment and challenges Contact tracing mobile apps for COVID-19: Privacy considerations and related tradeoffs A national survey of attitudes to COVID-19 digital contact tracing in the Republic of Ireland Political polarization on twitter A multi-platform Arabic news comment dataset for offensive language detection The optimality of naive bayes Support vector machines. IEEE Intelligent Systems and their applications Random forests. Machine learning Natural disasters detection in social media and satellite imagery: a survey Fake news detection in multiple platforms and languages Bag of tricks for efficient text classification A robustly optimized bert pretraining approach Unsupervised cross-lingual representation learning at scale DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter Huggingface's transformers: State-of-theart natural language processing