key: cord-0155062-qzq84cfw authors: Ullah, A. R. Sana; Das, Anupam; Das, Anik; Kabir, Muhammad Ashad; Shu, Kai title: A Survey of COVID-19 Misinformation: Datasets, Detection Techniques and Open Issues date: 2021-10-02 journal: nan DOI: nan sha: 36782acf14884d64c8035465c9df5ee330d01c68 doc_id: 155062 cord_uid: qzq84cfw Misinformation during pandemic situations like COVID-19 is growing rapidly on social media and other platforms. This expeditious growth of misinformation creates adverse effects on the people living in the society. Researchers are trying their best to mitigate this problem using different approaches based on Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP). This survey aims to study different approaches of misinformation detection on COVID-19 in recent literature to help the researchers in this domain. More specifically, we review the different methods used for COVID-19 misinformation detection in their research with an overview of data pre-processing and feature extraction methods to get a better understanding of their work. We also summarize the existing datasets which can be used for further research. Finally, we discuss the limitations of the existing methods and highlight some potential future research directions along this dimension to combat the spreading of misinformation during a pandemic. Coronavirus Disease 2019 (COVID- 19) is an infectious disease that is caused by a newly discovered virus SARS-Coronavirus-2 (SARS-CoV-2), which is closely related to the SARS virus [1] . The disease was first identified in Wuhan, China, and the first case of COVID-19 was reported on December 31, 2019 [2] . In the beginning, it was recognized to be a global health issue, later declared as a pandemic by the World Health Organization (WHO). By October 23, 2021, the total case count crossed the bar of 243.8 Million, and over 4.9 Million people have already lost their lives worldwide [3] . Owing to the brutal behavior of the virus, the sharp increase in infection and mortality led to a massive impact on several sectors such as the country's economy, public and private sectors, government bodies, and above all affecting the mental and physical health of the people by tempering their everyday lives. During this devastating pandemic situation, people worldwide are going through an unprecedented set of challenges and fear. Every now and then, they seek information about COVID-19 solutions, e.g., medicines, vaccines, mask usage, or regarding COVID-19 dangers on various online platforms using different languages. Along with factual information, it is observed that a large amount of misinformation related to COVID-19 is circulating through these platforms. Consequently, the 'infodemic' of rumors and misinformation related to the virus came to the surface. The term 'infodemic' was first coined by the World Health Organization (WHO) meaning to an overabundance of both inaccurate and accurate information to explain the misinformation about the virus and makes it harder for people to find trustworthy and reliable sources for any claim made on any online platforms during the pandemic [4, 5] . Misinformation is a piece of false information or inaccurate information that is intentionally created to get more attention from people. There are numerous terms related to misinformation including fake news, misleading news, rumors, and disinformation, which usually contain information that misguides people. During this COVID-19 situation, there has been an expeditious growth in usage of social media platforms and blogging websites which has passed 3.8 billion marks of active users [6] . People are now getting more involved in these platforms, especially on Facebook, Twitter, Instagram, etc., and expressing their thoughts, news, opinions, and information related to COVID-19. They gather information about COVID-19 from any news media or social media platforms and share it with others without fact-checking the information. As a result, it is causing panic to the people within these platforms and affecting people's mental health, daily lives, and behaviors. ii As people's activity on social media and other online platforms has been increased significantly in this pandemic situation, the misinformation provided on these platforms easily mislead the people. So, it is now a global concern to mitigate the spread of misinformation related to COVID-19 on these platforms. It has already gained a great deal of attention from researchers all around the world. Therefore, a significant number of research works have already been done. So, it is necessary to investigate the existing studies, their findings, and also their significance. Here, we present a systematic review of various misinformation detection approaches related to COVID-19 and discuss the promising research directions. In particular, we have made the following contributions in this survey paper. • We have conducted a systematic review of existing studies on COVID-19 misinformation detection using ML techniques. • We have done qualitative and quantitative analysis of the selected papers by considering the datasets, pre-processing techniques, feature extraction and classification methods. • Finally, we discuss the open issues and the future research directions that can help the researchers who are willing to work in this domain. The rest of the paper is organized as follows. Section 2 provides an overview of different traditional ML and DL methods. Section 3 presents our methodology to search databases along with the selection criteria of the articles. Section 4 outlines different datasets for COVID-19 misinformation, presents an analysis of various pre-processing, feature extraction and classification methods used in the state-of-the-art research. Section 5 discusses open issues and future research directions. Finally, Section 6 concludes the paper. ML is a subset of artificial intelligence where the main aim is to train machines by using algorithms about some statistical phenomenon to make decisions like human. It identifies the pattern of the datapoint based on some mathematical relation and predicts the new datapoint in similar way. In this section, we discuss the theoretical concepts of the ML methods which are related to our study. All the ML methods are discussed in two separate subsections named Traditional ML methods and DL methods in the following. iii Traditional means the things that we have been doing for years. These traditional ML methods work as the base for the cutting edge ML methods. These algorithms learn from the data where the input features to be fed into the chosen algorithm are made by the subject matter experts. These models expects all inputs to in the format of structured data like numbers. Traditional ML models can be used to solve classification [7] , regression [8] , clustering [9] etc. Here we describe the traditional ML methods that we have explored in our study. Logistic regression (LR) [10] is a statistical model based on the sigmoid or logistic function. It's a probability-based predictive analytic algorithm. It is a S-shaped curve which maps output value between 0 and 1 taking any real valued number as an input. The logistic regression hypothesis function is defined in [11] where β0 is the bias or intercept term and β1 is the coefficient for the single input value (X). This can be written as Equation 1 . Support vector machine (SVM) [12] is one of the most popular and widely used algorithms for classification problems in a lot of research areas [13] . It creates a hyperplane or set of hyperplane in a high dimensional space for the classification of the data point based on the feature set [12] . The dimension of the hyperplane depends upon the number of features. When the number of input features is two, the hyperplane is only a single line whereas the hyperplane becomes a two-dimensional plane if the number of input features is three. The main objective is to identify the optimal hyperplane that differentiate the data points with maximum margin (the maximum distance between the data points of both classes) as there could be many possibilities for a hyperplane to exist in an N-dimensional space. The hyperplanes are decision boundaries that help to classify the data points. The cost function for the SVM model [14] is written as below Equation 2: Naive bayes (NB) [15] is a simple probabilistic model based on Bayes theorem with strong indepeniv dence assumption. It is the simplest form of Bayesian Network which involves a conditional independence assumption. Bayes theorem can be formulated as Equation 3 . Where A and B are events. P(A) and P(B) are the probability of the events. The k-nearest neighbors (kNN) [16] is one of the simplest algorithms where a datapoint is classified based on the nearest datapoints. It is a non-parametric method used for classification and regression [17] . In this method, the euclidean distance between for each test datum and all the training data are calculated and the test datum is classified in such class that most of the k-nearest train data have and the value of k estimates the majority of its neighbors' votes. Suppose, if k = 1, the new datapoint is assigned to a class based on the closest training datum's class. If k > 1, then the datapoint is assigned to a class which is the most of the neighbours' class. The mathematical formulae to estimate the Euclidean distance between two points can be calculated according to Equation 4 : Decision tree(DT) [18] is a supervised learning algorithm. It can be used for both classification and regression problems. A tree is composed of nodes, each of which represents a data feature and the relation between them represents the decision rule. The leaf of the tree represents an outcome and the values can be categorical or continuous. The choice of an attribute selection measure and a pruning method are required for designing a decision tree. There are numerous strategies for selecting attributes and the majority of them directly assign a quality measure to the attribute. Information Gain Ratio criterion [19] and Gini index [20] are the most frequently used attribute selection measures in decision tree. Random forest (RF) [21] is an ensemble learning method that uses a combination of tree classifiers to perform classification, regression and other tasks. A random vector that is sampled independently from the input vector is generated for each classifier and each tree casts a vote for the most common class. This helps in the classification of an input vector [22] . Therefore, a combination of features or randomly selected features at each node make a tree. Bagging method is used to generate a training data set by randomly drawing with replacement N examples, where N is the size of the original training set [23] , was used for v each feature/feature combination selected. As an attribute selection measure, the random forest classifier uses the Gini index and this helps to measures the impurity of an attribute with respect to the classes. eXtreme gradient boosting (XGBoost) [24] is an implementation of optimized gradient boosted decision tree. It has the relatively faster computation capability in all the computing environments. It dominates tabular and structured datasets on classification and regression predictive modeling problems.This algorithm is widely used for its performance in modeling newer attributes and classification of labels. The XGBoost algorithm's evolution started with an approach focused on the decision tree where a decision is computed based on certain conditions. Sometimes single model cannot be helpful to get the correct output. For this, a systematic solution from ensemble learning can be helpful to combine the predictive power of multiple learners and it gives the aggregated output from several models. Bagging is one of the widely used ensemble learners. It randomly selects features and construct a forest or aggregation of decision trees. Furthermore, the model efficiency has been enhanced by reducing errors from building sequential model and the gradient decent algorithm was employed to reduce the errors in the sequential model for more improvement. Finally, XGBoost algorithm was recognized as a convenient approach by removing missing values and minimizing overfitting problems using parallel processing. It prevents overfitting problem by supporting both L1 and L2 regularization [25] . Deep learning (DL) is one of the most widely explored research topics in ML which was first introduced by Rina Dechter in 1986 [26] . DL is an emerging technology that is being used in numerous applications such as Computer Vision [27, 28] neurons. An activation function simply decides whether a neuron will be activated or not. A hidden layer of a neural network performs the necessary transformation of data that can be used by the output layer. Finally, there is an output layer that is responsible to generate the final result. Figure 1 represents the architecture of a simple NN. Convolutional neural network (CNN) is a class of deep neural networks (DNNs) designed to process data that has a grid-like structure such as an image. It was first introduced in the 1980s for document recognition tasks [34] . A CNN architecture consists of several layers such as an input layer, a convolutional layer with multiple filters/kernels, a pooling layer, a fully connected layer. A basic CNN architecture is illustrated in Figure 2 . Different layers of a CNN model are outlined as follows: i) Input layer: An input layer takes the text inputs and transforms them into a matrix form which is known as a word embedding vector. In this layer, each word of a text sequence is transformed into a dense vii Figure 2 : A Basic CNN Architecture [35] vector of fixed size. ii) Convolutional layer: A convolutional layer comprises several filters that perform the convolution operation on their input. A convolution is a mathematical operation that takes two inputs such as the input layer matrix and a convolution filter or kernel. A dot product is taken between the filter and the parts of the input matrix regarding the filter size by sliding the convolution filter over the input layer matrix. After the input layer matrix passes through this layer, a feature map with a single column is obtained as output. After the operation of convolution is complete, an output matrix of features is obtained through an activation function (e.g., tanh) upon the addition of a bias value. iii) Pooling layer: A pooling layer performs dimensionality reduction of its input feature vectors. It uses sub-sampling to the output vectors of the convolutional layer combining neighboring elements. Different types of pooling are used in this layer such as Max Pooling, Average Pooling, etc. However, max pooling is the most common approach used in the pooling layer. In max pooling, the largest value is taken from the feature map found from the convolutional layer. iv) Fully connected (FC) layer: FC layers are the last layers of a CNN architecture. It can be made up of one or more layers and is placed after the pooling layers. The output of the pooling layer gets flattened before feeding it into the FC layers. Different activation functions, e.g., softmax or sigmoid are used to decide the final output of this layer. In particular, the sigmoid activation function is used in the binary classification task, whereas the softmax activation function is generally used for a multi-class classification problem. viii Recurrent neural network (RNN) is a class of artificial neural networks (ANNs) that employs the sequential information in the network, which is important in the applications where the embedded structure in the data sequence conveys useful knowledge [36] . In an RNN architecture, the output from the previous step is fed as input to the current step. In traditional neural networks, we presume that all inputs are independent of each other. But there are many cases where this assumption becomes quite impractical. If one wants to predict the next word of a given sentence, it is required to know the previous words of it. Consequently, RNN comes into existence. RNN performs the same operation for each word of a given sentence one after one, by taking into account its previous information. It can remember the information about a sequence over the time. While working with RNN, a sentence is considered as a sequence of words. At each timestamp, only one word is fed as input to RNN and this continues until the whole sequence is finished. After feeding the whole sequence, a corresponding output is produced at the end of the RNN model. Figure 3 illustrates a basic RNN architecture. At time t, if x is given as input, we can compute the hidden state, h t as of Equation 5 . where tanh is an activation function, U h and V h are the weight matrices for the current input x t , h t and h t-1 stands for current and previous hidden states respectively, and b h is the corresponding bias value. The output y t is finally obtained by Equation 6 . where so f tmax is an activation function and W y represents the weight matrix for current input x t . Problems that require learning long-term dependencies can be difficult to solve using basic RNNs. This is because a basic RNN generally suffers from the vanishing gradient problem. There are two variants of RNN such as LSTM [37] and GRU [38] which are capable of learning long-term dependencies and can Bidirectional encoder representations from transformers (BERT) is a transformer-based DL model that is widely used for NLP tasks. BERT is a very recent technique designed for unsupervised pre-training of deep bidirectional representations from the texts [39] . BERT is an encoder-only transformer which can represent any token based on its bidirectional property. It can capture both left and right contexts in all layers as it is deeply bidirectional. BERT has been pre-trained on a large corpus of unlabeled data including Wikipedia and Book Corpus. Its pre-trained models are available in two sizes, e.g., BERT BASE and BERT LARGE . BERT LARGE uses higher number of parameters than BERT BASE . Figure 4 shows a BERT architecture for a text classification task. BERT uses an input representation by which it can represent both a single sentence and a pair of sentences in one token sequence. The first token of every input sequence is BERT has been pre-trained in an unsupervised way using two tasks named Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, for a given sequence of tokens, some of them are masked. The objective is to predict only those masked tokens. In the NSP task, for a given pair of sentences, the model predicts if the second sentence logically follows the first one. BERT has inspired many variants such as XLNet [40] , RoBERTa [41] , and ALBERT [42] . The XLNet architecture uses the bidirectional context of BERT but does not do masking. It combines the bidirectional property of BERT xi with auto regressive language modeling of Transformer-XL. RoBERTa is trained on more data and has longer training than BERT. It can dynamically change the masking pattern and the NSP procedure of BERT is removed here. On the other hand, ALBERT uses parameter reduction techniques to increase the training speed. To resolve the limitation of BERT regarding inter-sentence coherence, ALBERT uses sentence order prediction instead of NSP. We have searched for articles in some prominent databases such as Scopus, Web of Science and others (e.g., Google Scholar, ResearchGate, arXiv). Scopus and Web of Science are the popular authentic databases that maintain all the published papers from IEEE, ACM, Elsevier, etc. Google Scholar, Research-Gate also provide a simple way to broadly search for scholarly literature. Furthermore, we have searched in the arXiv repository to get the preprint of the papers that are not published yet. We have used query string/keyword-based searching method in our study. Our query string/keyword includes COVID-19 misinformation, fake news, rumours, misleading information related studies that have used detection, classification and clustering techniques using ML techniques. The searching keywords and query strings are shown in Table 1 . For the selection of the papers for our systematic review, we have developed some eligibility criteria, which includes : • The research articles must be focused on the detection or classification of COVID-19 misinformation. • The subject matter of this study exists anywhere in the title, abstract, or keywords of the article. xii • The articles containing classification models must have performance evaluation of the adopted methods in terms of evaluation metrics e.g., accuracy, precision, recall, F1 score, etc. • The research article must be written in English. xiii Figure 5 : Diagram of the systematic selection,evaluation, and quality control of the database using the Prisma model The systematic selection process of the articles for our research is illustrated in Figure 5 . A total of 260 papers were found in the "Identification phase" of our study by searching the databases using query string/keyword. After removing 38 duplicate articles, remaining 222 articles were screened by the titles and abstracts in the "Screening phase". In this phase, the articles are further filtered with the eligibility criteria and 133 articles were removed. In the "Eligibility phase", full-texts of the remaining 89 articles were studied for final selection. 42 articles were excluded in this phase for not having relevant outcomes, results or lack of evaluation metrics. Finally, in the "Included" phase, we have found 47 papers which were included in our survey. xiv In this section we discuss about the datasets, different pre-processing, feature extraction, and classification methods used in the existing literature regarding COVID-19 misinformation detection along with their evaluation results. Relevant and sufficient training data is considered as the basis to achieve precise results from any MLbased misinformation detection systems. To perform the misinformation classification task, data from various platforms such as social media, news websites, fact-checking sites, government or well-recognized authentic websites are being used frequently. But manually determining the veracity of news is a challenging task, usually requiring annotators with domain expertise who performs careful analysis of claims and additional evidence, context, and reports from authoritative sources. Therefore, to facilitate future research related to the COVID-19 misinformation task, the datasets existing in the literature are discussed in the next. Datasets used in the papers [43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60] are labeled in two or more classes to classify the COVID-19 misinformation using ML algorithms. The papers [43, 45, 46, 48, 50, 51, 53, 54, 55, 56, 58, 59, 60] use the data that are mainly collected from Twitter platform by using the Twitter APIs. On the other hand, paper [44] utilizes the data collected from various Chinese rumor refuting platforms, paper [47] uses data that is collected from different fact-checking websites after obtaining references from Poynter and Snopes, paper [49] makes use of the data that is scraped by Webhose.io API from various news and blog sites around the world, paper [52] uses the data collected from IFCN Poynter website and paper [57] The datasets [66, 58] contain the tweets only in Arabic language while the dataset [43, 45] contains data in xv two different languages (e.g. , English or Arabic) separately. The dataset [68] contains microblogs related to COVID-19 in Chinese language. Some studies [46, 47, 59] also introduced multilingual datasets containing data in multiple languages. Among the datasets that have already been used in existing works, the datasets [44, 49, 52, 54, 55, 56] have not been made publicly available. To encourage future research on COVID-19 misinformation detection, we have collected some more datasets that are vast in size. All of these datasets [71, 72, 73, 74, 75, 76] are publicly available to use. These data are collected from the Twitter platform. The datasets [72, 73, 75, 76] are multilingual while other two datasets [71, 74] are monolingual containing data in English and Arabic respectively. After making some modifications and proper annotations , future research works can be conducted in this domain by utilizing these datasets. Data pre-processing is one of the significant parts before feeding the data in any ML algorithm. Data pre-processing includes data cleaning, normalization, transformation, feature extraction & selection, etc. This step aims to facilitate data manipulation, reduce memory space needed, and shorten the processing of huge amounts of data. Some commonly used pre-processing techniques are described in the following. Tokenization process splits the text data into smaller parts known as tokens and it removes all the punctuations from the textual data [77] . By using tokenization process, texts can also be converted into lowercase or uppercase. Stops Words are the most common words in a language which do not provide much context and hold less useful information. These words help to make sentence structures. Stopwords are mainly articles, xvi prepositions and conjunctions and some pronouns for example an, are, as of, on, or, that, the, these, this, too, was, what, when, where, who, will, etc. Stemming is a technique that is used to convert a word to its grammatical roots so that they can be presented in one term only [78] . For example, the words "Logically", "Logic" and "Logicality" can be reduced to the word "Logic". Stemming is used to make classification faster and efficient. It reduces the input dimension which creates better possibility to get better accuracy. Feature extraction is a process of selecting relevant features without losing any important information. In the text categorization, generally a document consists of a large number of words, phrases which creates a high computational burden in the learning process. Also, it is difficult to learn from high dimensional data. Besides, classifier's accuracy can decrease for taking irrelevant features. By taking relevant and important features can help to speed up the learning process. We have found different feature extraction methods in our study which are represented in Table 2 and described in the following subsections. PCA is dimensionality reduction technique. The goal of PCA is to produce lower-dimensional feature sets from the original dataset. In PCA it is very important to determine the number of principal components. If p is the number of principal components to be chosen among all of the components, the values of p should represent the data at their very best. ICA is a linear transformation method in which the desired representation is the one that minimizes the statistical dependence of the components of the representation. It doesn't focus on mutual orthogonality of the components and the issue of the variance among the datapoints. The BoW model is a simplifying representation used in NLP and information retrieval (IR), where a text is represented as the bag (multiset) of its words, disregarding grammar and even word order, but keeping multiplicity. The occurrence of each word is taken as a feature. The main idea of TF-IDF comes from the theory of language modeling where the terms in a given document can be divided into two categories: those words with eliteness and those words without eliteness [79] . TF-IDF is measured by multiplying two metrics where one represents how many times a word appears in a document and the other represents the inverse document frequency of the word across a set of documents. and organizes a piece of content as a tree. In the study [64] ,the authors used a pretrained RST parser [83] to obtain the tree for each news article. It counted each rhetorical relation within a tree and classified in a traditional statistical learning framework. ELMo stands for Embeddings from Language Models. It is a deep contextualized word representation instead of using a fixed embedding for each word developed in 2018 by AllenNLP [84] . It uses a deep, bidirectional LSTM model to create word representations. Unlike other traditional word embeddings such as xviii word2vec and GLoVe, ELMo analyses words within the context that they are used rather than a dictionary of words or their corresponding vectors. Therefore, the same word can have different word vectors under different contexts. Word2vec is a word embedding technique developed by a team of researchers led by Tomas Mikolov at Google which uses shallow neural network [85] . There are two types of Word2Vec, Skip-gram and Continuous Bag of Words (CBOW). CBOW method takes the context of each word as the input and tries to predict the word related to the context. It has better representations for more frequent words. On the other hand, the distributed representation of the input word is used to predict the context in the Skip-gram model which works well with small amount of data and is found to represent rare words well. GloVe stands for global vectors for word representation developed by Stanford as an open source project [86] . It is an unsupervised learning algorithm for generating word embeddings. Here, all the words xix are mapped into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence matrix from a corpus, and the resulting representations show interesting linear substructures of the word in vector space. In case of classification task, two types of classification strategies are commonly used by researchers i.e., Binary classification and Multi-class classification. As can be seen from Table 3 , binary classification is the mostly used classification strategy for classifying COVID-19 misinformation, rather than multi-class classification. The classification methods used in the existing works on COVID-19 misinformation detection are rexx In another study, the author used kNN based classifier method to find the truthfulness of the news shared on social media using their own collected dataset during four months of lockdown [55] . Before fitting into the classifier, they preprocessed the dataset based on the similarity news in social medias. They got a decent accuracy using this classifier. In another study, this kNN classifier was used as candidate weak-learners during the experimental phase of ensemble learning where this algorithm obtained an accuracy of 94.39% for 10 fold cross validation [54] . Cui et al. [61] performed different kinds of simple methods on their own created dataset as baselines for the comparative analysis of misinformation detection task. They used BOW features and fed the representations to a linear kernel SVM and RF classifier. For feeding into the LR model, they concatenate all the word embeddings together. Although these models didn't achieve a good score in this dataset, the comparative analysis helped to find the overall model performance. In another study [64] , extensive experiments are conducted using ReCOVery dataset which included the baseline performances using either single-modal or multi-modal information of news articles for predicting news credibility and allow future methods to be compared to. Different kinds of methods such as : LR, NB, kNN, RF, DT and SVM are adopted in their experiment using LIWC and RST feature. Dharawat et al. [60] performed experiments with several multiclass classification models on their own created new benchmark dataset-"Covid-HeRA". They use RF, SVM, LR model with BOW and 100-dimensional pretrained Glove embeddings and achived a very good accuracy above 95%. In the study [57] , the author performed experiment with their annotated benchmark dataset using four ML baselines -DT, LR, GB, and SVM and obtained the best performance of 93.46% F1-score with SVM using TF-IDF feature. Over the last few years, DL is playing a vital role in misinformation detection tasks. Various DL techniques have been used to conduct the classification task of misinformation in the pre-COVID situation. During this COVID-19 situation, DL has emerged as one of the significant technologies to make efficient xxii systems that can detect and classify the misinformation related to COVID-19. Several DL methods have been adopted in the existing research works of COVID-19 misinformation detection and classification task. These methods are reviewed here in detail according to the categories described in Section 2. healthcare misinformation [61] . They used word embedding initialized by Glove and fed it into the CNN model. In another study, the authors deployed a CNN model using pre-trained Glove embedding to build up a system for detecting misleading information related to COVID-19 [45] . They utilized the word-level representation of features to preserve their order and were able to obtain high accuracy in results. Alkhalifa et al. introduced a CNN-based classification system with different pre-processing approaches and embedding methods to classify the COVID-19 rumors [48] . In this work, the best performing model comprises a CNN model with COVID-Twitter-BERT (CT-BERT) embedding which is pre-trained on COVID-19 Twitter data. Another study applied a CNN model with an embedding layer in front of it for the classification of fake news related to COVID-19 [49] . This study reported that lower weights of minority class cause overfitxxiii ting problems. By increasing the weights of the minority class, the author was able to reduce the overfitting problem significantly and increased the test accuracy as well. Dharawat et al. introduced a dataset for health risk assessment of COVID-19 misinformation [60] . The authors also experimented with CNN to classify the misinformation categories using both binary and multi-class classification methods. They implemented CNN with multiple kernels and used pre-trained Glove embedding as an initialization of word embedding. Among all the studies that experimented with CNN, the study [45] [64] and misinformation [56] in COVID-19 tweets.The TextCNN model uses a one-dimensional convolution layer and max-over-time pooling layer to capture the associations between the neighboring words in texts. The study [44] obtained the highest performance with accuracy and F1 score of 98.40% and 97.24% respectively, among the studies that adopted the TextCNN model. RNN: RNN has the ability to capture better contextual information from the texts, therefore various studies utilized RNN and it's other variants for the classification of COVID-19 misinformation. In particular, Chen used the TextRNN model to classify COVID-19 rumors [44] . The author used LSTM layers inside to implement this model. In this study, higher accuracy was obtained in the classification results as Tex-tRNN was able to capture the relationship between the semantics and the contexts strongly. As LSTM has the advantage of learning long term dependencies over RNNs, some studies implemented the LSTM model for the better classification of misinformation related to COVID-19 [49, 53, 56] . The study [49] reported the best performance score using LSTM with an accuracy of 75%. Some of the studies applied the BERT: BERT is a newer DL method that has been extensively used for dealing with NLP tasks. Several exiting studies focused on BERT and its variants for classification purposes. For instance, Chen proposed a fine-grained classification method based on the BERT pre-training model to classify the rumors of COVID-19 [44] . The author fine-tuned the pre-trained BERT model for classification purpose. This study demonstrates that the multiheaded attention mechanism used in BERT is capable to produce outstanding results. This study reported an accuracy of 99.20% in the classification results using the BERT model. Distil-RoBERTa, RoBERTa-base, RoBERTa-large), and two variants of ALBERT model (e.g., Albert-base-V2, Albert-large-V2) to perform a systematic analysis. They performed fine-tuning on these pre-trained models to get them ready for their classification task. Among all the adopted models, Roberta-large appeared the best performing model with an F1 score of 76% as it was trained on a larger corpus compared with the other models. In another study, the authors fine-tuned three transformer models e.g., XLNet base, BERT base, and RoBERTa base for the classification of user comments associated with COVID-19 misinformation videos [50] . Among these models, RoBERTa showed the best performance in test data. RoBERTa BASE to be used as baseline models [62] . The authors trained the BERT model with raw training data and obtained a good performance against the validation data. But the RoBERTa model achieved higher performance than BERT in the baseline results. Due to achieving higher performance using RoBERTa, the authors proposed some ensemble models to increase the performance of the baseline models using In topic modeling, SCHOLAR achieved a higher coherence score than NVDM but in terms of perplexity, NVDM showed higher performance than SCHOLAR. Some studies employed attention-based models for the classification of COVID-19 misinformation [61, 60] . The authors used two models based on attention mechanism namely HAN [92] and dEFEND [93] for their purposes. HAN uses two levels of attention mechanisms applied at the word and sentence level to learn the hierarchical structure of the documents. It uses a bidirectional GRU network for word and sentence level encoding procedures. An attention mechanism is used after the word encoder to extract the contextually important words and form a sentence vector by aggregating the representations of the informative words. A sentence encoder then works on the derived sentence vectors and generates a document vector. Another attention mechanism is used after the sentence encoder to measure the importance of sentences in the clasxxvi sification of a document. The dEFEND framework utilizes the HAN on article content and a co-attention mechanism between article content and user comments to classify misinformation. In the studies [61, 60] , dEFEND showed higher performance scores than HAN due to its robustness and explainability as well. In one study, multi-modal information ( e.g., textual and visual ) of new articles on coronavirus was used for the detection of fake news [64] . The authors adopted the SAFE [94] model which can jointly learn the textual and visual information along with their relationships to detect fake news. In SAFE architecture, a Text-CNN model is used to extract the textual features from the news articles and the visual features (e.g., images) are also extracted by the Text-CNN model while the visual information within the articles is first processed using a pre-trained image2sentence model. The authors achieved the best performance using the SAFE model among all the baseline methods employed. Another study employed a model called SAME [95] for the classification of healthcare misinformation on COVID-19 [61] . SAME is a multi-modal system which uses news image, content, user profile information as well as users' sentiments to detect fake news. In this study, the authors skipped the visual part of the SAME model for their classification purpose as the majority of the news articles doesn't contain any cover images. They weren't able to get satisfactory results with this model as their dataset was quite imbalanced. Some other methods such as XLM-r, FastText were used to perform fine-grained disinformation analysis on Arabic tweets [59] . In this study, the authors used these two models in both binary and multi-class classification settings. They achieved consistent and good results using FastText while XLM-r didn't perform well as the amount of data was small and it was likely to overfit. Some research works also used different combinations of traditional ML and DL methods to increase the overall performance of classification methods. A study proposed an ensemble-learning-based framework for justifying the credibility of a vast number of tweets based on tweet-level and user-level features [54] . For this, they integrated six traditional ML algorithms utilizing stacking-based ensemble learning which resulted in higher accuracy and a more generalized model. [45] . In the RCNN architecture, a recurrent structure captures the contextual information and the max-pooling layer can automatically judge which words play key roles to capture the key components in texts [97] . for topic generation by taking into account the classification information regarding disinformation [52] . They accumulated the properties of the BERT with a VAE model to build up a robust classification system. CANTM outperformed other baseline models in terms of accuracy and F1 score in the classification task. It also achieved the best perplexity score in the topic modeling task among all the models. It is essential to compare the performance of the algorithms systematically. To evaluate the performance of algorithms, we used different metrics. These metrics include accuracy, precision, recall, F1 Score and many of the metrics have more than one name. All of these evaluation metrics are Accuracy is defined as the ratio of correctly predicted instances over the total number of evaluated instances. It is formally defined in Equation 7 . Precision is defined as the correctly predicted positive instances from the total predicted instances in a positive class. It is also known as positive predictive value (PPV). It is formally defined in Equation 8 . Recall measures the fraction of positive instances that are correctly classified. It is also known as true positive rate (TPR) or sensitivity. It is formally defined in Equation 9 . . In the existing research on COVID-19 misinformation classification, several traditional ML and DL methods have been employed. Among them, some are highly efficient in the classification of COVID-19 misinformation and show higher performance scores. Table 4 represents the best performing models used in the existing studies on COVID-19 misinformation detection in terms of accuracy and F1-score. It is observed that there is still a lack of benchmark datasets that include resources to extract all relevant features related to COVID-19 misinformation. Besides, most of the studies have utilized the data that is mainly collected from social media platforms (e.g., Twitter, Facebook, etc.) and some other reliable sources. The majority of the datasets don't contain data from diverse sources. Moreover, class distribution in some data sets was observed to be imbalanced which affected the overall performance of the classification. Koirala showed that an increase in the weight of the minority class can handle this problem [49] . A promising direction is to create a comprehensive, well-annotated, and large-scale benchmark dataset on COVID-19 misinformation which can be used by the scholars to conduct further research in this domain. Furthermore, future researchers may employ and investigate different sampling techniques to handle the class imbalance problem and demonstrate their effect on classification performance. In misinformation classification, data pre-processing is an underrated step. It was observed that most of the researchers give more focus on the method and often neglect the data pre-processing phase. Elhadad xxxii et al. showed that, with proper data-pre-processing approaches, the performance of the classification model can be significantly improved [45] . Usually, in the pre-processing step, special characters, punctuation marks, tags, URLs, stop words are removed and Part of Speech(PoS) tagging, word stemming, case-folding, etc are performed. In the future, researchers may work on dataset-specific pre-processing tasks. As the number of studies working with large volume COVID-19 misinformation data is relatively small, it was noticed that there are no efficient techniques for the selection of the important features on large-scale data. Future researchers may contribute to this scope by proposing methods on how to extract the most significant features from large volume data by minimizing the feature vector size effectively. There is no study on COVID-19 misinformation detection that used multimodality such as texts, images, and videos altogether. Although individual modality is very important, it is not sufficient alone. Different modalities can help to gain different aspects of contents and derived information from different modalities complement each other to detect misinformation. The similarity between the image and the text is very important which can be an additional information for a comprehensive outcome. Thus, a study can be done by incorporating multimodal features to make a robust misinformation detection system. Though these multimodal systems can perform well in detecting misinformation, it can increase training and model size overhead, training cost & complexity as the classifiers have always been trained with another classifier. In today's competitive age, it is worthwhile to research those open issues and researchers can make contributions to solve these problems. All the existing works on COVID-19 misinformation detection are supervised, which requires an extensive amount of time and a pre-annotated misinformation dataset to train a model. Obtaining a benchmark misinformation dataset on COVID-19 is also time consuming and labor-intensive work as the process needs careful checking of the contents. There is also a need to check other additional proof such as authoritative reports, fact checking websites, news reports etc. Leveraging a crowdsourcing approach to obtain annotations could relieve the burden of expert checking, but the annotations' quality may suffer [99] . As misinformation is intentionally spread to mislead people, individual human workers alone may not have xxxiii the domain expertise to differentiate real information and misinformation [100] . So it is time to consider semi-supervised or unsupervised models having limited or unlabeled data. Besides, unsupervised models can be more practical because unlabeled datasets are easier to obtain. A few works implemented ensemble methods to build more complex and effective models to better utilize extracted features. Ensemble methods build a conjunction of several weak classifiers to learn a stronger one that is more robust than any individual classifier alone. In the case of misinformation detection system, different variants of ensemble methods can significantly boost up the overall performance of the system. Again, hybrid classifiers (ML+ML, DL+DL) have been used for improving the predictions of the classification task in some existing literature [54, 52, 45, 51, 56] . Other combinations (ML+DL, DL+ML) of the hybrid classifier can be used for building up a robust classification system of COVID-19 misinformation. In the course of COVID-19 pandemic, people are spending more time on internet to gather necessary information. Hence, when a piece of information is falsely represented, it spreads pretty fast and misguides its users by creating a strong negative impact on individuals and broader society. If we cannot halt the spread of COVID-19 misinformation, it may lead people to be more panicked by the overabundance of false information. To ease the detection of misinformation, traditional ML and DL methods are widely used to build up systems that can classify misinformation more precisely. In this survey, we outline various existing research works on COVID-19 misinformation classification and detection. In particular, we have provided a comprehensive view of different misinformation types and discussed existing methodologies to detect COVID-19 misinformation focusing on feature extraction methods, classification, detection performance etc. Comparing with the adopted existing techniques, DL appeared as one of the most efficient and effective techniques to classify misinformation accurately. Although sometimes the performance degrades, traditional ML methods also perform very well in the misinformation classification task. We also revealed the limitations of the existing studies and mentioned several research directions for further investigation in the future. We believe that our survey can provide important insights to build up robust classification xxxiv systems for detecting misinformation related to COVID-19 and help researchers around the world to come up with new strategies to fight against the spread of misinformation during this pandemic. A first case of meningitis/encephalitis associated with SARS-Coronavirus-2 -19)-update COVID Live Update from Worldometer How to fight an infodemic Disinformation and Misinformation on Twitter during the Novel Coronavirus Outbreak, arXiv (2020) Comparative study between traditional machine learning and deep learning approaches for text classification Large-scale bayesian logistic regression for text categorization Supervised clustering with support vector machines Ridge estimators in logistic regression The discipline of machine learning Support-vector networks Support Vector Machines Support Vector Machines -An Introduction Estimating continuous distributions in bayesian classifiers Instance-based learning algorithms An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression C4.5: Programs for Machine Learning 5: Programs for Machine Learning by Classification and Regression Trees Random Forests Random Forests Bagging predictors Xgboost: extreme gradient boosting A Novel PCA-Firefly Based XGBoost Classification Model for Intrusion Detection in Networks Using GPU Learning While Searching in Constraint-Satisfaction-Problems Weakly supervised cascaded convolutional networks DeepID-Net: Object Detection with Deformable Part Based Convolutional Neural Networks Deep learning for natural language processing and language modelling Deep Neural Networks for Acoustic Modeling in the Presence of Noise Using deep learning for community discovery in social networks DeepLog: Anomaly detection and diagnosis from system logs through deep learning DeTrAs: deep learning-based healthcare framework for IoT-based assistance of Alzheimer patients Proceedings of the IEEE A High-accuracy model average ensemble of convolutional neural networks for classification of cloud image patches on small datasets Detecting breaking news rumors of emerging topics in social media Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling BERT: Pre-training of deep bidirectional transformers for language understanding Generalized autoregressive pretraining for language understanding, arXiv A robustly optimized BERT pretraining approach, arXiv Albert: A lite bert for self-supervised learning of language representations Detecting Misleading Information on COVID-19 Research on Fine-Grained Classification of Rumors in Public Crisis --Take the COVID-19 incident as an example, E3S Web of Conferences An Ensemble Deep Learning Technique to Detect COVID-19 Misleading Information, volume 1264 AISC No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection FakeCovid-A Multilingual Cross-domain Fact Check News Dataset for COVID-19, arXiv Tweet Check-Worthiness Using an Enhanced CT-BERT with Numeric Expressions COVID-19 Fake News Classification using Deep Learning, Master's thesis NLP-based Feature Extraction for the Detection Misinformation Videos on YouTube, ACL 2020 Workshop NLP-COVID COVID LIES : Detecting COVID-19 Misinformation on Social Media Classification Aware Neural Topic Model and its Application on a New COVID-19 Disinformation Corpus Independent Component Analysis for Trustworthy Cyberspace during High Impact Events: An Application to Covid-19 Lies Kill, Facts Save: Detecting COVID-19 Misinformation in Twitter Analysis of Fake News in Social Medias for Four Months during Lockdown in COVID-19-A Study Grained Analysis of Misinformation in COVID-19 Tweets Fighting an Infodemic: COVID-19 Fake News Dataset COVID-19 and Arabic Twitter : How can Arab World Governments and Public Health Organizations Learn from Social Media? Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms Drink bleach or do what now? Covid-HeRA: A dataset for risk-informed health decision making in the presence of COVID19 misinformation Covid-19 healthcare misinformation dataset, arXiv (2020) 1-11 CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets -RoBERTa Ensembles and The Continued Relevance of Handcrafted Features The Role of the Crowd in Countering Misinformation: A Case Study of the COVID-19 Infodemic ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research Characterizing COVID-19 misinformation communities using a novel twitter dataset, arXiv ArCOV19-Rumors: Arabic COVID-19 Twitter Dataset for Misinformation Detection Misinformation Has High Perplexity CHECKED: Chinese COVID-19 fake news dataset An exploratory study of covid-19 misinformation on twitter TweetsCOV19 -A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic, International Conference on Information and Knowledge Management Design and analysis of a large-scale COVID-19 tweets dataset GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information Twitter chatter dataset for open scientific research -an international collaboration Large Arabic Twitter Dataset on COVID-19, arXiv (2020) 2-4 Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a xxxviii Public Coronavirus Twitter Data Set An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic Fast, consistent tokenization of natural language text Tokenising, stemming and stopword removal on anti-spam filtering domain Understanding inverse document frequency: On theoretical arguments for idf Linguistic inquiry and word count (liwc Fake news early detection: A theory-driven model Rethorical structure theory: Toward a functional theory of text organization Representation learning for text-level discourse parsing Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Distributed representations of words and phrases and their compositionality GloVe: Global vectors for word representation Conference on Empirical Methods in Natural Language Processing (EMNLP) The psychological meaning of words: Liwc and computerized text analysis methods Sentence embeddings using siamese BERT-networks, EMNLP-IJCNLP 2019 -2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference AraBERT: Transformer-based Model for Arabic Language Understanding, arXiv (2020) Neural models for documents with metadata Neural variational inference for text processing Hierarchical Attention Networks Defend: Explainable fake news detection SAFE: Similarity-Aware Multi-modal Fake News Detection Same: Sentiment-aware multi-modal embedding for detecting fake news CSI: A hybrid deep model for fake news detection Recurrent convolutional neural networks for text classification Bertscore: Evaluating text generation with bert, arXiv Leveraging the Crowd to Detect and Reduce the Spread of Fake News and Misinformation Accuracy of Deception Judgments