Automatic Hate Speech Detection using Machine Learning: A Comparative Study


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

Automatic Hate Speech Detection using Machine 
Learning: A Comparative Study 

Sindhu Abro1, Sarang Shaikh2, Zafar Ali4 
Sajid Khan5, Ghulam Mujtaba6 

Center for Excellence for Robotics, Artificial Intelligence 
and Blockchain, Department of Computer Science 

Sukkur IBA University, Sukkur, Pakistan 

Zahid Hussain Khand3 
Department of Computer Science 

Sukkur IBA University 
Sukkur, Pakistan 

 
Abstract—The increasing use of social media and information 
sharing has given major benefits to humanity. However, this has 
also given rise to a variety of challenges including the spreading 
and sharing of hate speech messages. Thus, to solve this emerging 
issue in social media sites, recent studies employed a variety of 
feature engineering techniques and machine learning algorithms 
to automatically detect the hate speech messages on different 
datasets. However, to the best of our knowledge, there is no study 
to compare the variety of feature engineering techniques and 
machine learning algorithms to evaluate which feature 
engineering technique and machine learning algorithm 
outperform on a standard publicly available dataset. Hence, the 
aim of this paper is to compare the performance of three feature 
engineering techniques and eight machine learning algorithms to 
evaluate their performance on a publicly available dataset having 
three distinct classes. The experimental results showed that the 
bigram features when used with the support vector machine 
algorithm best performed with 79% off overall accuracy. Our 
study holds practical implication and can be used as a baseline 
study in the area of detecting automatic hate speech messages. 
Moreover, the output of different comparisons will be used as 
state-of-art techniques to compare future researches for existing 
automated text classification techniques. 

Keywords—Hate speech; online social networks; natural 
language processing; text classification; machine learning 

I. INTRODUCTION 
In recent years, hate speech has been increasing in-person 

and online communication. The social media as well as other 
online platforms are playing an extensive role in the breeding 
and spread of hateful content – eventually which leads to hate 
crime. For example, according to recent surveys, the rise in 
online hate speech content has resulted in hate crimes 
including Trump's election in the US [2], the Manchester and 
London attacks in the UK [3], and terror attacks in New 
Zealand [4]. To tackle these harmful consequences of hate 
speech, different steps including legislation have been taken 
by the European Union Commission. Recently, the European 
Union Commission also enforced social media networks to 
sign an EU hate speech code to remove hate speech content 

within 24 hours [1]. However, the manual process to identify 
and remove hate speech content is labor-intensive and time-
consuming. Due to these concerns and widespread hate speech 
content on the internet, there is a strong motivation for 
automatic hate speech detection. 

The automatic detection of hate speech is a challenging 
task due to disagreements on different hate speech definitions. 
Therefore, some content might be hateful to some individuals 
and not to others, based on their concerned definitions. 
According to [5], hate speech is: 

“the content that promotes violence against individuals or 
groups based on race or ethnic origin, religion, disability, 
gender, age, veteran status, and sexual orientation/gender 
identity”. 

Despite these different definitions, some recent studies 
claimed favorable results to detect automatic hate speech in 
the text [21-32]. The proposed solutions employed the 
different feature engineering techniques and ML algorithms to 
classify content as hate speech. Regardless of this extensive 
amount of work, it remains difficult to compare the 
performance of these approaches to classify hate speech 
content. To the best of our knowledge, the existing studies 
lack the comparative analysis of different feature engineering 
techniques and ML algorithms. 

Therefore, this study contributes to solving this problem 
by comparing three feature engineering and eight ML 
classifiers on standard hate speech datasets. Table I shows 
major concepts related to automatic text classification along 
with their explanations and references. This study holds 
practical importance and served as a reference for new 
researchers in the domain of automatic hate speech detection. 

This rest of the paper is organized as: Section II highlights 
the related works. Section III discusses the methodology. 
Sections IV, V, and VI explain the experimental settings, 
results, and discussion. Finally, Section VII discusses the 
limitation, future work, and conclusion as well. 

484 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

TABLE I. TEXT CLASSIFICATION (KEY CONCEPTS) 

S. No. Concept Acronym Definition References 

1 Feature Extraction FE It is mapping from text data to real-valued vectors.  [6] 

2 Bigram - It’s a feature engineering technique which represents two adjacent words in a single numeric feature while creating master feature vectors for words.  [7] 

3 
Term Frequency -  
Inverse Document 
Frequency 

TFIDF 
It’s a feature representation technique that represents “word importance” is to a 
document in the document set. It works in a combination of the frequency of 
word appearance in a document with no. of documents containing that word. 

[8] 

4 Word2vec - It is a technique used to learn vector representation of words, which can further be used to train machine learning models. [9] 

5 Doc2vec - 
It is an unsupervised technique to learn document representations in fixed-length 
vectors. It is the same as word2vec, but the only difference is that it is unique 
among all documents. 

[10] 

6 Machine Learning Classifiers ML Classifiers 
These are applied to numeric features vector to build the predictive model which 
can be used for prediction class labels. [11] 

7 Naïve Bayes NB 
It’s a probabilistic based classification algorithm, which uses the “Bayes 
theorem” to predict the class. It works on conditional independence among 
features. 

[12] 

8 Random Forest RF It’s a type of ensemble classifier consisting of many decision trees. It classifies an instance based on voting decision of each decision trees class predictions. [13] 

9 Support Vector Machines SVM 
It’s a supervised classification algorithm which constructs an optimal hyperplane 
by learning from training data which separates the categories while classifying 
new data. 

[14] 

10 K Nearest Neighbor KNN It’s a simple text classification algorithm, which categorize the new data using some similarity measure by comparing it with all available data.  
[15] 
 

11 Decision Tree DT 
It is a supervised algorithm. It generates the classification rules in the tree-shaped 
form, where each internal node denotes attribute conditions, each branch denotes 
conditions for outcome and leaf node represents the class label. 

[16] 

12 Adaptive Boosting AdaBoost It is one of the best-boosting algorithms, which strengthens the weak learning algorithms. [17] 

13 Multilayer Perceptron MLP It is a feedforward artificial neural network. It produces a set of outputs using a set of inputs [18] 

14 Logistic Regression LR It is a predictive analysis. It uses a sigmoid function to explain the relationship between one independent variable and one or more independent variables [19] 

II. RELATED WORKS 
These days, hate speech is very common on social media. 

Therefore, in previous years, some of the researchers have 
applied a supervised ML-based text classification approach to 
classify hate speech content. Different researchers have 
employed different variety of feature representation 
techniques namely, dictionary-based [21-23], Bag-of-words-
based [24-26], N-grams-based [27-29], TFIDF-based [30, 31] 
and Deep-Learning-based [31]. 

Peter Burnap et al. [20] employed a dictionary-based 
approach to identify cyber hate on Twitter. In this research, 
they employed an N-gram feature engineering technique to 
generate the numeric vectors from the predefined dictionary of 
hateful words. The authors fed the generated numeric vector to 
ML classifier namely, SVM and obtained a maximum of 67% 
F-score. Stéphan Tulkens et al. [22] also used a dictionary-
based approach for the automatic detection of racism in Dutch 
Social Media. In this study, the authors used the distribution of 
words over three dictionaries as features. They fed the 
generated features to the SVM classifier. Their experimental 
results obtained 0.46 F-Score. Njagi Dennis et al. [21] used 

ML-based classifier to classify hate speech in web forums and 
blogs. The authors employed a dictionary-based approach to 
generate a master feature vector. The features were based on 
sentiment expressions using semantic and subjectivity features 
with an orientation to hate speech. Afterward, the authors fed 
the masters feature vector to a rule-based classifier. In the 
experimental settings, the authors evaluated their classifier by 
using a precision performance metric and obtained 73% 
precision. 

Nonetheless, the combination of dictionary-based and ML 
approaches showed a good result. However, the major 
disadvantage of such type of approach is that it requires a 
dictionary, based on the large corpus to look for domain 
words. To overcome this drawback, many of the researchers 
have used a BOW-based approach which is similar to a 
dictionary-based approach but the word features are obtained 
from training data and not from the predefined dictionaries. 

Edel Greevy et al. [23] used the supervised ML approach 
to classify the racist text. To convert the raw text into numeric 
vectors, the authors employed a bigram feature extraction 
technique. The authors used bigram features, with the BOW 
feature representation technique. They used the SVM 

485 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

classifier to perform experimental results. In their results, they 
achieved 87% accuracy. Irene Kwok et al. [24] employed an 
ML-based approach to the automatic detection of racism 
against black in the twitter community. In their research, they 
employed unigram with the BOW-based technique to generate 
the numeric vectors. The authors fed the generated numeric 
vector to the Naïve Bayes classifier. Their experimental 
results obtained a maximum of 76% accuracy. Sanjana 
Sharma et al. [25] classified hate speech on twitter. In their 
research, they employed BOW features. The authors fed the 
generated numeric vector to the Naïve Bayes classifier. Their 
experimental results showed a maximum of 73% accuracy. 

Nevertheless, BOW showed better accuracy in social 
network text classification. However, the major disadvantage 
of this technique is, the word-order is ignored and causes 
misclassification as different words are used in different 
contexts. To overcome this limitation, researchers have 
proposed an N-grams-based approach [7]. 

Zeerak Waseem et al. [28] classify the hate speech on 
twitter. In their research, they employed character Ngrams 
feature engineering techniques to generate the numeric 
vectors. The authors fed the generated numeric vector to the 
LR classifier and obtained overall 73% F-score. Chikashi 
Nobata et al. [27] used the ML-based approach to detect the 
abusive language in online user content. In their research 
authors employed character Ngrams feature representation 
technique to represent the features. The authors fed the 
features to the SVM classifier. The results showed that the 
classifier obtained overall 77% F-score. Shervin Malmasi et al 
[26] used an ML-based approach to classify hate speech in 
social media. In their research, the authors employed 4grams 
with character grams feature engineering techniques to 
generate numeric features. The authors fed the generated 
numeric features to the SVM classifier. The authors reported 
maximum of 78% accuracy. 

In recent years, few researchers employed ML approaches 
to detect automatic hate speech. For example, Karthik Dinakar 
et al. [29] classified sensitive topics from social media 
comments or posts. In their research, they employed unigram 
with the TFIDF feature representation technique to generate 
the numeric feature vectors. The authors fed the generated 
features to four ML classifiers namely Naïve Bayes, rule-
based, J48, and SVM. Their experimental results showed that 
the rule-based classifier outperformed NB, J48 and SVM 
classifiers by obtaining 73% accuracy. Shuhua Liu et al. [30] 
performed classification on web content pages into hatred or 
violence categories. In their study, they used trigram features, 
represented using TFIDF. The authors used the Naïve Bayes 
classifier. In their experimental settings, the Naïve Bayes 
classifier obtained highest accuracy of 68%. 

The N-gram-based approach gives better results than the 
BOW-based approach but it has two major limitations. First, 
the related words may be at a high distance in sentence and 
finally increasing the N value, results in slow processing speed 
[32]. 

In recent years, authors employed deep learning-based 
NLP techniques to classify hate speech messages. Sebastian 
Köffer et al. [31] employed word2vec features and SVM 

classifiers to classify German texts hate speech messages and 
obtained 67% F-score. The word2Vec showed the lowest 
results because such approaches need enormous data to learn 
complex word semantics. 

Recently, there has been a good attempt to construction 
and detection of hate speech as well as offensive language in 
other languages (i-e: Danish). An important research study 
[45] in 2019 worked on the construction of Danish dataset for 
hate speech and offensive language detection. The dataset 
contained comments from Reddit and Facebook. It also 
contained the various types and targets of the offensive 
language. The authors achieved the highest F1 score of 0.74 
by using deep learning models with different features sets. 

Schmidt et al. [46] conducted a survey on hate speech 
detection using natural language processing in 2017. The 
authors discussed in detail studies regarding various feature 
engineering techniques to be used for supervised classification 
of hate speech messages. The major drawback of this survey is 
that there were no experimental results for those mentioned 
techniques. 

Previous studies showed that a variety of researchers from 
across the globe are working on hate speech recognition 
written in different languages such as German, Dutch and 
English. However, according to our information, no study 
provides a comparative study of various features and ML 
algorithms on the standard dataset that can serve as a baseline 
study for future researchers in the field of hate speech 
recognition. Hence, in this study, we compared three feature 
engineering and eight ML classifiers to evaluate which one 
best works on hate speech datasets (discussed in Section III). 

III. METHODOLOGY 
This section explains the proposed system which we have 

employed to classify tweets into three different classes 
namely, “hate speech, offensive but not hate speech, and 
neither hate speech nor offensive speech”. Fig. 1 shows the 
complete research methodology. As shown in this figure, the 
research methodology is contained of six key steps namely, 
data collection, data preprocessing, feature engineering, data 
splitting, classification model construction, and classification 
model evaluation. Each of the step is discussed in detail in the 
subsequent sections. 

 
Fig. 1. System Overview. 

486 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

A. Data Collection 
In this research study, we collected publicly available hate 

speech tweets dataset. This dataset is compiled and labeled by 
CrowdFlower. In this dataset, the tweets are labeled into three 
distinct classes, namely, hate speech, not offensive, and 
offensive but not hate speech. This dataset has 14509 number 
of tweets. Of these, 16% of tweets belong to class hate speech. 
In addition, 50% of tweets belong to not offensive class and 
the remaining 33% tweets are offensive but not hate speech 
class. The details of this distribution are also shown in Fig. 2. 

B. Text Preprocessing 
Several research studies have explained that using text 

preprocessing makes better classification results [33]. So, in 
our dataset, we applied different preprocessing-techniques to 
filter noisy and non-informative features from the tweets. In 
preprocessing, we changed the tweets into lower case. Also, 
we removed all the URLs, usernames, white spaces, hashtags, 
punctuations and stop-words using pattern matching 
techniques from the collected tweets. Besides this, we have 
also performed tokenization and stemming from preprocessed 
tweets. The tokenization, converts each single tweet into 
tokens or words, then the porter stemmer converts words to 
their root forms, such as offended to offend using porter 
stemmer. 

C. Feature Engineering 
The ML algorithms cannot understand the classification 

rules from the raw text. These algorithms need numerical 
features to understand classification rules. Hence, in text-
classification one of the key steps is feature engineering. This 
step is used for extracting the key features from raw text and 
representing the extracted features in numerical form. In this 
study, we have performed three different features engineering 
techniques, namely, n-gram with TFIDF [8], Word2vec [9] 
and Doc2vec [10]. 

 
D. Data Splitting 
Table II shows the class-wise distribution of the overall 

dataset as well as data set after splitting (i.e. Training set and 
Test set). We have used the 80-20 ratio to split the 
preprocessed data (i.e. 80% for Training Data and 20% for 
Test Data). The training data is used to train the classification 
model to learn classification rules. Moreover, the test data is 
further used to evaluate the classification model. 

 
Fig. 2. Class wise Data Distribution. 

TABLE II. DETAILS OF DATA SPLIT 

 Class Total Instances 
Training 
instances 

Testing 
instances 

0 Hate Speech 2399 1909 490 

1 Not offensive 7274 5815 1459 

2 Offensive but not Hate Speech 4836 3883 953 

 Total 14509 1607 2902 

E. Machine Learning Models 
According to “no free lunch theorem” [34], there is no any 

single classifier which best performs on all kinds of datasets. 
Therefore, it is recommended to apply several different 
classifiers on a master feature vector to observe which one 
reaches to the better results. Hence, we selected eight different 
classifiers NB [12], SVM [14], KNN [15], DT [16], RF [13], 
AdaBoost [17], MLP [18] and LR [19]. 
 

F. Classifier Evaluation 
In this step, the constructed classifier predicts the class of 

unlabeled text (i.e. “hate speech, offensive but not hate 
speech, neither hate speech nor offensive speech”) using test 
set. The classifier performance is evaluated by calculating true 
negatives (TN), false positives (FP), false negatives (FN) and 
true positives (TP). These four numbers constitute a confusion 
matrix as in Fig. 3. Different performance metrics are used to 
assess the performance of the constructed classifier. Some 
common performance measures in text categorization are 
discussed briefly below. The more details of performance 
metrics can be found in [35]. 

1) Precision: Precision is also known as the positive 
predicted value. It is the proportion of predictive positives 
which are actually positive. Refer to “(1)”. 

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
(𝑇𝑃+𝐹𝑃)

             (1) 

2) Recall: It is the proportion of actual positives which are 
predicted positive. Refer to “(2)”. 

𝑅𝑒𝑐𝑎𝑙𝑙 =  𝑇𝑃
(𝑇𝑃+𝐹𝑁)

             (2) 

3) F-Measure: It is the harmonic mean of precision and 
recall (as shown in Equation 3). The standard F-measure (F1) 
gives equal importance to precision and recall. Refer to “(3)”. 

𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =  2 ×
(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 )

(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 )
           (3) 

4) Accuracy: It is the number of correctly classified 
instances (true positives and true negatives). Refer to “(4)”. 

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =  (𝑇𝑃+𝑇𝑁)
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁

            (4) 

 
Fig. 3. Confusion Matrix. 

487 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

IV. EXPERIMENTAL SETTINGS 
As mentioned in section C, we used three types of features 

namely n-gram (bigram) with TFIDF, Word2vec and 
Doc2vec. Hence, we have a total of three different master 
feature representations. In addition, eight different ML 
algorithms were applied to the created three master feature 
vectors. Hence, overall 24 analyses (3 master feature vectors x 
8 ML algorithms) were evaluated to check the effectiveness of 
classification models. 

V. RESULTS 
This section explains the overall results of 24 analyses. 

Tables III to Table VI shows the precision, recall, F-measure 
and accuracy of all 24 analyses, respectively. The bold values 
represented are the maximum and minimum result values. All 
the tables are showing performance for different features 
representation and classification techniques applied in 
experimental settings. In all 24 analyses, the lowest precision 
(0.58), recall (0.57), accuracy (57%) and F-measure (0.47) 
found in MLP and KNN classifier using TFIDF features 
representation with bigram features. Moreover, the highest 
recall (0.79), precision (0.77), accuracy (79%) and F-measure 
(0.77) were obtained by SVM using TFIDF features 
representation with bigram features. In feature representation, 
bigram features with TFIDF obtained the best performance as 
compared to Word2vec and Doc2vec. However, there was a 
fringe difference between the result observed in bigram, and 
Doc2vec. In text-classification models, the SVM classifier 
best performed among all the eight classifiers. However, the 
AdaBoost and RF classifiers results were lesser than SVM 
results and were better than LR, DT, NB, KNN, and MLP 
results. 

Furthermore, Fig. 4 and Fig. 5 show the confusion matrix 
of best-performing analyses. Fig. 4 shows the SVM 
classifiers’ confusion matrix using bigram with TFIDF 
features. As shown here, out of 490 tweets belonging to hate 
speech class, only 155 were correctly classified. However, the 
335 instances were incorrectly classified. Of these 335 

instances, 54 were falsely classified as not offensive and 281 
were falsely classified as Offensive but not Hate Speech. The 
1459 instances belong to the second class, the 1427 tweets 
were correctly classified as not offensive speech. The 
remaining 32 instances were misclassified, 5 were incorrectly 
classified as hate speech and 27 were falsely classified as an 
offensive language but not hate speech. The remaining 953 
instances out of 2902 test set belonging to offensive language 
but not hate speech class. Here, the SVM classifier correctly 
classified the 698 tweets as an offensive language but not hate 
speech. The 122 and 133 instances were misclassified into 
hate speech and not offensive speech, respectively. 

However, Fig. 5 shows the confusion matrix of the 
Adaboost classifier using bigram with TFIDF features. As 
shown here, the overall performance of the Adaboost classifier 
is lower than the SVM classifier while using bigram with 
TFIDF features. The Adaboost only performed well in 
offensive language but not hate speech class. 

 
Fig. 4. Confusion Matrix (Features: Bigram (TFIDF), Classifier: SVM). 

 
Fig. 5. Confusion Matrix (Features: Bigram (TFIDF), Classifier: 

ADABOOST). 

TABLE III. PRECISION OF ALL 24 ANALYSIS 

Features LR NB RF SVM KNN DT AdaBoost MLP 

Bigram 0.72 0.71 0.73 0.77 0.61 0.71 0.75 0.58 

Word2vec 0.69 0.66 0.66 0.70 0.64 0.62 0.65 0.69 

Doc2vec 0.70 0.65 0.65 0.70 0.69 0.61 0.66 0.71 
The bold marked values represented are the higher and lower result values. 

TABLE IV. RECALL OF ALL 24 ANALYSIS 

Features LR NB RF SVM KNN DT AdaBoost MLP 

Bigram 0.75 0.73  0.75  0.79 0.57  0.73  0.78  0.70 

Word2vec 0.72  0.67 0.68  0.73 0.61  0.63 0.68 0.71 

Doc2vec 0.72  0.62  0.67  0.72  0.65  0.63  0.67  0.71  
The bold marked values represented are the higher and lower result values. 

488 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

TABLE V. F-MEASURE OF ALL 24 ANALYSIS 

Features LR NB RF SVM KNN DT AdaBoost MLP 

Bigram 0.72  0.68  0.74  0.77 0.47  0.71 0.73  0.63 

Word2vec 0.69 0.66 0.66 0.70 0.61  0.60 0.65 0.65 

Doc2vec 0.70  0.63  0.66  0.72  0.65  0.61  0.66  0.66  

The bold marked values represented are the higher and lower result values. 

TABLE VI. ACCURACY OF ALL 24 ANALYSIS 

Features LR NB RF SVM KNN DT AdaBoost MLP 

Bigram 0.75 0.73 0.75 0.79 0.57 0.73 0.78 0.70 

Word2vec 0.72 0.67 0.68 0.73 0.61 0.63 0.68 0.71 

Doc2vec 0.72 0.62 0.67 0.72 0.65 0.63 0.67 0.71 

The bold marked values represented are the higher and lower result values. 

VI. DISCUSSION 
In the experimental work, we have evaluated eight 

classifiers over three different feature engineering techniques, 
giving 24 different analyses over hate speech dataset 
containing three classes. Our experimental results showed that 
the SVM algorithm with the combination of bigram with 
TFIDF FE techniques showed the best results. The theoretical 
analysis is discussed in subsequent sections. 

A. Feature Engineering 
The selection of feature engineering is important in text 

classification. In this study, we compared three distinct feature 
extraction techniques namely, Bigram with TFIDF, word2vec 
and doc2vec. The experimental results exhibited that from 
these three techniques, bigram with TFIDF outperformed. 
Conversely, the Word2vec and Doc2vec showed lower results. 
The possible reason for the outperformance of bigram and 
TFIDF is that bigram maintains the sequence of words 
compared to word2Vec and doc2vec [36]. Moreover, several 
studies showed that the TFIDF representation technique is 
better than the binary and term frequency representation [6]. 

The possible reason for the lower performance of 
Word2vec is because it is unable to handle OOV (out-of-
vocabulary) words specially in the domain of Twitter data. 
Moreover, Word2Vec requires a huge amount of training set 
to learn the complex relationship between the words [37]. 
However, as shown in Table I (data collection table), our 
dataset has approximately 15000 tweets, which might be not 
enough to train effectively to word2vec for eliciting the 
complex word relationship. 

In our experimental results, Doc2Vec also showed lower 
performance. This might be because it performs low in case of 
very short length documents [38] and the tweets which we 
used in our dataset often having 280 character length. 

B. Machine Learning Classifier 
Several studies proved that no single ML algorithm 

performed better on all kinds of data. Therefore, the 
comparison of various ML algorithms is required to discover 
which one is best performing on the given dataset. Hence, on 

our dataset, we used eight different ML algorithms as 
discussed in Section 3.E i.e. ML Models. 

The experimental results proved that SVM and AdaBoost 
classifiers achieved the best performance possibly because 
SVM uses threshold functions to separate the data, not the 
number of features based on margin. This shows that SVM is 
independent upon the presence of the number of features in 
the data [7, 15]. In addition, SVM has the capability to best 
perform on non-linear data apart from the linear data because 
of its kernel functions. The possible reasons behind the 
outperformance of AdaBoost are that it uses adaptive 
algorithms to learn the classification rules iteratively [39] and 
it focuses on the reduction of the training error. The results 
obtained with RF and LR classifiers are a little lower than 
SVM and AdaBoost results but are somewhat higher than the 
results of NB, DT, KNN, and MLP. The low performance of 
RF might be due to the unavailability of informative features 
which leads to incorrect predictions [40]. It is possible that the 
performance of LR might be lower because its decision 
surface is linear in nature and cannot handle nonlinear data 
adequately [41]. 

The lowest performance was obtained amongst the NB, 
DT, MLP and KNN classifiers. The NB classifier works on 
conditional independence among features. Thus, the 
performance of the NB classifier is negatively affected as the 
conditional dependence becomes more complicated due to the 
increase in the number of features [12]. The DT showed lower 
performance in predicting hate speech because the features 
inside the master features vector are represented as continuous 
data points that make it difficult to find the ideal threshold 
values that are required to build a decision tree [42]. The 
reason behind the poor performance of the MLP classifier is 
due to not having enough training data that’s why it is 
considered as complex “black box” [43]. The KNN had the 
worst performance due to laziness of the learning algorithm 
and it does not work adequately for noisy data [44]. Hence the 
KNN is not suitable for detecting hate speech tweets. 

C. Classwise Performance 
As discussed in Section 3.A we have three classes name 

“hate speech”, “offensive but not hate speech” and “neither 
hate speech nor offensive speech”. The results show that all 

489 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

features and classifiers performed well for two classes (i.e. 
offensive but not hate speech, and neither hate speech nor 
offensive speech). Our experimental results showed that the 
24 combinations performed lowest for class hate speech. 
According to Table I, the class “Hate Speech” has the lowest 
training instances as compared to other classes, but the major 
reason for misclassification of class “Hate Speech” (as shown 
in Fig. 3 and Fig. 4) might be overlapping of different bigram 
words with higher frequency in other classes than hate speech 
class. For example, bigrams like “lame nigga, white trash, 
bitch made” are more frequently appearing in class “Offensive 
but not Hate Speech” as compared to class “Hate Speech”. 
Hence, it might be possible that the classifier learned weak 
learning rules. 

 
VII. CONCLUSION 
This study employed automated text classification 

techniques to detect hate speech messages. Moreover, this 
study compared three feature engineering techniques and eight 
ML algorithms to classify hate speech messages. The 
experimental results exhibited that the bigram features, when 
represented through TFIDF, showed better performance as 
compared to word2Vec and Doc2Vec features engineering 
techniques. Moreover, SVM and RF algorithms showed better 
results compared to LR, NB, KNN, DT, AdaBoost, and MLP. 
The lowest performance was observed in KNN. The outcomes 
from this research study hold practical importance because 
this will be used as a baseline study to compare upcoming 
researches within different automatic text classification 
methods for automatic hate speech detection. Furthermore, 
this study also holds a scientific value because this study 
presents experimental results in form of more than one 
scientific measures used for automatic text classification. Our 
work has two important limitations. First, the proposed ML 
model is inefficient in terms of real-time predictions accuracy 
for the data. Finally, it only classifies the hate speech message 
in three different classes and is not capable enough to identify 
the severity of the message. Hence, in the future, the objective 
is to improve the proposed ML model which can be used to 
predict the severity of the hate speech message as well. 
Moreover, to improve the proposed model’s classification 
performance two approaches will be used. First, the lexicon-
based techniques will be explored and assessed by comparing 
with other current state-of-the-art results. Secondly, more data 
instances will be collected, to be used for learning the 
classification rules efficiently. 

REFERENCES 
[1] Hern, A., Facebook, YouTube, Twitter, and Microsoft sign the EU hate 

speech code. The Guardian, 2016. 31. 
[2] Rosa, J., and Y. Bonilla, Deprovincializing Trump, decolonizing 

diversity, and unsettling anthropology. American Ethnologist, 2017. 
44(2): p. 201-208. 

[3] Travis, A., Anti-Muslim hate crime surges after Manchester and London 
Bridge attacks. The Guardian, 2017. 

[4] MacAvaney, S., et al., Hate speech detection: Challenges and solutions. 
PloS one, 2019. 14(8): p. e0221152. 

[5] Fortuna, P. and S. Nunes, A survey on automatic detection of hate 
speech in text. ACM Computing Surveys (CSUR), 2018. 51(4): p. 85. 

[6] Mujtaba, G., et al., Prediction of cause of death from forensic autopsy 
reports using text classification techniques: A comparative study. 
Journal of forensic and legal medicine, 2018. 57: p. 41-50. 

[7] Cavnar, W.B. and J.M. Trenkle. N-gram-based text categorization. in 
Proceedings of SDAIR-94, 3rd annual symposium on document analysis 
and information retrieval. 1994. Citeseer. 

[8] Ramos, J. Using tf-idf to determine word relevance in document queries. 
in Proceedings of the first instructional conference on machine learning. 
2003. Piscataway, NJ. 

[9] Mikolov, T., et al. Distributed representations of words and phrases and 
their compositionality. in Advances in neural information processing 
systems. 2013. 

[10] Le, Q. and T. Mikolov. Distributed representations of sentences and 
documents. in International conference on machine learning. 2014. 

[11] Kotsiantis, S.B., I.D. Zaharakis, and P.E. Pintelas, Machine learning: a 
review of classification and combining techniques. Artificial Intelligence 
Review, 2006. 26(3): p. 159-190. 

[12] Lewis, D.D. Naive (Bayes) at forty: The independence assumption in 
information retrieval. in European conference on machine learning. 
1998. Springer. 

[13] Xu, B., et al., An Improved Random Forest Classifier for Text 
Categorization. JCP, 2012. 7(12): p. 2913-2920. 

[14] Joachims, T. Text categorization with support vector machines: 
Learning with many relevant features. in European conference on 
machine learning. 1998. Springer. 

[15] Zhang, M.-L. and Z.-H. Zhou, A k-nearest neighbor based algorithm for 
multi-label classification. GrC, 2005. 5: p. 718-721. 

[16] Abacha, A.B., et al., Text mining for pharmacovigilance: Using machine 
learning for drug name recognition and drug–drug interaction extraction 
and classification. Journal of biomedical informatics, 2015. 58: p. 122-
132. 

[17] Ying, C., et al., Advance and prospects of AdaBoost algorithm. Acta 
Automatica Sinica, 2013. 39(6): p. 745-758. 

[18] Gardner, M.W. and S. Dorling, Artificial neural networks (the multilayer 
perceptron)—a review of applications in the atmospheric sciences. 
Atmospheric environment, 1998. 32(14-15): p. 2627-2636. 

[19] Wenando, F.A., T.B. Adji, and I. Ardiyanto, Text classification to detect 
student level of understanding in prior knowledge activation process. 
Advanced Science Letters, 2017. 23(3): p. 2285-2287. 

[20] Burnap, P. and M.L. Williams, Us and them: identifying cyber hate on 
Twitter across multiple protected characteristics. EPJ Data Science, 
2016. 5(1): p. 11. 

[21] Gitari, N.D., et al., A lexicon-based approach for hate speech detection. 
International Journal of Multimedia and Ubiquitous Engineering, 2015. 
10(4): p. 215-230. 

[22] Tulkens, S., et al., A dictionary-based approach to racism detection in 
dutch social media. arXiv preprint arXiv:1608.08738, 2016. 

[23] Greevy, E. and A.F. Smeaton. Classifying racist texts using a support 
vector machine. in Proceedings of the 27th annual international ACM 
SIGIR conference on Research and development in information 
retrieval. 2004. ACM. 

[24] Kwok, I. and Y. Wang. Locate the hate: Detecting tweets against blacks. 
in Twenty-seventh AAAI conference on artificial intelligence. 2013. 

[25] Sharma, S., S. Agrawal, and M. Shrivastava, Degree based classification 
of harmful speech using twitter data. arXiv preprint arXiv:1806.04197, 
2018. 

[26] Malmasi, S. and M. Zampieri, Detecting hate speech in social media. 
arXiv preprint arXiv:1712.06427, 2017. 

[27] Nobata, C., et al. Abusive language detection in online user content. in 
Proceedings of the 25th international conference on world wide web. 
2016. International World Wide Web Conferences Steering Committee. 

[28] Waseem, Z. and D. Hovy. Hateful symbols or hateful people? predictive 
features for hate speech detection on twitter. in Proceedings of the 
NAACL student research workshop. 2016. 

[29] Dinakar, K., R. Reichart, and H. Lieberman. Modeling the detection of 
textual cyberbullying. in fifth international AAAI conference on 
weblogs and social media. 2011. 

490 | P a g e  
www.ijacsa.thesai.org 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 8, 2020 

[30] Liu, S. and T. Forss. Combining N-gram based Similarity Analysis with 
Sentiment Analysis in Web Content Classification. in KDIR. 2014. 

[31] Köffer, S., et al., Discussing the value of automatic hate speech 
detection in online debates. Multikonferenz Wirtschaftsinformatik 
(MKWI 2018): Data Driven X-Turning Data in Value, Leuphana, 
Germany, 2018. 

[32] Chen, Y., Detecting offensive language in social medias for protection 
of adolescent online safety. 2011. 

[33] Shaikh, S. and S.M. Doudpotta, Aspects Based Opinion Mining for 
Teacher and Course Evaluation. Sukkur IBA Journal of Computing and 
Mathematical Sciences, 2019. 3(1): p. 34-43. 

[34] Ho, Y.-C. and D.L. Pepyne, Simple explanation of the no-free-lunch 
theorem and its implications. Journal of optimization theory and 
applications, 2002. 115(3): p. 549-570. 

[35] Seliya, N., T.M. Khoshgoftaar, and J. Van Hulse. A study on the 
relationships of classifier performance metrics. in 2009 21st IEEE 
international conference on tools with artificial intelligence. 2009. IEEE. 

[36] Chaudhari, U.V. and M. Picheny, Matching criteria for vocabulary-
independent search. IEEE Transactions on Audio, Speech, and 
Language Processing, 2012. 20(5): p. 1633-1643. 

[37] Li, Y. and T. Yang, Word embedding for understanding natural 
language: a survey, in Guide to Big Data Applications. 2018, Springer. 
p. 83-104. 

[38] Wang, Y., et al. Comparisons and selections of features and classifiers 
for short text classification. in IOP Conference Series: Materials Science 
and Engineering. 2017. IOP Publishing. 

[39] Schapire, R.E., The boosting approach to machine learning: An 
overview, in Nonlinear estimation and classification. 2003, Springer. p. 
149-171. 

[40] Xu, B., Y. Ye, and L. Nie. An improved random forest classifier for 
image classification. in 2012 IEEE International Conference on 
Information and Automation. 2012. IEEE. 

[41] Eftekhar, B., et al., Comparison of artificial neural network and logistic 
regression models for prediction of mortality in head trauma based on 
initial clinical data. BMC medical informatics and decision making, 
2005. 5(1): p. 3. 

[42] Dreiseitl, S., et al., A comparison of machine learning methods for the 
diagnosis of pigmented skin lesions. Journal of biomedical informatics, 
2001. 34(1): p. 28-36. 

[43] Singh, P.K. and M.S. Husain, Methodological study of opinion mining 
and sentiment analysis techniques. International Journal on Soft 
Computing, 2014. 5(1): p. 11. 

[44] Bhatia, N., Survey of nearest neighbor techniques. arXiv preprint 
arXiv:1007.0085, 2010. 

[45] Sigurbergsson, G. I., & Derczynski, L. (2019). Offensive language and 
hate speech detection for Danish. arXiv preprint arXiv:1908.04531. 

[46] Schmidt, A., & Wiegand, M. (2017, April). A survey on hate speech 
detection using natural language processing. In Proceedings of the Fifth 
International workshop on natural language processing for social 
media (pp. 1-10). 

491 | P a g e  
www.ijacsa.thesai.org 


	I. Introduction
	II. Related Works
	III. Methodology
	A. Data Collection
	B. Text Preprocessing
	C. Feature Engineering
	D. Data Splitting
	E. Machine Learning Models
	F. Classifier Evaluation
	1) Precision: Precision is also known as the positive predicted value. It is the proportion of predictive positives which are actually positive. Refer to “(1)”.
	2) Recall: It is the proportion of actual positives which are predicted positive. Refer to “(2)”.
	3) F-Measure: It is the harmonic mean of precision and recall (as shown in Equation 3). The standard F-measure (F1) gives equal importance to precision and recall. Refer to “(3)”.
	4) Accuracy: It is the number of correctly classified instances (true positives and true negatives). Refer to “(4)”.


	IV. Experimental Settings
	V. Results
	VI. Discussion
	A. Feature Engineering
	B. Machine Learning Classifier
	C. Classwise Performance

	VII. Conclusion